AlphaStar was trained using a combination of supervised imitation learning and reinforcement learning:
How AlphaStar is trained
AlphaStar’s behaviour is generated by a deep neural network that receives input data from the raw game interface (a list of units and their properties), and outputs a sequence of instructions that constitute an action within the game. More specifically, the neural network architecture applies a transformer torso to the units, combined with a deep LSTM core, an auto-regressive policy head with a pointer network, and a centralised value baseline. We believe that this advanced model will help with many other challenges in machine learning research that involve long-term sequence modelling and large output spaces such as translation, language modelling and visual representations.
AlphaStar also uses a novel multi-agent learning algorithm. The neural network was initially trained by supervised learning from anonymised human games released by Blizzard. This allowed AlphaStar to learn, by imitation, the basic micro and macro-strategies used by players on the StarCraft ladder. This initial agent defeated the built-in “Elite” level AI - around gold level for a human player - in 95% of games.
The AlphaStar league. Agents are initially trained from human game replays, and then trained against other competitors in the league. At each iteration, new competitors are branched, original competitors are frozen, and the matchmaking probabilities and hyperparameters determining the learning objective for each agent may be adapted, increasing the difficulty while preserving diversity. The parameters of the agent are updated by reinforcement learning from the game outcomes against competitors. The final agent is sampled (without replacement) from the Nash distribution of the league.These were then used to seed a multi-agent reinforcement learning process. A continuous league was created, with the agents of the league - competitors - playing games against each other, akin to how humans experience the game of StarCraft by playing on the StarCraft ladder. New competitors were dynamically added to the league, by branching from existing competitors; each agent then learns from games against other competitors. This new form of training takes the ideas of population-based and multi-agent reinforcement learning further, creating a process that continually explores the huge strategic space of StarCraft gameplay, while ensuring that each competitor performs well against the strongest strategies, and does not forget how to defeat earlier ones.
Estimate of the Match Making Rating (MMR) - an approximate measure of a player’s skill - for competitors in the AlphaStar league, throughout training, in comparison to Blizzard’s online leagues.
The dataset of anonymized human games contains 500,000+ games. Yoav Goldberg, a machine learning researcher, estimated on Twitter that an AlphaStar-like system could be trained in 2 weeks for around $4 million using Google Cloud TPUs.
I find AlphaStar particularly intriguing because of a report that Tesla is using supervised imitation learning for autonomous driving. Later this year, Tesla will have the technical capability to upload hundreds of millions or billions of miles of human state/action pairs for imitation learning.
AlphaStar was able to get quite far with just supervised imitation learning. If StarCraft II’s built-in Elite AI (a bot that is part of the game, not one designed by DeepMind) is actually equivalent to a human player in the Gold league, then imitation learning achieved roughly Gold-level performance. DeepMind’s estimated MMR would also put this version of AlphaStar in Gold league. Edit: The info I originally posted about this is outdated. Gold league comprises the 30th to 50th percentile of players.
Waymo’s ChauffeurNet is a more direct comparison for imitation learning applied to autonomous driving, but 1) ChauffeurNet was only trained on around 25,000 to 75,000 miles of driving and 2) there is no direct way to compare ChauffeurNet’s performance to human performance.
AlphaStar is also a cool example of bootstrapping with imitation learning and then passing the baton to reinforcement learning. This is an approach that Waymo suggests in their ChauffeurNet paper, but doesn’t actually try. I have a hunch that Tesla will try the same approach because it just makes sense to do. Once you reach the limits of imitation learning, reinforcement learning offers a path for further improvement.
Last night I was re-reading Amir Efrati’s article on Tesla’s AI team. Something struck me that I had overlooked before (emphasis added):
But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team.
I wonder if Tesla would be able to create a ranking system for Tesla drivers. I know insurance companies have apps and devices that try to measure driving quality using data from accelerometers and gyroscopes. Tesla has access to the same data and more.
What if Tesla made a league system? Diamond drivers would be those in the top 20%. What if Tesla only used supervised imitation learning on the state/action pairs of Diamond drivers? If the end result were that a fully autonomous Tesla drove like a Diamond driver, that alone would be enough to get to above-average driving.