DeepMind’s AlphaStar beats a top player in StarCraft II


AlphaStar was trained using a combination of supervised imitation learning and reinforcement learning:

How AlphaStar is trained

AlphaStar’s behaviour is generated by a deep neural network that receives input data from the raw game interface (a list of units and their properties), and outputs a sequence of instructions that constitute an action within the game. More specifically, the neural network architecture applies a transformer torso to the units, combined with a deep LSTM core, an auto-regressive policy head with a pointer network, and a centralised value baseline. We believe that this advanced model will help with many other challenges in machine learning research that involve long-term sequence modelling and large output spaces such as translation, language modelling and visual representations.

AlphaStar also uses a novel multi-agent learning algorithm. The neural network was initially trained by supervised learning from anonymised human games released by Blizzard. This allowed AlphaStar to learn, by imitation, the basic micro and macro-strategies used by players on the StarCraft ladder. This initial agent defeated the built-in “Elite” level AI - around gold level for a human player - in 95% of games.

The AlphaStar league. Agents are initially trained from human game replays, and then trained against other competitors in the league. At each iteration, new competitors are branched, original competitors are frozen, and the matchmaking probabilities and hyperparameters determining the learning objective for each agent may be adapted, increasing the difficulty while preserving diversity. The parameters of the agent are updated by reinforcement learning from the game outcomes against competitors. The final agent is sampled (without replacement) from the Nash distribution of the league.

These were then used to seed a multi-agent reinforcement learning process. A continuous league was created, with the agents of the league - competitors - playing games against each other, akin to how humans experience the game of StarCraft by playing on the StarCraft ladder. New competitors were dynamically added to the league, by branching from existing competitors; each agent then learns from games against other competitors. This new form of training takes the ideas of population-based and multi-agent reinforcement learning further, creating a process that continually explores the huge strategic space of StarCraft gameplay, while ensuring that each competitor performs well against the strongest strategies, and does not forget how to defeat earlier ones.

Estimate of the Match Making Rating (MMR) - an approximate measure of a player’s skill - for competitors in the AlphaStar league, throughout training, in comparison to Blizzard’s online leagues.

The dataset of anonymized human games contains 500,000+ games. Yoav Goldberg, a machine learning researcher, estimated on Twitter that an AlphaStar-like system could be trained in 2 weeks for around $4 million using Google Cloud TPUs.

I find AlphaStar particularly intriguing because of a report that Tesla is using supervised imitation learning for autonomous driving. Later this year, Tesla will have the technical capability to upload hundreds of millions or billions of miles of human state/action pairs for imitation learning.

AlphaStar was able to get quite far with just supervised imitation learning. If StarCraft II’s built-in Elite AI (a bot that is part of the game, not one designed by DeepMind) is actually equivalent to a human player in the Gold league, then imitation learning achieved roughly Gold-level performance. DeepMind’s estimated MMR would also put this version of AlphaStar in Gold league. Edit: The info I originally posted about this is outdated. Gold league comprises the 30th to 50th percentile of players.

Waymo’s ChauffeurNet is a more direct comparison for imitation learning applied to autonomous driving, but 1) ChauffeurNet was only trained on around 25,000 to 75,000 miles of driving and 2) there is no direct way to compare ChauffeurNet’s performance to human performance.

AlphaStar is also a cool example of bootstrapping with imitation learning and then passing the baton to reinforcement learning. This is an approach that Waymo suggests in their ChauffeurNet paper, but doesn’t actually try. I have a hunch that Tesla will try the same approach because it just makes sense to do. Once you reach the limits of imitation learning, reinforcement learning offers a path for further improvement.

Last night I was re-reading Amir Efrati’s article on Tesla’s AI team. Something struck me that I had overlooked before (emphasis added):

But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team.

I wonder if Tesla would be able to create a ranking system for Tesla drivers. I know insurance companies have apps and devices that try to measure driving quality using data from accelerometers and gyroscopes. Tesla has access to the same data and more.

What if Tesla made a league system? Diamond drivers would be those in the top 20%. What if Tesla only used supervised imitation learning on the state/action pairs of Diamond drivers? If the end result were that a fully autonomous Tesla drove like a Diamond driver, that alone would be enough to get to above-average driving.


DeepMind says AlphaStar runs on “a single desktop GPU”, which is crazy to me. Edit: On second thought, an Nvidia Titan V does 110 deep learning teraops and costs $3,000. A single desktop GPU can be pretty beefy. I was thinking of a typical desktop GPU.

I’m looking forward to DeepMind’s paper on AlphaStar. It will probably include which GPU AlphaStar runs on.

Here’s a point of comparison. Mobileye says that ~99% of the computation in its autonomous vehicles will be used for perception and only ~1% will be used for driving policy, which is trained using reinforcement learning. Mobileye plans to use three EyeQ5 chips in each vehicle. An EyeQ5 can do 24 deep learning teraops, so that’s 72 teraops total. The second and third EyeQ5 might be purely for redundancy, but here I’ll assume they’re all being used for just one instance of the autonomous driving software. ~1% of 72 teraops is 720 gigaops. That’s a lot less than a typical GPU for a gaming PC.

The slide is from 26:15 in this video.


An analysis of AlphaStar’s matches against MaNa from a top 5th percentile player. Goes over mechanical superiority vs. strategic intelligence.


MaNa talks about playing against AlphaStar:


I wrote a post about AlphaStar and self-driving cars:

The big, lingering question:

Is it harder for a neural network to learn to drive a car through imitation learning and/or reinforcement learning than to play StarCraft or Dota at an expert level?


Interesting collection of articles - thanks for sharing.

These really big RL systems are very hard to train. There are a lot of choices to make and most of them don’t lead to very good results. With the AlphaGo / AlphaGo Zero / Alpha Zero effort there’s a clear procession from imitating experts to, eventually, being completely independent of them. Starting from imitation gives you a base to work on, then you augment that with self-play and study the parameters needed to get self play to learn effectively. Once you have self play working well you retrain from the start but with less imitation and more self learning. By repeating this process you step-wise reduce the need for imitation to zero by finding the training characteristics that let the system start from scratch and train effectively all the way up to high performance.

Here you see the process repeating with a different game. Other complex RL efforts contain similar development themes. It seems like such an approach might work with vehicles: start with imitation, add RL, explore the parameter space for self learning, gradually reduce the dependency on imitation. At the end of the process you have a methodology that lets the system start from raw data without imitation and produce a highly functional system.

It’s probably helpful to have reasonably high quality human behavior samples to start from, but that’s just an accelerant. Even relatively low quality human imitation probably works in theory - if the development team is willing to curate the set to remove excessive amounts of negative examples. AlphaGo started with a database of 600k ‘serious amateur’ games. I don’t think they ever had anything higher quality than that and the team could not curate the games themselves because they didn’t have any good players.


One of the lead AlphaStar developers, Oriol Vinyals, said on Reddit:

Driving a car is harder. The lack of (perfect) simulators doesn’t allow training for as much time as would be needed for Deep RL to really shine.

I asked him about imitation learning but he didn’t reply (yet).


Certainly true. In thinking about that I wonder if imperfect simulation might not be managed in a similar manner to what OpenAI did with Dactyl (train over a range of perturbations of the imperfect parameters) and might not result in something similar to what happens when you use DRL in imperfect information games (of which StarCraft is one).


Yeah, domain randomization!

From what I’ve read I think the main reality gap for driving simulators (i.e. the main difference between simulation and reality) is the behaviour of road users. Randomizing that might help. But with Dactyl, the physics of the simulations weren’t completely random; if I recall correctly, they were all within a small range of each other. So the problem still remains of how to create a human-like agent for simulations.

There may be other important reality gaps, but I haven’t come across mention of them yet. I’m still not entirely sure whether physical control is a solved problem. When a Tesla on Autopilot goes over the centre line on a tight corner, is that a perception problem, a path planning problem, or a control problem?

I don’t see why you couldn’t use HD maps of real world locations to create simulation environments. You could also play back mid-level representations recorded from real world driving, so what the virtual car in the simulator sees is exactly what a real world car saw. I read a paper where the researchers (one of whom was Yann LeCun) did this but based on video feeds from overhead traffic cameras. The problem is that you need the other cars in the simulation to actually respond to your behaviour, not just ignore you like a ghost. So that’s why there needs to be smart, human-like agents in the simulator.