OpenAI Five defeats the world champion Dota team 2-0

It was also revealed today that OpenAI Five has beaten two other pro teams 2-0. About 9 months ago, OpenAI Five lost 2 matches against two teams of pro players.

Blog post from OpenAI:

Commentary by RL researcher Alex Irpan:

Me and @thenonconsensus were talking about this result, and asking the question:

If pure reinforcement learning in simulation is so successful with Dota, will it be successful for self-driving cars?

Talking it through helped me express my intuition on this more clearly than I ever have before. Here’s why I’m skeptical pure RL in sim will work.

One of the things Waymo has struggled with is unprotected left turns: turning left on a green light when there is oncoming traffic. Unprotected left turns seem like a hard task to solve with a hand-coded if-then-else rule. One way to solve it would be imitation learning: just observe how humans make unprotected left turns, and do it that way. There is all kinds of information you could capture through imitation learning that would be hard, if not impossible, to know and to exactly specify in a hand-coded rule. Such as:

  1. What is the likely speed, acceleration, and overall trajectory of oncoming vehicles at an intersection like this one? (Real world behaviour, not just the posted speed limit.)

  2. If I start to turn left, will oncoming vehicles slow down at all? What percent will slow down and by how much? If they do slow down, how much time do human drivers need to react?

  3. Is nudging left slowly a helpful signal that will get oncoming cars to slow down, or do I have to turn left aggressively to get the desired response?

These are empirical questions that can only be answered with empirical data. The sort of real world experimentation you would need to do to get this data would obviously be unsafe. Imitation learning is a safe alternative. Through imitation learning, you can (at least in theory) copy the behaviour of cars trying to make unprotected left turns and the behaviour of the oncoming cars. Once you’ve done that, you can simulate that behaviour and do reinforcement learning in simulation. So, pure RL in sim probably won’t work, but RL bootstrapped with IL could work!

What would happen if you took the OpenAI Five approach and did pure multi-agent RL from scratch? An important difference between Dota and driving is that Dota is competitive and driving is cooperative. Multi-agent RL with Dota means as other agents get better and better, the task gets harder and harder. With driving, the inverse is true: as other agents get better and better at driving, the easier and easier driving gets. If you copied the OpenAI Five approach with a driving simulator, here are two possibilities of what you might end up with:

A. Oncoming cars cautiously slow down to let a car make an unprotected left.

B. Cars don’t slow down at all but just weave around the left-turning car with superhuman reflexes.

With either A or B, this behaviour would be successful in simulation, but it would be a disaster in the real world. The self-driving car would have an expectation for how oncoming traffic will behave that is totally unrealistic.

To make unprotected left turns in the real world, you need to know how real world human drivers empirically behave, including how they react (or don’t react) to your actions such as initiating a left turn. If you just do pure RL in sim, you will only know how other RL agents behave, and they might behave totally unlike human drivers.

I’m not claiming that pure RL in sim certainly won’t work; this is just the reason I’m skeptical it will work. Maybe there are good tricks to overcome this problem, like randomizing the behaviour of agents. I’m open to hearing ideas, and criticisms of what I’ve said above.

1 Like

New blog post from OpenAI:

In total, the current version of OpenAI Five has consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months (up from about 10,000 years over 1.5 realtime months as of The International), for an average of 250 years of simulated experience per day.

“800 petaflop/s-days” is a confusing term (to the unfamiliar). Do they mean the equivalent of 800 days of 1 petaflop per second? If so, that would be about 70 million petaflops of total computation.

Yes. It’s 800 * 10^15 * 24 * 3600 operations.

So, to train OpenAI Five in 12 months, you need 70 million petaflops / the seconds in a year. Which comes to 2.22 petaflops per second. Tried to figure out how much this would cost but gave up quickly. :stuck_out_tongue_closed_eyes:

~99.2% win rate against human players so far

One of the discussion threads here:

Sounds like the bots are great at team fighting but not so good at handling split push strategies, which is a slightly different way of playing that doesn’t involve ever grouping up together as a team

1 Like