OpenAI Five: thoughts

OpenAI recently released some updates on their quest to build a Dota AI. They revealed that they trained an agent that can effectively play 5v5 Dota 2 and beat semi-pro teams (albeit on a restricted variant of the original game).

A lot of interesting things are said in the post, so you should go check it out yourself. But here were a few that jumped out to me:

Dollar value cost

Almost always, the first thing I ask when I read something like this is: how much does it cost? Impressive feats of reinforcement learning (RL), like AlphaGo Zero, can cost on the order of millions to tens of millions of dollars to actually run.

OpenAI Five is not that different (emphasis mine):

OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. It trains using a scaled-up version of Proximal Policy Optimization running on 256 GPUs and 128,000 CPU cores — a larger-scale version of the system we built to play the much-simpler solo variant of the game last year.

Given the current (as of June 2018) prices for the machinery mentioned:

Preemptible CPU: $5.10/mo (source)
Preemptible P100 GPU: $313.90/mo (source)

The whole system costs around $25,000 per day to run. About 90% of the cost is dedicated to simulating matches (as opposed to running gradient updates).

Caveats (or, why you shouldn’t be afraid of AI domination in Dota 2 just yet)

The blog post mentions the following restrictions on the game that OpenAI Five was trained on (numbered labels mine):

(1) Mirror match of Necrophos, Sniper, Viper, Crystal Maiden, and Lich
(2) No warding
(3) No Roshan
(4) No invisibility (consumables and relevant items)
(5) No summons/illusions
(6) No Divine Rapier, Bottle, Quelling Blade, Boots of Travel, Tome of Knowledge, Infused Raindrop
(7) 5 invulnerable couriers, no exploiting them by scouting or tanking
(8) No Scan

In addition, they hard-code item and skill builds, as well as courier management (9).

These restrictions highlight the limitations on what reinforcement learning can do:

RL is not very good at attributing credit to actions that only pay off on a very long time-scale (2, 3).
It’s hard to model complex domains and hidden state efficiently (4, 5, 6, 8).
Getting RL agents to explore massive policy (strategy) spaces is difficult (1, 6, 7, 9).

Team selection

Note that by having rule (1), OpenAI Five avoids having to select a team of five heroes from a pool of around 100. This is a crucial aspect of matches between human players; selecting the right heroes to play against an opponent can decide the fate of matches before they even begin.

Not only that, it’s still an challenging problem even for human players. There is a whole metagame built around hero selection, and even automated tooling to help people select heroes.¹

Reward shaping

In a two-party game like chess, Go, or Dota, the only reward signal that ultimately matters is whether you win or lose the game. However, the reward signal for a typical Dota game is much sparser than that for chess and Go because many more moves are made before the end result of a game is seen (~20k for Dota, vs ~100 for chess or Go).

This makes assigning credit to an action insanely challenging in a RL setting. How does the algorithm determine that the gank at 5’ was crucial to winning the match? How does the algorithm determine that killing enemy heroes is generally useful? How does the algorithm determine that last-hitting creep is important?

To deal with this, rewards are assigned for things that happen often over the course of a game, such as gaining experience, collecting gold, or staying alive (this link has the full list for OpenAI Five). This method of assigning rewards more closely models how humans playing Dota assign reward. It also allows the rollout workers in this diagram to push meaningful reward signals every 60 seconds (as opposed to waiting until a match is complete) to the optimizer.

In addition, the reward signal is post-processed by subtracting the average reward for the opposing team’s heroes. This has two benefits:

it prevents “positive sum situations” (for example, a scenario where both teams decide it’s more profitable to stay idle rather than destroy their opponent’s ancient)
it smoothes out credit assignment to the whole team (for example, when an opponent’s hero is killed, players that aren’t directly involved in the kill still receive a small reward).

On the other hand, this kind of reward potentially leaks information about the opposing team’s status (health of units, amount of gold farmed), so it may only make sense for it to be used during self-play.

Encouraging exploration by randomizing game properties

As someone who’s interested in applying RL to game-playing, this idea stood out the most to me as novel (emphasis mine):

In March 2017, our first agent defeated bots but got confused against humans. To force exploration in strategy space, during training (and only during training) we randomized the properties (health, speed, start level, etc.) of the units, and it began beating humans. Later on, when a test player was consistently beating our 1v1 bot, we increased our training randomizations and the test player started to lose. (Our robotics team concurrently applied similar randomization techniques to physical robots to transfer from simulation to the real world.)

The act of randomizing health, start level, and other properties forces an RL agent to consider game states it may never have encountered on its own. You could also imagine that randomizing properties such as unit stats helps an RL agent to learn the effects of a hero’s stats, rather than solely pattern matching on the hero’s identity.

I hadn’t seen this anywhere else, but if there is precedent for doing something like this, let me know!

Final thoughts

One of the things that bugs me about projects like this is that they aren’t accessible to people that don’t have access to lots of resources (read: I can’t go ahead and train my own Dota bot without burning a lot of cash). I hope that there will be more research on sample efficiency in reinforcement learning in the future.

Nevertheless, this is all very exciting! I can’t wait to see what’s next – will OpenAI Five beat a team of pros?

Footnotes

Hat tip to Ben for telling me about DotaPicker and similar tools for League of Legends. ↩