Nash Q-Learning for General-Sum Stochastic Games

Vidal's library

Title:	Nash Q-Learning for General-Sum Stochastic Games
Author:	Junling Hu and Michael P. Wellman
Journal:	Journal of Machine Learning Research
Volume:	4
Pages:	1039--1069
Year:	2003
Abstract:	We extend Q-learning to a noncooperative multiagent context, using the framework of generalsum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.

Cited by 57 - Google Scholar

@Article{hu03a,
  author =	 {Junling Hu and Michael P. Wellman},
  title =	 {Nash Q-Learning for General-Sum Stochastic Games},
  journal =	 {Journal of Machine Learning Research},
  year =	 2003,
  volume =	 4,
  pages =	 {1039--1069},
  abstract =	 {We extend Q-learning to a noncooperative multiagent
                  context, using the framework of generalsum
                  stochastic games. A learning agent maintains
                  Q-functions over joint actions, and performs updates
                  based on assuming Nash equilibrium behavior over the
                  current Q-values. This learning protocol provably
                  converges given certain restrictions on the stage
                  games (defined by Q-values) that arise during
                  learning. Experiments with a pair of two-player grid
                  games suggest that such restrictions on the game
                  structure are not necessarily required. Stage games
                  encountered during learning in both grid
                  environments violate the conditions. However,
                  learning consistently converges in the first grid
                  game, which has a unique equilibrium Q-function, but
                  sometimes fails to converge in the second, which has
                  three different equilibrium Q-functions. In a
                  comparison of offline learning performance in both
                  games, we find agents are more likely to reach a
                  joint optimal path with Nash Q-learning than with a
                  single-agent Q-learning method. When at least one
                  agent adopts Nash Q-learning, the performance of
                  both agents is better than using single-agent
                  Q-learning. We have also implemented an online
                  version of Nash Q-learning that balances exploration
                  with exploitation, yielding improved performance.},
  keywords =     {multiagent reinforcement learning},
  googleid = 	 {GF9Y4c6NGIEJ:scholar.google.com/},
  url = 	 {http://jmvidal.cse.sc.edu/library/hu03a.pdf},
  cluster = 	 {9302340950017203992}
}

Last modified: Wed Mar 9 10:16:02 EST 2011