Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Vidal's library

Title:	Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition
Author:	Thomas G. Dietterich
Journal:	Journal of Artificial Intelligence Research
Volume:	13
Pages:	227--303
Year:	2000
Abstract:	This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

Cited by 221 - Google Scholar

@Article{dietterich00a,
  author =	 {Thomas G. Dietterich},
  title =	 {Hierarchical Reinforcement Learning with the {MAXQ}
                  Value Function Decomposition},
  journal =	 {Journal of Artificial Intelligence Research},
  year =	 2000,
  volume =	 13,
  pages =	 {227--303},
  abstract =	 {This paper presents a new approach to hierarchical
                  reinforcement learning based on decomposing the
                  target Markov decision process (MDP) into a
                  hierarchy of smaller MDPs and decomposing the value
                  function of the target MDP into an additive
                  combination of the value functions of the smaller
                  MDPs. The decomposition, known as the MAXQ
                  decomposition, has both a procedural semantics---as
                  a subroutine hierarchy---and a declarative
                  semantics---as a representation of the value
                  function of a hierarchical policy. MAXQ unifies and
                  extends previous work on hierarchical reinforcement
                  learning by Singh, Kaelbling, and Dayan and
                  Hinton. It is based on the assumption that the
                  programmer can identify useful subgoals and define
                  subtasks that achieve these subgoals. By defining
                  such subgoals, the programmer constrains the set of
                  policies that need to be considered during
                  reinforcement learning. The MAXQ value function
                  decomposition can represent the value function of
                  any policy that is consistent with the given
                  hierarchy. The decomposition also creates
                  opportunities to exploit state abstractions, so that
                  individual MDPs within the hierarchy can ignore
                  large parts of the state space. This is important
                  for the practical application of the method. This
                  paper defines the MAXQ hierarchy, proves formal
                  results on its representational power, and
                  establishes five conditions for the safe use of
                  state abstractions. The paper presents an online
                  model-free learning algorithm, MAXQ-Q, and proves
                  that it converges with probability 1 to a kind of
                  locally-optimal policy known as a recursively
                  optimal policy, even in the presence of the five
                  kinds of state abstraction. The paper evaluates the
                  MAXQ representation and MAXQ-Q through a series of
                  experiments in three domains and shows
                  experimentally that MAXQ-Q (with state abstractions)
                  converges to a recursively optimal policy much
                  faster than flat Q learning. The fact that MAXQ
                  learns a representation of the value function has an
                  important benefit: it makes it possible to compute
                  and execute an improved, non-hierarchical policy via
                  a procedure similar to the policy improvement step
                  of policy iteration. The paper demonstrates the
                  effectiveness of this non-hierarchical execution
                  experimentally. Finally, the paper concludes with a
                  comparison to related work and a discussion of the
                  design tradeoffs in hierarchical reinforcement
                  learning.},
  keywords =     {reinforcement learning},
  url = 	 {http://jmvidal.cse.sc.edu/library/dietterich00a.pdf},
  googleid = 	 {Sc71WfvQLxYJ:scholar.google.com/},
  cluster = 	 {1598726170704465481}
}

Last modified: Wed Mar 9 10:14:58 EST 2011