On Multiagent Q-Learning in a Semi-competitive Domain

Authors:

Tuomas W. Sandholm and Robert H. Crites
{sandholm, crites}@cs.umass.edu
University of Massachusetts at Amherst
Computer Science Department
Amherst, MA 01003

Abstract:

Q-learning is a recent reinforcement learning (RL) algorithm that
does not need a model of its environment and can be used on-line.
Therefore it is well-suited for use in repeated games against an
unknown opponent.  Most RL research has been confined to single
agent settings or to multiagent settings where the agents have
totally positively correlated payoffs (team problems) or totally
negatively correlated payoffs (zero-sum games).  This paper is an
empirical study of reinforcement learning in the iterated
prisoner's dilemma (IPD), where the agents' payoffs are neither
totally positively nor totally negatively correlated. RL is
considerably more difficult in such a domain.  This paper
investigates the ability of a variety of Q-learning agents to
play the IPD game against an unknown opponent.  In some
experiments, the opponent is the fixed strategy Tit-for-Tat,
while in others it is another Q-learner.  All the Q-learners
learned to play optimally against Tit-for-Tat.  Playing against
another learner was more difficult because the adaptation of the
other learner creates a nonstationary environment in ways that
are detailed in the paper.  The learners that were studied varied
along three dimensions: the length of history they received as
context, the type of memory they employed (lookup tables based on
restricted history windows or recurrent neural networks (RNNs)
that can theoretically store features from arbitrarily deep in
the past), and the exploration schedule they followed.  Although
all the learners faced difficulties when playing against other
learners, agents with longer history windows, lookup table
memories, and longer exploration schedules fared best in the IPD
games.