This is a summary of a book titled “Multi-Agent Reinforcement Learning: Foundations and Modern Approaches” written by Lukas Schäfer, Filippos Christianos and Stefano Albrecht and published by MIT Press in 2024. This book presents a systematic treatment of multi-agent reinforcement learning (MARL) by placing it at the intersection of reinforcement learning, game theory, and modern machine learning. It focuses on how multiple autonomous agents can learn, adapt, and coordinate in shared and potentially non-stationary environments.
A multi-agent system consists of several agents interacting with a common environment while pursuing individual or collective objectives. Each agent is capable of observing its surroundings, selecting actions according to a policy, and updating that policy based on feedback from the environment and the behavior of other agents. Unlike single-agent reinforcement learning, where the environment is typically assumed to be stationary, MARL settings are inherently dynamic: the environment evolves not only due to external factors but also as a direct consequence of other agents learning and changing their policies concurrently.
MARL extends reinforcement learning by replacing individual actions with joint actions and individual rewards with reward structures that depend on the combined behavior of multiple agents. Agents learn through repeated interaction over episodes, collecting experience about state transitions, rewards, and the strategies of others. Coordination is a central challenge, particularly in settings where agents have partial observability, conflicting goals, or limited communication. In some cases, agents must learn explicit or implicit communication protocols to align their behavior.
The theoretical foundations of MARL are closely tied to game theory. Multi-agent environments are commonly modeled as games, ranging from fully observable, deterministic settings to stochastic and partially observable games. In these models, agents assign probabilities to actions, and joint actions induce state transitions and rewards. Depending on the assumptions about observability, dynamics, and information availability, different classes of games—such as stochastic games or partially observable stochastic games—are used to formalize agent interaction.
Within these frameworks, multiple solution concepts may apply. The book discusses equilibrium notions such as minimax equilibrium in zero-sum games, Nash equilibrium in general-sum games, and correlated equilibrium, along with refinements including Pareto optimality, social welfare, fairness, and no-regret criteria. A key distinction from single-agent learning is that multi-agent systems may admit multiple optimal or stable policies, and convergence is often defined in terms of equilibrium behavior rather than a single optimal policy.
Training instability is a defining difficulty in MARL. Because agents learn simultaneously, the learning problem faced by any one agent changes as others update their policies, violating the stationarity assumptions underlying many reinforcement learning algorithms. Credit assignment further complicates learning, as rewards must be attributed appropriately across agents whose actions jointly influence outcomes. Performance is often evaluated by whether agents converge to a stable joint policy or to stable distributions over policies.
The book surveys a range of algorithmic approaches developed to address these challenges. Joint action learning explicitly models the value of joint actions, while agent modeling techniques attempt to predict the behavior of other agents based on observed histories. Policy-based methods optimize parameterized policies directly, and no-regret learning algorithms, such as regret matching, aim to eliminate systematically poor decisions over time. For specific classes of problems, such as zero-sum stochastic games, value iteration methods can be used to compute optimal state values with respect to joint actions.
Scalability and partial observability motivate the use of function approximation. Deep learning plays a central role in modern MARL by enabling agents to approximate value functions, policies, and belief states in high-dimensional and continuous environments. Neural network architectures such as multilayer perceptrons, convolutional neural networks, and recurrent neural networks are employed depending on whether the inputs are structured, visual, or sequential. These models are trained via gradient-based optimization to generalize beyond the limited set of states encountered during interaction.
The book distinguishes between different training and execution paradigms. Centralized training and execution assumes shared observations and policies but scales poorly and obscures individual responsibility for outcomes. Decentralized training and execution allows agents to learn independently but suffers from non-stationarity and limited coordination. A hybrid approach—centralized training with decentralized execution—seeks to combine the advantages of both by learning joint representations during training while allowing agents to act independently at deployment.
Overall, the book provides a detailed and technically grounded account of MARL, covering its theoretical foundations, algorithmic methods, and practical challenges, with an emphasis on learning and coordination in complex multi-agent environments.