Multi-Agent Multi-Armed Bandits

Overview

The standard multi-armed bandit (MAB) problem models a single agent choosing actions to maximize its reward over time. However, in many real-world scenarios, actions affect multiple agents—raising questions of fairness in decision making. For example, a city planning a new transportation policy must consider how different options impact all residents.

This project explores algorithms for fairness in the multi-agent MAB setting (MA-MAB), as formalized by Hossain et al. (NeurIPS 2021). We provide an empirical analysis of the proposed algorithms and evaluate their performance across different fairness and utility trade-offs.

What is the MA-MAB Problem?

The multi-agent MAB problem generalizes the standard bandit setting by introducing a population of agents. Each round, a central decision-maker selects one arm, and every agent receives a stochastic reward from that choice. The goal is to learn a policy over the arms that maximizes the fairness across agents. Fairness is quantified in the setting using the Nash Social Welfare.

Algorithms Studied

We evaluate several classic MAB algorithms adapted to this setting in Hossain et al.:

Explore Then Commit – allocates an initial exploration phase to fairly estimate rewards before committing to a policy
Epsilon Greedy – selects arms by mostly exploiting the best option while occasionally exploring randomly
UCB – choosing arms with the highest confidence bounds

For each, we provide a proof sketch of the main regret bounds from Hossain et al., validate the regret bounds empirically, and analyze the policy trajectory during rollouts.

Key Results

Our findings show:

As expected, all regret bounds from Hossain et al. hold.
In practice, UCB tends to perform much better than its regret bound guarantee.
Epsilon greedy can reach the optimal policy and then deviate away from it, whereas UCB does not suffer from this behavior.

Figure: Epsilon-Greedy analysis. Left: Expected Regret vs. Horizon. Right: Nash Social Welfare with μ*, the optimal p*, and example policy trajectories for T=10⁴.

More Details

The full write-up can be found here and the implementation is available here.