Multi-armed bandit problems


Published 2022-05-07.

In computer science, a multi-armed bandit problem is an optimization problem in which one faces a tradeoff between exploration and exploitation. The name is a bit confusing. It comes from the fact that a slot machine is sometimes referred to as a “one-armed bandit”, because it is operated by a single handle (or “arm”) and it “steals” money from losing players. So a “multi-armed bandit” is a set of slot machines, and a multi-armed bandit problem is one where the goal is to play these machines and maximize income.

Exploring, in this context, refers to playing a new slot machine and inferring its win probability. Exploitation refers to playing the slot machine you currently believe has the highest win probability. The question at any given moment is: which machine should I play?

Since multi-armed bandit problems involve statistical inference, utility maximization, and repeated decision-making under uncertainty, they bear some similarities to many problems we commonly face in life. For example, how should we choose the food we eat, the music we listen to, the books we read, the films we watch, the places we live, the people we spend time with, etc? In each of these cases, we often choose what to do by inferring how satisfying an experience will be based on its similarity to experiences we’ve had previously. Or we deliberately seek new experiences because we know we might discover something even more satisfying that what we’ve had before. Which meal should I order? In many ways, this question is similar to that of a gambler asking themself which slot machine to play.

However, a true multi-armed bandit problem involves several assumptions that don’t always hold in seemingly analogous real-life problems. In a multi-armed bandit problem:

  • There is one, and only one, “gambler” who can only play one machine at a time.
  • This gambler is not supposed to be a real person; they are essentially an income-maximizing computer.
  • The gambler only cares about one thing: maximizing income.
  • That one thing—income—is unidimensional, absolute, and quantifiable. It is a situation with unnaturally high value clarity.
  • The gambler doesn’t care about playing slot machines per se; they would just as happily accept a check for a large sum of money. In C. Thi Nguyen’s game-playing framework, it is extrinsic achievement play.
  • The slot machines each have different winning probabilities. (If they don’t, then there is no true tradeoff between exploration and exploitation.)
  • The exploration-exploitation tradeoff is fairly strict, in that the only value of trying a new slot machine is that it might turn out to be better than the one the gambler currently believes is best. In other words, the gambler has no inherent interest in novelty or variety among slot machines.

I should keep these assumptions in mind when applying insights from multi-armed bandit problems to real-life decisions. As counter-points, I should remember that:

  • Human values are complex.
  • Human values change.
  • Human values are hard to commensurate.
  • Novelty is inherently satisfying.
  • Much of my time is spent in striving play.
  • I don’t need to, and in fact I cannot, rely purely on direct experience as a source of information.

Pages that link to this page