Member lower bounds?

Multi-armed bandit algorithms are becoming more and more important in the field of machine learning (at least to me, since I started a PhD on this topic :D). This funny name derives from the one-armed bandit, a name used for a lever operated slot machine (and apparently also for a Belgian rock album).


Bandit since it steals your money

In the multi-armed bandit setting we have an agent which acts in rounds. Each round the agent selects one of the possible $K$ arms and collects a reward. His objective is to maximize the rewards collected over $n$ rounds. However, to characterize the behaviour of any strategy, we derive its (pseudo)-regret with respect to the an (unknown) optimal strategy which always selects the best arm. More specifically, if we denote by $X_{i,1}, X_{i, 2}, \ldots$ the rewards associated to arms $i=1, \ldots, K$ in $t=1,2,\ldots,n$ and the arm played by our strategy denoted by the index $I_t$, then we can define the pseudo-regret as follows:

\[R_n = \max_{i=1,\ldots,K} \mathbb{E} \Big[ \sum_{t=1}^n X_{i, t} - \sum_{t=1}^n X_{I_t, t} \Big]\]

Note that the we compare the reward our strategy to the one of the best arm in expectation (there may be some realizations where the best arm doesn’t give the highest reward). We usually distinguish between 2 set of bandit scenarios: stochastic and adversarial. In this post I will focus on the former.

Stochastic Bandits

In the stochastic multi-armed bandit problem, each arm is parametrized by an unknown probability distribution $\nu_i$. For each round $t=1,\ldots,n$, the algorithm selects one arm $I_t \in {1, \ldots, K}$ and collects a reward $X_{I_t} \sim \nu_{I_t}$ independent from past rewards. If we denote by $\mu^* $ the mean of the best arm, then the regret can be rewritten as $R_n = n\mu^* - \mathbb{E}[\sum_{t=1}^n X_t]$. We can introduce the gap $\Delta_i = \mu^* - \mu_i $, where $\mu_i$ is the mean of arm $i$, and $T_i(n) = \sum_{t=1}^n \mathbb{1}\{ I_t = i \}$ as the number of times the algorithm selected arm $i$ on the first $n$ rounds. In general $T_i(n)$ is a random quantity since in each round $t$ it depends on $I_t$, which in turn depends on the previous random rewards observed. Then the regret can be rewritten as $R_n = \sum_{i=1}^K \mathbb{E}[T_i(n)] \Delta_i$. To see this:

\[\begin{align*} R_n &= T\mu^* - \mathbb{E}[S_n] \\ &= \sum_{i=1}^K\sum_{t=1}^n \mathbb{E}[(\mu^*-X_t)\mathbb{1}\{I_t = i\}] \\ &= \sum_{i=1}^K\sum_{t=1}^n \mathbb{E}[(\mu^*-X_t)\mathbb{1}\{I_t = i\}|I_t] \\ &= \sum_{i=1}^K\sum_{t=1}^n \mathbb{1}\{I_t = i\}\mathbb{E}[(\mu^*-X_t)|I_t] \\ &= \sum_{i=1}^K\sum_{t=1}^n \mathbb{1}\{I_t = i\}(\mu^*-\mu_{I_t}) \\ &= \sum_{i=1}^K \mathbb{E}[T_i(n)] \Delta_i \tag{1}\label{1} \end{align*}\]

Rewriting the regret in this way suggests that the goal of any algorithm is to quickly learn the arms with large $\Delta_i$ and discard them. Indeed, for arms whose $\Delta_i$ is large, pulling them even a few times causes a high regret.

Now, what is the minimum regret any algorithm can attain? In the case of stochastic bandits, there’s a fundamental lower bound that we are going to describe next, limited to the case where the probability distributions are Bernoulli.

Theorem: Consider a strategy that satisfies $\mathbb{E}[T_i(n)] = o(n^a)$ for any set of Bernoulli reward distributions, any arm $i$ with $\Delta_i > 0$, and any $a > 0$. Then, for any set of Bernoulli reward distributions the following holds:

\[\lim_{n\to +\infty} \inf\frac{R_n}{\ln n}\geq \sum_{i: \Delta_i > 0} \frac{\Delta_i}{\text{kl}(\mu_i, \mu^*)}\]


For simplicity, we only consider 2 arms, i.e. $K=2$.

1. Notation

Wlog, assume $\mu_2 < \mu_1 < 1$. We now define a modified bandit where arm 2 is the unique optimal arm. Let $\varepsilon > 0$. We can find bla bla such that:

\[\text{kl}(\mu_2, \mu_2') \leq (1 + \varepsilon)\text{kl}(\mu_2, \mu_1) \tag{2}\label{2}\]

We want to have a lower bound on the number of times the optimal arm is played..

If we denote by $X_{2, 1}, \ldots, X_{2, n}$ the sequence of random variables obtained when pulling arm 2 $n$ times, then we can define the empirical estimate of the $\text{kl}(\mu_2, \mu_2’)$ at time $n$ as follows:

\[\hat{\text{kl}}_s = \sum_{t=1}^s \ln \frac{\mu_2 X_{2,t} + (1-\mu_2)(1-X_{2,t})}{\mu_2' X_{2,t} + (1-\mu_2')(1-X_{2,t})}\]

Where $s \in {1, \ldots, n}$.

An important property for $ \hat{\text{kl}} $ is the following: for any event $A$ in the $\sigma$-algebra generated by $X_{2,1} , \ldots , X_{2,n}$ the following change-of-measure identity holds:

\[\mathbb{P}'(A) = \mathbb{E}[\mathbb{1}_A \exp(-\hat{\text{kl}}_{T_2(n)})] \tag{3}\label{3}\]

We can indeed write the probability $ \mathbb{P}’(\omega) $ of any individual bit sequence $ \omega $ (remind that the samples of a Bernoulli distribution are either 0 or 1) as $\mathbb{P}(\omega)f(\omega)$, where by definition, $f(\omega) = \exp(-\hat{\text{kl}}_{T_2(n)})$. To see this, notice that:

\[\begin{align} \exp(-\hat{\text{kl}}_m) & = \exp(-\sum_m \ln(\dots)) \\ & = \exp(-\ln(g(X_{2,1}))) \ldots \exp(-\ln(g(X_{2,m})) \\ & = \prod_{t=1}^m \frac{\mu_2' X_{2,t} + (1-\mu_2')(1-X_{2,t})}{\mu_2 X_{2,t} + (1-\mu_2)(1-X_{2,t})} \end{align}\]


\[\begin{align} \mathbb{E}[\mathbb{1}_A \exp(-\hat{\text{kl}}_{T_2(n)})] & = \sum_{\omega \in A} \mathbb{P}(\omega)f(\omega) \\ & = \sum_{\omega \in A} \prod_{X_2 \in \omega} \mu_2' X_{2,t} + (1-\mu_2')(1-X_{2,t}) \\ & = \mathbb{P}'(A) \end{align}\]

We finally define the event:

\[C_n = \Big\{ T_2(n) < \frac{1 - \varepsilon}{\text{kl}(\mu_2, \mu_2')} \ln(n) \quad \text{and} \quad \hat{\text{kl}}_{T_2(n)} \leq \Big(1 - \frac{\varepsilon}{2} \Big) \ln(n) \Big\} \tag{4}\label{4}\]

$ \mathbf{2. \quad \mathbb{P}(C_n) = o(1)} $

Combining the previous $\eqref{3}$ and $\eqref{4}$, we get the following:

\[\begin{align} \mathbb{P}'(C_n) & = \mathbb{E} \Big[\mathbb{1}_{C_n} \exp(-\hat{\text{kl}}_{T_2(n)}) \Big] \\ & \geq \mathbb{E} \Big[e^{-(1-\varepsilon/2) \ln(n)} \mathbb{1}_{C_n} \Big] \\ & = e^{-(1-\varepsilon/2) \ln(n) } \mathbb{E}[\mathbb{1}_{C_n}] \\ & = n^{-(1 - \varepsilon/2)} \mathbb{P}(C_n) \end{align}\]

Where we used the fact that the expected value of the indicator function of an event $A$ is indeed the probability of that event happening. Thus we have obtained that $ \mathbb{P}’(C_n) \geq n^{-(1 - \varepsilon/2)} \mathbb{P}(C_n) $. If we introduce the following shorthand:

\[f_n = \frac{1 - \varepsilon}{\text{kl}(\mu_2, \mu_2')}\]


\[\begin{align} \mathbb{P}(C_n) & \leq n^{(1 - \varepsilon/2)} \mathbb{P}'(C_n) \newline & \leq n^{(1 - \varepsilon/2)} \mathbb{P}'(T_2(n) < f_n) \newline & = n^{(1 - \varepsilon/2)} \mathbb{P}'(n - T_2(n) > n - f_n ) \newline                   & \leq n^{(1 - \varepsilon/2)} \frac{\mathbb{E}'[n - T_2(n)]}{n - f_n} \qquad \text{using Markov inequality} \end{align}\]

Now, because of the assumption at the beginning:

\[\mathbb{P}(C_n) \leq n^{(1 - \varepsilon/2)} \frac{\mathbb{E}'[n - T_2(n)]}{n - f_n} = n^{(1 - \varepsilon/2)} \frac{n - \mathbb{E}'[T_2(n)]}{n - f_n} = o(1)\]

$ \mathbf{3. \quad \mathbb{P}(T_2(n) < f_n) = o(1)} $

\[\begin{align} \mathbb{P}(C_n) & \geq \mathbb{P} \bigg(T_2(n) < f_n \quad \text{and} \quad \max_{s \leq f_n} \hat{\text{kl}} \leq \Big(1 - \frac{\varepsilon}{2} \Big) \ln(n) \bigg) \\ & = \mathbb{P} \bigg(T_2(n) < f_n \quad \text{and} \quad \frac{\text{kl}(\mu_2, \mu_2')}{(1-\varepsilon)\ln(n)} \times \max_{s \leq f_n} \hat{\text{kl}}_s \leq \frac{1 - \varepsilon/2}{1 - \varepsilon} \text{kl}(\mu_2, \mu_2') \bigg) \end{align}\]

We can now use the maximal version of the strong law of large numbers, which says that for any sequence $(X_t)$ of independent real random variables with positive mean $\mu > 0$ we have that:

\[\lim_{n \to \infty} \frac{1}{n} \sum_{t=1}^n X_t = \mu \quad a.s. \quad \text{implies} \quad \lim_{n \to \infty} \max_{s=1,\ldots,n} \sum_{t=1}^s X_t = \mu \quad a.s.\]

see e.g. here, Lemma 10.5. If we apply this result to our empirical $\hat{\text{kl}}$ divergence, then we get:

\[\lim_{n \to \infty} \max_{s=1,\ldots, f_n} \sum_{t=1}^s \hat{\text{kl}}_t = \text{kl}(\mu_2, \mu_2') \quad a.s.\]

Now, since the kl-divergence is a quantity between 0 and 1 while $ \frac{1-\varepsilon/2}{1-\varepsilon} > 1$, we have that:

\[\lim_{n \to \infty} \mathbb{P} \bigg( \frac{\text{kl}(\mu_2, \mu_2')}{(1-\varepsilon)\ln(n)} \times \max_{s \leq f_n} \hat{\text{kl}}_s \leq \frac{1 - \varepsilon/2}{1 - \varepsilon} \text{kl}(\mu_2, \mu_2') \bigg) = 1\]

since the first term goes to zero asymptotically while the second term is greater than 0.

Thus, by using the result in the second step and this last one we get:

\[\mathbb{P}(T_2(n) < f_n) = \mathbb{P}\big(T_2(n) < \frac{1 - \varepsilon}{\text{kl}(\mu_2, \mu_2')} \ln(n) \big) = o(1)\]

By using $\eqref{2}$, we obtain:

\[\mathbb{E}[T_2(n)] \geq (1 + o(1)) \frac{1 + \varepsilon}{1 - \varepsilon} \frac{\ln(n)}{\text{kl}(\mu_2, \mu_1)}\]

Now, by using the regret decomposition in $\eqref{1}$, the theorem is proved.

Furthermore, by using the fact that $\ln x \leq x - 1$:

\[\text{kl}(p,q) = p \ln\frac{p}{q} + (1-p) \frac{1-p}{1-q} \leq\frac{(p-q)^2}{q(1-q)}\]

In our case $p$ and $q$ are the parameters of the 2 Bernoulli $\mu^* $ and $\mu_i$, thus (by reversing everything) we get:

\[\frac{\Delta_i}{\text{kl}(\mu_i, \mu^* )} \geq \frac{\mu^* (1 - \mu^* )}{2\Delta_i}\]

So, up to a variance-like term this lower bound is $ \sum_{i: \Delta_i > 0} \frac{\ln(n)}{2 \Delta_i}$. This bound holds more generally than just for Bernoulli distributions, as shown for example in Burnetas and Katehakis [1996].

Written on January 6, 2018