Nested Collection Policies in Off-Policy Evaluation

Off-policy evaluation allows you to estimate the reward of a policy using feedback data collected from a different policy. The standard method is to use inverse propensity scores (IPS) based on importance sample reweighting, i.e., reweighting with respect to the ratio between the target policy and collection policy.

IPS requires the policies to be stochastic and for the probability of actions under the collection policy to be non-zero wherever the target policy is non-zero. But what happens when the collection policy has additional noise?

For example, let's say a ghost haunts your production pipeline and at random times during collection it tweaks the collection policy -- assume you have still logged the correct propensity, it's just that it is different to what was intended.

It might seem that you would need to correct for this additional noise in the logged propensity somehow. But in fact, counter-intuitively, the standard IPS estimator works fine in this case too.

Nested Collection Policy

Let's assume the ghost in the machine ensures that the collection policy is actually a mixture of S sub-policies:

$$\pi_c = \sum_{s=1}^S w_s \pi_c^s$$

In words, for each action the bandit randomly chooses one of S policies according to the weights \(w\) where \( \sum_s w_s = 1\) and \( w_s \ge 0 \; \forall s \), then randomly selects an action according to the selected policy \( \pi_c^s \).

As usual, we are interested in estimating the average reward (r.v. \( R \)) under the target policy \( \pi_t \) as \( \mathbb{E}_{\pi_t}[R] \) using data logged from the collection policy \( (x_i, a_i, r_i) \sim F_X \pi_c F_R \). How to show that the standard IPS estimator (that uses whatever logged propensity was applied in each action) is equal to \( \mathbb{E}_{\pi_t}[R] \) in expectation? That is, how to show:

$$ \mathbb{E} \left[ \frac{1}{N} \sum_{n=1}^N \frac{\pi_t(a_n)}{\pi_c^{s_n}(a_n)} r_n\ \right] = \mathbb{E}_{\pi_t}[R],$$ where I have introduced \( s_n \) as the index of the sub-policy used for action \( n \).

The trick here is to separate the data from the different policies into S "sub-experiments" then notice that we have an unbiased estimator for each sub-experiment and use the fact that \( \sum_s w_s = 1\):

$$ \mathbb{E} \left[ \frac{1}{N} \sum_{n=1}^N \frac{\pi_t(a_n)}{\pi_c^{s_n}(a_n)} r_n \right] \\ = \mathbb{E} \left[\sum_{s=1}^S \frac{1}{N} \sum_{n: s_n = s}^N \frac{\pi_t(a_n)}{\pi_c^{s_n}(a_n)} r_n \right] \\ = \mathbb{E} \left[ \sum_{s=1}^S \frac{\sum_{n: s_n = s}^N 1}{N} \mathbb{E}_{\pi_t}[ R ] \right ]\\ = \sum_{s=1}^S w_n \mathbb{E}_{\pi_t}[ R ]\\ = \mathbb{E}_{\pi_t}[ R ]. $$

Needless to say, all this requires that none of the sub-policies violates the assumption that the collection policy has non-zero mass wherever the target policy has non-zero mass.

Acknowledgements: thanks to Alois Gruson, Christophe Charbuillet, and Damien Tardieu for the helpful discussion.

James McInerney