Skip to content

Challenge of A/B testing in the presence of network and spillover effects

Gaurav Sood writes:

There is a fun problem that I recently discovered:

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.

2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated. To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with “social influence world.” To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.

The phrase ‘social influence world’ is linked to:

Tomorrow’s Post: The State of the Art


  1. Anoneuoid says:

    The information density online has been dropping lower and lower. At this point we need to download like 5 mb of content to read a couple hundred kb of info. I blame AB testing.

    Compare to to Same info on each page with progressively more bloat.

  2. In the related (or same, depending on who you ask) field of information retrieval, folks have been using a technique called interleaving to do A/B tests of different ranking functions. The idea is to run the query with two different algorithms and then interleave the results in a fair way and keep track of which ranker’s results are clicked on to measure user preference. I think this technique is worth consideration for this problem, or at least good food for thought.

    Further reading:

    Radlinski, F., & Craswell, N. (2013). Optimized interleaving for online retrieval evaluation. the sixth ACM international conference (pp. 245–254). New York, New York, USA: ACM.

    Chapelle, O., Joachims, T., Radlinski, F., & Yue, Y. (2012). Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems, 30(1), 1–41.

Leave a Reply