Article originally posted here at Doordash, reposted with permission.
In this post, we introduce a method we call CUPAC (Control Using Predictions As Covariates) that we successfully deployed to reduce extraneous noise in online controlled experiments, thereby accelerating our experimental velocity.
Rapid experimentation is essential to helping DoorDash push key performance metrics forward. Whether improving our store feed ranking, optimizing the efficiency of our logistics system, or evaluating new marketing campaigns, it’s critical for DoorDash to maintain a robust experimentation methodology to ensure new releases improve business metrics in production. Without such rigorous validation, the practice of “test and learn” can quickly devolve into “ship and pray.”
To complicate matters, the metrics we care most about are often very noisy. One such metric is the total time it takes to prepare and deliver an order, which we call ASAP. ASAP is a key metric for us to monitor as delivery speed drives both customer satisfaction and retention. ASAP is very noisy as it varies by merchant type (e.g. quick-service restaurants prepare food quicker than steakhouses), the customer’s distance from the merchant, and current traffic conditions. Such variation lowers the probability of detecting improvements (i.e. the power of the test) driven by new product features and models in an experiment. This makes it difficult for us to conclude whether observed changes in ASAP are real or are merely fluctuations driven by random chance.
To mitigate this issue we developed and deployed CUPAC. CUPAC is inspired by the CUPED methodology pioneered at Microsoft (Deng, Xu, Kohavi, & Walker, 2013), extending it to leverage machine learning predictions built using inputs unaffected by experiment intervention. This approach has proved powerful in practice, allowing us to shorten our switchback tests by more than 25% while maintaining experimental power.
What is Variance Reduction?
Strategies that attempt to reduce the variance of a target metric are known as variance reduction techniques. Common variance reduction techniques include stratification, post-stratification, and covariate control. CUPAC falls into the covariate control category. To more clearly illustrate what variance reduction seeks to accomplish, consider the distributions for the test and control observations in Figure 1 below:
In this example we see substantial overlap between the treatment and control distributions of ASAP. All else held equal, such overlap will make it difficult to detect whether a product change meaningfully reduces delivery time.
In Figure 2, we can see that if we were able to explain away a portion of the variation in ASAP using factors that have nothing to do with our experimental intervention, this overlap will decrease. Such improvements in the signal to noise ratio will make it much easier to identify the treatment effect:
In the case of ASAP, the features we can use to explain such irrelevant variability include historical dasher availability, historical merchant food preparation times, and the expected travel time between the merchant and the consumer given typical traffic conditions.
Reducing Variance Using Linear Models
The standard t-test for a difference in population averages can be generalized to a regression setting using an approach known as ANOVA (Analysis of Variance). In the most basic version of this approach, the treatment indicator T is regressed on the outcome variable Y. The estimated value of the coefficient on T, typically denoted as β̂ , is then compared to its own standard error. We conclude that the treatment effect is statistically significant if this ratio is sufficiently large.
This regression approach can be then extended to include control variables that help explain the variation in Y not due to the treatment T. The expansion is straightforward and involves adding X to the regression of T on Y, where X is the vector of covariates. Note that for our measurement of the treatment effect to remain valid under this extension, each of these control variables must be independent of our treatment T.
To make this concrete, let’s again consider measuring changes in ASAP times. A potential control variable for such a test would be the travel time between the restaurant and the consumer (as estimated at the time of order creation). Under the assumption that our treatment is assigned randomly and that it does not meaningfully affect road conditions, this variable should be independent of our treatment T. In addition, it should have significant explanatory power over the ASAP time, Y. As a result the standard error on β̂ , the coefficient on T, will be significantly smaller after the introduction of this covariate, making it easier for our test to achieve statistical significance.
Using Predictions as Covariates in Linear Models
CUPED (Controlled-experiment Using Pre-Experiment Data) delivers variance reduction through the inclusion of control variables defined using pre-experiment data. The key aspect of this approach is the insight that pre-experimental data is collected before a randomly assigned treatment effect is introduced. Such variables must be uncorrelated with the treatment assignment and therefore are permissible to include as controls.
To build on this approach, we defined an approach to variance reduction we call CUPAC that uses the output of a machine learning model as a control variable. CUPAC involves using pre-experiment data to build a model of our outcome variable Y using observation-level features. As long as these features are uncorrelated with the treatment T during the experimental period, this estimator ŷ — as a function of these features — will also be uncorrelated with T. As a result, it is permissible to include such an estimator as a covariate.
The amount of variance reduced by CUPAC scales linearly with its out-of-sample partial correlation with the outcome variable Y, given other control variables. When improving model performance (hyperparameter tuning, feature engineering, etc.), we therefore recommend aiming to maximize the partial correlation between the prediction covariate (CUPAC) and the target metric.
Using an ML-based covariate represents an improvement over competing control strategies for multiple reasons. First, an ML-based encoding of our outcome variable generalizes CUPED to situations where the relevant pre-experiment data is not clearly defined. For example, in logistics experiments at DoorDash pre-experiment data does not exist at the delivery level in the same way that it does for a market or customer-level test. While there are multiple pre-experiment aggregates that can be used as proxies (e.g. the average value of the outcome Y during the same hour the week before), all such averages can be viewed as simple approximations to a ML-based control model specifically designed to maximize the partial correlation with the outcome variable.
Second, an ML-based encoding of our outcome variable will capture complex relationships between multiple factors that a set of linear covariates will miss. In nature, nonlinearities and complex dependencies are not the exception, but the rule. ML models such as ensembles of gradient-boosted trees are uniquely suited to capture such complex interaction effects.
Finally, an ML-based approach to control can reduce the computational complexity of variance reduction. Prior to implementing CUPAC, we had used multiple categorical covariates in our experiment analysis. As such regression analysis requires categorical variables to be one-hot encoded, excessive cardinality can make the required calculations computationally expensive. By replacing a large number of variables with a single ML-based covariate, we are able to significantly reduce runtime.
Increasing Experimental Power using CUPAC
In offline simulations, CUPAC consistently delivers power improvements across a variety of effect sizes. The graphic below shows how CUPAC reduces the time required to detect a 5 second ASAP change with 80% power relative to a baseline model with no controls. We include results for each of the 4 random subsets of markets we currently use for switchback testing. On average, CUPAC drives a nearly 40% reduction in required test length vs. baseline and a 15-20% improvement when compared to alternative control methods. While the magnitude of these effects varies across market groups, in each instance CUPAC proves itself to be the most powerful control method.
In Figure 4, we build upon the above analysis by plotting simulated confidence intervals for an A/A switchback test run on similar DoorDash data. As above, we see that CUPAC leads to lower uncertainty in our estimation of the experimental effect than a model with one-hot encoded regional features. However the difference between the confidence intervals of the two methods look smaller than the improvement in Figure 3 might suggest. This is due to the fact that the amount of data required to measure a given effect approximately varies as the inverse of the sample variance, whereas the size of the confidence interval varies as the sample variances’s square root. Such a nonlinear relationship implies that variance reduction can greatly accelerate the rate of testing even when its impacts on confidence are modest.
Conclusion
As DoorDash grows, being able to measure incremental improvements in our logistics system becomes increasingly important. Through variance reduction, CUPAC allows us to detect smaller effect sizes and conclude our experiments more quickly. We are currently using CUPAC on all of our experiments with our dispatch system and are looking to expand this to additional areas of the business.
In our next post, we’ll deep dive into important aspects of building and maintaining such models, such as identifying and handling feature endogeneity and monitoring model deterioration. Stay tuned!
Acknowledgments
We wish to thank Sifeng Lin and Caixia Huang for their significant contributions to this project. We are also grateful to Alok Gupta, Raghav Ramesh, and Ezra Berger for their detailed feedback on drafts of this post.
References
Yixin Tang and Caixia Huang. Cluster Robust Standard Error in Switchback Experiments 2019.