Stay up to date with the latest from Uber Engineering
了解 Uber 工程的最新动态

Experimentation is at the core of how Uber improves the customer experience. Uber applies several experimental methodologies to use cases as diverse as testing out a new feature to enhancing our app design.
Uber’s Experimentation Platform (XP) plays an important role in this process, enabling us to launch, debug, measure, and monitor the effects of new ideas, product features, marketing campaigns, promotions, and even machine learning models. The platform supports experiments across our driver, rider, Uber Eats, and Uber Freight apps and is widely used to run A/B/N, causal inference, and multi-armed bandit (MAB)-based continuous experiments.
There are over 1,000 experiments running on our platform at any given time. For example, before Uber launched our new driver app, completely redesigned with our driver-partners in mind, it went through extensive hypothesis testings through a series of experiments conducted with our XP.
At a high level, Uber’s XP allows engineers and data scientists to monitor treatment effects to ensure they do not cause regressions of any key metrics. The platform also lets users configure the universal holdout, used to measure the long-term effects of all experiments for a specific domain.

Below is a chart outlining the types of experimentation methodologies that the Experimentation Platform team uses:

图 2. Uber 的实验平台同时开展随机实验和观察性研究。
There are various factors that determine which statistics methodology we should apply to a given use case. Broadly, we use four types of statistical methodologies: fixed horizon A/B/N tests (t-test, chi-squared, and rank-sum tests), sequential probability ratio tests (SPRT), causal inference tests (synthetic control and diff-in-diff tests), and continuous A/B/N tests using bandit algorithms (Thompson sampling, upper confidence bounds, and Bayesian optimization with contextual multi-armed-bandit tests, to name a few). We also apply block bootstrap and delta methods to estimate standard errors, as well as regression-based methods to measure bias correction when calculating the probability of type I and type II errors in our statistical analyses.
有多种因素决定了我们应该将哪种统计方法应用于特定用例。广义上,我们使用四种类型的统计方法:固定时间范围 A/B/N 检验(t 检验、卡方检验和秩和检验)、序贯概率比检验 (SPRT)、因果推断检验(合成控制和双差检验)以及使用老虎机算法的连续 A/B/N 检验(例如,汤普森抽样、置信上限以及基于上下文多臂老虎机检验的贝叶斯优化)。我们还使用区块引导法和增量法来估计标准误差,并在统计分析中计算 I 类和 II 类错误的概率时使用基于回归的方法来衡量偏差校正。
In this article, we discuss how each of these statistical methods are used by Uber’s Experimentation Platform to improve our services.
在本文中,我们讨论了 Uber 实验平台如何使用这些统计方法来改进我们的服务。
Classic A/B testing 经典 A/B 测试
Randomized A/B or A/B/N tests are considered the gold standard in many quantitative scientific fields for evaluating treatment effects. Uber applies this technique to make objective, data-driven, and scientifically rigorous product and business decisions. In essence, classic A/B testing enables us to randomly split users into control and treatment groups to compare the decision metrics between these groups and determine the experiment’s treatment effects.
随机 A/B 测试或 A/B/N 测试被认为是许多定量科学领域评估实验效果的黄金标准。Uber 运用这项技术来制定客观、数据驱动且科学严谨的产品和业务决策。本质上,经典的 A/B 测试使我们能够将用户随机分为对照组和实验组,以比较两组之间的决策指标,并确定实验的实验效果。

图 3. Uber 的实验平台团队利用 A/B/N 测试进行随机实验以确定提升度。
A common use case for this methodology is feature release experiments. Suppose a product manager wants to evaluate whether a new feature increases user satisfaction with Uber’s platform. The product manager could use our XP to glean the following metrics: the average values of the metric in both treatment and control groups, the lift (treatment effect), whether the lift is significant, and whether the sample sizes are large enough to wield high statistical power.
该方法的一个常见用例是功能发布实验。假设一位产品经理想要评估一项新功能是否能提升用户对 Uber 平台的满意度。产品经理可以使用我们的经验教训来收集以下指标:实验组和对照组中该指标的平均值、提升度(实验效果)、提升度是否显著,以及样本量是否足够大以达到较高的统计功效。

图 4.我们的 XP 分析仪表板使数据科学家和其他用户可以轻松访问和解释他们的 A/B 测试结果。
Statistics engine 统计引擎
One of our team’s main goals is to deliver one-size-fits-most methodologies of hypothesis testing that can be applied to use cases across the company. To accomplish this, we collaborated with multiple stakeholders to build a statistics engine.
我们团队的主要目标之一是提供通用的假设检验方法,使其能够应用于整个公司的用例。为了实现这一目标,我们与多个利益相关者合作构建了一个统计引擎。
When we analyze a randomized experiment, the first step is to pick a decision metric (e.g., rider gross bookings). This choice relates directly to the hypothesis being tested. Our XP enables experimenters to easily reuse pre-defined metrics and automatically handles data gathering and data validation. Depending on the metrics type, our statistics engine applies different statistical hypothesis testing procedures and generates easy-to-read reports. At Uber, we invest heavily in the research and validation of methodologies and are constantly improving the robustness and effectiveness of our statistics engine.
当我们分析随机实验时,第一步是选择一个决策指标(例如,乘客总预订量)。这个选择与待检验的假设直接相关。我们的 XP 使实验者能够轻松地重复使用预定义的指标,并自动处理数据收集和数据验证。根据指标类型,我们的统计引擎会应用不同的统计假设检验程序,并生成易于阅读的报告。在 Uber,我们投入巨资进行方法论的研究和验证,并不断提升统计引擎的稳健性和有效性。
Figure 5, below, offers a high-level overview of this powerful tool:
下面的图 5 提供了这个强大工具的高级概述:

图 5:Uber 的统计引擎用于 A/B/N 实验,并由固定期限假设检验方法决定。
Key components and statistical methodologies
关键组成部分和统计方法
After gathering data, our XP’s analytic platform validates the data and detects two major issues for experimenters to watch for and to keep a healthy skepticism in their A/B experiments:
收集数据后,我们的 XP 分析平台会验证数据并检测出实验人员需要注意的两个主要问题,并在 A/B 实验中保持健康的怀疑态度:
- Sample size imbalance, meaning that the sample size ratio in the control and treatment groups is significantly different from what was expected. In these scenarios, experimenters must double check their randomization mechanisms.
样本量不平衡 ,即对照组和治疗组的样本量比例与预期存在显著差异。在这种情况下,实验者必须仔细检查其随机化机制。 - Flickers, which refers to users that have switched between control and treatment groups. For example, a rider purchases a new Android cell phone to replace an old iPhone, while the treatment of the experiment was only configured for iOS. The rider would switch from the treatment group to the control group. Existence of such users might contaminate the experiment results, so we would exclude these users (flickers) in our analyses.
闪变用户 ,指在对照组和实验组之间切换的用户。例如,一位骑手购买了一部新的安卓手机来替换旧的 iPhone,而实验组只配置了 iOS 系统。骑手会从实验组切换到对照组。这类用户的存在可能会影响实验结果,因此我们会在分析中排除这些用户(闪变用户)。
Most of our use cases are randomized experiments and most of the time summarized data is sufficient for performing fixed horizon A/B tests. At the user level, there are three distinct types of metrics:
我们的大多数用例都是随机实验,大多数情况下汇总的数据足以进行固定时间范围的 A/B 测试。在用户层面,有三种不同类型的指标:
- Continuous metrics contain one numeric value column, e.g., gross bookings per user.
连续指标包含一个数值列,例如每个用户的总预订量。 - Proportion metrics contain one binary indicator value column, e.g., to test the proportion of users who complete any trips after sign-up.
比例指标包含一个二进制指标值列,例如,测试注册后完成任何行程的用户比例。 - Ratio metrics contain two numeric value columns, the numerator values and the denominator values, e.g., the trip completion ratio, where the numerator values are the number of completed trips, and the denominator values are the number of total trip requests.
比率指标包含两个数值列,即分子值和分母值,例如,行程完成率,其中分子值是已完成的行程次数,分母值是总行程请求次数。
Three variants of data preprocessing are applied to improve the robustness and effectiveness of our A/B analyses:
我们采用三种数据预处理的方法来提高 A/B 分析的稳健性和有效性:
- Outlier detection removes irregularities in data and improves the robustness of analytic results. We use a clustering-based algorithm to perform outlier detection and removal.
异常值检测可以消除数据中的异常,并提高分析结果的稳健性。我们使用基于聚类的算法来执行异常值检测和移除。 - Variance reduction helps increase the statistical power of hypothesis testing, which is especially helpful when the experiment has a small user base or when we need to end the experiment prematurely without sacrificing scientific rigor. The CUPED Method leverages extra information we have and reduces the variance in decision metrics.
方差缩减有助于提高假设检验的统计功效,这在实验用户群较小或需要提前结束实验且不牺牲科学严谨性的情况下尤其有用。CUPED 方法利用我们掌握的额外信息,并减少决策指标的方差。 - Pre-experiment bias is a big challenge at Uber because of our diversity of users. Sometimes, constructing robust counterfactual via mere randomization just doesn’t cut it. Difference in differences (diff-in-diff) is a well-accepted method in quantitative research and we use it to correct pre-experiment bias between groups so as to produce reliable treatment effects estimation.
由于用户群体的多样性 ,实验前偏差对 Uber 来说是一个巨大挑战。有时,仅仅通过随机化构建稳健的反事实模型并不能解决问题。差异-差异法 (diff-in-diff) 是定量研究中一种广为接受的方法,我们用它来纠正组间实验前偏差,从而得出可靠的实验效果估计。
The p-value calculation is central to our statistics engine. The p-value directly determines whether the XP reports that a result is significant. We compare the p-value to the false positive rate (Type-I error) we desire (0.05) in a common A/B test. Our XP leverages various procedures for p-value calculation, including:
p 值计算是我们统计引擎的核心。p 值直接决定了 XP 报告结果是否显著。我们将 p 值与常见 A/B 测试中我们期望的假阳性率(I 型错误)0.05 进行比较。我们的 XP 利用各种程序来计算 p 值,包括:
- Welch’s t-test, the default test used for continuous metrics, e.g., completed trips.
Welch t 检验 ,用于连续指标(例如已完成的行程)的默认检验。 - The Mann-Whitney U test, a nonparametric rank sum test used to detect severe skewness in the data. It requires weaker assumptions than the t-test and performs better with skewed data.
Mann-Whitney U 检验是一种非参数秩和检验,用于检测数据中的严重偏斜。它要求的假设比 t 检验更弱,并且在处理偏斜数据时效果更好。 - The Chi-squared test, used for proportion metrics, e.g., rider retention rate.
卡方检验 , 用于比例指标,例如骑手保留率。 - The Delta method (Deng et al. 2011) and bootstrap methods, used for standard error estimation whenever suitable to generate robust results for experiments with ratio metrics or with small sample sizes, e.g., the ratio of trips cancelled by riders.
Delta 方法( Deng 等人,2011 ) 和引导方法,用于标准误差估计,只要适合为具有比率指标或小样本量(例如,乘客取消的行程比率)的实验生成稳健的结果。
On top of these calculations, we use multiple comparison correction (the Benjamini-Hochberg procedure) to control the overall false discovery rate (FDR) when there are two or more treatment groups (e.g., in an A/B/C test or an A/B/N test).
除了这些计算之外,我们还使用多重比较校正( Benjamini-Hochberg 程序 )来控制存在两个或多个治疗组时(例如,在 A/B/C 测试或 A/B/N 测试中)的总体错误发现率 (FDR)。
The power calculation provides additional information about the level of confidence users should put into their analysis. An experiment with low power will suffer from high false negative rates (Type-II error) and high FDRs. In the power calculations our XP conducts, a t-test is always assumed. On the flipside, required sample size calculation is the opposite of a power calculation and estimates how many users are required by the experiment for it to achieve a high power (0.8).
功效计算提供了关于用户在分析中应投入的置信水平的额外信息。功效低的实验将面临较高的假阴性率(II 型错误)和 FDR。在我们的 XP 进行的功效计算中,始终假设使用 t 检验。另一方面,所需样本量计算与功效计算相反,它估算实验需要多少用户才能达到较高的功效 (0.8)。
Metrics management 指标管理
As the number of the metrics used by the XP’s analytics component grows (incorporating 1,000+ metrics), it becomes more and more challenging for users to determine the proper metrics to evaluate the performance of an experiment. To make it easier for new users of our analytics tool to uncover these metrics, we built a recommendation engine that facilitates the discovery of metrics available on our platform.
随着 XP 分析组件使用的指标数量不断增长(包含 1,000 多个指标),用户越来越难以确定合适的指标来评估实验的效果。为了让我们的分析工具新用户更容易发现这些指标,我们构建了一个推荐引擎,方便用户发现我们平台上可用的指标。
At Uber, there are two common collaborative filtering methods used for content recommendation: item-based and user-based methods. We primarily use an item-based recommendation engine since the characteristics of the experimenter do not typically have a strong influence on their project. For instance, if an experimenter switches to the Uber Eats team from the Rider team, it’s not necessary for the algorithm to review the previous, Uber Eats-inspired choices of that experimenter when selecting metrics to evaluate.
在 Uber,内容推荐常用两种协同过滤方法:基于物品的方法和基于用户的方法。我们主要使用基于物品的推荐引擎,因为实验者的特征通常不会对其项目产生很大影响。例如,如果一位实验者从 Rider 团队转到 Uber Eats 团队,那么在选择评估指标时,算法就无需参考该实验者之前受 Uber Eats 启发做出的选择。
Recommendation engine methodology
推荐引擎方法
To determine how correlated two metrics are to each other, we add their popularity and absolute scores, enabling us to better understand their relationship. The two basic approaches to calculating these scores are:
为了确定两个指标之间的相关性,我们将它们的受欢迎程度和绝对分数相加,以便更好地理解它们之间的关系。计算这些分数的两种基本方法是:
- Popularity score: The more frequently two metrics are selected together across experiments, the higher the score assigned to their relationship. We use the Jaccard Index to help users discover the most relevant metric once they select their initial metric. This score accounts for the experimenters’ metrics selection from past experiments.
热门度得分: 两个指标在实验中同时被选中的次数越多,它们之间的关系得分就越高。我们使用杰卡德指数来帮助用户在选择初始指标后找到最相关的指标。该得分反映了实验者在过去实验中对指标的选择。 - Absolute score: Using our XP, we can generate a pool of user samples from our metrics and calculate the Pearson correlation score of the two metrics. This accounts for serendipitous discovery; namely, the experimenter may not have considered adding a metric to the experiment since it is not directly related, but it might be moving with the user-selected metric.
绝对得分: 利用我们的经验法则 (XP),我们可以从指标中生成一个用户样本池,并计算这两个指标的皮尔逊相关得分。这解释了偶然发现的情况;也就是说,实验者可能没有考虑在实验中添加一个指标,因为它与用户选择的指标没有直接关系,但它可能会随着用户选择的指标而变化。
After calculating these two scores, we add the score of the two steps above with relative weights on each term and recommend the metrics with the highest score to the experimenter based on their first choice of metrics.
计算出这两个分数后,我们将上述两个步骤的分数与每个术语的相对权重相加,并根据实验者对指标的首选,向他们推荐得分最高的指标。
Insights discovery 洞察发现
As Uber continues to scale, it becomes more and more challenging to mine our metrics knowledge base. Our recommendation engine enables both global and local teams to access the information they need quickly and easily, allowing them to improve our services accordingly.
随着 Uber 规模不断扩大,挖掘指标知识库变得越来越具有挑战性。我们的推荐引擎使全球和本地团队都能快速轻松地获取所需信息,从而相应地改进我们的服务。
For example, if an experimenter wants to measure the treatment effect on driver-partner supply hours, it may not be obvious to the experimenter to also add the number of trips taken by new riders as a metric, since this experiment focuses on the driver side of the trip equation. However, both metrics are important for this experiment because of the dynamics of our marketplace. Our recommendation engine helps data scientists and other users discover important metrics that may not have been obvious.
例如,如果实验者想要衡量处理结果对司机伙伴供应时长的影响,那么实验者可能不太容易想到将新乘客的出行次数也纳入指标,因为本实验的重点是出行方程中的司机方面。然而,由于我们市场的动态性,这两个指标对本实验都很重要。我们的推荐引擎可以帮助数据科学家和其他用户发现一些可能不太明显的重要指标。
Sequential testing 顺序测试
While traditional A/B testing methods (for example, a t-test) inflate Type-I error by repeatedly taking subsamples, sequential testing offers a way to continuously monitor key business metrics.
虽然传统的 A/B 测试方法(例如 t 检验)会通过重复抽取子样本而增加 I 型错误,但顺序测试提供了一种持续监控关键业务指标的方法。
One use case where a sequential test comes in handy for our team is when identifying outages caused by the experiments running on our platform. We cannot wait until a traditional A/B test collects sufficient sample sizes to determine the cause of an outage; we want to make sure experiments are not introducing key degradations of business metrics as soon as possible, in this case, during the experimentation period. Therefore, we built a monitoring system powered by a sequential testing algorithm to adjust the confidence intervals accordingly without inflating Type-I error.
顺序测试对我们团队来说非常有用的一个用例是识别平台上运行的实验导致的中断。我们不能等到传统的 A/B 测试收集到足够的样本量后再确定中断的原因;我们希望确保实验不会尽快(在本例中,是在实验期间)导致业务指标的关键下降。因此, 我们构建了一个由顺序测试算法驱动的监控系统,以便在不增加 I 型错误的情况下相应地调整置信区间。
Using our XP, we conduct periodic comparisons about these business metrics, such as app crash rates and trip frequency rates, between treatment and control groups for ongoing experiments. Experiments continue if there are no significant degradations, otherwise they will be given an alert or even paused. The workflow for this monitoring system is shown in Figure 6, below:
利用我们的经验教训,我们会定期比较实验组和对照组之间的业务指标,例如应用崩溃率和行程频率。如果没有显著的性能下降,实验将继续进行;否则,实验将收到警报,甚至暂停。该监控系统的工作流程如下图 6 所示:

图 6.我们将顺序测试方法集成到 XP 中断监控系统的工作流程中。
Methodologies 方法论
We leverage two main methodologies to perform sequential testing for metrics monitoring purposes: the mixture sequential probability ratio test (mSPRT) and variance estimation with FDR.
我们利用两种主要方法执行顺序测试以达到指标监控目的: 混合顺序概率比检验 (mSPRT) 和使用 FDR 的方差估计。
Mixture Sequential Probability Ratio Test
混合序贯概率比检验
The most common method we use for monitoring is mSPRT. This test builds on the likelihood ratio test by incorporating an extra specification of mixing distribution H. Suppose we are testing the metric difference with the null hypothesis being
, then the test statistics could be written as
=
. Sinces we have large sample sizes and the central limit theorem can be applied to most cases, we use normal distribution as our mixing distribution,
. This leads to easy computation and a closed form expression for
. Another useful property about this method is under null hypothesis, nH, 0 is proven to be a martingale:
. Following this, we could construct
confidence interval.
我们用于监测的最常用方法是 mSPRT。该检验在似然比检验的基础上,加入了额外的混合分布 H 规范。假设我们在零假设为
的情况下检验度量差异 ,则检验统计量可以写成
=
。 由于样本量较大,并且中心极限定理可以应用于大多数情况,因此我们使用正态分布作为混合分布
。这使得计算变得简单,并且
有一个闭式表达式 。此方法的另一个有用属性是在零假设下, n H, 0 被证明为鞅 :
。由此,我们可以构建
置信区间。
Variance estimation with FDR control
带 FDR 控制的方差估计
To apply sequential testing correctly, we need to estimate variance as accurately as possible. Since we monitor the cumulative difference between our control and treatment groups on a daily basis, observations from the same users introduce correlations which violate the assumption of the mSPRT test. For example, if we are monitoring click through rates, then the metric from one user across multiple days may be correlated. To overcome this, we use delete-a-group jackknife variance estimation/block bootstrap methods to generalize mSPRT test under correlated data.
为了正确应用序贯检验,我们需要尽可能准确地估计方差。由于我们每天监测对照组和实验组之间的累积差异,来自同一用户的观察结果会引入相关性,这违反了 mSPRT 检验的假设。例如,如果我们监测点击率,那么同一用户在多天内的指标可能存在相关性。为了解决这个问题,我们使用删除组刀切法方差估计 /块引导法来推广相关数据下的 mSPRT 检验。
Since our monitoring system wants to evaluate the overall health of an ongoing experiment, we monitor many business metrics at the same time, potentially leading to false alarms. In theory, either the Bonferroni or BH correction could be applied in this scenario. However, since the potential loss of missing business degradations can be substantial, we apply BH correction here and also tune in parameters (MDE, power, tolerance for practical significance, etc.) for metrics with varying levels of importance and sensitivity.
由于我们的监控系统需要评估正在进行的实验的整体健康状况,因此我们会同时监控多个业务指标,这可能会导致误报。理论上, 在这种情况下,可以应用 Bonferroni 校正或 BH 校正 。然而,由于缺失业务性能下降的潜在损失可能很大,我们在此应用了 BH 校正,并针对不同重要性和敏感度的指标调整了参数(MDE、功效、对实际意义的容忍度等)。
Use cases 用例
Suppose we want to monitor a key business metric for a specific experiment, as depicted in Figure 7, below:
假设我们要监控特定实验的关键业务指标,如下图 7 所示:
![]() | ![]() |
| Figure 7. The sequential test methodology indicates a significant difference between our treatment and control groups, as identified in Plot B. In contrast, no significant difference is identified in Plot A. 图 7. 顺序检验方法表明我们的治疗组和对照组之间存在显著差异,如图 B 所示。相反,在图 A 中没有发现显著差异。 |
The red lines Plots A and B signify the observed cumulative relative difference between our treatment and control groups. The red band is the
confidence interval for this cumulative relative difference.
红线(图 A 和 B)表示我们观察到的治疗组和对照组之间的累积相对差异。红色带表示此累积相对差异的
置信区间。
As time passes, we accumulate more samples and the confidence interval narrows. In Plot B, the confidence interval consistently deviates from zero starting on a given date, in this example, November 21. With an extra threshold (in other words, tolerance for our monitoring system) for practical significance imposed, metrics degradation is detected to be both statistically and practically significant after a certain date. In contrast, Plot A’s confidence interval shrinks but always includes 0. Thus, we didn’t detect any regressions for the crash monitored in Plot A.
随着时间的推移,我们积累了更多样本,置信区间也随之缩小。在图 B 中,置信区间从给定日期(本例中为 11 月 21 日)开始持续偏离零。通过对实际意义设置额外的阈值(换句话说,也就是我们监控系统的容忍度),我们检测到指标的下降在某个日期之后既具有统计意义,又具有实际意义。相比之下,图 A 的置信区间虽然缩小,但始终包含 0。因此,我们没有检测到图 A 中监控到的崩溃的任何回归。
Continuous experiments 持续实验

图 3. Uber 的实验平台团队利用 A/B/N 测试进行随机实验以确定提升度。
To accelerate innovation and learning, the data science team at Uber is always looking to optimize driver, rider, eater, restaurant, and delivery-partner experiences through continuous experiments. Our team has implemented bandit and optimization-focused reinforcement learning methods to learn iteratively and rapidly from the continuous evaluation of related metric performance.
为了加速创新和学习,Uber 的数据科学团队始终致力于通过持续实验来优化司机、乘客、食客、餐厅和配送员的体验。我们的团队实施了老虎机游戏和以优化为重点的强化学习方法,通过对相关指标性能的持续评估,快速迭代地进行学习。
Recently, we completed an experiment using bandit techniques for content optimization to improve customer engagement. The technique helped improve customer engagement compared to classic hypothesis testing methods. Figure 9, below, outlines Uber’s various continuous experiment use cases, including content optimization, hyper-parameter tuning, spend optimization, and automated feature rollouts:
最近,我们完成了一项实验,利用老虎机技术进行内容优化,以提高客户参与度。与传统的假设检验方法相比,该技术有助于提高客户参与度。下图 9 概述了 Uber 的各种持续实验用例,包括内容优化、超参数调优、支出优化和自动化功能发布:

图 9.Uber 的 XP 利用持续实验来应对各种用例,包括超参数调整和自动化功能推出。
In Case Study 1, we outline how bandits have helped optimize email campaigns and enhance rider engagement at Uber. Here, the Uber Eats Customer Relationship Management (CRM) team in Europe, the Middle East, and Africa (EMEA) launched an email campaign to encourage order momentum early in the customer life cycle. The experimenters plan to run a campaign with ten different email subject lines and find out the best subject line in terms of the open rate and the number of open emails. Figure 10, below, details this case study:
在案例研究 1 中,我们概述了强盗邮件营销如何帮助 Uber 优化电子邮件营销活动并提升乘客参与度。Uber Eats 欧洲、中东和非洲 (EMEA) 的客户关系管理 (CRM) 团队发起了一项电子邮件营销活动,旨在在客户生命周期的早期提升订单量。实验人员计划使用十个不同的电子邮件主题行开展营销活动,并根据打开率和打开数量找出最佳主题行。下图 10 详细展示了此案例研究:
A second example of how we leverage continuous experiments is parameter tuning. Unlike the first case, the second case study uses a more advanced bandit algorithm, the contextual multi-armed bandit technique, which combines statistical experiments and machine learning modeling. We use contextual MAB to choose the best parameters in a machine learning model.
我们如何利用持续实验的第二个例子是参数调优。与第一个案例不同,第二个案例研究采用了一种更先进的老虎机算法——上下文多臂老虎机技术,该技术结合了统计实验和机器学习建模。我们使用上下文多臂老虎机技术来选择机器学习模型中的最佳参数。
As depicted in Figure 11, below, the Uber Eats Data Science team leveraged MAB testing to create a linear programming model, called the multiple-objective optimization (MOO), that ranks restaurants on the main feed of the Uber Eats app:
如下图 11 所示,Uber Eats 数据科学团队利用 MAB 测试创建了一个线性规划模型,称为多目标优化 (MOO),该模型对 Uber Eats 应用程序主信息流中的餐厅进行排名:
The algorithm behind MOO incorporates several metrics, such as session conversion rate, gross booking fee, and user retention rate. However, the mathematical solution contains a set of parameters that we need to give to the algorithm.
MOO 背后的算法包含多个指标,例如会话转化率、总预订费和用户留存率。然而,数学解包含一组我们需要提供给算法的参数。
These experiments contain many parameter candidates for use with our ranking algorithms. The ranking results depend on the hyper-parameters we chose for the MOO model. Therefore, to improve the performance of the MOO model, we hope to figure out the best hyper-parameters via multi-armed bandits algorithm. The traditional A/B test framework is too time-intensive to handle each test, so we decided to utilize the MAB method for these experiments. MAB is able to provide a framework to quickly tune these parameters.
这些实验包含许多可用于我们排名算法的候选参数。排名结果取决于我们为 MOO 模型选择的超参数。因此,为了提升 MOO 模型的性能,我们希望通过多臂老虎机算法找出最佳超参数。传统的 A/B 测试框架处理每次测试耗时过长,因此我们决定在这些实验中使用 MAB 方法。MAB 能够提供一个框架来快速调整这些参数。
We chose the contextual MAB and the Bayesian optimization methods to find the maximizers of a black box function optimization problem. Figure 12, below, outlines the setup of this experiment:
我们选择了上下文 MAB 和贝叶斯优化方法来寻找黑盒函数优化问题的最大化器。下图 12 概述了本次实验的设置:

图 12:我们的 XP 利用上下文 MAB 进行超参数调整。
As shown above, contextual Bayesian optimization works well with both personalized information and exploration-exploitation trade-offs.
如上所示,上下文贝叶斯优化对于个性化信息和探索-利用权衡都表现出色。
Moving Forward 前进
As a result of its scale and global impact, Uber’s problem space poses unique challenges. As our methodologies evolve, we aspire to build an ever more intelligent experimentation platform. In the future, this platform will provide insights gleaned not only from current experiments, but also previous ones, and, over time, proactively predict metrics.
由于其规模和全球影响力,Uber 的问题空间带来了独特的挑战。随着我们方法论的不断发展,我们渴望构建一个更加智能的实验平台。未来,该平台不仅能提供从当前实验中收集的洞察,还能收集以往实验的成果,并随着时间的推移,主动预测各项指标。
Uber’s Experimentation Platform team is hiring. If you are passionate about experimentation and machine learning, please apply for this role.
Uber 实验平台团队正在招聘。如果您对实验和机器学习充满热情,欢迎申请此职位。
Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.
订阅我们的时事通讯 ,了解 Uber Engineering 的最新创新。

Anirban Deb 阿尼尔班·德布
Anirban Deb is the former tech lead of the Experimentation, Segmentation, Personalization and Mobile App Development Platform data science teams at Uber and currently heading the Uber Freight data science organization.
Anirban Deb 曾担任 Uber 实验、细分、个性化和移动应用开发平台数据科学团队的技术主管,目前担任 Uber Freight 数据科学组织的负责人。

Suman Bhattacharya 苏曼·巴塔查里亚
Suman Bhattacharya is a senior data scientist on Uber’s Experimentation Platform team.
Suman Bhattacharya 是 Uber 实验平台团队的高级数据科学家。

Jeremy Gu 顾杰里米
Jeremy Gu is a data scientist on Uber’s Experimentation Platform team.
Jeremy Gu 是 Uber 实验平台团队的数据科学家。

Tianxia Zhou
Tianxia Zhou is a data scientist on Uber's Experimentation Platform team.
Tianxia Zhou 是 Uber 实验平台团队的数据科学家。

Eva Feng 冯
Eva Feng is a data scientist on Uber's Experimentation Platform team.
Eva Feng 是 Uber 实验平台团队的数据科学家。

Mandie Liu 刘曼迪
Mandie Liu is a data scientist on Uber’s Experimentation Platform team.
Mandie Liu 是 Uber 实验平台团队的数据科学家。
Posted by Anirban Deb, Suman Bhattacharya, Jeremy Gu, Tianxia Zhou, Eva Feng, Mandie Liu
发布者:Anirban Deb、Suman Bhattacharya、Jeremy Gu、周天下、Eva Feng、Mandie Liu
Related articles 相关文章
Most popular 最受欢迎

工程、后端、Uber AI 5 月 1 日 / 全球
Fixrleak: Fixing Java Resource Leaks with GenAI
Fixrleak:使用 GenAI 修复 Java 资源泄漏

工程、后端、数据 / 机器学习 5 月 8 日 / 全球
Migrating Large-Scale Interactive Compute Workloads to Kubernetes Without Disruption
无中断地将大规模交互式计算工作负载迁移到 Kubernetes

工程、后端、安全 5 月 15 日 / 全球
Building Uber’s Multi-Cloud Secrets Management Platform to Enhance Security
构建 Uber 多云机密管理平台以增强安全性

工程、后端 5 月 22 日 / 全球







