因果推断论文阅读(一):Comparison of Approaches to Ad Mean (Facebook's paper)
Paper信息
Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2017). A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. Ssrn. https://doi.org/10.2139/ssrn.3033144
- Link: https://www.kellogg.northwestern.edu/faculty/gordon_b/files/fb_comparison.pdf
- Slides: https://www.ftc.gov/system/files/documents/public_events/945353/zettelmeyer_fb_fcc_11-3-2016_fz_slides_0.pdf
小结
读这篇文章的时候还没有读 Imbens&Rubin 的因果推断教材,但是依然可以很顺畅地读完,说明作者的写作功力真的是超好,解释能力一流。
一些小结:
- Section 4 Observational Approaches 提供了很多经典的 observational methods 的具体定义和优缺点,可以作为手册参考
- 一些Tricks
- 用户活跃度的变量,都转化成了deciles。虽然文中没有解释具体原因,个人猜测是因为活跃度指标往往服从长尾分布,如果不做这个处理,容易导致估算propensity score的时候有问题,类似于机器学习套逻辑回归前需要做的特征预处理
- 部分variables如年龄,转化成了one-hot
- 用LASSO去掉了一些variables (Imbens & Rubin教课书中的做法)
- 去掉了propensity score <0.05 和>0.95的样本 (Imbens & Rubin教课书中的做法)0.05>
一些读后感:
- 因果推断不是一个万能占卜机器,它对因果,或者说“实验结果”的推断的能力是有很大局限性的
- 因果推断里,相比于复杂的推断算法,个人感觉找到满足unconfoundedness的covariates更加困难
摘录
Summary of Contributions
- Shed light on whether - as is thought in the industry - observational methods using good individual-level data are “good enough” for ad measurement, or whether even good data prove inadequate to yield reliable estimates of advertising effects. Our results support the latter.
- Generally, the observational methods overestimate ad effectiveness relative to the RCT, although in some cases, they significantly underestimate effectiveness.
- These biases persist even after conditioning on a rich set of observables and using a variety of flexible estimation methods.
- Characterize the nature of the unobservable needed to use observational methods successfully to estimate ad effectiveness.
- Obtaining such data is likely not trivial.
- The third contribution of our paper is to the literature on observational versus experimental approaches to causal measurement.
- We analyzed whether the improvements in observational methods for causal inference are sufficient for replicating experimentally generated results in a large industry where such methods are commonly used. We found they do not—at least not with the data at our disposal.
Experiment Setup
- User groups
- Three user groups: control-unexposed, test-unexposed, and test-exposed
- Control group: the second-place ad will be shown
- Determinants of Advertising Exposure
- Summary: compliance is perfect for users in the control group; there is selection bias between exposed and unexposed test-group users
- User-induced endogeneity: In our context, activity bias arises because a user must visit Facebook during the campaign to be exposed.
- Targeting-induced endogeneity
- Bias: Assessing ad effectiveness by comparing exposed versus unexposed consumers will, therefore, overstate the effectiveness of advertising because exposed users were specifically chosen based on their higher conversion rates.
- Don’t worry: Note that the implementation of this system at Facebook does not invalidate experimentation, because the upweighting or downweighting of bids is applied equally to users in the test and control group.
- Competition-induced endogeneity
- Addressing the selection bias
- In the RCT, we address potential selection bias by leveraging the random-assignment mechanism and information on whether a user receives treatment.
- For the observational models, we discard the randomized control group and address the selection bias by relying solely on the treatment status and observables in the test group.
Analysis of the RCT
- Definitions
- $Z_i = \{0,1\}$: control/test
- $W_i(Z_i)=\{0,1\}$: unexposed/exposed
- $Y_i^{obs}=Y_i(Z_i, W_i^{obs})$: observed outcome
- Assumptions (ITT requires 1+2; ATT requires 1+2+3)
- Stable Unit Treat Value Assumption (SUTVA): A user can receive only one ver-sion of the treatment, and a user’s treatment assignment does not interfere with another user’s outcomes.
- Assignment to treatment is random, or that the distribution of $Z_i$ is independent of all potential outcomes $Yi(Z_i,W_i(Z_i))$ and both potential treatments $W_i(Z_i)$.
- Assignment affects a user’s outcome only through receipt of the treatment.
- Remarks:
- ITT: intent-to-treat
- ATT: average treatment effect on the treated
- Causal Effects in the RCT
- ITT on the outcome Y: $ITT_Y=E[Y(1, W(1)) - Y(0, W(0))]$
- ATT: $ATT = E[Y(1, W(1)) - Y(0, W(0)) | W(1)=1]$
- If the sample contains no “always-takers” and no “defiers,” which is true in our experimental design with one-sided non-compliance, the LATE is equal to the ATT.
- LATE: local average treatment effect
- $\tau = ATT = ITT_{Y, co} = ITT_Y / \pi_{co}$, $\pi_{co}=E[W(1)]$ is the share of compliers (e.g., exposed test group users)
- Lift
- Def: incremental conversion rate among treated users
- Equation: $\tau_{\ell}=\frac{\tau}{E[Y^{obs}|Z=1, W^{obs}=1] - \tau}$
- Pros: Reporting the lift facilitates comparison of advertising effects across studies because it normalizes the results according to the treated group’s baseline conversion rate, which can vary significantly with study characteristics.
- Cons: One downside of using lift is that differences between methods can seem large when the treated group’s baseline conversion rate is small.
- Other choices: ROI, for example
Observational Approaches
- Definitions
- $Y_i^{obs}=Y_i(W_i^{obs})$: observed outcome
- Using observation method $m$:
- ATT obtained: $\tau^m = E[Y(1)-Y(0)|W=1]$
- Lift: $\tau_{\ell}^m = \frac{\tau^m}{E[Y^{obs}|W=1] - \tau^m}
- Assumptions
- SUTVA
- Unconfoundedness: the most controversial assumption and is untestable without an experiment
- Overlap: A positive probability of receiving treatment for all values of the observables
- Remarsk
- Strong ignorability := unconfoundedness + overlap
- Propensity score: $e(x) = \Pr(W_i=1|X_i=x)$
- Under string ignorability, treatment assignment and the potential outcomes are independent, conditional on the propensity score
- Observational methods (refer to the paper for equations)
- Exact matching (EM)
- Propensity score matching (PSM)
- Stratification (STRAT)
- Regression adjustment (RA)
- Inverse-probability-weighted regression adjustment (IPWRA)
- Stratification and Regression (STRATREG)
- Discussion
- Challenges: A critique of these observational methods is that sophisticated ad-targeting systems aim for ad exposure that is deterministic and based on a machine-learning algorithm.
- In the limit, such ad-targeting systems would completely eliminate any random variation in exposure, in which case, the observational methods we have discussed in section 4.2 would fail.
- As of now, ad-targeting systems have not eliminated all exogenous reasons a given person would be exposed to an ad campaign whereas a probabilistically equivalent person would not. As a result, the observational methods we have discussed in section 4.2 need not fail.
- However, as ad-targeting systems become more sophisticated, such failure is increasingly likely.
- Challenges: A critique of these observational methods is that sophisticated ad-targeting systems aim for ad exposure that is deterministic and based on a machine-learning algorithm.
Data
- Fifteen advertising studies
- Variables
- FB Variables: age, gender, how long users have been on Facebook, num of friends, phone OS, etc.
- Census Variables: estimated zip code -> 40 variables drawn from the most recent Census and American Communities Surveys (ACS)
- User-Activity Variables: transformed into deciles
- Match Score: Including this variable, and functions of it, in estimating our propensity score allows us to condition on a summary statistic for data beyond which we had direct access and to move beyond concerns that a more flexible propensity-score model might change the results
Identification and Estimation
- Identification
- Estimation and Inference (4 steps)
- Variable selection: LASSO, retain a subset of variables for prediction
- Analysis using the Bootstrap: sample users with replacement repeat 2000 times,
- estimate the RCT ATT $\tau$ and lift $\tau_{\ell}$
- discard the control group
- trimming: keep observations where propensity score is in [0.05, 0.95]
- use observational model m and the trimmed data to estimate ATT $\tau^m$ and lift $\tau_{\ell}^m$
- Inference: Calculate standard errors and confidence intervals using the bootstrap samples of $(\tau, \tau^m, \tau_{\ell}, \tau_{\ell}^m)$
- Assessing Balance
- Density of estimated propensity scores (before/after, trimmed)
- Standardized differences in means (should not exceed 0.25)
Results
Summary of All 15 Studies
- The observational methods we study mostly overestimate the RCT lift, although in some cases, they can significantly underestimate RCT lift.
- Generally, we find that more information helps, but adding census data and activity variables helps less than the Facebook match variable.
- We do not find that one method consistently dominates: in some cases, a given approach performs better than another for one study but not the other.
Assessing the Role of Unobservables in Reducing Bias
- Questions: “If we could obtain new observables, how much better would they need to be to eliminate the bias between the observational and RCT estimates?”
- Results: Our results show that for some studies, observational methods would require additional covari-ates that exceed considerably our combined observables’ explanatory power. This suggests that eliminating bias from observational methods would be hard, even for industry insiders with access to additional data.