因果推断学习笔记（一）：经典方法尝试之Propensity Score Matching (Lalonde's Dataset)

谨以此文纪念一个8点下班买了一杯肥宅快乐奶茶学习直到半夜的美好夜晚。—— 2018/11/15 2:30am超困的碎碎念

入门因果推断是希望推测Causal Effect。仔细来说，假设已知道 counfounding variables，希望推测 treatment 对 outcome 的影响。

这个文档的目的是收集和尝试各种经典、基础的因果推断方式，感受一下不同方法的差异。数据集使用了Lalonde，感觉各种入门的tutorial都在用这个数据集。

持续更新中，加油！

以下内容由Rmarkdown转来。Git: yishilin14/causal_playground

2018/11/15：新增所有MatchIt带的方法 (CEM除外，Mac的XQuartz没装好)
2018/12/16: 重新整理了所有方法的代码，末尾增加了一块汇总表格。

博客的md用render("./causality1_playaround-with-the_lalonde_dataset.Rmd", md_document(variant = "markdown_github"))生成~

Load the dataset

首先读一下lalonde数据集。这份数据背后其实是有对应的随机实验的，ground truth ATT为1800美金左右。不过这里这一份数据是由随机实验的实验组，加上了其它方式得到的对照组，所以可以用来尝试用各种方式估算ATT并且和ground truth比较。

data(lalonde, package="MatchIt")
lalonde <- as.data.table(lalonde)

if (simulated.data) {
  # simulated ATE and ATT
  lalonde[, known_pscore := (age + educ + 10*black + 10*hispan + 10*married)/100]
  lalonde[, .(min(known_pscore), max(known_pscore))]
  lalonde[, y0 := educ + married]
  lalonde[, y1 := y0 + 10 * age + 300 * married]
  lalonde[, treat := sapply(lalonde$known_pscore, function(x){sample(c(0,1), prob = c(1-x, x))[1]})]
  lalonde[, re78 := ifelse(treat==1, y1, y0)]
  cat("ATT (expectation): ", lalonde[,sum((y1-y0)*known_pscore) / sum(known_pscore)], "\n")
  cat("ATT (sample): ", lalonde[treat==1, mean(y1)-mean(y0)], "\n")
  cat("ATE (expectation&sample): ", lalonde[, mean(y1)-mean(y0)], "\n")
}

kable(head(lalonde))

treat	age	educ	black	hispan	married	nodegree	re78
1	37	11	1	0	1	1	9930.0460
1	22	9	0	1	0	1	3595.8940
1	30	12	1	0	0	0	24909.4500
1	27	11	1	0	0	1	7506.1460
1	33	8	1	0	0	1	289.7899
1	22	9	1	0	0	1	4056.4940

Setup

为了公平起见，大家都用一样的方式matching和估算causal effect。

ps.fml <- treat ~ age + educ + black + hispan + married + nodegree + re74 + re75 
ate.fml <- re78 ~ treat + age + educ + black + hispan + married + nodegree + re74 + re75

estimate.causal.effect <- function(m.data, estimand = "causal effect", method.name = "method name") {
  # formula setup (must be inside the function)
  ate.fml <- re78 ~ treat + age + educ + black + hispan + married + nodegree + re74 + re75
  ate.fml.onlytreat <- re78 ~ treat 
  
  res <- lm(ate.fml, data = m.data, weights = m.data$weights)
  res.reg.cov <- coef(res)['treat']
  cat("[Regression on treat + covariates] ", estimand, ": ", res.reg.cov, "\n")
  
  res <- lm(ate.fml.onlytreat, data = m.data, weights = m.data$weights)
  res.reg.treat <- coef(res)['treat']
  cat("[Regression on treated] ", estimand, ": ", res.reg.treat, "\n")
  
  y.treat <- weighted.mean(m.data$re78[m.data$treat == 1], m.data$weights[m.data$treat == 1])
  y.cont <- weighted.mean(m.data$re78[m.data$treat == 0], m.data$weights[m.data$treat == 0])
  res.direct.compare <- y.treat - y.cont
  cat("[Direct comparison] ", estimand, ":", y.treat - y.cont, "\n")
  
  return(data.table(
    `Method name` = method.name,
    `Estimand` = estimand,
    `Regression on treat + covariates` = res.reg.cov,
    `Regression on treat` = res.reg.treat,
    `Direct comparison` = res.direct.compare
    ))
}

原始数据

直接看数据，显得treat(就业培训)反而对re78(收入)产生了副作用。treat导致re78下降635美刀。不过直接用所有数据进行regression on treat+covariates，效果已经不错了，之后会再写一篇不使用propensity score而只用regression的博客。

1
2
3

lalonde.copy <- lalonde
lalonde.copy$weights <- 1
causal.effect.results.all <- estimate.causal.effect(lalonde.copy, "Causal effect", "Raw data")

## [Regression on treat + covariates]  Causal effect :  1548.244 
## [Regression on treated]  Causal effect :  -635.0262 
## [Direct comparison]  Causal effect : -635.0262

Matching via MatchIt

用MatchIt这个包来完成一些matching工作。部分代码来源于

MatchIt的文档：https://r.iq.harvard.edu/docs/matchit/2.4-20/matchit.pdf
MatchIt的Demo：https://github.com/kosukeimai/MatchIt/tree/master/demo
http://www.stephenpettigrew.com/teaching/gov2001/section11_2014.pdf

Exact Matching

This technique matches each treated unit to all possible control units with exactly the same values on all the covariates, forming subclasses such that within each subclass all units (treatment and control) have the same covariate values.

配出来只有15个人了，放弃治疗。匹配后的Treated样本比原来少，Treated+Control也不是原来的样子了，感觉估算的causal effect是介于ATE和ATT间的。

1 2	m.out <- matchit(ps.fml, data = lalonde, method = "exact") print(summary(m.out, standarize=T)$nn)

##           Control Treated
## All           429     185
## Matched        12      13
## Discarded     417     172

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  estimate.causal.effect(match.data(m.out), "Causal effect", "Exact Matching")
)

## [Regression on treat + covariates]  Causal effect :  -1106.311 
## [Regression on treated]  Causal effect :  -1106.311 
## [Direct comparison]  Causal effect : -1106.311

Subclassification

The goal of subclassification is to form subclasses, such that in each the distribution (rather than the exact values) of covariates for the treated and control groups are as similar as possible.

用regression计算ATT得到$1600左右，接近ground truth。

m.out <- matchit(ps.fml, data = lalonde, method = "subclass", subclass=6)
if (!simulated.data) {
  print(summary(m.out, standarize=T))
}

## 
## Call:
## matchit(formula = ps.fml, data = lalonde, method = "subclass", 
##     subclass = 6, sub.by = "treat")
## Summary of balance for all data:
##          Means Treated Means Control  Mean Diff   eQQ Med  eQQ Mean
## distance        0.5774        0.1822     0.3952    0.5176    0.3955
## age            25.8162       28.0303    -2.2141    1.0000    3.2649
## educ           10.3459       10.2354     0.1105    1.0000    0.7027
## black           0.8432        0.2028     0.6404    1.0000    0.6432
## hispan          0.0595        0.1422    -0.0827    0.0000    0.0811
## married         0.1892        0.5128    -0.3236    0.0000    0.3243
## nodegree        0.7081        0.5967     0.1114    0.0000    0.1135
## re74         2095.5737     5619.2365 -3523.6628 2425.5720 3620.9240
## re75         1532.0553     2466.4844  -934.4291  981.0968 1060.6582
##            eQQ Max
## distance    0.5966
## age        10.0000
## educ        4.0000
## black       1.0000
## hispan      1.0000
## married     1.0000
## nodegree    1.0000
## re74     9216.5000
## re75     6795.0100
## 
## 
## Summary of balance by subclasses:
## , , Subclass 1
## 
##          Means Treated Means Control  Mean Diff    eQQ Med   eQQ Mean
## distance        0.1376        0.0763     0.0613     0.0570     0.0573
## age            25.5806       28.4595    -2.8789     2.0000     3.9355
## educ           10.4839       10.2543     0.2295     0.0000     0.7742
## black           0.0968        0.0145     0.0823     0.0000     0.0645
## hispan          0.3226        0.1734     0.1492     0.0000     0.1290
## married         0.2903        0.5780    -0.2877     0.0000     0.2903
## nodegree        0.5806        0.5809    -0.0003     0.0000     0.0000
## re74         3445.7771     6311.1947 -2865.4176  3015.3120  3607.4284
## re75         2080.7043     2637.3832  -556.6789   447.2700   882.0608
##             eQQ Max
## distance     0.1008
## age         14.0000
## educ         4.0000
## black        1.0000
## hispan       1.0000
## married      1.0000
## nodegree     0.0000
## re74      9177.7500
## re75      6810.6600
## 
## , , Subclass 2
## 
##          Means Treated Means Control  Mean Diff    eQQ Med   eQQ Mean
## distance        0.5081        0.4669     0.0413     0.0317     0.0388
## age            26.9032       32.2500    -5.3468     6.5000     6.5417
## educ            9.6129        9.3750     0.2379     0.0000     0.5417
## black           0.9677        0.9583     0.0094     0.0000     0.0000
## hispan          0.0323        0.0417    -0.0094     0.0000     0.0000
## married         0.4194        0.6667    -0.2473     0.0000     0.2500
## nodegree        0.6129        0.6250    -0.0121     0.0000     0.0417
## re74         6437.9113     6609.9134  -172.0021   786.1610  1165.8371
## re75         3089.0804     3401.3890  -312.3086   653.6978  1101.5569
##             eQQ Max
## distance     0.0737
## age         18.0000
## educ         3.0000
## black        0.0000
## hispan       0.0000
## married      1.0000
## nodegree     1.0000
## re74      4011.3600
## re75      4199.6200
## 
## , , Subclass 3
## 
##          Means Treated Means Control  Mean Diff    eQQ Med   eQQ Mean
## distance        0.6262        0.6292    -0.0031     0.0024     0.0039
## age            25.5517       25.1765     0.3753     2.0000     3.2941
## educ           10.2759       10.0000     0.2759     1.0000     1.0000
## black           1.0000        1.0000     0.0000     0.0000     0.0000
## hispan          0.0000        0.0000     0.0000     0.0000     0.0000
## married         0.4138        0.0588     0.3550     0.0000     0.3529
## nodegree        0.6897        0.4706     0.2191     0.0000     0.1765
## re74          806.7050     2110.7067 -1304.0016   133.2301  1267.0213
## re75          628.5508      940.9724  -312.4216     0.0000   302.3409
##             eQQ Max
## distance     0.0111
## age          9.0000
## educ         3.0000
## black        0.0000
## hispan       0.0000
## married      1.0000
## nodegree     1.0000
## re74      7110.3400
## re75      1878.2374
## 
## , , Subclass 4
## 
##          Means Treated Means Control  Mean Diff    eQQ Med   eQQ Mean
## distance        0.6823        0.6824    -0.0001     0.0031     0.0037
## age            23.5000       23.8571    -0.3571     3.0000     4.4286
## educ           10.1875       10.4762    -0.2887     0.0000     0.4762
## black           1.0000        1.0000     0.0000     0.0000     0.0000
## hispan          0.0000        0.0000     0.0000     0.0000     0.0000
## married         0.0312        0.1429    -0.1116     0.0000     0.0952
## nodegree        0.6250        0.6667    -0.0417     0.0000     0.0476
## re74         1099.4532     1134.9752   -35.5221     2.1510   707.1054
## re75         1446.2640     1704.0458  -257.7818   277.5000  1205.1226
##             eQQ Max
## distance     0.0101
## age         17.0000
## educ         1.0000
## black        0.0000
## hispan       0.0000
## married      1.0000
## nodegree     1.0000
## re74      7268.3960
## re75     12746.0500
## 
## , , Subclass 5
## 
##          Means Treated Means Control  Mean Diff    eQQ Med   eQQ Mean
## distance        0.7296        0.7346    -0.0049     0.0034     0.0056
## age            24.0000       22.1111     1.8889     1.5000     2.4444
## educ           10.2903       10.8889    -0.5986     0.5000     0.6111
## black           1.0000        1.0000     0.0000     0.0000     0.0000
## hispan          0.0000        0.0000     0.0000     0.0000     0.0000
## married         0.0000        0.0000     0.0000     0.0000     0.0000
## nodegree        0.8710        0.8333     0.0376     0.0000     0.0000
## re74          525.2020      432.2360    92.9660     0.0000   311.7185
## re75          754.4831      357.4678   397.0153     0.0000   432.7898
##             eQQ Max
## distance     0.0186
## age          7.0000
## educ         2.0000
## black        0.0000
## hispan       0.0000
## married      0.0000
## nodegree     0.0000
## re74      1557.2690
## re75      3772.3020
## 
## , , Subclass 6
## 
##          Means Treated Means Control  Mean Diff    eQQ Med   eQQ Mean
## distance        0.7805        0.7774     0.0031     0.0076     0.0256
## age            29.4194       25.6667     3.7527     7.0000     5.0000
## educ           11.2258       10.6667     0.5591     1.0000     2.0000
## black           1.0000        1.0000     0.0000     0.0000     0.0000
## hispan          0.0000        0.0000     0.0000     0.0000     0.0000
## married         0.0000        0.0000     0.0000     0.0000     0.0000
## nodegree        0.8710        1.0000    -0.1290     0.0000     0.3333
## re74          207.3736      281.4813   -74.1077     0.0000   454.1653
## re75         1137.7261     1912.6611  -774.9350  1367.8060  2215.0418
##             eQQ Max
## distance     0.0640
## age          7.0000
## educ         5.0000
## black        0.0000
## hispan       0.0000
## married      0.0000
## nodegree     1.0000
## re74      1362.4960
## re75      4387.5290
## 
## 
## Sample sizes by subclasses:
##         Subclass 1 Subclass 2 Subclass 3 Subclass 4 Subclass 5 Subclass 6
## Treated         31         31         29         32         31         31
## Control        346         24         17         21         18          3
## Total          377         55         46         53         49         34
## 
## Summary of balance across subclasses
##          Means Treated Means Control Mean Diff  eQQ Med  eQQ Mean
## distance        0.5774        0.5610    0.0124   0.0176    0.0226
## age            25.8162       26.2522    1.2403   3.6811    4.2855
## educ           10.3459       10.2809    0.1621   0.4081    0.8972
## black           0.8432        0.8279    0.0139   0.0000    0.0108
## hispan          0.0595        0.0360    0.0250   0.0000    0.0216
## married         0.1892        0.2425    0.0867   0.0000    0.1623
## nodegree        0.7081        0.6984    0.0417   0.0000    0.0987
## re74         2095.5737     2811.9421  523.0624 658.2604 1249.1060
## re75         1532.0553     1834.5600  192.6517 461.6864 1031.9283
##            eQQ Max
## distance    0.0466
## age        12.0595
## educ        2.9892
## black       0.1676
## hispan      0.1676
## married     0.6649
## nodegree    0.6649
## re74     5071.1549
## re75     5711.4374
## 
## Percent Balance Improvement:
##          Mean Diff.   eQQ Med eQQ Mean  eQQ Max
## distance    95.8506   96.5957  94.2885  92.1958
## age         80.3096 -268.1081 -31.2605 -20.5946
## educ        41.1584   59.1892 -27.6725  25.2703
## black       97.5999  100.0000  98.3193  83.2432
## hispan      71.6922    0.0000  73.3333  83.2432
## married     83.5264    0.0000  49.9451  33.5135
## nodegree    91.2511    0.0000  13.0169  33.5135
## re74        79.6698   72.8616  65.5031  44.9774
## re75        67.6268   52.9418   2.7087  15.9466

m.data <- match.data(m.out)

# regression
result <- aggregate(weights ~ subclass, data=m.data, sum)
result$att <- sapply(1:6, function(x){return(coef(lm(ate.fml, data = m.data[m.data$subclass==x, ]))["treat"])})
res.reg.cov <- weighted.mean(result$att, weights=result$weights)
cat("[Regression] ATT:", res.reg.cov)

## [Regression] ATT: 1642.246

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  data.table(
    `Method name` = "Subclassification",
    `Estimand` = "ATT",
    `Regression on treat + covariates` = res.reg.cov,
    `Regression on treat` = NA,
    `Direct comparison` = NA
    ))

Nearest Neighbor Matching

Matches are chosen for each treated unit one at a time, with the order specified by the m.order command (default=largest to smallest). At each matching step we choose the control unit that is not yet matched but is closest to the treated unit on the distance measure.

使用MatchIt的默认参数进行配平。可以看到配平后一部分对照组的样本被去掉了。之前在Coursera上看“A Crash Course in Causality”的时候，prof说SMD<0.1认为可以接受，SMD在[0.1,0.3]之间就要很谨慎了。而这里black和hispan配平后的SMD都远远超过了0.3。

直接比较两组得到的ATT是900左右，通过regression得到的是1350左右。

# Propensity score matching
m.out <- matchit(ps.fml, data = lalonde, method = "nearest")
if (!simulated.data) {
  print(summary(m.out))
}

## 
## Call:
## matchit(formula = ps.fml, data = lalonde, method = "nearest")
## 
## Summary of balance for all data:
##          Means Treated Means Control SD Control  Mean Diff   eQQ Med
## distance        0.5774        0.1822     0.2295     0.3952    0.5176
## age            25.8162       28.0303    10.7867    -2.2141    1.0000
## educ           10.3459       10.2354     2.8552     0.1105    1.0000
## black           0.8432        0.2028     0.4026     0.6404    1.0000
## hispan          0.0595        0.1422     0.3497    -0.0827    0.0000
## married         0.1892        0.5128     0.5004    -0.3236    0.0000
## nodegree        0.7081        0.5967     0.4911     0.1114    0.0000
## re74         2095.5737     5619.2365  6788.7508 -3523.6628 2425.5720
## re75         1532.0553     2466.4844  3291.9962  -934.4291  981.0968
##           eQQ Mean   eQQ Max
## distance    0.3955    0.5966
## age         3.2649   10.0000
## educ        0.7027    4.0000
## black       0.6432    1.0000
## hispan      0.0811    1.0000
## married     0.3243    1.0000
## nodegree    0.1135    1.0000
## re74     3620.9240 9216.5000
## re75     1060.6582 6795.0100
## 
## 
## Summary of balance for matched data:
##          Means Treated Means Control SD Control Mean Diff  eQQ Med
## distance        0.5774        0.3629     0.2533    0.2145   0.1646
## age            25.8162       25.3027    10.5864    0.5135   3.0000
## educ           10.3459       10.6054     2.6582   -0.2595   0.0000
## black           0.8432        0.4703     0.5005    0.3730   0.0000
## hispan          0.0595        0.2162     0.4128   -0.1568   0.0000
## married         0.1892        0.2108     0.4090   -0.0216   0.0000
## nodegree        0.7081        0.6378     0.4819    0.0703   0.0000
## re74         2095.5737     2342.1076  4238.9757 -246.5339 131.2709
## re75         1532.0553     1614.7451  2632.3533  -82.6898 152.1774
##          eQQ Mean    eQQ Max
## distance   0.2146     0.4492
## age        3.3892     9.0000
## educ       0.4541     3.0000
## black      0.3730     1.0000
## hispan     0.1568     1.0000
## married    0.0216     1.0000
## nodegree   0.0703     1.0000
## re74     545.1182 13121.7500
## re75     349.5371 11365.7100
## 
## Percent Balance Improvement:
##          Mean Diff.   eQQ Med eQQ Mean  eQQ Max
## distance    45.7140   68.1921  45.7536  24.7011
## age         76.8070 -200.0000  -3.8079  10.0000
## educ      -134.7737  100.0000  35.3846  25.0000
## black       41.7636  100.0000  42.0168   0.0000
## hispan     -89.4761    0.0000 -93.3333   0.0000
## married     93.3191    0.0000  93.3333   0.0000
## nodegree    36.9046    0.0000  38.0952   0.0000
## re74        93.0035   94.5880  84.9453 -42.3724
## re75        91.1508   84.4891  67.0453 -67.2655
## 
## Sample sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  estimate.causal.effect(match.data(m.out), "ATT", "Nearest Neighbor Matching")
)

## [Regression on treat + covariates]  ATT :  1351.96 
## [Regression on treated]  ATT :  908.2022 
## [Direct comparison]  ATT : 908.2022

Full matching

A fully matched sample is composed of matched sets, where each matched set contains one treated unit and one or more controls (or one control unit and one or more treated units).

估计的到的ATT都是1800左右，非常接近ground truth。

1	m.out <- matchit(ps.fml, data = lalonde, method = "full")

## Warning in optmatch::fullmatch(d, ...): Without 'data' argument the order of the match is not guaranteed
##     to be the same as your original data.

1
2
3

if (!simulated.data) {
  print(summary(m.out, standarize=T))
}

## 
## Call:
## matchit(formula = ps.fml, data = lalonde, method = "full")
## 
## Summary of balance for all data:
##          Means Treated Means Control  Mean Diff   eQQ Med  eQQ Mean
## distance        0.5774        0.1822     0.3952    0.5176    0.3955
## age            25.8162       28.0303    -2.2141    1.0000    3.2649
## educ           10.3459       10.2354     0.1105    1.0000    0.7027
## black           0.8432        0.2028     0.6404    1.0000    0.6432
## hispan          0.0595        0.1422    -0.0827    0.0000    0.0811
## married         0.1892        0.5128    -0.3236    0.0000    0.3243
## nodegree        0.7081        0.5967     0.1114    0.0000    0.1135
## re74         2095.5737     5619.2365 -3523.6628 2425.5720 3620.9240
## re75         1532.0553     2466.4844  -934.4291  981.0968 1060.6582
##            eQQ Max
## distance    0.5966
## age        10.0000
## educ        4.0000
## black       1.0000
## hispan      1.0000
## married     1.0000
## nodegree    1.0000
## re74     9216.5000
## re75     6795.0100
## 
## 
## Summary of balance for matched data:
##          Means Treated Means Control Mean Diff  eQQ Med eQQ Mean   eQQ Max
## distance        0.5774        0.5764    0.0011   0.0034   0.0069     0.096
## age            25.8162       24.7815    1.0347   3.0000   3.1652     9.000
## educ           10.3459       10.2991    0.0469   0.0000   0.5128     4.000
## black           0.8432        0.8347    0.0086   0.0000   0.0080     1.000
## hispan          0.0595        0.0594    0.0001   0.0000   0.0008     1.000
## married         0.1892        0.1290    0.0601   0.0000   0.0648     1.000
## nodegree        0.7081        0.7083   -0.0002   0.0000   0.0188     1.000
## re74         2095.5737     2177.3149  -81.7412 172.4155 576.0290 13121.750
## re75         1532.0553     1503.4044   28.6509 215.5100 530.5839 12746.050
## 
## Percent Balance Improvement:
##          Mean Diff.   eQQ Med eQQ Mean  eQQ Max
## distance    99.7289   99.3438  98.2480  83.9052
## age         53.2686 -200.0000   3.0526  10.0000
## educ        57.6040  100.0000  27.0246   0.0000
## black       98.6582  100.0000  98.7563   0.0000
## hispan      99.8925    0.0000  99.0133   0.0000
## married     81.4165    0.0000  80.0200   0.0000
## nodegree    99.8266    0.0000  83.4381   0.0000
## re74        97.6802   92.8918  84.0917 -42.3724
## re75        96.9339   78.0338  49.9760 -87.5796
## 
## Sample sizes:
##           Control Treated
## All           429     185
## Matched       429     185
## Unmatched       0       0
## Discarded       0       0

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  estimate.causal.effect(match.data(m.out), "ATT", "Full Matching")
)

## [Regression on treat + covariates]  ATT :  1848.929 
## [Regression on treated]  ATT :  1803.443 
## [Direct comparison]  ATT : 1803.443

Optimal

In contrast, “optimal” matching finds the matched samples with the smallest average absolute distance across all the matched pairs.

通过regression得到的ATT为1400，直接比较的ATT为1000。

1	m.out <- matchit(ps.fml, data = lalonde, method = "optimal")

## Warning in optmatch::fullmatch(d, min.controls = ratio, max.controls = ratio, : Without 'data' argument the order of the match is not guaranteed
##     to be the same as your original data.

1
2
3

if (!simulated.data) {
  print(summary(m.out, standarize=T))
}

## 
## Call:
## matchit(formula = ps.fml, data = lalonde, method = "optimal")
## 
## Summary of balance for all data:
##          Means Treated Means Control SD Control  Mean Diff   eQQ Med
## distance        0.5774        0.1822     0.2295     0.3952    0.5176
## age            25.8162       28.0303    10.7867    -2.2141    1.0000
## educ           10.3459       10.2354     2.8552     0.1105    1.0000
## black           0.8432        0.2028     0.4026     0.6404    1.0000
## hispan          0.0595        0.1422     0.3497    -0.0827    0.0000
## married         0.1892        0.5128     0.5004    -0.3236    0.0000
## nodegree        0.7081        0.5967     0.4911     0.1114    0.0000
## re74         2095.5737     5619.2365  6788.7508 -3523.6628 2425.5720
## re75         1532.0553     2466.4844  3291.9962  -934.4291  981.0968
##           eQQ Mean   eQQ Max
## distance    0.3955    0.5966
## age         3.2649   10.0000
## educ        0.7027    4.0000
## black       0.6432    1.0000
## hispan      0.0811    1.0000
## married     0.3243    1.0000
## nodegree    0.1135    1.0000
## re74     3620.9240 9216.5000
## re75     1060.6582 6795.0100
## 
## 
## Summary of balance for matched data:
##          Means Treated Means Control SD Control Mean Diff  eQQ Med
## distance        0.5774        0.3629     0.2533    0.2145   0.1646
## age            25.8162       25.2486    10.6451    0.5676   3.0000
## educ           10.3459       10.5622     2.6961   -0.2162   0.0000
## black           0.8432        0.4703     0.5005    0.3730   0.0000
## hispan          0.0595        0.2108     0.4090   -0.1514   0.0000
## married         0.1892        0.2000     0.4011   -0.0108   0.0000
## nodegree        0.7081        0.6432     0.4803    0.0649   0.0000
## re74         2095.5737     2282.3340  4173.1175 -186.7603 133.2301
## re75         1532.0553     1543.9354  2592.3666  -11.8801 133.5970
##          eQQ Mean    eQQ Max
## distance   0.2146     0.4492
## age        3.4432    10.0000
## educ       0.4649     3.0000
## black      0.3730     1.0000
## hispan     0.1514     1.0000
## married    0.0108     1.0000
## nodegree   0.0649     1.0000
## re74     535.0728 13121.7500
## re75     328.4209 11365.7100
## 
## Percent Balance Improvement:
##          Mean Diff.   eQQ Med eQQ Mean  eQQ Max
## distance    45.7145   68.1921  45.7508  24.7011
## age         74.3656 -200.0000  -5.4636   0.0000
## educ       -95.6447  100.0000  33.8462  25.0000
## black       41.7636  100.0000  42.0168   0.0000
## hispan     -82.9424    0.0000 -86.6667   0.0000
## married     96.6595    0.0000  96.6667   0.0000
## nodegree    41.7581    0.0000  42.8571   0.0000
## re74        94.6998   94.5073  85.2228 -42.3724
## re75        98.7286   86.3829  69.0361 -67.2655
## 
## Sample sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  estimate.causal.effect(match.data(m.out), "ATT", "Optimal Matching")
)

## [Regression on treat + covariates]  ATT :  1394.992 
## [Regression on treated]  ATT :  1022.93 
## [Direct comparison]  ATT : 1022.93

Genetic Matching

Genetic matching automates the process of finding a good matching solution (Diamond and Sekhon 2005). The idea is to use a genetic search algorithm to find a set of weights for each covariate such that the a version of optimal balance is achieved after matching.

Genetic matching略慢，这里只保留注释掉的代码。

# m.out <- matchit(ps.fml, data = lalonde, method = "genetic")
# if (!simulated.data) {
#   print(summary(m.out, standarize=T))
# }
# estimate.causal.effect(match.data(m.out), "ATT")

手写

IPW (lm)

部分参考：

基础inverse propensity weighting (IPW)。得到的ATT大约是1210-1240。

lalonde.copy <- as.data.table(lalonde)
ps.model <- glm(ps.fml, data=lalonde.copy, family=binomial())
lalonde.copy[, pscore := predict(ps.model, newdata = lalonde.copy, type = "response")]
lalonde.copy[, weights := treat + (1-treat) *pscore / (1-pscore)]

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  estimate.causal.effect(lalonde.copy, "ATT", "Inverse PS Weighting")
)

## [Regression on treat + covariates]  ATT :  1237.405 
## [Regression on treated]  ATT :  1214.071 
## [Direct comparison]  ATT : 1214.071

检查平衡。为了不让输出太刷屏，注释掉了几个。可以看到其实不是很平衡。

# balance.check.col <- c("age", "educ", "black", "hispan", "nodegree", "re74", "re75")
balance.check.col <- c("re74", "re75")
for (v in balance.check.col) {
  print(v)
  fml <- paste0(v, "~treat")
  print(summary(lm(fml, data=lalonde.copy, weights=pscore)))
}

## [1] "re74"
## 
## Call:
## lm(formula = fml, data = lalonde.copy, weights = pscore)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
##  -2494  -1241   -492   1343  15286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2808.0      270.0  10.400  < 2e-16 ***
## treat        -1233.7      355.3  -3.472 0.000553 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2387 on 612 degrees of freedom
## Multiple R-squared:  0.01932,    Adjusted R-squared:  0.01772 
## F-statistic: 12.06 on 1 and 612 DF,  p-value: 0.0005528
## 
## [1] "re75"
## 
## Call:
## lm(formula = fml, data = lalonde.copy, weights = pscore)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1543.1  -803.2  -292.2   496.2 19756.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1777.5      188.1   9.451   <2e-16 ***
## treat         -425.1      247.5  -1.718   0.0864 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1663 on 612 degrees of freedom
## Multiple R-squared:  0.004797,   Adjusted R-squared:  0.003171 
## F-statistic:  2.95 on 1 and 612 DF,  p-value: 0.08639

删去propensity score小于0.1和大于0.9的样本后，balance提升了很多，但是ATT值变化不大，变成了1230到1260。

lalonde.copy2 <- lalonde.copy[pscore >= .1 & pscore <=.9,]
balance.check.col <- c("re74", "re75")
for (v in balance.check.col) {
  print(v)
  fml <- paste0(v, "~treat")
  print(summary(lm(fml, data=lalonde.copy2, weights=pscore)))
}

## [1] "re74"
## 
## Call:
## lm(formula = fml, data = lalonde.copy2, weights = pscore)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2064.6 -1338.4 -1027.5  -219.3 15285.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2324.1      363.3   6.398 5.23e-10 ***
## treat         -749.2      462.1  -1.621    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2942 on 339 degrees of freedom
## Multiple R-squared:  0.007695,   Adjusted R-squared:  0.004768 
## F-statistic: 2.629 on 1 and 339 DF,  p-value: 0.1059
## 
## [1] "re75"
## 
## Call:
## lm(formula = fml, data = lalonde.copy2, weights = pscore)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1407.6 -1123.7  -645.7   241.4 19754.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1621.5      263.6   6.150 2.18e-09 ***
## treat         -267.7      335.3  -0.798    0.425    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2135 on 339 degrees of freedom
## Multiple R-squared:  0.001877,   Adjusted R-squared:  -0.001068 
## F-statistic: 0.6374 on 1 and 339 DF,  p-value: 0.4252

causal.effect.results.all <- rbind(
  causal.effect.results.all,
  estimate.causal.effect(lalonde.copy2, "ATT", "Inverse PS Weighting + Trimming")
)

## [Regression on treat + covariates]  ATT :  1233.417 
## [Regression on treated]  ATT :  1261.842 
## [Direct comparison]  ATT : 1261.842

汇总结果

个人感觉：Full Matching > Subclassification > Nearest Neighbor，然后它们都应该再配上regression来估计ATT。另外，IPW挺好的。

1	kable(causal.effect.results.all)

Method name	Estimand	Regression on treat + covariates	Regression on treat	Direct comparison
Raw data	Causal effect	1548.244	-635.0262	-635.0262
Exact Matching	Causal effect	-1106.311	-1106.3111	-1106.3111
Subclassification	ATT	1642.246	NA	NA
Nearest Neighbor Matching	ATT	1351.960	908.2022	908.2022
Full Matching	ATT	1848.929	1803.4427	1803.4427
Optimal Matching	ATT	1394.992	1022.9297	1022.9297
Inverse PS Weighting	ATT	1237.405	1214.0712	1214.0712
Inverse PS Weighting + Trimming	ATT	1233.417	1261.8423	1261.8423