因果推断学习笔记(二):经典方法尝试之 Regression (Lalonde's Dataset)
Mostly Harmless Econometrics 的第三章是关于如何用 regression 来估算因果效应的。其中一节是用 Lalonde’s dataset (NSW + PSID + CPS Data) 这个数据集来尝试不同的 regression specifications。这篇博客的目标是复现一下课本儿里的结果,实战一下。
以下内容由Rmarkdown转来。Git: yishilin14/causal_playground
博客的md用render("./causality2_playaround_with_the_lalonde_dataset_regression.Rmd", md_document(variant = "markdown_github"))
生成~
读数据
数据集
- Paper: Dehejia R H, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs[J]. Journal of the American statistical Association, 1999, 94(448): 1053-1062. (http://www.uh.edu/~adkugler/Dehejia&Wahba_JASA.pdf)
- 下载地址:http://users.nber.org/~rdehejia/data/nswdata2.html
1 | # Load the datsets |
复现 Dehejia-Wahba (1999) 里的 Table 1,看看每一个分组的 sample mean 。 均值和论文里的一致,standard error of sample mean 的数值和论文相似但是不完全一致,不知道论文中是用什么方式计算的。
1 | get.mean.se <- function(x) { |
Dataset | treat | no. obs | age | educ | black | hispan | married | nodegree | re74 | re75 | re78 |
---|---|---|---|---|---|---|---|---|---|---|---|
NSW | 1 | 185 | 25.82(0.53) | 10.35(0.15) | 0.84(0.03) | 0.06(0.02) | 0.19(0.03) | 0.71(0.03) | 2095.57(359.27) | 1532.06(236.68) | 6349.14(578.42) |
NSW | 0 | 260 | 25.05(0.44) | 10.09(0.1) | 0.83(0.02) | 0.11(0.02) | 0.15(0.02) | 0.83(0.02) | 2107.03(352.75) | 1266.91(192.44) | 4554.8(340.09) |
CPS-1 | 0 | 15992 | 33.23(0.09) | 12.03(0.02) | 0.07(0) | 0.07(0) | 0.71(0) | 0.3(0) | 14016.8(75.67) | 13650.8(73.31) | 14846.66(76.29) |
CPS-3 | 0 | 429 | 28.03(0.52) | 10.24(0.14) | 0.2(0.02) | 0.14(0.02) | 0.51(0.02) | 0.6(0.02) | 5619.24(327.76) | 2466.48(158.94) | 6984.17(352.17) |
Regression
目标是复现 Mostly Harmless Econometrics: Table 4.3.3 (page 68) 。
Raw Difference
书中的 standard error 用的是 pooled variance。计算SE的时候学习了 Comparing Two Population Means: Independent Samples:
- https://newonlinecourses.science.psu.edu/stat500/node/50/
- https://newonlinecourses.science.psu.edu/stat800/node/52/
1 | raw.difference <- function(dt) { |
Dataset | ATT | SE | No. Obs. |
---|---|---|---|
NSW | 1794 | 633 | 185/260 |
CPS-1 | -8498 | 712 | 185/15992 |
CPS-3 | -635 | 657 | 185/429 |
Demographic controls
这里复现的结果和书中数值不完全一致,但是趋势是差不多的。书里没有明确描述 regression 的时候具体的模型是如何的,例如有没有二次项。接下来几块利用了 demographic controls的结果都和书中数值不完全一致。
1 | # P-score Screened Samples |
Dataset | ATT | SE | No. Obs. |
---|---|---|---|
NSW | 1671 | 638 | 185/260 |
CPS-1 | -2973 | 716 | 185/15992 |
CPS-3 | 1164 | 812 | 185/429 |
CPS-1-Subset | -2675 | 826 | 123/415 |
CPS-3-Subset | 1383 | 882 | 172/156 |
1975 Earnings
1 | fml_att <- re78 ~ treat + re75 |
Dataset | ATT | SE | No. Obs. |
---|---|---|---|
NSW | 1750 | 632 | 185/260 |
CPS-1 | -78 | 537 | 185/15992 |
CPS-3 | -91 | 641 | 185/429 |
CPS-3-Subset | 162 | 643 | 183/426 |
Demographics, 1975 Earnings
1 | fml_att <- re78 ~ treat + age + educ + black + hispan + nodegree + married + re75 |
Dataset | ATT | SE | No. Obs. |
---|---|---|---|
NSW | 1636 | 638 | 185/260 |
CPS-1 | 632 | 557 | 185/15992 |
CPS-3 | 1221 | 794 | 185/429 |
CPS-1-Subset | 1421 | 675 | 147/360 |
CPS-3-Subset | 1262 | 883 | 171/160 |
Demographics, 1974 and 1974 Earnings
1 | fml_att <- re78 ~ treat + age + educ + black + hispan + nodegree + married + re74 + re75 |
Dataset | ATT | SE | No. Obs. |
---|---|---|---|
NSW | 1676 | 639 | 185/260 |
CPS-1 | 699 | 548 | 185/15992 |
CPS-3 | 1548 | 781 | 185/429 |
CPS-1-Subset | 1653 | 668 | 145/359 |
CPS-3-Subset | 1230 | 849 | 175/166 |
汇总
1 | results.all <- results.demographic[, .(Dataset)] |
Dataset | Raw Difference | Demographic controls | 1975 Earnings | Demographics, 1975 Earnings | Demographics, 1974 and 1974 Earnings |
---|---|---|---|---|---|
NSW | 1794 | 1671 | 1750 | 1636 | 1676 |
CPS-1 | -8498 | -2973 | -78 | 632 | 699 |
CPS-3 | -635 | 1164 | -91 | 1221 | 1548 |
CPS-1-Subset | -2675 | no obs. | 1421 | 1653 | |
CPS-3-Subset | 1383 | 162 | 1262 | 1230 |