library(dplyr)
library(ggplot2)
Motivated by discussion with Tim Cadman relating to this Renzi et al (2019). Long-Term PM10 Exposure and Cause-Specific Mortality in the Latium Region (Italy): A Difference-in-Differences Approach.
For argument’s sake suppose that air pollution (PM10) isn’t causal for deaths. Simulate a situation where wealth varies by region, and each region has a different wealth trajectory over time. Some info:
- Wealth, PM10 and deaths are measured for 10 years in 300 regions.
- Wealth causes PM10 and deaths.
- There is a global confounder for the start point of wealth and deaths
- Time has an additional effect on both deaths and wealth - i.e. it’s also a global confounder
- The causal effect of wealth on death = 1. We want our model to get that right.
<- 300
nregion <- 1:300
region <- rnorm(nregion)
wealth_int <- rnorm(nregion)
wealth_slope <- rnorm(nregion, sd=10)
global_confounder <- 10
nyear
<- lapply(1:nregion, function(i)
dat
{tibble(
region=i,
year=1:nyear,
# wealth goes up due to global confounder, year, random error
wealth = global_confounder[i] + wealth_int[i] + year * wealth_slope[i] + rnorm(nyear),
# PM10 only related to wealth and random error
pm10 = wealth + rnorm(nyear, sd=5),
# deaths go up due to global confounders, wealth, year and random term
deaths = global_confounder[i] + wealth + rnorm(nyear) + year
)%>%
}) bind_rows()
dat
# A tibble: 3,000 × 5
region year wealth pm10 deaths
<int> <int> <dbl> <dbl> <dbl>
1 1 1 2.25 2.28 5.18
2 1 2 1.81 0.393 5.79
3 1 3 2.25 4.05 9.20
4 1 4 2.89 12.0 11.6
5 1 5 4.02 7.55 12.7
6 1 6 3.15 3.08 12.7
7 1 7 1.67 1.51 11.3
8 1 8 2.83 4.92 13.4
9 1 9 5.08 8.42 16.2
10 1 10 4.96 10.1 18.4
# … with 2,990 more rows
Plot it showing change in deaths over time by region
ggplot(dat, aes(x=year, y=deaths)) +
geom_point(aes(group=as.factor(region))) +
geom_line(aes(group=as.factor(region)))
Try again but just regression lines per region
ggplot(dat, aes(x=year, y=deaths)) +
geom_point(aes(group=as.factor(region))) +
geom_smooth(method="lm", aes(group=as.factor(region)), se=FALSE)
`geom_smooth()` using formula = 'y ~ x'
Show that PM10 is closely coupled with wealth
plot(pm10 ~ wealth, dat)
Use regression to test for influence of wealth on deaths - this gives a very confounded result because of the global confounder.
summary(lm(deaths ~ wealth, dat))
Call:
lm(formula = deaths ~ wealth, data = dat)
Residuals:
Min 1Q Median 3Q Max
-17.6518 -4.4416 -0.7553 3.7309 25.9024
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.760562 0.115301 49.96 <2e-16 ***
wealth 1.715797 0.009365 183.21 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.315 on 2998 degrees of freedom
Multiple R-squared: 0.918, Adjusted R-squared: 0.918
F-statistic: 3.357e+04 on 1 and 2998 DF, p-value: < 2.2e-16
Same will be true for PM10
summary(lm(deaths ~ pm10, dat))
Call:
lm(formula = deaths ~ pm10, data = dat)
Residuals:
Min 1Q Median 3Q Max
-33.668 -6.799 -0.196 6.836 38.319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.88951 0.18634 31.6 <2e-16 ***
pm10 1.46754 0.01399 104.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.21 on 2998 degrees of freedom
Multiple R-squared: 0.7858, Adjusted R-squared: 0.7857
F-statistic: 1.1e+04 on 1 and 2998 DF, p-value: < 2.2e-16
So PM10 is confounded at two levels - the between-region (global) level and the within-region level.
Now do a DiD model for wealth - it should give us an unbiased estimate of 1. Note that this long form of DiD is sometimes called a fixed-effects model.
summary(lm(deaths ~ wealth + as.factor(region) + as.factor(year), dat)) %>%
%>% as_tibble %>% slice(n=2) coefficients
# A tibble: 1 × 4
Estimate `Std. Error` `t value` `Pr(>|t|)`
<dbl> <dbl> <dbl> <dbl>
1 0.993 0.00590 168. 0
It does. What about DiD model for PM10? This should be less confounded because it eliminates the global confounder, but still confounded by the structural confounding that happens at all areas
summary(lm(deaths ~ pm10 + as.factor(region) + as.factor(year), dat)) %>%
%>% as_tibble %>% slice(n=2) coefficients
# A tibble: 1 × 4
Estimate `Std. Error` `t value` `Pr(>|t|)`
<dbl> <dbl> <dbl> <dbl>
1 0.293 0.00931 31.4 4.94e-185
If we control for wealth, the effect of PM10 will be unbiased because the within-region bias has been removed, and the global bias acted via wealth anyway so that’s been removed also
summary(lm(deaths ~ pm10 + wealth, dat)) %>%
%>% as_tibble %>% slice(n=2) coefficients
# A tibble: 1 × 4
Estimate `Std. Error` `t value` `Pr(>|t|)`
<dbl> <dbl> <dbl> <dbl>
1 -0.0224 0.0231 -0.969 0.333
We could try to simplify by doing the more explicit difference in difference estimate. Compare the change in deaths with the change in wealth (or PM10) over the 10 year period.
<- group_by(dat, region) %>%
dat2 summarise(
delta_deaths = deaths[nyear] - deaths[1],
delta_wealth = wealth[nyear] - wealth[1],
delta_pm10 = pm10[nyear] - pm10[1]
) dat2
# A tibble: 300 × 4
region delta_deaths delta_wealth delta_pm10
<int> <dbl> <dbl> <dbl>
1 1 13.2 2.71 7.80
2 2 19.6 12.9 18.6
3 3 20.6 10.6 6.65
4 4 4.24 -3.10 -10.7
5 5 9.08 -0.285 -9.31
6 6 17.7 9.94 8.22
7 7 10.7 2.15 10.4
8 8 16.7 6.61 -3.01
9 9 6.55 -2.51 13.1
10 10 17.9 9.83 10.8
# … with 290 more rows
Do the DiD estimates using these - should recapitulate what we got above
summary(lm(delta_deaths ~ delta_wealth, dat2))
Call:
lm(formula = delta_deaths ~ delta_wealth, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-5.6788 -0.9511 0.1077 1.0252 3.6020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.063688 0.082771 109.5 <2e-16 ***
delta_wealth 0.990565 0.008896 111.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.434 on 298 degrees of freedom
Multiple R-squared: 0.9765, Adjusted R-squared: 0.9764
F-statistic: 1.24e+04 on 1 and 298 DF, p-value: < 2.2e-16
summary(lm(delta_deaths ~ delta_pm10, dat2))
Call:
lm(formula = delta_deaths ~ delta_pm10, data = dat2)
Residuals:
Min 1Q Median 3Q Max
-14.0715 -3.9080 -0.3154 3.7670 16.6385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.09954 0.33647 27.04 <2e-16 ***
delta_pm10 0.61677 0.02844 21.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.827 on 298 degrees of freedom
Multiple R-squared: 0.6122, Adjusted R-squared: 0.6109
F-statistic: 470.5 on 1 and 298 DF, p-value: < 2.2e-16
Weirdly, it works for wealth but we do get a slightly difference answer for PM10. There is probably a lot of literature on what makes a fixed effects model (the first version of the DiD we did above) different from an explicit version like this one.
To summarise - the DiD model is useful to account for unmeasured global confounders (including time), but it might not be too surprising that it doesn’t control for all confounders - you really do need some sort of experiment / randomisation that specifically mimics the exact intervention you want to make to get to completely unconfounded effects.
sessionInfo()
R version 4.2.1 Patched (2022-09-06 r82817)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.4.0 dplyr_1.0.10
loaded via a namespace (and not attached):
[1] pillar_1.8.1 compiler_4.2.1 tools_4.2.1 digest_0.6.31
[5] lattice_0.20-45 nlme_3.1-158 jsonlite_1.8.4 evaluate_0.19
[9] lifecycle_1.0.3 tibble_3.1.8 gtable_0.3.1 mgcv_1.8-40
[13] pkgconfig_2.0.3 rlang_1.0.6 Matrix_1.4-1 DBI_1.1.3
[17] cli_3.5.0 yaml_2.3.6 xfun_0.36 fastmap_1.1.0
[21] withr_2.5.0 stringr_1.5.0 knitr_1.41 generics_0.1.3
[25] vctrs_0.5.1 htmlwidgets_1.5.4 grid_4.2.1 tidyselect_1.2.0
[29] glue_1.6.2 R6_2.5.1 fansi_1.0.3 rmarkdown_2.16
[33] farver_2.1.1 magrittr_2.0.3 scales_1.2.1 htmltools_0.5.4
[37] splines_4.2.1 assertthat_0.2.1 colorspace_2.0-3 labeling_0.4.2
[41] utf8_1.2.2 stringi_1.7.8 munsell_0.5.0