Rsq in simulations

Author

Gibran Hemani

Published

December 6, 2023

Background

What determines \(R^2\) between X and Y in a linear model under confounding

$$ \[\begin{aligned} Y &= a + bX + E \\ b &= cov(X, Y) / var(X) \\ R &= cov(X, Y) / [sd(X)sd(Y)] \\ &= b * sd(x) / sd(y) \end{aligned}\]

$$

In OLS the b will be different to the IV b if there is unmeasured confounding

$$ \[\begin{aligned} X &= a + \beta_1 G + \beta_2 U + \epsilon \\ Y &= a + b_1 X + b_2 U + E \\ b_{OLS} &= cov(X, Y) / var(X) \\ &= cov(\beta_1 G + \beta_2 U, b_1(\beta_1 G + \beta_2 U) + b_2 U)/var(X) \\ &= cov(\beta_1 G + \beta_2 U, b_1\beta_1 G + b_1\beta_2 U + b_2 U)/var(X) \\ &= [b_1\beta_1^2 var(G) + (b_1\beta_2^2 + b_2\beta_2) var(U)]/var(X) \\ \end{aligned}\]

$$

and

\[ R_{OLS} = b_{OLS} sd(x)/sd(y) \]

therefore to get the OLS \(R^2\) of X,Y

\[ R^2 = \left [ \frac{b_1\beta_1^2 var(G) + (b_1\beta_2^2 + b_2\beta_2) var(U)}{sd(x) sd(y)} \right]^2 \]

Note that

\[ \begin{aligned} var(x) &= sd(x)^2 = \beta_1^2var(G) + \beta_2^2var(U) + var(\epsilon) \\ var(y) &= sd(y)^2 = b_1^2var(X) + b_2^2var(U) + var(E) \\ \end{aligned} \]

So ultimately if you want to fix \(R^2\) for different parameters of effects you should be able to scale \(var(\epsilon)\) and \(var(E)\), the residual variances, according to these formulae.

By contrast the variance explained by the causal effect of X is

\[ R^2_{IV, x,y} = b^2_1var(X) / var(Y) \]

Check

set.seed(1)
b1 <- 0.2
b2 <- 3
beta1 <- 4
beta2 <- 5
n <- 10000
u <- rnorm(n)
g <- rnorm(n)
x <- u * beta2 + g * beta1 + rnorm(n)
y <- u * b2 + x * b1 + rnorm(n, sd=0)

Beta

cov(x, y)/var(x)
[1] 0.562716
summary(lm(y ~ x))$coef[2,1]
[1] 0.562716
( b1*beta1^2*var(g) + (b1*beta2^2 + b2*beta2) * var(u) ) / var(x)
[1] 0.5557103

Correlation

((( b1*beta1^2*var(g) + (b1*beta2^2 + b2*beta2) * var(u) ) / var(x)) * sd(x) / sd(y))^2
[1] 0.768337
cor(x, y)^2
[1] 0.7878316

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.3 compiler_4.3.2    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.2       htmltools_0.5.7   yaml_2.3.7        rmarkdown_2.25   
 [9] knitr_1.45        jsonlite_1.8.7    xfun_0.41         digest_0.6.33    
[13] rlang_1.1.2       evaluate_0.23