MR and imperfect clumping

Author

Gibran Hemani

Published

July 13, 2023

Background

4000 instruments for educational attainment using clumping r2 = 0.1, and se doubles when using r2 = 0.001.

That smaller standard error is either due to the R2 in the exposure being higher or the non-independence of effects artificially increasing precision, or a mixture of both.

So the question is the impact of the latter – if we have some true correlation structure with realistic F stats at a specific locus, and then we try to clump at r2 = 0.001 vs 0.1, how many instruments do we retain (it should be 1) and if more than 1, what is that impact on the standard error

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(simulateGP)
library(TwoSampleMR)
TwoSampleMR version 0.5.7 
[>] New: Option to use non-European LD reference panels for clumping etc
[>] Some studies temporarily quarantined to verify effect allele
[>] See news(package='TwoSampleMR') and https://gwas.mrcieu.ac.uk for further details

Attaching package: 'TwoSampleMR'
The following objects are masked from 'package:simulateGP':

    allele_frequency, contingency, get_population_allele_frequency
library(purrr)
library(ggplot2)

Simulate causal snp g + another that is correlated with it

n <- 100000
g <- correlated_binomial(n, p1=0.5, p2=0.5, rho=sqrt(0.1))
# x caused by just snp 1
x <- g[,1] * 0.5 + rnorm(n)
y <- x * 0.5 + rnorm(n)

# MR using both SNPs, treating as if they are independent
get_effs(x, y, g) %>% mr(method="mr_ivw") %>% str
Analysing 'X' on 'Y'
'data.frame':   1 obs. of  9 variables:
 $ id.exposure: chr "X"
 $ id.outcome : chr "Y"
 $ outcome    : chr "Y"
 $ exposure   : chr "X"
 $ method     : chr "Inverse variance weighted"
 $ nsnp       : int 2
 $ b          : num 0.497
 $ se         : num 0.00956
 $ pval       : num 0
# MR using just the causal SNP
get_effs(x, y, g[,1, drop=F]) %>% mr(method=c("mr_ivw", "mr_wald_ratio")) %>% str
Analysing 'X' on 'Y'
'data.frame':   1 obs. of  9 variables:
 $ id.exposure: chr "X"
 $ id.outcome : chr "Y"
 $ outcome    : chr "Y"
 $ exposure   : chr "X"
 $ method     : chr "Wald ratio"
 $ nsnp       : num 1
 $ b          : num 0.496
 $ se         : num 0.0101
 $ pval       : num 0

There’s hardly any difference in the SE here. Try over a range of scenarios

param <- expand.grid(
    r2=seq(0, 1, by=0.02),
    bgx=seq(0,1, by=0.2),
    bxy=seq(0,1, by=0.2),
    n=100000
)
param$sim <- 1:nrow(param)
dim(param)
[1] 1836    5
res <- map(1:nrow(param), \(i){
    g <- correlated_binomial(param$n[i], p1=0.5, p2=0.5, rho=sqrt(param$r2[i]))
    # x caused by just snp 1
    x <- g[,1] * param$bgx[i] + rnorm(n)
    y <- x * param$bxy[i] + rnorm(n)

    bind_rows(
        get_effs(x, y, g) %>% {suppressMessages(mr(., method="mr_ivw"))},
        get_effs(x, y, g[,1, drop=F]) %>% {suppressMessages(mr(., method="mr_wald_ratio"))}
    ) %>% mutate(sim=param$sim[i]) %>% return()
}) %>% bind_rows %>% inner_join(param, ., by="sim")
Warning in summary.lm(stats::lm(b_out ~ -1 + b_exp, weights = 1/se_out^2)):
essentially perfect fit: summary may be unreliable

Standard errors across all scenarios

ggplot(res, aes(x=r2, y=se)) +
geom_point(aes(colour=as.factor(nsnp))) +
geom_line(aes(colour=as.factor(nsnp))) +
facet_grid(bgx ~ bxy, labeller=label_both, scale="free_y")

Bias across all scenarios:

ggplot(res, aes(x=r2, y=b)) +
geom_point(aes(colour=as.factor(nsnp))) +
geom_line(aes(colour=as.factor(nsnp))) +
facet_grid(bgx ~ bxy, labeller=label_both)

Look at just one

ggplot(res %>% filter(bgx == 0.2, bxy == 0.2), aes(x=r2, y=se)) +
geom_point(aes(colour=as.factor(nsnp))) +
geom_line(aes(colour=as.factor(nsnp))) +
facet_grid(bgx ~ bxy, labeller=label_both, scale="free_y")

Summary

  • Relaxed r2 e.g. from 0 to 0.1 doesn’t seem to have a huge impact on standard errors
  • In the one SNP situation relaxed r2 has no impact on bias, and could only plausibly change things under substantial heterogeneity which correlates with overrepresentation.
  • More realistic simulations would look at whether this changes when the p-value at the second locus is very large, and would also look at the probability of erroneously keeping multiple loci for a single causal variant
  • Some instability in SEs when correlated SNPs used

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.2     purrr_1.0.1       TwoSampleMR_0.5.7 simulateGP_0.1.2 
[5] dplyr_1.1.2      

loaded via a namespace (and not attached):
 [1] vctrs_0.6.2       cli_3.6.1         knitr_1.43        rlang_1.1.1      
 [5] xfun_0.39         generics_0.1.3    jsonlite_1.8.5    labeling_0.4.2   
 [9] glue_1.6.2        colorspace_2.1-0  plyr_1.8.8        htmltools_0.5.5  
[13] scales_1.2.1      fansi_1.0.4       rmarkdown_2.22    grid_4.3.0       
[17] munsell_0.5.0     evaluate_0.21     tibble_3.2.1      fastmap_1.1.1    
[21] yaml_2.3.7        lifecycle_1.0.3   compiler_4.3.0    Rcpp_1.0.10      
[25] htmlwidgets_1.6.2 pkgconfig_2.0.3   rstudioapi_0.14   farver_2.1.1     
[29] digest_0.6.31     R6_2.5.1          tidyselect_1.2.0  utf8_1.2.3       
[33] pillar_1.9.0      magrittr_2.0.3    withr_2.5.0       gtable_0.3.3     
[37] tools_4.3.0