Gibran Hemani’s lab book - Converting z to beta

Background

Summary imputation returns Z scores, need to convert back to beta and se. Approach

\(se(\beta) = var(y) / \sqrt{2p(1-p)n}\)
= z * se()

Assumes no inbreeding or HW disequilibrium and constant N across SNPs. Assumes collapsability of effects, which is probably ok for small effects in logistic regression studies.

Instead of trying to estimate \(var(y)\) we could just use known betas to obtain a correction factor for imputed values e.g.

Fit a linear model of known betas on imputed betas
Divide imputed betas by coefficient from (1)

Continuous trait example

library(ieugwasr)

OpenGWAS updates:

  Date: 2024-05-17

  [>] OpenGWAS is growing!

  [>] Please take 2 minutes to give us feedback -

  [>] It will help directly shape our emerging roadmap

  [>] https://forms.office.com/e/eSr7EFAfCG

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Setup data to have Z scores and some known betas
setup_data <- function(a) {
    a$z <- a$beta / a$se
    a$beta_known <- NA
    a$se_known <- NA
    index <- sample(1:nrow(a), nrow(a)*0.05)
    a$beta_known[index] <- a$beta[index]
    a$se_known[index] <- a$se[index]
    return(a)
}

# Function to convert Z to beta
make_beta <- function(dat) {
    dat$senew <- 1 / sqrt(2 * dat$eaf * (1 - dat$eaf) * dat$n)
    dat$betanew <- dat$z * dat$senew
    correction <- lm(dat$betanew ~ dat$beta_known)$coef[2]
    dat$betanew <- dat$betanew / correction
    dat$senew <- dat$senew / correction
    return(dat)
}

Get LDL cholesterol region as an example

a <- associations("1:11000000-12000000", "ieu-a-300") %>% setup_data()
a <- make_beta(a)

cor(a$beta, a$betanew, use="pair")

[1] 0.9751509

summary(lm(a$beta ~ a$betanew))


Call:
lm(formula = a$beta ~ a$betanew)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.044206 -0.000829  0.000218  0.000906  0.014450 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.456e-04  9.237e-05  -4.824  1.7e-06 ***
a$betanew    8.591e-01  7.047e-03 121.903  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.002518 on 767 degrees of freedom
  (65 observations deleted due to missingness)
Multiple R-squared:  0.9509,    Adjusted R-squared:  0.9509 
F-statistic: 1.486e+04 on 1 and 767 DF,  p-value: < 2.2e-16

plot(a$beta, a$betanew)

plot(a$se, a$senew)

Case control study example

Use CHD

b <- associations("1:11000000-12000000", "ieu-a-7") %>% setup_data()
b <- make_beta(b)

cor(b$beta, b$betanew, use="pair")

[1] 0.9868891

summary(lm(b$beta ~ b$betanew))


Call:
lm(formula = b$beta ~ b$betanew)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.035710 -0.001275  0.000141  0.001254  0.040604 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.644e-04  6.768e-05  -2.428   0.0152 *  
b$betanew    1.049e+00  3.073e-03 341.322   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.003779 on 3116 degrees of freedom
Multiple R-squared:  0.974, Adjusted R-squared:  0.9739 
F-statistic: 1.165e+05 on 1 and 3116 DF,  p-value: < 2.2e-16

plot(b$beta, b$betanew)

plot(b$se, b$senew)

Deviation from expectation?

Some deviation - most likely due to different sample sizes across SNPs. Can check that the deviation doesn’t track with allele frequency

b$maf <- b$eaf
b$maf[b$maf > 0.5] <- 1 - b$maf[b$maf > 0.5]
ggplot(b, aes(x=beta, y=betanew)) + geom_point(aes(colour=maf)) + geom_smooth(method="lm")

`geom_smooth()` using formula = 'y ~ x'

ggplot(b, aes(x=se, y=senew)) + geom_point(aes(colour=maf)) + geom_smooth(method="lm")

`geom_smooth()` using formula = 'y ~ x'

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4    ggplot2_3.5.1  ieugwasr_1.0.0

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0      gtable_0.3.5      jsonlite_1.8.8    compiler_4.4.0   
 [5] tidyselect_1.2.1  splines_4.4.0     scales_1.3.0      yaml_2.3.8       
 [9] fastmap_1.2.0     lattice_0.22-6    R6_2.5.1          labeling_0.4.3   
[13] generics_0.1.3    curl_5.2.1        knitr_1.47        htmlwidgets_1.6.4
[17] tibble_3.2.1      munsell_0.5.1     pillar_1.9.0      rlang_1.1.3      
[21] utf8_1.2.4        xfun_0.44         cli_3.6.2         withr_3.0.0      
[25] magrittr_2.0.3    mgcv_1.9-1        digest_0.6.35     grid_4.4.0       
[29] lifecycle_1.0.4   nlme_3.1-164      vctrs_0.6.5       evaluate_0.23    
[33] glue_1.7.0        farver_2.1.2      fansi_1.0.6       colorspace_2.1-0 
[37] rmarkdown_2.27    httr_1.4.7        tools_4.4.0       pkgconfig_2.0.3  
[41] htmltools_0.5.8.1