library(simulateGP)
library(tidyverse)
library(jsonlite)
set.seed(12345)
config <- read_json("generate_ldobj_config.json")
datadir <- file.path(config$dl_dir, "AFR_1kg_hm3")
We’d like to be able to simulate realistic GWAS summary data for a
large (e.g. millions) of SNPs and representing arbitrarily large sample
sizes. The simplest way to do this is to take a genetic reference
dataset (e.g. UK Biobank, 500000 samples), and generate a phenotype
based on the SNPs that we want to be causal. We would then run a GWAS on
the simulated phenotype to obtain the GWAS summary data. However this
would be very slow. The simulateGP
package has a few
methods to rapidly simulate GWAS summary data directly without going via
analysis of individual level datasets.
Features of GWAS:
We use the following model to describe the distribution of effect sizes for each SNP \(j...M\) across the genome:
\[ \beta_j \sim N(0, [2p_j (1-p_j)]^S \sigma^2_\beta)\pi + \vartheta (1 - \pi) \]
where \(\beta_j\) is the effect size, \(p_j\) is the allele frequency, \(S\) is the selection coefficient acting on the trait, \(\pi\) is the fraction of SNPs that have an effect and \(\vartheta()\) represents a null value for all SNPs that have no effect. If the phenotypic variance is 1 then \(M \sigma^2_\beta\) is the heritability.
We can generate effect sizes for an arbitrary number of SNPs by starting with a set of allele frequencies and supplying the relevant parameters described above
First create a map of all the SNPs to be simulated, containing at least the SNP names and the effect allele frequencies
nsnp <- 1000000
map <- arbitrary_map(runif(nsnp, 0.01, 0.99))
Next, generate effect sizes for each SNP
params <- generate_gwas_params(map=map, h2=0.4, S=-0.4, Pi=0.005)
Here there is negative selection acting on a small number of causal variants, and the overall heritability explained by all causal variants is 0.4. Observe the relationship between allele frequency and effect size for this selection coefficient:
ggplot(params[params$beta!=0,], aes(af, abs(beta))) +
geom_point() +
geom_smooth()
Finally, generate sample effect estimates, standard errors and p-values for an arbitrary sample size, e.g. for 10 million samples here
nid <- 10000000
ss <- generate_gwas_ss(params, nid=nid)
ss
Note that you can generate more complex underlying genetic architectures manually. For example if there is a mixture of distributions, where some SNPs have large effects and many have small effects:
map <- rbind(
tibble(af = runif(10, 0.4, 0.5), group=1),
tibble(af = runif(10000, 0.01, 0.5), group=2),
tibble(af = runif(1000000-10-10000, 0.01, 0.5), group=3)
) %>%
mutate(snp=1:n(), pos=snp, chr=99, ea="1", oa="2")
param <- rbind(
generate_gwas_params(subset(map, group==1), h2=0.1, S=0),
generate_gwas_params(subset(map, group==2), h2=0.3, S=0),
generate_gwas_params(subset(map, group==3), h2=0, S=0)
)
# Generate GWAS summary stats
res <- generate_gwas_ss(param, 450000)
res
The above effects assume all SNPs are completely independent. In order for the characteristic patterns of LD around a causal variant to be generated, we need to transform the distribution of effects according to the correlation matrix amongst the SNPs.
We need a set of LD matrices to generate realistic LD patterns for a large set of SNPs. LD matrices are available for three major human populations for download here:
The above are large (~5Gb) .tar
files. Download one and
extract, it should contain a directory called
<POP>_1kg_hm3/
, which contains a map.rds
and a number of ldobj_<region>.rds
files. We will use
the map.rds
as our genome scaffold - it contains SNPs,
chromosomes, positions, alleles and allele frequencies for the specific
population. We can then select causal variants for the trait as above,
and then introduce the LD transformation before sampling the effect
estimates / SEs / p-values etc from the LD-aware effect sizes.
Set the LD matrix data directory:
file.exists(datadir)
Begin with a small number of SNPs for illustration. E.g. let’s use the first region in the set of regions downloaded
fn <- list.files(datadir, full.names=TRUE) %>%
grep("ldobj_", ., value=TRUE) %>%
{.[1]}
ldobj <- readRDS(fn)
str(ldobj)
We see it contains two items in a list, the map
and the
ld
objects. The map
is a data frame containing
information about the SNPs in the region. The ld
is a
matrix containing LD correlations amongst all the SNPs in the region. To
use this to transform our data - let’s simulate a single causal variant
in the region
params <- ldobj$map %>%
generate_gwas_params(h2=0.01, S=0, Pi=1/nrow(.))
We can now sample effect estimates for either the raw effects or the LD-aware effects
ss_raw <- params %>%
generate_gwas_ss(nid=10000)
ss_ld <- params %>%
generate_gwas_ss(nid=10000, ldobj=ldobj)
Compare the associations in the region. With no LD:
ggplot(ss_raw, aes(x=pos, y=-log10(pval))) +
geom_point()
Now when LD is added:
ggplot(ss_ld, aes(x=pos, y=-log10(pval))) +
geom_point()
The same process can be achieved for more complex architectures across the entire genome. It takes about 2 minutes to generate the LD-aware effect sizes and effect estimates.
map <- readRDS(file.path(datadir, "map.rds"))
params <- map %>%
generate_gwas_params(h2=0.5, S=-0.3, Pi=1000/nrow(.)) %>%
generate_gwas_ss(nid=1000000, ldobjdir=datadir)
Finally, we can draw joint estimates of two different GWASs that
share some amount of sample overlap using the summary_set
function. The steps are:
For a set of variants, generate the true effect sizes for each of two different datasets. Here we make a simple example using the single small region as above. Trait 1 has a causal effect of 0.3 on trait 2.
params1 <- ldobj$map %>%
generate_gwas_params(h2=0.01, S=0, Pi=3/nrow(.))
params2 <- params2
params2$beta <- params1$beta * 0.3
Now sample the effect estimates jointly. We specify the sample sizes of each GWAS, and the number of overlapping samples. We also specify an observational correlation, which represents the total association due to the causal relationship and any confounding.
ss <- summary_set(
beta_gx=params1$beta,
beta_gy=params2$beta,
af=params1$af,
n_gx=100000,
n_gy=10000,
n_overlap=5000,
cor_xy=0.5
)
In principle each of these three components are independent modular aspects of the data generating process. This means that we can
As illustration, assume a polygenic model for both trait X and trait Y, and trait X also has a causal effect on trait Y .
# Define the causal effect
b_xy <- 0.3
# Start with a variant map
map <- readRDS(file.path(datadir, "map.rds"))
# Generate genetic effects on trait 1
paramsx <- map %>%
generate_gwas_params(h2=0.5, S=-0.3, Pi=1000/nrow(.))
# Generate genetic effects on trait 2
paramsy <- map %>%
generate_gwas_params(h2=0.5, S=-0.3, Pi=2000/nrow(.))
paramsy$beta <- paramsy$beta + paramsx$beta * b_xy
# Introduce LD
paramsx <- paramsx %>%
add_ld_to_params(ldobjdir=datadir)
paramsy <- paramsy %>%
add_ld_to_params(ldobjdir=datadir)
# Create sample estimates where there is some non-genetic confounding of X and Y and partial overlap of the G-X and G-Y datasets
ss <- summary_set(
beta_gx = paramsx$beta,
beta_gy = paramsy$beta,
af = paramsx$af,
n_gx = 100000,
n_gy = 100000,
n_overlap = 50000,
cor_xy = 0.5
)