Gibran Hemani’s lab book - Differences in cancer prevalence and heritability

Background

Richard Peto points out, and later George Davey Smith, that cancer prevalence is very different across geographical regions and across time periods within geographical regions. Such systematic differences are too rapid to be explained by genetic factors, and too structured to be explained by chance. When comparing the rates of cancer between the highest 20% and lowest 20% prevalence, they suggest that ~80% of cancer is due to modifiable risk factors. Similar rates for mortality differences, meaning that it’s plausible that this isn’t due to differential rates of cancer detection.

How does this square with heritability? It implies that 80% of the variance in cancer is due to a set of modifiable factors. But if h2 is higher than 20% then is this possible?

Consider smoking and lung cancer. If smoking is partially genetic and partially environmental, what happens in this high vs low prevalence calculation?

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

Positive control, lets make it so that 80% of the trait is due to environment and 20% random chance

n <- 10000

dat <- bind_rows(
    tibble(
        envir = rbinom(n, 1, 0) * 10,
        cancer = rbinom(n, 1, plogis(-1.9 + envir)),
        cat="low"
    ),
    tibble(
        envir = rbinom(n, 1, 0.8) * 10,
        cancer = rbinom(n, 1, plogis(-1.9 + envir)),
        cat="high"
    )
) 

cor(dat$envir[dat$cat=="high"], dat$cancer[dat$cat=="high"])^2

[1] 0.8200225

cor(dat$envir, dat$cancer)^2

[1] 0.7139614

dat %>%
    group_by(cat) %>%
    summarise(prev_cancer = sum(cancer) / n(), mean_env = mean(envir))

# A tibble: 2 × 3
  cat   prev_cancer mean_env
  <chr>       <dbl>    <dbl>
1 high        0.824     7.94
2 low         0.135     0

Now introduce genetic variance

There is a heritability of 50% which is independent of environmental effect

dat <- bind_rows(
    tibble(
        envir = rbinom(n, 1, 0) * 10,
        g = rnorm(n),
        cancer = rbinom(n, 1, plogis(-1.9 + envir + g*10)),
        cat="low"
    ),
    tibble(
        envir = rbinom(n, 1, 0.8) * 10,
        g = rnorm(n),
        cancer = rbinom(n, 1, plogis(-1.9 + envir + g*10)),
        cat="high"
    )
) 
cor(dat$g[dat$cat=="high"], dat$cancer[dat$cat=="high"])^2

[1] 0.4576331

dat %>%
    group_by(cat) %>%
    summarise(prev_cancer = sum(cancer) / n(), mean_env = mean(envir))

# A tibble: 2 × 3
  cat   prev_cancer mean_env
  <chr>       <dbl>    <dbl>
1 high        0.714     8.03
2 low         0.436     0

That raises the prevalence in the low group. Now lets make the genetic factor mediated through behaviour

dat <- bind_rows(
    tibble(
        g = rnorm(n),
        e = rnorm(n, sd=1),
        envir = rbinom(n, 1, plogis(g + e)) * 10,
        cancer = rbinom(n, 1, plogis(-1.9 + envir)),
        cat="low"
    ),
    tibble(
        g = rnorm(n),
        e = rnorm(n, sd=10),
        envir = rbinom(n, 1, plogis(g + e)) * 10,
        cancer = rbinom(n, 1, plogis(-1.9 + envir)),
        cat="high"
    )
) 
cor(dat$g[dat$cat=="high"], dat$cancer[dat$cat=="high"])^2

[1] 0.002550874

dat %>%
    group_by(cat) %>%
    summarise(prev_cancer = sum(cancer) / n(), mean_env = mean(envir))

# A tibble: 2 × 3
  cat   prev_cancer mean_env
  <chr>       <dbl>    <dbl>
1 high        0.571     5.04
2 low         0.566     4.94

I think you need a G x E interaction. You might have a high liability to smoke but the availability of cigarettes is very low, so that genetic variance cannot manifest. How does that appear as a statistical model? It will be the genetic risk on outcome + environmental risk on outcome, multiplied by accessibility to the outcome

dat <- bind_rows(
    tibble(
        g = rbeta(n, 3, 1),
        accessibility = rbinom(n, 1, 0.001),
        envir = rbinom(n, 1, g * accessibility),
        cancer = rbinom(n, 1, plogis(-1.9 + envir * 10)),
        cat="low"
    ),
    tibble(
        g = rbeta(n, 3, 1),
        accessibility = rbinom(n, 1, 0.999),
        envir = rbinom(n, 1, g * accessibility),
        cancer = rbinom(n, 1, plogis(-1.9 + envir * 10)),
        cat="high"
    )
) 
cor(dat$g[dat$cat=="high"], dat$cancer[dat$cat=="high"])^2

[1] 0.1662182

cor(dat$g[dat$cat=="low"], dat$cancer[dat$cat=="low"])^2

[1] 2.929504e-05

dat %>%
    group_by(cat) %>%
    summarise(prev_cancer = sum(cancer) / n(), mean_env = mean(envir), mean_acc=mean(accessibility), mean_env_p = mean(plogis(envir)))

# A tibble: 2 × 5
  cat   prev_cancer mean_env mean_acc mean_env_p
  <chr>       <dbl>    <dbl>    <dbl>      <dbl>
1 high        0.779   0.749    0.999       0.673
2 low         0.132   0.0003   0.0005      0.500

But here the heritability is limited to the non-genetic variance.

Allow liability variance to change

Simulate twins
Estimate heritability in high and low prevalence groups
Estimate fraction of cases change

15-30x more likely to get lung cancer if you smoke 1.5% of non-smokers get lung cancer 12% of smokers get lung cancer

n <- 1000000
dat <- bind_rows(
    tibble(
        g = rnorm(n, sd=1),
        e = rnorm(n, sd=1),
        smoking = 2,
        liability = -4 + g + e + smoking,
        cancer = rbinom(n, 1, plogis(liability)),
        cat="high"
    ),
    tibble(
        g = rnorm(n, sd=1),
        e = rnorm(n, sd=1),
        smoking = 0,
        liability = -4 + g + e + smoking,
        cancer = rbinom(n, 1, plogis(liability)),
        cat="low"
    ),
)

group_by(dat, cat) %>%
    summarise(
        prev_cancer = sum(cancer) / n(),
        mean_g = mean(g),
        mean_e = mean(e),
        mean_smoking = mean(smoking),
        mean_liability = mean(liability),
        h2 = cor(g, liability)^2,
        h2obs = cor(g, cancer)^2,
    )

# A tibble: 2 × 8
  cat   prev_cancer    mean_g    mean_e mean_smoking mean_liability    h2  h2obs
  <chr>       <dbl>     <dbl>     <dbl>        <dbl>          <dbl> <dbl>  <dbl>
1 high       0.184  -0.00131  -0.000625            2          -2.00 0.499 0.0904
2 low        0.0405  0.000666 -0.000439            0          -4.00 0.499 0.0311

cor(dat$cancer, dat$g)^2

[1] 0.05730214

cor(dat$smoking, dat$cancer)^2

[1] 0.05152482

ggplot(dat, aes(x=liability)) +
geom_density(aes(fill=cat), alpha=0.5)

Twin model

make_families <- function(af, nfam, beta) {
    nsnp <- length(af)
    dads <- matrix(0, nfam, nsnp)
    mums <- matrix(0, nfam, nsnp)
    sibs1 <- matrix(0, nfam, nsnp)
    sibs2 <- matrix(0, nfam, nsnp)
    sibs3 <- matrix(0, nfam, nsnp)
    for(i in 1:nsnp)
    {
        dad1 <- rbinom(nfam, 1, af[i]) + 1
        dad2 <- (rbinom(nfam, 1, af[i]) + 1) * -1
        mum1 <- rbinom(nfam, 1, af[i]) + 1
        mum2 <- (rbinom(nfam, 1, af[i]) + 1) * -1

        dadindex <- sample(c(TRUE, FALSE), nfam, replace=TRUE)
        dadh <- rep(NA, nfam)
        dadh[dadindex] <- dad1[dadindex]
        dadh[!dadindex] <- dad2[!dadindex]

        mumindex <- sample(c(TRUE, FALSE), nfam, replace=TRUE)
        mumh <- rep(NA, nfam)
        mumh[mumindex] <- mum1[mumindex]
        mumh[!mumindex] <- mum2[!mumindex]

        sib1 <- cbind(dadh, mumh)

        dadindex <- sample(c(TRUE, FALSE), nfam, replace=TRUE)
        dadh <- rep(NA, nfam)
        dadh[dadindex] <- dad1[dadindex]
        dadh[!dadindex] <- dad2[!dadindex]

        mumindex <- sample(c(TRUE, FALSE), nfam, replace=TRUE)
        mumh <- rep(NA, nfam)
        mumh[mumindex] <- mum1[mumindex]
        mumh[!mumindex] <- mum2[!mumindex]

        sib2 <- cbind(dadh, mumh)

        sibs1[,i] <- rowSums(abs(sib1) - 1)
        sibs2[,i] <- rowSums(abs(sib2) - 1)
    }

    sdat <- bind_rows(
        tibble(fid = 1:nfam, iid = paste0(1:nfam, "a"), prs = drop(sibs1 %*% beta)),
        tibble(fid = 1:nfam, iid = paste0(1:nfam, "b"), prs = drop(sibs2 %*% beta))
    )
    return(sdat)
}

make_twins <- function(nmz, ndz, nsnp, bsd) {
    af <- runif(nsnp, 0.01, 0.99)
    b <- rnorm(nsnp, sd=bsd)
    g <- sapply(af, function(x) rbinom(nmz, 2, x))
    prs <- drop(g %*% b)
    dim(g)
    mz <- tibble(
        fid = c(1:nmz, 1:nmz),
        iid = c(paste0(1:nmz, "a"), paste0(1:nmz, "b")),
        prs = c(prs, prs),
        what = "mz"
    )

    dz <- make_families(af, ndz, b) %>% mutate(what="dz")
    bind_rows(mz, dz)
}

gx_to_gp <- function(gx, h2x, prev) {
    x_prime <- qnorm(prev, 0, 1, lower.tail=FALSE)
    p <- pnorm(x_prime, mean=gx, sd = 1 - sqrt(h2x), lower.tail=FALSE)
    return(p)
}
rbinom(100000, 1, gx_to_gp(rnorm(100000), 0.5, 0.014)) %>% table %>% prop.table

.
      0       1 
0.98331 0.01669

make_twins2 <- function(nmz, ndz, prev, h2) {
    mzg <- rnorm(nmz)
    dzg <- mvtnorm::rmvnorm(ndz, sigma=matrix(c(1,0.5,0.5,1), 2, 2))
    mz <- tibble(
        fid = paste(c(1:nmz, 1:nmz), "mz", sep="_"),
        iid = c(paste0(1:nmz, "a"), paste0(1:nmz, "b")),
        g = c(mzg, mzg),
        what = "mz"
    )
    dz <- tibble(
        fid = paste(c(1:ndz, 1:ndz), "dz", sep="_"),
        iid = c(paste0(1:ndz, "a"), paste0(1:ndz, "b")),
        g = c(dzg[,1], dzg[,2]),
        what = "dz"
    )
    d <- bind_rows(mz, dz) %>%
        mutate(
            p = gx_to_gp(g, h2, prev),
            cancer = rbinom(n(), 1, p)
        )
    return(d)
}

get_twin_h2 <- function(x) {
    a <- subset(x, grepl("a", iid))
    b <- subset(x, grepl("b", iid))
    ab <- inner_join(a, b, by=c("fid", "what"))
    ab %>% group_by(what) %>%
        summarise(
            cor = cor(cancer.x, cancer.y)
        ) %>% pivot_wider(names_from=what, values_from=cor) %>%
        mutate(h2 = 0.5 * (mz - dz))
}

get_twin_h2_binary <- function(x) {
    a <- subset(x, grepl("a", iid))
    b <- subset(x, grepl("b", iid))
    ab <- inner_join(a, b, by=c("fid", "what"))
    temp <- group_by(ab, what, cancer.x, cancer.y) %>% summarise(n=n()) %>% ungroup()
    temp %>% group_by(what) %>%
        mutate(
            cl = paste(cancer.x, cancer.y),
        ) %>%
        summarise(
            t = (n[cl=="1 1"] * n[cl=="0 0"] - n[cl=="1 0"] * n[cl == "0 1"]) / (n[cl=="1 1"] * n[cl=="0 0"] + n[cl=="1 0"] * n[cl == "0 1"])
        ) %>% ungroup() %>%
        summarise(h2 = 2 * (t[2]-t[1]))
}

get_twin_h2_falconer <- function(x) {
    qg <- subset(x, grepl("a", iid)) %>% summarise(prev = sum(cancer) / n()) %>% {.$prev}

    xg <- qnorm(1 - qg)

    fids_mz <- subset(x, what == "mz" & cancer == 1 & grepl("a", iid))$fid
    fids_dz <- subset(x, what == "dz" & cancer == 1 & grepl("a", iid))$fid

    x1 <- subset(x, (what == "mz" & fid %in% fids_mz) | (what == "dz" & fid %in% fids_dz))
    p <- x1 %>% group_by(what, fid) %>%
        summarise(c = sum(cancer)) %>%
        group_by(what, c) %>%
        summarise(n=n())
    print(p)
    
    xr <- p %>% group_by(what) %>%
        summarise(qr = n[2] / (n[2]+n[1]), xr = qnorm(1 - qr), N = sum(n))

    zg <- dnorm(xg)
    ag <- zg / qg
    
    xr <- xr %>% 
        mutate(
            xg,
            qg,
            zg,
            ag,
            b = (xg - xr) / ag, 
            r = case_when(what=="mz" ~ 1, what == "dz" ~ 0.5, TRUE ~ NA), 
            h2 = b / r,
            h2se = 1 / (b * ag^2) * sqrt((1-qr) / (qr * N))
        )
    xr
}

twins <- make_twins(100000, 100000, 100, 1)
twins

# A tibble: 400,000 × 4
     fid iid      prs what 
   <int> <chr>  <dbl> <chr>
 1     1 1a    21.3   mz   
 2     2 2a    11.0   mz   
 3     3 3a     9.17  mz   
 4     4 4a    12.5   mz   
 5     5 5a     5.23  mz   
 6     6 6a     4.72  mz   
 7     7 7a    13.3   mz   
 8     8 8a     9.81  mz   
 9     9 9a    -0.181 mz   
10    10 10a    9.09  mz   
# ℹ 399,990 more rows

twins %>% group_by(what) %>% 
    summarise(mean_prs = mean(prs), sd_prs = sd(prs))

# A tibble: 2 × 3
  what  mean_prs sd_prs
  <chr>    <dbl>  <dbl>
1 dz        11.2   5.48
2 mz        11.2   5.47

twins %>% filter(what == "mz") %>%
    mutate(sib = gsub("[0-9]", "", iid)) %>%
    select(fid, sib, g=prs) %>%
    pivot_wider(values_from=g, names_from=sib) %>% {cov(.[,-1])}

         a        b
a 29.86894 29.86894
b 29.86894 29.86894

twins %>% filter(what == "dz") %>%
    mutate(sib = gsub("[0-9]", "", iid)) %>%
    select(fid, sib, g=prs) %>%
    pivot_wider(values_from=g, names_from=sib) %>% {cov(.[,-1])}

         a        b
a 29.94604 14.90168
b 14.90168 30.03234

sd(twins$prs[1:1000])

[1] 5.49012

twdat <- make_twins(1000000, 1000000, 100, 0.3)
twins <- twdat %>%
    rename(g=prs) %>%
    mutate(
        g = (g - mean(g))*2,
        e = rnorm(n(), sd=sd(g[1:1000])),
        cat = sample(0:1, n(), replace=TRUE),
        smoking = cat * 3.5,
        liability = -7 + g + e + smoking,
        cancer = rbinom(n(), 1, plogis(liability))
    )

twins

# A tibble: 4,000,000 × 9
     fid iid        g what       e   cat smoking liability cancer
   <int> <chr>  <dbl> <chr>  <dbl> <int>   <dbl>     <dbl>  <int>
 1     1 1a    -0.120 mz    -2.55      0     0       -9.67      0
 2     2 2a    -0.243 mz    -2.30      0     0       -9.54      0
 3     3 3a    -1.74  mz    -3.37      1     3.5     -8.61      0
 4     4 4a     4.76  mz     3.51      0     0        1.27      1
 5     5 5a     4.30  mz    -5.00      0     0       -7.69      0
 6     6 6a     0.219 mz    -3.85      1     3.5     -7.13      0
 7     7 7a    -5.26  mz     3.62      0     0       -8.64      0
 8     8 8a     1.59  mz     2.68      0     0       -2.74      0
 9     9 9a     7.55  mz    -4.92      0     0       -4.38      0
10    10 10a   -3.38  mz     0.762     0     0       -9.62      0
# ℹ 3,999,990 more rows

twins %>% filter(grepl("a", iid)) %>%
    group_by(cat) %>%
    summarise(prev = sum(cancer) / n())

# A tibble: 2 × 2
    cat  prev
  <int> <dbl>
1     0 0.113
2     1 0.272

group_by(twins, cat) %>%
    summarise(
        prev_cancer = sum(cancer) / n(),
        mean_g = mean(g),
        mean_e = mean(e),
        mean_smoking = mean(smoking),
        mean_liability = mean(liability),
        h2 = cor(g, liability)^2,
        h2obs = cor(g, cancer)^2
    ) %>% str

tibble [2 × 8] (S3: tbl_df/tbl/data.frame)
 $ cat           : int [1:2] 0 1
 $ prev_cancer   : num [1:2] 0.114 0.273
 $ mean_g        : num [1:2] 0.000756 -0.000756
 $ mean_e        : num [1:2] 0.002363 0.000533
 $ mean_smoking  : num [1:2] 0 3.5
 $ mean_liability: num [1:2] -7 -3.5
 $ h2            : num [1:2] 0.492 0.492
 $ h2obs         : num [1:2] 0.162 0.247

group_by(twins, cat) %>%
    do({
        tibble(
            h2twin = get_twin_h2(.)$h2,
            h2twin2 = get_twin_h2_binary(.)$h2
        )
    })

`summarise()` has grouped output by 'what', 'cancer.x'. You can override using
the `.groups` argument.

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `t = `/`(...)`.
ℹ In group 2: `what = "mz"`.
Caused by warning in `n[cl == "1 1"] * n[cl == "0 0"] + n[cl == "1 0"] * n[cl == "0 1"]`:
! NAs produced by integer overflow

`summarise()` has grouped output by 'what', 'cancer.x'. You can override using
the `.groups` argument.

Warning: There were 4 warnings in `summarise()`.
The first warning was:
ℹ In argument: `t = `/`(...)`.
ℹ In group 1: `what = "dz"`.
Caused by warning in `n[cl == "1 1"] * n[cl == "0 0"]`:
! NAs produced by integer overflow
ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

# A tibble: 2 × 3
# Groups:   cat [2]
    cat h2twin h2twin2
  <int>  <dbl>   <dbl>
1     0 0.0660      NA
2     1 0.0710      NA

get_twin_h2_binary(twins)

`summarise()` has grouped output by 'what', 'cancer.x'. You can override using
the `.groups` argument.

Warning: There were 8 warnings in `summarise()`.
The first warning was:
ℹ In argument: `t = `/`(...)`.
ℹ In group 1: `what = "dz"`.
Caused by warning in `n[cl == "1 1"] * n[cl == "0 0"]`:
! NAs produced by integer overflow
ℹ Run `dplyr::last_dplyr_warnings()` to see the 7 remaining warnings.

# A tibble: 1 × 1
     h2
  <dbl>
1    NA

twins %>%
    summarise(
        prev_cancer = sum(cancer) / n(),
        mean_g = mean(g),
        mean_e = mean(e),
        mean_smoking = mean(smoking),
        mean_liability = mean(liability),
        h2 = cor(g, liability)^2,
        h2obs = cor(g, cancer)^2,
        h2twin = get_twin_h2(.)$h2,
        h2twin2 = get_twin_h2_binary(.)$h2,
    ) %>% str

`summarise()` has grouped output by 'what', 'cancer.x'. You can override using
the `.groups` argument.

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `t = `/`(...)`.
Caused by warning:
! There were 8 warnings in `summarise()`.
The first warning was:
ℹ In argument: `t = `/`(...)`.
ℹ In group 1: `what = "dz"`.
Caused by warning in `n[cl == "1 1"] * n[cl == "0 0"]`:
! NAs produced by integer overflow
ℹ Run `dplyr::last_dplyr_warnings()` to see the 7 remaining warnings.

tibble [1 × 9] (S3: tbl_df/tbl/data.frame)
 $ prev_cancer   : num 0.193
 $ mean_g        : num 8.59e-16
 $ mean_e        : num 0.00145
 $ mean_smoking  : num 1.75
 $ mean_liability: num -5.25
 $ h2            : num 0.447
 $ h2obs         : num 0.196
 $ h2twin        : num 0.0613
 $ h2twin2       : num NA

group_by(twins, cat) %>%
    do({
        tibble(
            h2twin = get_twin_h2(.)$h2,
            h2twin2 = get_twin_h2_binary(.)$h2
        )
    })

`summarise()` has grouped output by 'what', 'cancer.x'. You can override using
the `.groups` argument.

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `t = `/`(...)`.
ℹ In group 2: `what = "mz"`.
Caused by warning in `n[cl == "1 1"] * n[cl == "0 0"] + n[cl == "1 0"] * n[cl == "0 1"]`:
! NAs produced by integer overflow

`summarise()` has grouped output by 'what', 'cancer.x'. You can override using
the `.groups` argument.

Warning: There were 4 warnings in `summarise()`.
The first warning was:
ℹ In argument: `t = `/`(...)`.
ℹ In group 1: `what = "dz"`.
Caused by warning in `n[cl == "1 1"] * n[cl == "0 0"]`:
! NAs produced by integer overflow
ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

# A tibble: 2 × 3
# Groups:   cat [2]
    cat h2twin h2twin2
  <int>  <dbl>   <dbl>
1     0 0.0660      NA
2     1 0.0710      NA

group_by(twins, cat) %>%
    do({
        print(.)
        get_twin_h2_falconer(.)
    }) %>% str

# A tibble: 2,000,246 × 9
     fid iid        g what       e   cat smoking liability cancer
   <int> <chr>  <dbl> <chr>  <dbl> <int>   <dbl>     <dbl>  <int>
 1     1 1a    -0.120 mz    -2.55      0       0     -9.67      0
 2     2 2a    -0.243 mz    -2.30      0       0     -9.54      0
 3     4 4a     4.76  mz     3.51      0       0      1.27      1
 4     5 5a     4.30  mz    -5.00      0       0     -7.69      0
 5     7 7a    -5.26  mz     3.62      0       0     -8.64      0
 6     8 8a     1.59  mz     2.68      0       0     -2.74      0
 7     9 9a     7.55  mz    -4.92      0       0     -4.38      0
 8    10 10a   -3.38  mz     0.762     0       0     -9.62      0
 9    11 11a   -2.60  mz     0.221     0       0     -9.38      0
10    12 12a   -0.900 mz    -4.27      0       0    -12.2       0
# ℹ 2,000,236 more rows

`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   what [2]
  what      c     n
  <chr> <int> <int>
1 dz        1 51362
2 dz        2  5526
3 mz        1 47773
4 mz        2  8763
# A tibble: 1,999,754 × 9
     fid iid        g what       e   cat smoking liability cancer
   <int> <chr>  <dbl> <chr>  <dbl> <int>   <dbl>     <dbl>  <int>
 1     3 3a    -1.74  mz    -3.37      1     3.5    -8.61       0
 2     6 6a     0.219 mz    -3.85      1     3.5    -7.13       0
 3    13 13a    3.83  mz     0.956     1     3.5     1.29       0
 4    17 17a    0.939 mz    -4.94      1     3.5    -7.50       0
 5    18 18a    3.17  mz    -6.75      1     3.5    -7.08       0
 6    22 22a    3.20  mz    -5.96      1     3.5    -6.26       0
 7    24 24a   -0.115 mz     3.10      1     3.5    -0.516      0
 8    29 29a    2.76  mz     4.90      1     3.5     4.17       1
 9    30 30a   -0.282 mz     6.79      1     3.5     3.01       1
10    33 33a   -0.960 mz     1.41      1     3.5    -3.05       0
# ℹ 1,999,744 more rows

`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   what [2]
  what      c      n
  <chr> <int>  <int>
1 dz        1 110944
2 dz        2  24796
3 mz        1 104552
4 mz        2  32143
gropd_df [4 × 13] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ cat : int [1:4] 0 0 1 1
 $ what: chr [1:4] "dz" "mz" "dz" "mz"
 $ qr  : num [1:4] 0.0971 0.155 0.1827 0.2351
 $ xr  : num [1:4] 1.298 1.015 0.905 0.722
 $ N   : int [1:4] 56888 56536 135740 136695
 $ xg  : num [1:4] 1.209 1.209 0.605 0.605
 $ qg  : num [1:4] 0.113 0.113 0.272 0.272
 $ zg  : num [1:4] 0.192 0.192 0.332 0.332
 $ ag  : num [1:4] 1.69 1.69 1.22 1.22
 $ b   : num [1:4] -0.0527 0.1141 -0.2461 -0.0958
 $ r   : num [1:4] 0.5 1 0.5 1
 $ h2  : num [1:4] -0.1055 0.1141 -0.4921 -0.0958
 $ h2se: num [1:4] -0.0844 0.03 -0.0157 -0.0343
 - attr(*, "groups")= tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ cat  : int [1:2] 0 1
  ..$ .rows: list<int> [1:2] 
  .. ..$ : int [1:2] 1 2
  .. ..$ : int [1:2] 3 4
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

get_twin_h2_falconer(twins) %>% str

`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   what [2]
  what      c      n
  <chr> <int>  <int>
1 dz        1 139109
2 dz        2  53519
3 mz        1 120469
4 mz        2  72762
tibble [2 × 12] (S3: tbl_df/tbl/data.frame)
 $ what: chr [1:2] "dz" "mz"
 $ qr  : num [1:2] 0.278 0.377
 $ xr  : num [1:2] 0.589 0.315
 $ N   : int [1:2] 192628 193231
 $ xg  : num [1:2] 0.867 0.867
 $ qg  : num [1:2] 0.193 0.193
 $ zg  : num [1:2] 0.274 0.274
 $ ag  : num [1:2] 1.42 1.42
 $ b   : num [1:2] 0.196 0.389
 $ r   : num [1:2] 0.5 1
 $ h2  : num [1:2] 0.391 0.389
 $ h2se: num [1:2] 0.00931 0.00373

get_twin_h2_falconer(twins %>% filter(cat == 1)) %>% str

`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'what'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   what [2]
  what      c      n
  <chr> <int>  <int>
1 dz        1 110944
2 dz        2  24796
3 mz        1 104552
4 mz        2  32143
tibble [2 × 12] (S3: tbl_df/tbl/data.frame)
 $ what: chr [1:2] "dz" "mz"
 $ qr  : num [1:2] 0.183 0.235
 $ xr  : num [1:2] 0.905 0.722
 $ N   : int [1:2] 135740 136695
 $ xg  : num [1:2] 0.605 0.605
 $ qg  : num [1:2] 0.272 0.272
 $ zg  : num [1:2] 0.332 0.332
 $ ag  : num [1:2] 1.22 1.22
 $ b   : num [1:2] -0.2461 -0.0958
 $ r   : num [1:2] 0.5 1
 $ h2  : num [1:2] -0.4921 -0.0958
 $ h2se: num [1:2] -0.0157 -0.0343

twins$fid <- paste(twins$fid, twins$what, sep="_")
kid <- twins %>% subset(cancer & !duplicated(fid)) %>% {.$fid}
twins_proband <- twins %>% filter(fid %in% kid) %>% arrange(desc(cancer)) %>% mutate(proband = !duplicated(fid)) 
twins_proband %>% group_by(proband) %>% summarise(n=n(), nc=sum(cancer))

# A tibble: 2 × 3
  proband      n     nc
  <lgl>    <int>  <int>
1 FALSE   385859 126281
2 TRUE    385859 385859

mean(twins$liability)

[1] -5.248767

mean(twins_proband$liability)

[1] -0.02932395

mean(twins_proband$liability[twins_proband$proband])

[1] 2.570294

mean(twins_proband$liability[!twins_proband$proband])

[1] -2.628942

sum(twins$cancer)/nrow(twins)

[1] 0.1930982

sum(twins_proband$cancer)/nrow(twins_proband)

[1] 0.6636362

sum(twins_proband$cancer[!twins_proband$proband])/sum(!twins_proband$proband)

[1] 0.3272724

tw <- bind_rows(
    make_twins2(100000, 100000, 0.005, 0.2) %>% mutate(cat = 0),
    make_twins2(100000, 100000, 0.15, 0.2) %>% mutate(cat = 1)
)
tw %>% group_by(cat) %>%
    filter(what == "mz" & !duplicated(fid)) %>%
    summarise(
        prev = sum(cancer) / n(),
        mean_g = mean(g),
        h2 = cor(g, cancer)^2
    )

# A tibble: 2 × 4
    cat   prev    mean_g     h2
  <dbl>  <dbl>     <dbl>  <dbl>
1     0 0.0119  0.00142  0.0618
2     1 0.183  -0.000322 0.360

twins_proband %>% group_by(cat, proband) %>% 
    summarise(prev = sum(cancer)/n())

`summarise()` has grouped output by 'cat'. You can override using the `.groups`
argument.

# A tibble: 4 × 3
# Groups:   cat [2]
    cat proband  prev
  <int> <lgl>   <dbl>
1     0 FALSE   0.217
2     0 TRUE    1    
3     1 FALSE   0.438
4     1 TRUE    1

twins2 <- left_join(twins, twins_proband %>% select(fid, iid, proband), by=c("fid", "iid")) %>%
    mutate(proband = case_when(proband ~ "proband", !proband ~ "relative", is.na(proband) ~ "pop", TRUE ~ NA))
table(twins2$proband, twins2$cancer)

          
                 0       1
  pop      2968029  260253
  proband        0  385859
  relative  259578  126281

twins2 %>% group_by(cat, what) %>% 
    summarise(
        qg = sum(cancer)/n(),
        xg = qnorm(1 - qg),
        qr = sum(cancer[proband=="relative"])/sum(proband=="relative"),
        xr = qnorm(1 - qr),        
    )

`summarise()` has grouped output by 'cat'. You can override using the `.groups`
argument.

# A tibble: 4 × 6
# Groups:   cat [2]
    cat what     qg    xg    qr      xr
  <int> <chr> <dbl> <dbl> <dbl>   <dbl>
1     0 dz    0.114 1.21  0.178 0.922  
2     0 mz    0.113 1.21  0.256 0.656  
3     1 dz    0.272 0.607 0.378 0.311  
4     1 mz    0.273 0.603 0.497 0.00705

Summary

Not clear what model would allow heritability to be greater than the variance explained by environmental factors.

sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.2 tidyr_1.3.0   dplyr_1.1.4  

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.2         knitr_1.45        rlang_1.1.3      
 [5] xfun_0.42         purrr_1.0.2       generics_0.1.3    jsonlite_1.8.8   
 [9] labeling_0.4.2    glue_1.7.0        colorspace_2.1-0  htmltools_0.5.7  
[13] scales_1.2.1      fansi_1.0.6       rmarkdown_2.26    grid_4.3.3       
[17] munsell_0.5.0     evaluate_0.23     tibble_3.2.1      fastmap_1.1.1    
[21] mvtnorm_1.2-2     yaml_2.3.8        lifecycle_1.0.4   compiler_4.3.3   
[25] htmlwidgets_1.6.3 pkgconfig_2.0.3   farver_2.1.1      digest_0.6.34    
[29] R6_2.5.1          tidyselect_1.2.0  utf8_1.2.4        pillar_1.9.0     
[33] magrittr_2.0.3    withr_3.0.0       gtable_0.3.3      tools_4.3.3