NAT2 and bladder cancer notebook

Author

Gibran Hemani

Published

March 5, 2026

Analysis if NAT2, smoking and bladder cancer

This is the relationship between smoking and bladder cancer risk stratified by NAT2 genotype (rapid vs slow acetylators) according to Davey Smith 2010, reproduced from Garcia-Closas 2006.

library(dplyr)
library(ggplot2)
ss <- tibble(
    Group = rep(factor(c("Nonsmoker", "Occasional smoker", "Former smoker", "Current smoker"), levels = c("Nonsmoker", "Occasional smoker", "Former smoker", "Current smoker")), times = 2),
    NAT2_group = rep(c("Rapid", "Slow"), each = 4),
    OR = c(1.0, 1.2, 2.4, 5.2, 0.9, 1.6, 4.1, 7.5)
) 
ggplot(ss, aes(x = Group, y = OR, group = NAT2_group, color = NAT2_group)) +
    geom_line() +
    geom_point() +
    theme_bw() +
    labs(y = "Risk Ratio for bladder cancer", color = "NAT2 genotype")

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Generate a DAG that could explain this association:

NAT2 and bladder cancer risk

Note that no interaction is required, but the effect of NAT2 on carcinogen levels is a relative effect, meaning that fast metabolisers will reduce carcinogen levels at a faster rate than slow metabolisers, but the additive change in carcinogen levels will be dependent on the initial level.

The model will look like:

\[ \text{logit}(P(Y=1)) = \beta_0 + \beta_{H,Y} H + \beta_{U,Y} U + E_Y \]

Carcinogen intake:

\[ H_{intake} = \beta_{S,H} C + \beta_{U,H} U + E_H \]

Carcinogen levels:

\[ H = \beta_{G_N,H} \times N \times H_{intake} \]

NAT2 activity / levels:

\[ N = \beta_{G_N,N} G_N + E_N \]

Smoking is represented as cigarettes per day, where G_I influences ever/never smoking and G_C influences number of cigarettes smoked per day among smokers.

\[ logit(P(S=1)) = \beta_{G_I,S} G_I + \beta_{U,S} U + E_S \]

\(C_i = 0\) if \(S_i = 0\), otherwise

\[ C = \beta_{G_C,C} G_C + \beta_{U,C} U + E_C \]

where

  • \(N\) is the NAT2 gene activity level
  • \(H\) is the level of heterocyclic amines / carcinogens
  • \(S\) is smoking initiation (0 for never, 1 for ever)
  • \(C\) is the number of cigarettes smoked per day
  • \(Y\) is bladder cancer status (0/1)
  • \(G_1\) is the NAT2 genotype (2 for rapid, 1 for slow)

That describes the data generating model in the simulation below. Function for the data generating model:

dgm <- function(b_0, b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh, n) {
    U <- runif(n)
    Gc <- rbinom(n, 2, 0.5)
    Gi <- rbinom(n, 2, 0.5)
    Gn <- rbinom(n, 1, 0.5) + 1
    logit_S <- b_gis * Gi + b_us * U + rnorm(n)
    S <- rbinom(n, 1, exp(logit_S) / (1 + exp(logit_S)))
    C <- ifelse(S == 0, 0, rpois(n, lambda = b_gcc * Gc + b_uc * U))
    N <- b_gnn * Gn + rnorm(n)
    H_intake <- b_sh * C + b_uh * U + rnorm(n)
    H <- H_intake * N * b_nh
    logit_p <- b_0 + b_hy * H + b_uy * U
    p <- exp(logit_p) / (1 + exp(logit_p))
    Y <- rbinom(n, 1, p)
    Ccat <- cut(C, breaks=c(-Inf, 0, 1, 2, 3, Inf), labels=c("0", "1", "2", "3", "4+"))    
    tibble(Y = Y, H = H, U = U, S = S, C = C, N = N, Gc = Gc, Gi = Gi, Gn = Gn, Ccat = Ccat)
}
draw_dag <- function(b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh) {
    nodes <- tibble::tribble(
        ~node,       ~x, ~y,
        # "Gc",         1,  4,
        # "Gi",         1,  3,
        "Gn",         3,  2,
        "U",          3,  4,
        "C",          1,  3,
        "N",          5,  2,
        "Hi",   3,  3,
        "H",          5,  3,
        "Y",          7,  3
    )

    edges <- tibble::tribble(
        ~from, ~to,        ~beta,   ~label,   ~curv,
        # "Gc",  "C",        b_gcc,   "b_gcc",   0.00,
        # "Gi",  "C",        b_gis,   "b_gis",   0,
        # "U",   "C",        b_us,    "b_us",    0.20,
        "U",   "C",        b_uc,    "b_uc",   0,
        "Gn",  "N",        b_gnn,   "b_gnn",   0.00,
        "C",   "Hi", b_sh,    "b_sh",    0.00,
        "U",   "Hi", b_uh,    "b_uh",    0.00,
        "Hi", "H",   b_nh,    "b_nh",    0.00,
        "N",   "H",        b_nh,    "b_nh",    0,
        "H",   "Y",        b_hy,    "b_hy",    0.00,
        "U",   "Y",        b_uy,    "b_uy",    0
    ) %>%
        dplyr::mutate(
            sign_type = dplyr::case_when(
                beta > 0 ~ "pos",
                beta < 0 ~ "neg",
                TRUE ~ "null"
            ),
            col = dplyr::case_when(
                sign_type == "pos" ~ "red3",
                sign_type == "neg" ~ "blue3",
                TRUE ~ "grey60"
            ),
            lty = ifelse(sign_type == "null", "dotted", "solid")
        ) %>%
        dplyr::left_join(nodes, by = c("from" = "node")) %>%
        dplyr::left_join(nodes, by = c("to" = "node"), suffix = c("", "end")) %>%
        dplyr::mutate(
            # Shorten arrows to stop at node edges
            dx = xend - x,
            dy = yend - y,
            length = sqrt(dx^2 + dy^2),
            # Shorten by 0.25 units at each end (approximate node radius)
            x = x + 0.25 * dx / length,
            y = y + 0.25 * dy / length,
            xend = xend - 0.25 * dx / length,
            yend = yend - 0.25 * dy / length
        )

    # Split edges into straight and curved
    edges_straight <- edges %>% dplyr::filter(curv == 0)
    edges_curved <- edges %>% dplyr::filter(curv != 0)
    
    p <- ggplot2::ggplot()
    
    # Add straight edges
    if (nrow(edges_straight) > 0) {
        p <- p + ggplot2::geom_segment(
            data = edges_straight,
            ggplot2::aes(
                x = x, y = y, xend = xend, yend = yend,
                color = col, linetype = lty
            ),
            arrow = grid::arrow(length = grid::unit(0.4, "cm"), type = "closed"),
            linewidth = 0.9
        )
    }
    
    # Add curved edges (one layer per unique curvature value)
    if (nrow(edges_curved) > 0) {
        for (curv_val in unique(edges_curved$curv)) {
            edges_subset <- edges_curved %>% dplyr::filter(curv == curv_val)
            p <- p + ggplot2::geom_curve(
                data = edges_subset,
                ggplot2::aes(
                    x = x, y = y, xend = xend, yend = yend,
                    color = col, linetype = lty
                ),
                curvature = curv_val,
                arrow = grid::arrow(length = grid::unit(0.4, "cm"), type = "closed"),
                linewidth = 0.9
            )
        }
    }
    
    p <- p +
        ggplot2::geom_text(
            data = edges,
            ggplot2::aes(
                x = (x + xend) / 2,
                y = (y + yend) / 2,
                label = label
            ),
            size = 3, vjust = -0.6
        ) +
        ggplot2::geom_point(
            data = nodes,
            ggplot2::aes(x = x, y = y),
            size = 9, shape = 21, fill = "white", color = "black", stroke = 0.8
        ) +
        ggplot2::geom_text(
            data = nodes,
            ggplot2::aes(x = x, y = y, label = node),
            size = 3.2
        ) +
        ggplot2::scale_color_identity() +
        ggplot2::scale_linetype_identity() +
        ggplot2::theme_void() +
        ggplot2::coord_cartesian(xlim = c(0.5, 9.5), ylim = c(0.5, 4.5))
    return(p)
}

a <- draw_dag(
    b_hy = 2,
    b_uy = 0,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 2,
    b_uc = 2,
    b_uh = 3
)
class(a)
a
  1. 'ggplot2::ggplot'
  2. 'ggplot'
  3. 'ggplot2::gg'
  4. 'S7_object'
  5. 'gg'

And this is the estimation model - the influence of smoking on bladder cancer stratified by NAT2 genotype

# Redefine draw_dag so N points to the Hi -> H edge (interaction), not directly to H
draw_dag <- function(b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh) {
    nodes <- tibble::tribble(
        ~node, ~x, ~y,
        "Gn",   2,  2,
        "U",    3,  4,
        "C",    1,  3,
        "N",    4,  2,
        "Hi",   3,  3,
        "H",    5,  3,
        "Y",    7,  3
    )

    # Main causal edges
    edges <- tibble::tribble(
        ~from, ~to,   ~beta, ~label,  ~curv,
        "U",   "C",   b_uc,  "b_uc",  0,
        "Gn",  "N",   b_gnn, "b_gnn", 0,
        "C",   "Hi",  b_sh,  "b_sh",  0,
        "U",   "Hi",  b_uh,  "b_uh",  0,
        "Hi",  "H",   b_nh,  "b_nh",  0,
        "H",   "Y",   b_hy,  "b_hy",  0,
        "U",   "Y",   b_uy,  "b_uy",  0
    ) %>%
        dplyr::mutate(
            sign_type = dplyr::case_when(beta > 0 ~ "pos", beta < 0 ~ "neg", TRUE ~ "null"),
            col = dplyr::case_when(sign_type == "pos" ~ "red3", sign_type == "neg" ~ "blue3", TRUE ~ "grey60"),
            lty = ifelse(sign_type == "null", "dotted", "solid")
        ) %>%
        dplyr::left_join(nodes, by = c("from" = "node")) %>%
        dplyr::left_join(nodes, by = c("to" = "node"), suffix = c("", "end")) %>%
        dplyr::mutate(
            dx = xend - x,
            dy = yend - y,
            length = sqrt(dx^2 + dy^2),
            x = x + 0.25 * dx / length,
            y = y + 0.25 * dy / length,
            xend = xend - 0.25 * dx / length,
            yend = yend - 0.25 * dy / length
        )

    # Interaction arrow: N -> (Hi -> H) midpoint
    n_xy <- nodes %>% dplyr::filter(node == "N")
    hi_xy <- nodes %>% dplyr::filter(node == "Hi")
    h_xy <- nodes %>% dplyr::filter(node == "H")

    x_target <- (hi_xy$x + h_xy$x) / 2
    y_target <- (hi_xy$y + h_xy$y) / 2

    dx_i <- x_target - n_xy$x
    dy_i <- y_target - n_xy$y
    len_i <- sqrt(dx_i^2 + dy_i^2)

    interaction_edge <- tibble::tibble(
        x = n_xy$x + 0.25 * dx_i / len_i,
        y = n_xy$y + 0.25 * dy_i / len_i,
        xend = x_target,
        yend = y_target,
        col = ifelse(b_nh > 0, "red3", ifelse(b_nh < 0, "blue3", "grey60")),
        lty = ifelse(b_nh == 0, "dotted", "solid")
    )

    p <- ggplot2::ggplot() +
        ggplot2::geom_segment(
            data = edges,
            ggplot2::aes(x = x, y = y, xend = xend, yend = yend, color = col, linetype = lty),
            arrow = grid::arrow(length = grid::unit(0.4, "cm"), type = "closed"),
            linewidth = 0.9
        ) +
        ggplot2::geom_segment(
            data = interaction_edge,
            ggplot2::aes(x = x, y = y, xend = xend, yend = yend, color = col, linetype = lty),
            arrow = grid::arrow(length = grid::unit(0.35, "cm"), type = "closed"),
            linewidth = 0.9
        ) +
        ggplot2::geom_text(
            data = edges,
            ggplot2::aes(x = (x + xend) / 2, y = (y + yend) / 2, label = label),
            size = 3, vjust = -0.6
        ) +
        ggplot2::geom_point(
            data = nodes,
            ggplot2::aes(x = x, y = y),
            size = 9, shape = 21, fill = "white", color = "black", stroke = 0.8
        ) +
        ggplot2::geom_text(
            data = nodes,
            ggplot2::aes(x = x, y = y, label = node),
            size = 3.2
        ) +
        ggplot2::scale_color_identity() +
        ggplot2::scale_linetype_identity() +
        ggplot2::theme_void() +
        ggplot2::coord_cartesian(xlim = c(0.5, 9.5), ylim = c(0.5, 4.5))

    return(p)
}
a <- draw_dag(
    b_hy = 2,
    b_uy = 0,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 2,
    b_uc = 2,
    b_uh = 3
)
class(a)
a
  1. 'ggplot2::ggplot'
  2. 'ggplot'
  3. 'ggplot2::gg'
  4. 'S7_object'
  5. 'gg'

estimation_gc2005 <- function(data) {
    model <- glm(Y ~ C * Gn, data = data, family = binomial)
    summary(model)
}
plot_rr <- function(data) {
    print(estimation_gc2005(data))
    rr <- estimation2(data)
    p <- ggplot(rr, aes(x = Ccat, y = RR, group = Gn, color = factor(Gn))) +
        geom_line() +
        geom_point() +
        theme_bw() +
        labs(y = "Risk Ratio for bladder cancer", color = "NAT2 genotype")
    return(p)
}
estimation2 <- function(data) {
    total_cases <- sum(data$Y)
    risk_baseline <- sum(data$Y[data$Ccat == 0 & data$Gn == 2]) / sum(data$Ccat == 0 & data$Gn == 2)
    group_by(data, Gn, Ccat) %>%
        do({
            cases <- sum(.$Y)
            n <- nrow(.)
            tibble(
                cases = cases,
                risk = cases / n,
                RR = risk / risk_baseline
            )
        })
}
analysis <- function(b_0, b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh, n, plot=TRUE) {
    data <- dgm(b_0, b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh, n)
    p1 <- draw_dag(b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh)
    p2 <- plot_rr(data)
    # plot p1 and p2 side by side
    if (plot) {
        gridExtra::grid.arrange(p1, p2, nrow = 2)
    }
}

Example dataset with no confounding:

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 2,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 0,
    b_uc = 0,
    b_uh = 0,
    b_uy = 0,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -3.021504   0.016227 -186.199  < 2e-16 ***
C            0.259425   0.008479   30.598  < 2e-16 ***
Gn           0.048677   0.010158    4.792 1.65e-06 ***
C:Gn         0.134023   0.005237   25.589  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 538054  on 999999  degrees of freedom
Residual deviance: 507664  on 999996  degrees of freedom
AIC: 507672

Number of Fisher Scoring iterations: 5

So this generates a G x smoking interaction on breast cancer with no marginal effect that is pretty close to the empirical results

Smoking is not causal: leads to no smoking assoc and no interaction with smoking

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 0,
    b_uc = 0,
    b_uh = 0,
    b_uy = 0,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -3.010171   0.017188 -175.128  < 2e-16 ***
C            0.008626   0.013082    0.659  0.50965    
Gn           0.036060   0.010820    3.333  0.00086 ***
C:Gn        -0.002983   0.008234   -0.362  0.71710    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 394612  on 999999  degrees of freedom
Residual deviance: 394598  on 999996  degrees of freedom
AIC: 394606

Number of Fisher Scoring iterations: 5

Scenario: Smoking is not causal but is positively confounded with H_intake

This leads to an interaction as well as marginal effects for smoking and Gn.

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 6,
    b_uy = 0,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.882430   0.019336 -149.07   <2e-16 ***
C            0.060301   0.003610   16.70   <2e-16 ***
Gn           0.214897   0.011855   18.13   <2e-16 ***
C:Gn         0.023153   0.002203   10.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 666590  on 999999  degrees of freedom
Residual deviance: 656602  on 999996  degrees of freedom
AIC: 656610

Number of Fisher Scoring iterations: 5

Scenario: Smoking is causal and confounded

How different is this to a scenario where smoking is causal and there are also confounders?

The shape is basically the same as when smoking was not causal

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 0.5,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 6,
    b_uy = 0,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.853711   0.017321 -164.76   <2e-16 ***
C            0.133142   0.003050   43.65   <2e-16 ***
Gn           0.267966   0.010595   25.29   <2e-16 ***
C:Gn         0.033829   0.001874   18.05   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 886863  on 999999  degrees of freedom
Residual deviance: 840223  on 999996  degrees of freedom
AIC: 840231

Number of Fisher Scoring iterations: 4

Scenario: Smoking not causal but confounded with the outcome

What about if smoking is confounded with the outcome rather than Hi

This leads to no interaction

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 0,
    b_uy = 3,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -1.9851912  0.0144262 -137.610   <2e-16 ***
C            0.1675023  0.0026661   62.828   <2e-16 ***
Gn          -0.0009948  0.0091264   -0.109    0.913    
C:Gn         0.0004881  0.0016879    0.289    0.772    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1047761  on 999999  degrees of freedom
Residual deviance: 1007040  on 999996  degrees of freedom
AIC: 1007048

Number of Fisher Scoring iterations: 4

Scenario: Smoking is causal and it is confounded with the outcome

This induces the interaction, but it also makes the smoking - risk ratio relationship more linear

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 2,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 0,
    b_uy = 6,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.953839   0.011943 -79.865   <2e-16 ***
C            0.201889   0.002884  70.003   <2e-16 ***
Gn           0.016425   0.007647   2.148   0.0317 *  
C:Gn         0.100130   0.001915  52.299   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1362349  on 999999  degrees of freedom
Residual deviance: 1180900  on 999996  degrees of freedom
AIC: 1180908

Number of Fisher Scoring iterations: 4

Scenario: Smoking is causal and it is negatively confounded with the outcome

This induces the interaction, but it also makes the smoking - risk ratio relationship more non-linear

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 2,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 0,
    b_uy = -6,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.320992   0.028962 -149.19   <2e-16 ***
C            0.179804   0.004488   40.06   <2e-16 ***
Gn           0.320014   0.017413   18.38   <2e-16 ***
C:Gn         0.041742   0.002700   15.46   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 482286  on 999999  degrees of freedom
Residual deviance: 443162  on 999996  degrees of freedom
AIC: 443170

Number of Fisher Scoring iterations: 6

Scenario: All confounding paths and no causality

Largely similar to the confounding of smoking - Hi example.

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 6,
    b_uy = 6,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.213696   0.012211  -99.39   <2e-16 ***
C            0.304844   0.002899  105.17   <2e-16 ***
Gn           0.088846   0.007726   11.50   <2e-16 ***
C:Gn         0.029124   0.001862   15.64   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1378050  on 999999  degrees of freedom
Residual deviance: 1195876  on 999996  degrees of freedom
AIC: 1195884

Number of Fisher Scoring iterations: 4

Scenario: All confounding paths and causality

As above

analysis(
    b_0 = -3,
    b_hy = 1.3,
    b_sh = 2,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 0.4,
    b_us = 6,
    b_uc = 6,
    b_uh = 6,
    b_uy = 6,
    n = 1000000
)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.915009   0.011658  -78.49   <2e-16 ***
C            0.174259   0.002781   62.66   <2e-16 ***
Gn           0.112442   0.007444   15.11   <2e-16 ***
C:Gn         0.092653   0.001847   50.17   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1354870  on 999999  degrees of freedom
Residual deviance: 1202536  on 999996  degrees of freedom
AIC: 1202544

Number of Fisher Scoring iterations: 4

Summary

  • The signature of causality is that exposure associates with the outcome and the buffering mechanism interacts with exposure
  • Confounding of exposure - mediator will lead to the intercept being non-1 for one of the genotype classes
  • Unfortunately confounding of the exposure - mediator will give the same pattern whether or not the exposure is causal
  • We can separate exposure-outcome confounding from exposure-outcome causality
  • We cannot separate exposure-mediator confounding from exposure-outcome causality
  • Could investigate fitting multiple exposures