NAT2 and bladder cancer – Gibran Hemani’s lab book

Background

This is the relationship between smoking and bladder cancer risk stratified by NAT2 genotype (rapid vs slow acetylators).

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
ss <- tibble(
    Group = rep(factor(c("Nonsmoker", "Occasional smoker", "Former smoker", "Current smoker"), levels = c("Nonsmoker", "Occasional smoker", "Former smoker", "Current smoker")), times = 2),
    NAT2_group = rep(c("Rapid", "Slow"), each = 4),
    OR = c(1.0, 1.2, 2.4, 5.2, 0.9, 1.6, 4.1, 7.5)
) 
ggplot(ss, aes(x = Group, y = OR, group = NAT2_group, color = NAT2_group)) +
    geom_line() +
    geom_point() +
    theme_bw() +
    labs(y = "Odds Ratio for bladder cancer", color = "NAT2 genotype")

The hypothesised mechanism is that slow acetylators have a reduced ability to detoxify carcinogens in tobacco smoke, leading to higher levels of DNA damage and increased bladder cancer risk.

Here we will develop a simulation to examine if an interaction is required at all to explain the finding. Our model will be

\[ \text{logit}(P(Y=1)) = \beta_0 + \beta_{H,Y} H + E_Y \]

\[ H = \beta_{S,H} C + \beta_{N,Y} N + \beta_{U,H} U + E_H \]

\[ N = \beta_{G_N,N} G_N + E_N \]

Smoking is represented as cigarettes per day, where G_I influences ever/never smoking and G_C influences number of cigarettes smoked per day among smokers.

\[ logit(P(S=1)) = \beta_{G_I,S} G_I + \beta_{U,S} U + E_S \]

\(C_i = 0\) if \(S_i = 0\), otherwise

\[ C = \beta_{G_C,C} G_C + \beta_{U,C} U + E_C \]

where

\(N\) is the NAT2 gene activity level
\(H\) is the level of heterocyclic amines / carcinogens
\(S\) is smoking initiation (0 for never, 1 for ever)
\(C\) is the number of cigarettes smoked per day
\(Y\) is bladder cancer status (0/1)
\(G_1\) is the NAT2 genotype (0 for rapid, 1 for slow)

The 8:18415371:A:G variant (rs1495741) is an eQTL for NAT2 expression and is also associated with bladder cancer risk (https://www.ebi.ac.uk/gwas/variants/rs1495741).

dgm <- function(b_0, b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh, n) {
    U <- runif(n)
    Gc <- rbinom(n, 2, 0.5)
    Gi <- rbinom(n, 2, 0.5)
    Gn <- rbinom(n, 1, 0.5) + 1
    logit_S <- b_gis * Gi + b_us * U + rnorm(n)
    S <- rbinom(n, 1, exp(logit_S) / (1 + exp(logit_S)))
    C <- ifelse(S == 0, 0, rpois(n, lambda = b_gcc * Gc + b_uc * U))
    N <- b_gnn * Gn + rnorm(n)
    H_intake <- b_sh * C + b_uh * U + rnorm(n) 
    H <- H_intake * N * b_nh
    logit_p <- b_0 + b_hy * H + b_uy * U
    p <- exp(logit_p) / (1 + exp(logit_p))
    Y <- rbinom(n, 1, p)
    Ccat <- cut(C, breaks=c(-Inf, 0, 1, 2, 3, Inf), labels=c("0", "1", "2", "3", "4+"))    
    tibble(Y = Y, H = H, U = U, S = S, C = C, N = N, Gc = Gc, Gi = Gi, Gn = Gn, Ccat = Ccat)
}

estimation_gc2005 <- function(data) {
    # Estimate the effect of smoking on bladder cancer risk stratified by NAT2 genotype
    model <- glm(Y ~ C * Gn, data = data, family = binomial)
    summary(model)
}

dat <- dgm(
    b_0 = -3,
    b_hy = 0.5,
    b_uy = 0,
    b_sh = 0.4,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 1.0,
    b_us = 0,
    b_uc = 0,
    b_uh = 0,
    n = 1000000
)

dat

# A tibble: 1,000,000 × 10
       Y       H     U     S     C     N    Gc    Gi    Gn Ccat 
   <int>   <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <dbl> <fct>
 1     0 -0.0510 0.782     1     0 0.932     0     2     2 0    
 2     0  0.0204 0.640     0     0 1.05      0     1     1 0    
 3     0  0.438  0.780     1     1 1.95      1     1     2 1    
 4     0  0.112  0.306     1     1 1.31      1     2     1 1    
 5     0 -0.193  0.659     1     0 1.48      0     1     1 0    
 6     0  0.790  0.541     0     0 2.42      2     1     2 0    
 7     0 -0.0316 0.951     1     0 0.669     0     1     1 0    
 8     0  0.119  0.346     1     0 0.437     1     1     1 0    
 9     0  0.180  0.791     0     0 1.78      2     1     1 0    
10     0 -0.161  0.154     0     0 1.26      0     0     2 0    
# ℹ 999,990 more rows

estimation_gc2005(dat)


Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -3.004566   0.017180 -174.886  < 2e-16 ***
C            0.013845   0.012373    1.119   0.2631    
Gn           0.020208   0.010826    1.867   0.0619 .  
C:Gn         0.033749   0.007703    4.381 1.18e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 401845  on 999999  degrees of freedom
Residual deviance: 401525  on 999996  degrees of freedom
AIC: 401533

Number of Fisher Scoring iterations: 5

table(dat$Ccat)


     0      1      2      3     4+ 
635304 173044 109178  51544  30930

plot_dat <- function(dat) {
    ggplot(dat, aes(x = Ccat, y = Y, group = Gn, color = factor(Gn))) +
        geom_line(stat = "summary", fun = "mean") +
        geom_point(stat = "summary", fun = "mean") +
        labs(x = "Cigarettes per day (categorised)", y = "Proportion with bladder cancer", color = "NAT2 genotype") +
        theme_bw()
}

dat <- dgm(
    b_0 = -3,
    b_hy = 0.5,
    b_uy = 0,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 1.0,
    b_us = 0.4,
    b_uc = 0.4,
    b_uh = 0.8,
    n = 5000000
)
estimation_gc2005(dat)


Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -2.9973872  0.0078890 -379.945   <2e-16 ***
C           -0.0012968  0.0052537   -0.247    0.805    
Gn           0.0515205  0.0049547   10.398   <2e-16 ***
C:Gn        -0.0004443  0.0032996   -0.135    0.893    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2017456  on 4999999  degrees of freedom
Residual deviance: 2017296  on 4999996  degrees of freedom
AIC: 2017304

Number of Fisher Scoring iterations: 5

ggplot(dat, aes(x = Ccat, y = Y, group = Gn, color = factor(Gn))) +
    geom_line(stat = "summary", fun = "mean") +
    geom_point(stat = "summary", fun = "mean") +
    labs(x = "Cigarettes per day (categorised)", y = "Proportion with bladder cancer", color = "NAT2 genotype") +
    theme_bw()

sessionInfo()

R version 4.6.0 (2026-04-24)
Platform: aarch64-apple-darwin23
Running under: macOS Sequoia 15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_4.0.3 dplyr_1.2.1  

loaded via a namespace (and not attached):
 [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
 [5] xfun_0.57          otel_0.2.0         generics_0.1.4     S7_0.2.2          
 [9] jsonlite_2.0.0     labeling_0.4.3     glue_1.8.1         htmltools_0.5.9   
[13] scales_1.4.0       rmarkdown_2.31     grid_4.6.0         evaluate_1.0.5    
[17] tibble_3.3.1       fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5   
[21] compiler_4.6.0     RColorBrewer_1.1-3 htmlwidgets_1.6.4  pkgconfig_2.0.3   
[25] farver_2.1.2       digest_0.6.39      R6_2.6.1           utf8_1.2.6        
[29] tidyselect_1.2.1   pillar_1.11.1      magrittr_2.0.5     withr_3.0.2       
[33] tools_4.6.0        gtable_0.3.6