NAT2 and bladder cancer

Author

Gibran Hemani

Published

March 5, 2026

Background

This is the relationship between smoking and bladder cancer risk stratified by NAT2 genotype (rapid vs slow acetylators).

NAT2 and bladder cancer risk
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
ss <- tibble(
    Group = rep(factor(c("Nonsmoker", "Occasional smoker", "Former smoker", "Current smoker"), levels = c("Nonsmoker", "Occasional smoker", "Former smoker", "Current smoker")), times = 2),
    NAT2_group = rep(c("Rapid", "Slow"), each = 4),
    OR = c(1.0, 1.2, 2.4, 5.2, 0.9, 1.6, 4.1, 7.5)
) 
ggplot(ss, aes(x = Group, y = OR, group = NAT2_group, color = NAT2_group)) +
    geom_line() +
    geom_point() +
    theme_bw() +
    labs(y = "Odds Ratio for bladder cancer", color = "NAT2 genotype")

The hypothesised mechanism is that slow acetylators have a reduced ability to detoxify carcinogens in tobacco smoke, leading to higher levels of DNA damage and increased bladder cancer risk.

Here we will develop a simulation to examine if an interaction is required at all to explain the finding. Our model will be

\[ \text{logit}(P(Y=1)) = \beta_0 + \beta_{H,Y} H + E_Y \]

\[ H = \beta_{S,H} C + \beta_{N,Y} N + \beta_{U,H} U + E_H \]

\[ N = \beta_{G_N,N} G_N + E_N \]

Smoking is represented as cigarettes per day, where G_I influences ever/never smoking and G_C influences number of cigarettes smoked per day among smokers.

\[ logit(P(S=1)) = \beta_{G_I,S} G_I + \beta_{U,S} U + E_S \]

\(C_i = 0\) if \(S_i = 0\), otherwise

\[ C = \beta_{G_C,C} G_C + \beta_{U,C} U + E_C \]

where

  • \(N\) is the NAT2 gene activity level
  • \(H\) is the level of heterocyclic amines / carcinogens
  • \(S\) is smoking initiation (0 for never, 1 for ever)
  • \(C\) is the number of cigarettes smoked per day
  • \(Y\) is bladder cancer status (0/1)
  • \(G_1\) is the NAT2 genotype (0 for rapid, 1 for slow)

The 8:18415371:A:G variant (rs1495741) is an eQTL for NAT2 expression and is also associated with bladder cancer risk (https://www.ebi.ac.uk/gwas/variants/rs1495741).

dgm <- function(b_0, b_hy, b_uy, b_sh, b_nh, b_gcc, b_gis, b_gnn, b_us, b_uc, b_uh, n) {
    U <- runif(n)
    Gc <- rbinom(n, 2, 0.5)
    Gi <- rbinom(n, 2, 0.5)
    Gn <- rbinom(n, 1, 0.5) + 1
    logit_S <- b_gis * Gi + b_us * U + rnorm(n)
    S <- rbinom(n, 1, exp(logit_S) / (1 + exp(logit_S)))
    C <- ifelse(S == 0, 0, rpois(n, lambda = b_gcc * Gc + b_uc * U))
    N <- b_gnn * Gn + rnorm(n)
    H_intake <- b_sh * C + b_uh * U + rnorm(n) 
    H <- H_intake * N * b_nh
    logit_p <- b_0 + b_hy * H + b_uy * U
    p <- exp(logit_p) / (1 + exp(logit_p))
    Y <- rbinom(n, 1, p)
    Ccat <- cut(C, breaks=c(-Inf, 0, 1, 2, 3, Inf), labels=c("0", "1", "2", "3", "4+"))    
    tibble(Y = Y, H = H, U = U, S = S, C = C, N = N, Gc = Gc, Gi = Gi, Gn = Gn, Ccat = Ccat)
}

estimation_gc2005 <- function(data) {
    # Estimate the effect of smoking on bladder cancer risk stratified by NAT2 genotype
    model <- glm(Y ~ C * Gn, data = data, family = binomial)
    summary(model)
}

dat <- dgm(
    b_0 = -3,
    b_hy = 0.5,
    b_uy = 0,
    b_sh = 0.4,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 1.0,
    b_us = 0,
    b_uc = 0,
    b_uh = 0,
    n = 1000000
)

dat
# A tibble: 1,000,000 × 10
       Y       H     U     S     C     N    Gc    Gi    Gn Ccat 
   <int>   <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <dbl> <fct>
 1     0 -0.0510 0.782     1     0 0.932     0     2     2 0    
 2     0  0.0204 0.640     0     0 1.05      0     1     1 0    
 3     0  0.438  0.780     1     1 1.95      1     1     2 1    
 4     0  0.112  0.306     1     1 1.31      1     2     1 1    
 5     0 -0.193  0.659     1     0 1.48      0     1     1 0    
 6     0  0.790  0.541     0     0 2.42      2     1     2 0    
 7     0 -0.0316 0.951     1     0 0.669     0     1     1 0    
 8     0  0.119  0.346     1     0 0.437     1     1     1 0    
 9     0  0.180  0.791     0     0 1.78      2     1     1 0    
10     0 -0.161  0.154     0     0 1.26      0     0     2 0    
# ℹ 999,990 more rows
estimation_gc2005(dat)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -3.004566   0.017180 -174.886  < 2e-16 ***
C            0.013845   0.012373    1.119   0.2631    
Gn           0.020208   0.010826    1.867   0.0619 .  
C:Gn         0.033749   0.007703    4.381 1.18e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 401845  on 999999  degrees of freedom
Residual deviance: 401525  on 999996  degrees of freedom
AIC: 401533

Number of Fisher Scoring iterations: 5
table(dat$Ccat)

     0      1      2      3     4+ 
635304 173044 109178  51544  30930 
plot_dat <- function(dat) {
    ggplot(dat, aes(x = Ccat, y = Y, group = Gn, color = factor(Gn))) +
        geom_line(stat = "summary", fun = "mean") +
        geom_point(stat = "summary", fun = "mean") +
        labs(x = "Cigarettes per day (categorised)", y = "Proportion with bladder cancer", color = "NAT2 genotype") +
        theme_bw()
}
dat <- dgm(
    b_0 = -3,
    b_hy = 0.5,
    b_uy = 0,
    b_sh = 0,
    b_nh = 0.2,
    b_gcc = 1.0,
    b_gis = 1.0,
    b_gnn = 1.0,
    b_us = 0.4,
    b_uc = 0.4,
    b_uh = 0.8,
    n = 5000000
)
estimation_gc2005(dat)

Call:
glm(formula = Y ~ C * Gn, family = binomial, data = data)

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -2.9973872  0.0078890 -379.945   <2e-16 ***
C           -0.0012968  0.0052537   -0.247    0.805    
Gn           0.0515205  0.0049547   10.398   <2e-16 ***
C:Gn        -0.0004443  0.0032996   -0.135    0.893    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2017456  on 4999999  degrees of freedom
Residual deviance: 2017296  on 4999996  degrees of freedom
AIC: 2017304

Number of Fisher Scoring iterations: 5
ggplot(dat, aes(x = Ccat, y = Y, group = Gn, color = factor(Gn))) +
    geom_line(stat = "summary", fun = "mean") +
    geom_point(stat = "summary", fun = "mean") +
    labs(x = "Cigarettes per day (categorised)", y = "Proportion with bladder cancer", color = "NAT2 genotype") +
    theme_bw()


sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: aarch64-apple-darwin23
Running under: macOS Sequoia 15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_4.0.3 dplyr_1.2.1  

loaded via a namespace (and not attached):
 [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
 [5] xfun_0.57          otel_0.2.0         generics_0.1.4     S7_0.2.2          
 [9] jsonlite_2.0.0     labeling_0.4.3     glue_1.8.1         htmltools_0.5.9   
[13] scales_1.4.0       rmarkdown_2.31     grid_4.6.0         evaluate_1.0.5    
[17] tibble_3.3.1       fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5   
[21] compiler_4.6.0     RColorBrewer_1.1-3 htmlwidgets_1.6.4  pkgconfig_2.0.3   
[25] farver_2.1.2       digest_0.6.39      R6_2.6.1           utf8_1.2.6        
[29] tidyselect_1.2.1   pillar_1.11.1      magrittr_2.0.5     withr_3.0.2       
[33] tools_4.6.0        gtable_0.3.6