Regression with non i.i.d. samples

statistics
Author

Gibran Hemani

Published

October 25, 2022

If the individuals in my dataset are correlated with known correlation structure, how can I perform regression whilst accounting for that correlation structure?

\[ \beta = (X^T \rho^{-1} X)^{-1} X^T\rho^{-1}Y \]

The variance of the estimate will be

\[ \begin{aligned} Var(\beta) &= \sigma^2(X^T \rho^{-1} X) \\ &= \frac{(\hat{e}^T \rho^{-1} \hat{e})(X^T \rho^{-1} X)}{n-r} \end{aligned} \]

Adding to the complication, in this case it is the h2 estimates of a set of traits being correlated against number of GWAS hits for a set of traits, where the traits are in some way correlated. So the estimates of e.g. h2 are estimated with sampling error, so to account for that perhaps need to parametric bootstrap.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
library(mvtnorm)

#' Regression with samples that are not independent
#'
#' For our analysis we're regressing estimates against estimates,
#' and so it's a bit more complicated because the estimates each have an SE.
#' For now let's just ignore that but I think if we don't account for it we
#' will get a bit of regression dilution bias.
#'
#' @param x Vector of x values
#' @param y Vector of y values
#' @param rho correlation matrix
#' @param se_x SE of x evalues (Ignore)
#' @param se_y SE of y values (Ignore)
#' @param nboot Number of bootstraps to get standard error (Ignore)
#'
#' @return
#' @export
reg_nonind <- function(x, y, rho, se_x=NULL, se_y=NULL, nboot=NULL)
{
  X <- cbind(rep(1, length(x)), x)
  rho_inv <- solve(rho)
  #beta <- solve(t(X) %*% rho_inv %*% X) %*% t(X) %*% rho_inv %*% y
  beta <- ginv(t(X)%*%rho_inv %*%X)%*%t(X)%*%rho_inv %*%y
  yhat <- X %*% beta
  yres <- as.numeric(y - yhat)
  se <- as.numeric((t(yres) %*% rho_inv %*% yres) / (length(x)-qr(X)$rank)) * solve(t(X) %*% rho_inv %*% X)
  se <- sqrt(diag(se))
  return(tibble(
    param=c("intercept", "slope"), beta=beta, se=se, pval=pnorm(abs(beta)/se, lower.tail=FALSE)
  ))
  
  # get standard error via parametric bootstrap
  # betaboot <- matrix(0, nboot, 2)
  # for(i in 1:nboot)
  # {
  #   X[,2] <- rnorm(length(x), mean=x, sd=se_x)
  #   Y <- rnorm(length(y), mean=y, sd=se_y)
  #   betaboot[i,] <- solve(t(X) %*% rho_inv %*% X) %*% t(X) %*% rho_inv %*% Y %>% as.numeric()
  # }
  # se_boot <- apply(betaboot, 2, sd)
}

x <- runif(300)
y <- runif(300)
rho <- diag(300)
rho[lower.tri(rho)] <- rnorm(sum(lower.tri(rho)), sd=0.01)
rho[upper.tri(rho)] <- t(rho)[upper.tri(rho)]

reg_nonind(x, y, rho, 10)
# A tibble: 2 × 4
  param     beta[,1]     se pval[,1]
  <chr>        <dbl>  <dbl>    <dbl>
1 intercept   0.507  0.0332 8.42e-53
2 slope      -0.0522 0.0579 1.84e- 1
summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50453 -0.23397 -0.01499  0.22869  0.51654 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.51135    0.03319  15.407   <2e-16 ***
x           -0.05653    0.05748  -0.983    0.326    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2824 on 298 degrees of freedom
Multiple R-squared:  0.003235,  Adjusted R-squared:  -0.0001103 
F-statistic: 0.967 on 1 and 298 DF,  p-value: 0.3262