Linear mixed-effects models

Continuous case with continuous measurement error (Neyman-Scott problem)

Some historical comments: Fisher and Pearson arguments about MOM and MLE; Fisher and Neyman arguments about testing and p-values. This is an example where Fisher is wrong about consistency and efficiency of the MLE.

Motivation

Animal breeding example (Linear mixed models, LMM): Consider the model underlying one way analysis of variance (ANOVA).

$Y_{ij} = \mu + \alpha_i + \varepsilon_{ij}$, $i = 1, 2, \ldots, n$ and $j = 1, 2$, $\varepsilon_{ij} \sim Normal(0, \sigma_{\varepsilon}^2)$.

There are two offsprings per individual. We want to know something about the genetic potential of the individuals so that we can turn some into hamburgers and some can have fun as studs/dams.

Parameters

These are the unknown quantities that we want to learn about from the data. In this model, the parameters are ($\mu, \alpha_1, \alpha_2, \ldots, \alpha_n, \sigma_{\varepsilon}^2$):

  • number of parameters: $n + 2$,
  • number of observations: $2 n$.

Question

It is easy to write down the likelihood function. What happens if we compute MLE for these parameters? These are familiar quantities from ANOVA, except for the estimate of the variance.

\[\hat{\mu} = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{2} Y_{ij}\] \[\hat{\alpha}_{i} = \frac{1}{2} (Y_{i1} + Y_{i2})\] \[\hat{\sigma}_{\varepsilon}^2 = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{2} (Y_{ij} - \hat{\mu} - \hat{\alpha}_{i})^2\]

One of the most important results from Neyman and Scott (1949) is that as the sample size increases, $\hat{\sigma}{\varepsilon}^2 \rightarrow \sigma{\varepsilon}^2 / 2$. It is almost obvious that we cannot estimate consistently (although it is an unbiased estimator).

Increasing the sample size does not guarantee good estimators. What is important is that the information in the sample about the parameter should converge to infinity.

In this model, the number of parameters is increasing at the same rate as the sample size (the information). Hence we have limited information about any particular parameter. We are spending the information profligately. This kind of situation is not unusual in practice.

  • Logistic regression and multi-center clinical studies: Each hospital has only a few patients and we want to combine information across the hospitals.
  • Combining data across large geographic areas in abundance surveys: Random effect for the area and the effect of the covariates in the Poisson regression.

What can we do? We need more information but more data are not going to give us more information. In mathematics the solution always exists: assume!

These assumptions should, in effect, reduce the number of parameters. Hopefully we can reduce them to the level that information increases sufficiently faster than the number of parameters. Usually we make assumptions so that the final number of parameters is unchanging with the sample size but this is not necessary.

Smoothness assumptions:

  • regression assumption,
  • random effects assumption.

WARNING This has nothing to do with the Bayesian thinking. These are simply modeling assumptions. The effect of these assumptions (e.g. distribution of the latent variable) does not go away as we increase the sample size. On the other hand, as we have seen repeatedly, the effect of the prior (which is also an assumption) vanishes as the information in the data increases.

There is no such thing as “Bayesian model” or “Frequentist model”. There are stochastic models; there are deterministic models; there are descriptive models. Some Bayesians claim that specification of the prior on the parameters is on the same level as specifying a stochastic model for the data. Hence they consider all hierarchical models as “Bayesian models”.

We do not agree with this. As we have pointed out, the effect of the modeling assumption does not vanish as we increase the sample size, whereas the effect of the prior does.

Unknown quantities in the stochastic models: We have come across two different types of unknown quantities in the hierarchical models: latent variables and parameters.

  • Parameters: These are quantities in the model that can be estimated with perfect certainty as the information in the data increases to infinity.
  • Latent variables: No amount of data can determine these with certainty. The uncertainty does not go to zero.

Analogy with estimation and prediction in time series or regression: If we have large amount of data (and, the model is correct), then the standard error for the parameter estimates goes to zero but the prediction error does not. Latent variables in the hierarchical models are similar to the prediction of unobserved time points.

Imposing a distributional assumption is qualitatively same as imposing regression model. This is not a ‘prior distribution’ of any kind. This is a misleading terminology commonly used in the Bayesian literature.

Prior distributions are smoothness assumptions on the parameters and their effect goes to zero as the information increases.

Data generation

set.seed(1234)
n <- 100
m <- 1
mu <- 2.5
sigma_sq <- 0.2^2
eps <- rnorm(n * m, mean = 0, sd = sqrt(sigma_sq))
tau_sq <- 0.5^2
alpha <- rnorm(n, mean = 0, sd = sqrt(tau_sq))
Y <- mu + alpha + eps
dim(Y) <- c(n, m)
summary(Y)

Without a smoothing assumption:

library(dclone)
model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 0.001)
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
    mu ~ dnorm(0, 0.1)
}")
dat <- list(Y = Y, n = n, m = m)
fit <- jags.fit(data = dat, params = c("mu", "sigma_sq","alpha"), 
    model = model, n.update = 30000)
summary(fit[,c("mu","sigma_sq")])
plot(fit[,c("mu","sigma_sq")])
pairs(fit[,c("mu","sigma_sq")])

plot(fit[,c("alpha[1]","alpha[100]")])

With smoothing assumption:

library(dclone)
model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 1 / tau_sq)
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
    mu ~ dnorm(0, 0.1)
    log_tau ~ dnorm(0, 0.001)
    tau_sq <- exp(log_tau)^2
}")
dat <- list(Y = Y, n = n, m = m)
fit <- jags.fit(data = dat, 
    params = c("mu", "sigma_sq", "tau_sq", "alpha"), 
    model = model, n.update = 30000)
summary(fit[,c("mu","sigma_sq", "tau_sq")])
plot(fit[,c("mu","sigma_sq", "tau_sq")])
pairs(fit[,c("mu","sigma_sq", "tau_sq")])

plot(fit[,c("alpha[1]","alpha[100]")])

plot(alpha[1:n], coef(fit)[grep("alpha", varnames(fit))], col=2)
abline(0,1)

DC with the smoothing assumption:

model <- custommodel("model {
    for (k in 1:K) {
        for (i in 1:n) {
            for (j in 1:m) {
                Y[i,j,k] ~ dnorm(mu + alpha[i,k], 1 / sigma_sq)
            }
            alpha[i,k] ~ dnorm(0, 1 / tau_sq)
        }
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
    mu ~ dnorm(0, 0.1)
    log_tau ~ dnorm(0, 0.001)
    tau_sq <- exp(log_tau)^2
    sum <- sigma_sq + tau_sq
}")
dat <- list(Y = dcdim(array(Y, c(n, m, 1))), n = n, 
    m = m, K = 1)
str(dat)
K <- c(1, 10, 25)
dcfit1 <- dc.fit(data = dat, 
    params = c("mu", "sigma_sq", "tau_sq"), 
    model = model, n.iter = 1000,
    n.clones = K,
    unchanged = c("n", "m"), multiply = "K")
dcfit2 <- dc.fit(data = dat, 
    params = c("mu", "sum"), 
    model = model, n.iter = 1000,
    n.clones = K,
    unchanged = c("n", "m"), multiply = "K")
dcdiag(dcfit1)
dcdiag(dcfit2)
plot(dcfit1[,c("sigma_sq", "tau_sq")])
pairs(dcfit1[,c("sigma_sq", "tau_sq")])
cov2cor(vcov(dcfit1))
cov2cor(vcov(dcfit2))
pairs(dcfit1)
pairs(dcfit2)

plot(dctable(dcfit1))
plot(dctable(dcfit2))
coef(dcfit1)
coef(dcfit2)
c(mu=mu, sigma_sq=sigma_sq, tau_sq=tau_sq, sum=sigma_sq + tau_sq)
## m=1 leads to error:
## Error: number of levels of each grouping factor must be < number of observations
if (m > 1) {
    library(lme4)
    g <- rep(1:n, m)
    Yvec <- as.numeric(Y)
    mod.lm <- lmer(Yvec ~ 1 + (1|g))
    summary(mod.lm)
    plot(alpha[1:n], ranef(mod.lm)$g[,1], col = 2)
    abline(0, 1)
}
  • Run things with m = 1, check diagnostics, estimates
  • Run things with m = 2, check diagnostics, estimates

How do we predict $\alpha_i$ based on data cloning? alpha is now a vector of length K. We need a separate run for prediction:

model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 1 / tau_sq)
    }
    param[1:3] ~ dmnorm(cf[1:3], V[1:3,1:3])
    mu <- param[1]
    log_sigma <- param[2]
    sigma_sq <- exp(log_sigma)^2
    log_tau <- param[3]
    tau_sq <- exp(log_tau)^2
}")
## we need parameters on log scale
## calculate covariance matrix by hand
## create a matrix of the posterior samples
pos <- as.matrix(dcfit1)
head(pos)
pos[,"sigma_sq"] <- log(sqrt(pos[,"sigma_sq"]))
pos[,"tau_sq"] <- log(sqrt(pos[,"tau_sq"]))
colnames(pos)[2:3] <- c("log_sigma", "log_tau")
head(pos)
(V <- cov(pos) * nclones(dcfit1))
(cf <- colMeans(pos))
dat <- list(Y = Y, n = n, m = m,
    cf = cf, V = solve(V)) # precision matrix
pred <- jags.fit(data = dat, 
    params = c("alpha"), 
    model = model)

alpha_b <- cbind(est=coef(fit)[grep("alpha", varnames(fit))],
    t(quantile(fit[,grep("alpha", varnames(fit))], 
    probs = c(0.025, 0.975))))
alpha_dc <- cbind(est=coef(pred),
    t(quantile(pred, probs = c(0.025, 0.975))))
head(alpha_b)
head(alpha_dc)

plot(1:n, alpha[order(alpha)], type = "l", 
    ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1], 
    col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
    y0 = alpha_b[order(alpha),2], 
    y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_dc[order(alpha),1], 
    col=4, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
    y0 = alpha_dc[order(alpha),2], 
    y1 = alpha_dc[order(alpha),3], col = 4)

table(rowSums(sign(alpha - alpha_b[,-1])))
table(rowSums(sign(alpha - alpha_dc[,-1])))

model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 1 / tau_sq)
    }
    sigma_sq ~ dunif(0.001, 5)
    mu ~ dnorm(10, 1)
    tau_sq ~ dgamma(0.01, 0.01)
}")
dat <- list(Y = Y, n = n, m = m)
fit2 <- jags.fit(data = dat, 
    params = c("mu", "sigma_sq", "tau_sq", "alpha"), 
    model = model, n.update = 30000)
alpha_b2 <- cbind(est=coef(fit2)[grep("alpha", varnames(fit))],
    t(quantile(fit2[,grep("alpha", varnames(fit))], 
    probs = c(0.025, 0.975))))

plot(1:n, alpha[order(alpha)], type = "l", 
    ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1], 
    col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
    y0 = alpha_b[order(alpha),2], 
    y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_b2[order(alpha),1], 
    col=3, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
    y0 = alpha_b2[order(alpha),2], 
    y1 = alpha_b2[order(alpha),3], col = 3)