Hierarchical models

Introduction

We are now familiar with the basic concepts of statistical inference and the two philosophies that are commonly adopted to make the inferential statements. In this lecture, we will look at making inferential statements about realistic and hence complex ecological models. In the rest of the course, we will write the description of the model but will not discuss the philosophical aspects in detail. We will mostly use a graphical model and its JAGS version. We will provide tools to obtain either Bayesian or Frequentist inferential statements. We will discuss pros and cons of these inferential statements. The choice of the inferential statement will be left to the scientist.

Occupancy models with detection error

Let us continue with the simple occupancy model we used previously. Most applied ecologists are aware that the occupancy and abundance surveys have some level of detection error. Even if the species is present, for various reasons we may not observe its presence. Similarly we may not be able to count all the individuals that are present at a location. Let us look at how to model such a situation. We will discuss the model and then show how it can be looked upon as a hierarchical model.

Notation

  • $W_i$: this denotes the observed status at the location $i$, can be 0 or 1,
  • $Y_i$: this denotes the true status at the location $i$, can be 0 or 1; this status is unknown.

Assumptions

  1. The observed status depends on the true status. If there is no dependence between the two variables, obviously we cannot do any inference.
  2. There are no ``phantom’’ individuals. That is, if the true status is 0, we will observe 0 with probability 1.
  3. True status at one location is independent of status of other locations.
  4. Observation at one location is not affected by what we observed anywhere else (or, at other times at that location). Surveys are independent of each other.

We can extend the Bernoulli model from Lecture 1 as follows:

  • True status: \(Y_i \sim Bernoulli(\varphi)\).
  • Observed status: \((W_i \mid Y_i = y_i) \sim Bernoulli(p^{y_i} (1 - p)^{1 - y_i})\).

An important thing to note here is that we only observe $W$’s and not the true statuses ($Y$) which are unknown. We can use the standard probability rules to compute:

\[\begin{align} P(W_i = 1) & = P(W_i = 1 \mid Y_i = 1) P(Y_i = 1) + P(W_i = 1 \mid Y_i = 0) P(Y_i = 0) \\ & = p \varphi + 0 \cdot (1 - \varphi) \\ & = p \varphi \end{align}\] \[\begin{align} P(W_i = 0) & = P(W_i = 0 \mid Y_i = 1) P(Y_i = 1) + P(W_i = 0 \mid Y_i = 0) P(Y_i = 0) \\ & = 1 - p \varphi \end{align}\]

This is called the marginal distribution of $W$. We can write down the likelihood function as a function of parameters $(p, \varphi)$.

\[\begin{align} L(p, \varphi; w_{1}, w_{2}, \ldots, w_{n}) & = \prod_{i=1}^{n} P(W_i = w_i; p, \varphi) \\ & = \prod_{i=1}^{n} (p \varphi)^{w_i} (1 - p \varphi)^{1 - w_i} \end{align}\]

Cautionary note

Just because one can write down the likelihood function, it does not mean one can estimate the parameters.

This is a simple situation with two parameters and hence we can plot the likelihood function as a contour plot.

R code for data generation:

set.seed(4321)
n <- 100
p <- 0.6
phi <- 0.4
y <- rbinom(n = n, size = 1, prob = phi)
w <- rbinom(n = n, size = y, prob = p)
table(Y = y, W = w)

Given the data, plot the likelihood contours.

## setting up the grid for p and phi
grid <- expand.grid(p = seq(0, 1, by = 0.01),
    phi = seq(0, 1, by = 0.01), 
    L = NA)
## the likelihood function
L_fun <- function(w, p, phi) {
    prod((p * phi)^w * (1 - p * phi)^(1 - w))
}
## calculating the likelihood for the grid
for (i in 1:nrow(grid)) {
    grid$L[i] <- L_fun(w = w, p = grid$p[i], phi = grid$phi[i])
}
## plot the likelihood surface
dcpal_reds <- colorRampPalette(c("#f9f2f4", "#c7254e"))
L_mat <- matrix(grid$L, sqrt(nrow(grid)))
image(L_mat, 
    xlab = "p", ylab = expression(varphi),
    col = dcpal_reds(12))
abline(h = phi, v = p, col = "#f9f2f4", lwd = 3)
abline(h = phi, v = p, col = "#c7254e", lwd = 1)
curve((p * phi) / x, 0, 1, add = TRUE, 
    col = "#18bc9c", lwd = 2)

Likelihood surface

We can see that the likelihood function looks like a mountain with a ridge tracing a curve corresponding to the product $p \varphi = c$.

library(rgl)
open3d()
bg3d("white")
material3d(col = "black")
dcpal_grbu <- colorRampPalette(c("#18bc9c", "#3498db"))
Col <- rev(dcpal_grbu(12))[cut(L_mat, breaks = 12)]
persp3d(L_mat / max(L_mat), col = Col,
    theta=50, phi=25, expand=0.75, ticktype="detailed",
    xlab = "p", ylab = "phi", zlab = "L")
  • Likelihood function does not have a unique maximum. All values along this curve have equal support in the data. We can estimate the product but not the individual components of the product.
  • The placement of the curve depends on the data. So there is information in the data only about the product but not the components.

When the likelihood function attains maximum at more than one parameter combination, we call the parameters non-estimable. There are various reasons for such non-estimability (Reference: Campbell and Lele, 2013 and a couple of references from that paper).

Structural problems with the model: it might be that the structure of the problem is such that no matter what, you cannot estimate the parameters. This is called non-identifiability.

Sometimes there are no structural issues but the observed data combination is such that the likelihood is problematic. This is called non-estimability. An example will be collinear covariates in regression.

Consequences of non-identifiability: management decisions can be based only on identifiable components of the model.

For models with more than two parameters, it is very difficult to plot the likelihood function. It is nearly impossible to diagnose non-identifiability and non-estimability of the parameters. Data cloning method provides a very simple approach to diagnose non-estimability for general hierarchical models.

We can skip all the mathematical details in the calculation of the likelihood function and use JAGS and MCMC to do almost all of the above analysis.

Bayesian model in JAGS

library(dclone)
library(rjags)
model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dbern(phi)
        W[i] ~ dbern(Y[i] * p)
    }
    #p ~ dunif(0.001, 0.999)
    #phi ~ dunif(0.001, 0.999)
    p ~ dbeta(1, 1)
    phi ~ dbeta(0.5, 0.5)
}")
dat <- list(W = w, n = n)
#ini <- list(Y = w)
ini <- list(Y = rep(1, n))
fit <- jags.fit(data = dat, params = c("p", "phi"), 
    model = model, init = ini)
summary(fit)
plot(fit)
pairs(fit)

MCMC output

Bayesian inference

Observe what happens to convergence diagnostics.

Data cloning

To make sure that both locations and klones are independent (i.i.d.), it is safest to include and extra dimension and the corresponding loop.

library(dclone)
library(rjags)
model <- custommodel("model {
    for (k in 1:K) {
        for (i in 1:n) {
            Y[i,k] ~ dbern(phi)
            W[i,k] ~ dbern(Y[i,k] * p)
        }
    }
    #p ~ dunif(0.001, 0.999)
    #phi ~ dunif(0.001, 0.999)
    p ~ dbeta(1, 1)
    phi ~ dbeta(0.5, 0.5)
}")
dat <- list(W = dcdim(data.matrix(w)), n = n, K = 1)
ini <- list(Y = dcdim(data.matrix(w)))
ifun <- function(model, n.clones) {
    dclone(list(Y = dcdim(data.matrix(w))),
        n.clones)
}
dcfit <- dc.fit(data = dat, params = c("p", "phi"), 
    model = model, inits = ini,
    n.clones = c(1,2,4,8), unchanged = "n", multiply = "K",
    initsfun = ifun, n.iter = 10000)
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)

Modification

If locations are treated as i.i.d., it is possible to replicate the vector, so that length becomes n * K.

model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dbern(p)
        W[i] ~ dbern(Y[i] * phi)
    }
    p ~ dunif(0.001, 0.999)
    phi ~ dunif(0.001, 0.999)
}")
dat <- list(W = w, n = n)
ini <- list(Y = w)
ifun <- function(model, n.clones) {
    dclone(list(Y = w), n.clones)
}
dcfit <- dc.fit(data = dat, params = c("p", "phi"), 
    model = model, inits = ini,
    n.clones = c(1,2,4,8), multiply = "n",
    initsfun = ifun)

Data cloning inference

Observe what happens to the standard errors as we increase the number of clones. It does not converge to 0 as it did before. This indicates non-estimabilty of the parameters.

Can we do something about this non-identifiability?

Suppose we go to the same location more than once, say $T$ times. Then sometimes we will observe the species and sometimes we will not. These changes may help us learn about the detection error process.

The occupancy model with replicate visits is:

  • True status: \(Y_i \sim Bernoulli(\varphi)\).
  • Observed status: \((W_{i,t} \mid Y_i = 1) \sim Bernoulli(p)\) and \(W_{i,t} \mid Y_i = 0\) equals 0 with probability 1.

The likelihood function is:

\[L(p, \varphi; w_{1,1}, \ldots, w_{n,T}) = \prod_{i=1}^{n} \left[ \varphi \left( \binom{Y}{w_{i \cdot}} p^{w_{i \cdot}} (1 - p)^{T - w_{i \cdot}} \right) + (1 - \varphi) I(w_{i \cdot} = 0)\right]\]

where \(w_{i \cdot} = \sum^{t=1}_{T} w_{i,t}\) and \(I( w_{i \cdot} = 0 )\) is an indicator function that is equal to one if \(w_{i \cdot} = 0\).

Assumptions

  1. Closed population assumption: there is colonization or extinction, that is the true status remains the same over the visits.
  2. Independent survey assumption: replicate visits are independent of each other.

R code for data generation:

set.seed(1234)
n <- 50
T <- 5
p <- 0.6
phi <- 0.4
y <- rbinom(n = n, size = 1, prob = phi)
w <- matrix(NA, n, T)
for (t in 1:T)
    w[,t] <- rbinom(n = n, size = y, prob = p)

Given the data, plot the likelihood contours.

## setting up the grid for p and phi
grid <- expand.grid(p = seq(0, 1, by = 0.01),
    phi = seq(0, 1, by = 0.01), 
    L = NA)
## the likelihood function
L_fun <- function(w, p, phi) {
    wdot <- rowSums(w)
    T <- ncol(w)
    prod(phi * (choose(T, wdot) * p^wdot * (1 - p)^(T - wdot)) + 
        (1 - phi) * (wdot == 0))
}
## calculating the likelihood for the grid
for (i in 1:nrow(grid)) {
    grid$L[i] <- L_fun(w = w, p = grid$p[i], phi = grid$phi[i])
}
## plot the likelihood surface
dcpal_reds <- colorRampPalette(c("#f9f2f4", "#c7254e"))
L_mat <- matrix(grid$L, sqrt(nrow(grid)))
image(L_mat, 
    xlab = "p", ylab = expression(varphi),
    col = dcpal_reds(12))
abline(h = phi, v = p, col = "#f9f2f4", lwd = 3)
abline(h = phi, v = p, col = "#c7254e", lwd = 1)

Likelihood surface 2

library(rgl)
open3d()
bg3d("white")
material3d(col = "black")
dcpal_grbu <- colorRampPalette(c("#18bc9c", "#3498db"))
Col <- rev(dcpal_grbu(12))[cut(L_mat, breaks = 12)]
persp3d(L_mat / max(L_mat), col = Col,
    theta=50, phi=25, expand=0.75, ticktype="detailed",
    ylab = "p", xlab = "phi", zlab = "L")

Bayesian model in JAGS

library(dclone)
library(rjags)
model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dbern(phi)
        for (t in 1:T) {
            W[i,t] ~ dbern(Y[i] * p)
        }
    }
    p ~ dunif(0.001, 0.999)
    phi ~ dunif(0.001, 0.999)
}")
dat <- list(W = w, n = n, T = T)
ini <- list(Y = ifelse(rowSums(w) > 0, 1, 0))
fit <- jags.fit(data = dat, params = c("p", "phi"), 
    model = model, inits = ini)
summary(fit)
plot(fit)
pairs(fit)

MCMC output 2

Bayesian inference

Effect of priors on the estimation and prediction of the occupancy proportion:

model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dbern(p)
        for (t in 1:T) {
            W[i,t] ~ dbern(Y[i] * phi)
        }
    }
    p <- ilogit(logit_p)
    phi <- ilogit(logit_phi)
    logit_p ~ dnorm(-2, 0.01)
    logit_phi ~ dnorm(2, 0.01)
}")
dat <- list(W = w, n = n, T = T)
ini <- list(Y = ifelse(rowSums(w) > 0, 1, 0))
fit2 <- jags.fit(data = dat, params = c("p", "phi"), 
    model = model, inits = ini)
summary(fit2)
plot(fit2)
pairs(fit2)

Data cloning

Frequentist inference: Identifiability check, independence from the specification of the prior check, confidence intervals and predictions for the occupancy proportion.

library(dclone)
library(rjags)
model <- custommodel("model {
    for (k in 1:K) {
    for (i in 1:n) {
        Y[i,k] ~ dbern(phi)
        for (t in 1:T) {
            W[i,t,k] ~ dbern(Y[i,k] * p)
        }
    }
    }
    p ~ dunif(0.001, 0.999)
    phi ~ dunif(0.001, 0.999)
}")
dat <- list(W = dcdim(array(w, c(n,T,1))), n = n, T = T, K = 1)
ini <- list(Y = data.matrix(rep(1, n)))
ifun <- function(model, n.clones) {
    list(Y = dclone(dcdim(data.matrix(rep(1, n))), n.clones))
}
dcfit <- dc.fit(data = dat, params = c("p", "phi"), 
    model = model, inits = ini,
    n.clones = c(1,2,4,8), multiply = "K", unchanged = c("n","T"),
    initsfun = ifun)
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)

Generalization to take into account covariates

$p$ and $\varphi$ can be a function of independent variables with values varying across the $n$ location, for example:

  • \(p_i = \frac{exp(\theta_0 + \theta_1 z_i)}{1 + exp(\theta_0 + \theta_1 z_i)}\).
  • \(\varphi_i = \frac{exp(\beta_0 + \beta_1 x_i)}{1 + exp(\beta_0 + \beta_1 x_i)}\),

R code for data generation:

set.seed(1234)
n <- 1000
x <- rnorm(n)
z <- rnorm(n)
beta <- c(0, 1)
theta <- c(0.2, -0.5)
p <- exp(theta[1] + theta[2] * z) / (1 + exp(theta[1] + theta[2] * z))
phi <- exp(beta[1] + beta[2] * x) / (1 + exp(beta[1] + beta[2] * x))
#p <- plogis(model.matrix(~z) %*% theta)
#phi <- plogis(model.matrix(~x) %*% beta)
y <- rbinom(n = n, size = 1, prob = phi)
w <- rbinom(n = n, size = y, prob = p)
table(Y = y, W = w)
naive <- glm(w ~ x, family = binomial("logit"))
summary(naive)
library(detect)
m <- svocc(w ~ x | z)
summary(m)
model <- custommodel("model {
    for (i in 1:n) {
        W[i] ~ dbin(p[i] * phi[i], K)
        logit(p[i]) <- inprod(Z[i,], theta)
        logit(phi[i]) <- inprod(X[i,], beta)
    }
    beta[1] ~ dnorm(0, 0.001)
    beta[2] ~ dnorm(0, 0.001)
    theta[1] ~ dnorm(0, 0.001)
    theta[2] ~ dnorm(0, 0.001)
}")
dat <- list(W = w, n = n, K = 1, 
    X = model.matrix(~x), Z = model.matrix(~z))
dcfit <- dc.fit(data = dat, 
    params = c("beta", "theta"), model = model,
    n.clones = c(1, 10), n.iter = 2000,
    unchanged = c("W", "n", "X", "Z"), multiply = "K")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)

For a quasi-Bayesian approach, see here how to utilize the naive estimator to stabilize single visit based estimates:

model <- custommodel("model {
    for (i in 1:n) {
        W[i] ~ dbin(p[i] * phi[i], K)
        logit(p[i]) <- inprod(Z[i,], theta)
        logit(phi[i]) <- inprod(X[i,], beta)
    }
    beta[1] ~ dnorm(naive[1], penalty)
    beta[2] ~ dnorm(naive[2], penalty)
    theta[1] ~ dnorm(0, 0.001)
    theta[2] ~ dnorm(0, 0.001)
}")
dat <- list(W = w, n = n, K = 1, 
    X = model.matrix(~x), Z = model.matrix(~z),
    naive = coef(naive), penalty = 0.5)
dcfit <- dc.fit(data = dat, 
    params = c("beta", "theta"), model = model,
    n.clones = c(1, 10, 100), 
    n.update = 5000, n.iter = 2000,
    unchanged = c("W", "n", "X", "Z", "naive", "penalty"), 
    multiply = "K")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)

Abundance surveys

We can easily generalize this to model abundance surveys. The N-mixture model is the simplest (though unrealistic in practice).

Assumptions

  • Replicate surveys,
  • independence,
  • closed population.

Specification of the hierarchical model

  • True abundance model: \(N_i \sim Poisson(\lambda)\) for locations $i=1,2, \ldots, n$.
  • Observation model: \((Y_{i,t} \mid N_i) \sim Binomial(N_i, p)\) for visits $t=1,2, \ldots, T$.
set.seed(1234)
n <- 200
T <- 1
p <- 0.6
lambda <- 4.2
N <- rpois(n = n, lambda = lambda)
Y <- matrix(NA, n, T)
for (t in 1:T) {
    Y[,t] <- rbinom(n = n, size = N, prob = p)
}
table(N = N, Y = apply(Y, 1, max))
library(dclone)
library(rjags)
model <- custommodel("model {
    for (i in 1:n) {
        N[i] ~ dpois(lambda)
        for (t in 1:T) {
            Y[i,t] ~ dbin(p, N[i])
        }
    }
    p ~ dunif(0.001, 0.999)
    lambda ~ dlnorm(0, 0.001)
}")
dat <- list(Y = Y, n = n, T = T)
ini <- list(N = apply(Y, 1, max) + 1)
fit <- jags.fit(data = dat, params = c("p", "lambda"), 
    n.update = 10000,
    model = model, inits = ini)
summary(fit)
gelman.diag(fit)
plot(fit)
pairs(fit)
ifun <- function(model, n.clones) {
    dclone(list(N = apply(Y, 1, max) + 1), n.clones)
}
dcfit <- dc.fit(data = dat, 
    params = c("p", "lambda"), model = model,
    inits = ini, initsfun = ifun,
    n.clones = c(1, 2, 4, 8), 
    unchanged = "T", multiply = "n")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)

As before it is easy to include covariates in the models. There are various extensions and modifications proposed to this basic model. See Lele et al., Solymos et al, Solymos and Lele, Dail and Madsen. (Here we can advertise our work on single survey method and the poster.)

Single visit abundance model with covariates:

set.seed(1234)
n <- 200
x <- rnorm(n)
z <- rnorm(n)
beta <- c(0.9, 0.5)
theta <- c(0.8, -0.5)
Z <- model.matrix(~z)
X <- model.matrix(~x)
p <- plogis(Z %*% theta)
lambda <- exp(X %*% beta)
N <- rpois(n = n, lambda = lambda)
Y <- rbinom(n = n, size = N, prob = p)
table(N = N, Y = Y)
## naive abundance parameter estimates
m <- glm(Y ~ x, family = poisson("log"))
coef(m)

library(detect)
md <- svabu(Y ~ x | z, zeroinfl = FALSE)
coef(md)
model <- custommodel("model {
    for (i in 1:n) {
        N[i] ~ dpois(lambda[i])
        Y[i] ~ dbin(p[i], N[i])
        log(lambda[i]) <- inprod(X[i,], beta)
        logit(p[i]) <- inprod(Z[i,], theta)
    }
    for (j in 1:px) {
        beta[j] ~ dnorm(naive[j], 0.1)
    }
    for (j in 1:pz) {
        theta[j] ~ dnorm(0, 0.01)
    }
}")
dat <- list(Y = Y, n = n, X = X, Z = Z, 
    px = ncol(X), pz = ncol(Z), naive = coef(m))
ini <- list(N = Y + 1)
fit <- jags.fit(data = dat, params = c("beta", "theta"), 
    n.update = 5000,
    model = model, inits = ini)
## DC
ifun <- function(model, n.clones) {
    dclone(list(N = Y + 1), n.clones)
}
dcfit <- dc.fit(data = dat, 
    params = c("beta", "theta"), model = model,
    inits = ini, initsfun = ifun,
    n.clones = c(1, 2, 4, 8), 
#    n.update = 5000,
    unchanged = c("px", "pz", "naive"), multiply = "n")

Learning with DC

model <- custommodel("model {
    for (i in 1:n) {
        N[i] ~ dpois(lambda[i])
        Y[i] ~ dbin(p[i], N[i])
        log(lambda[i]) <- inprod(X[i,], beta)
        logit(p[i]) <- inprod(Z[i,], theta)
    }

    cf[1:(px + pz)] ~ dmnorm(pr[,1], pr[,2:(px + pz + 1)])
    beta <- cf[1:px]
    theta <- cf[(px + 1):(px + pz)]
}")
dat <- list(Y = Y, n = n, X = X, Z = Z, 
    px = ncol(X), pz = ncol(Z), 
    pr = unname(cbind(c(coef(m), rep(0, ncol(Z))),
        diag(0.01, ncol(X) + ncol(Z)))))
ini <- list(N = Y + 1)
ifun <- function(model, n.clones) {
    dclone(list(N = Y + 1), n.clones)
}
ufun <- function(model, n.clones) {
    cbind(coef(model), solve(vcov(model)))
}
dcfit <- dc.fit(data = dat, 
    params = c("beta", "theta"), model = model,
    inits = ini, initsfun = ifun, 
    update = "pr", updatefun = ufun,
    n.clones = c(1, 2, 4, 8), 
    n.update = 5000,
    unchanged = c("px", "pz", "pr"), multiply = "n")

Zero-inflated Poisson latent process

This really becomes an issue when $T = 1$. With $T > 1$ it is much easier to distinguish non occupied ($O_i = 0$ or $N_i = 0 | O_i = 1$) locations when all the detection history is 0, and non-detections when some of the detection history is >0 if $p$ is not too small.

set.seed(1234)
n <- 100
T <- 2
p <- 0.6
lambda <- 3.5
q <- 0.25
O <- rbinom(n, size = 1, prob = q)
N <- O * rpois(n = n, lambda = lambda)
Y <- matrix(NA, n, T)
for (t in 1:T) {
    Y[,t] <- rbinom(n = n, size = N, prob = p)
}
table(N = N, Y = apply(Y, 1, max))
library(dclone)
model <- custommodel("model {
    for (i in 1:n) {
        O[i] ~ dbern(q)
        N[i] ~ dpois(lambda * O[i])
        for (t in 1:T) {
            Y[i,t] ~ dbin(p, N[i])
        }
    }
    p ~ dunif(0.001, 0.999)
    lambda ~ dlnorm(0, 0.001)
    q ~ dunif(0.001, 0.999)
}")
dat <- list(Y = Y, n = n, T = T)
ini <- list(N = ifelse(rowSums(Y) > 0, 1, 0) * (apply(Y, 1, max) + 1),
    O = ifelse(rowSums(Y) > 0, 1, 0))
fit <- jags.fit(data = dat, params = c("p", "lambda", "q"), 
    n.update = 10000,
    model = model, inits = ini)
summary(fit)
gelman.diag(fit)
plot(fit)
pairs(fit)

Continuous case with continuous measurement error (Neyman-Scott problem)

Some historical comments: Fisher and Pearson arguments about MOM and MLE; Fisher and Neyman arguments about testing and p-values. This is an example where Fisher is wrong about consistency and efficiency of the MLE.

Motivation

Animal breeding example (Linear mixed models, LMM): Consider the model underlying one way analysis of variance (ANOVA).

\(Y_{ij} = \mu + \alpha_i + \varepsilon_{ij}\), $i = 1, 2, \ldots, n$ and $j = 1, 2$, \(\varepsilon_{ij} \sim Normal(0, \sigma_{\varepsilon}^2)\).

There are two offsprings per individual. We want to know something about the genetic potential of the individuals so that we can turn some into hamburgers and some can have fun as studs/dams.

Parameters

These are the unknown quantities that we want to learn about from the data. In this model, the parameters are (\(\mu, \alpha_1, \alpha_2, \ldots, \alpha_n, \sigma_{\varepsilon}^2\)):

  • number of parameters: $n + 2$,
  • number of observations: $2 n$.

Question

It is easy to write down the likelihood function. What happens if we compute MLE for these parameters? These are familiar quantities from ANOVA, except for the estimate of the variance.

\[\hat{\mu} = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{2} Y_{ij}\] \[\hat{\alpha}_{i} = \frac{1}{2} (Y_{i1} + Y_{i2})\] \[\hat{\sigma}_{\varepsilon}^2 = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{2} (Y_{ij} - \hat{\mu} - \hat{\alpha}_{i})^2\]

One of the most important results from Neyman and Scott (1949) is that as the sample size increases, \(\hat{\sigma}_{\varepsilon}^2 \rightarrow \sigma_{\varepsilon}^2 / 2\). It is almost obvious that we cannot estimate consistently (although it is an unbiased estimator).

Moral of the story

Increasing the sample size does not guarantee good estimators. What is important is that the information in the sample about the parameter should converge to infinity.

In this model, the number of parameters is increasing at the same rate as the sample size (the information). Hence we have limited information about any particular parameter. We are spending the information profligately. This kind of situation is not unusual in practice.

  • Logistic regression and multi-center clinical studies: Each hospital has only a few patients and we want to combine information across the hospitals.
  • Combining data across large geographic areas in abundance surveys: Random effect for the area and the effect of the covariates in the Poisson regression.

What can we do?

We need more information but more data are not going to give us more information.

In mathematics the solution always exists: assume!

These assumptions should, in effect, reduce the number of parameters. Hopefully we can reduce them to the level that information increases sufficiently faster than the number of parameters. Usually we make assumptions so that the final number of parameters is unchanging with the sample size but this is not necessary.

Smoothness assumptions:

  • regression assumption,
  • random effects assumption.
Warning
This has nothing to do with the Bayesian thinking. These are simply modeling assumptions. The effect of these assumptions (e.g. distribution of the latent variable) does not go away as we increase the sample size. On the other hand, as we have seen repeatedly, the effect of the prior (which is also an assumption) vanishes as the information in the data increases.

There is no such thing as “Bayesian model” or “Frequentist model”. There are stochastic models; there are deterministic models; there are descriptive models. Some Bayesians claim that specification of the prior on the parameters is on the same level as specifying a stochastic model for the data. Hence they consider all hierarchical models as “Bayesian models”. Some historical lesson might be useful here.

We do not agree with this. As we have pointed out, the effect of the modeling assumption does not vanish as we increase the sample size, whereas the effect of the prior does.

Unknown quantities in the stochastic models: We have come across two different types of unknown quantities in the hierarchical models: latent variables and parameters.

  • Parameters: These are quantities in the model that can be estimated with perfect certainty as the information in the data increases to infinity.
  • Latent variables: No amount of data can determine these with certainty. The uncertainty does not go to zero.

Analogy with estimation and prediction in time series or regression: If we have large amount of data (and, the model is correct), then the standard error for the parameter estimates goes to zero but the prediction error does not. Latent variables in the hierarchical models are similar to the prediction of unobserved time points.

Imposing a distributional assumption is qualitatively same as imposing regression model. This is not a ‘prior distribution’ of any kind. This is a misleading terminology commonly used in the Bayesian literature.

Prior distributions are smoothness assumptions on the parameters and their effect goes to zero as the information increases.

Data generation

set.seed(1234)
n <- 100
m <- 2
mu <- 2.5
sigma_sq <- 0.2^2
eps <- rnorm(n * m, mean = 0, sd = sqrt(sigma_sq))
tau_sq <- 0.5^2
alpha <- rnorm(n, mean = 0, sd = sqrt(tau_sq))
Y <- mu + alpha + eps
dim(Y) <- c(n, m)
summary(Y)
library(dclone)
model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 0.001)
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
    mu ~ dnorm(0, 001)
}")
dat <- list(Y = Y, n = n, m = m)
fit <- jags.fit(data = dat, params = c("mu", "sigma_sq","alpha"), 
    model = model, n.update = 30000)
summary(fit[,c("mu","sigma_sq")])
plot(fit[,c("mu","sigma_sq")])
pairs(fit[,c("mu","sigma_sq")])

plot(fit[,c("alpha[1]","alpha[100]")])
library(dclone)
model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 1 / tau_sq)
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
    mu ~ dnorm(0, 001)
    log_tau ~ dnorm(0, 0.001)
    tau_sq <- exp(log_tau)^2
}")
dat <- list(Y = Y, n = n, m = m)
fit <- jags.fit(data = dat, 
    params = c("mu", "sigma_sq", "tau_sq", "alpha"), 
    model = model, n.update = 30000)
summary(fit[,c("mu","sigma_sq", "tau_sq")])
plot(fit[,c("mu","sigma_sq", "tau_sq")])
pairs(fit[,c("mu","sigma_sq", "tau_sq")])

plot(fit[,c("alpha[1]","alpha[100]")])
plot(alpha[1:n], coef(fit)[grep("alpha", varnames(fit))])
model <- custommodel("model {
    for (k in 1:K) {
        for (i in 1:n) {
            for (j in 1:m) {
                Y[i,j,k] ~ dnorm(mu + alpha[i,k], 1 / sigma_sq)
            }
            alpha[i,k] ~ dnorm(0, 1 / tau_sq)
        }
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
    mu ~ dnorm(0, 001)
    log_tau ~ dnorm(0, 0.001)
    tau_sq <- exp(log_tau)^2
    sum <- sigma_sq + tau_sq
}")
dat <- list(Y = dcdim(array(Y, c(n, m, 1))), n = n, 
    m = m, K = 1)
str(dat)
K <- c(1, 10, 25, 50)
dcfit1 <- dc.fit(data = dat, 
    params = c("mu", "sigma_sq", "tau_sq"), 
    model = model, n.iter = 1000,
    n.clones = K,
    unchanged = c("n", "m"), multiply = "K")
dcfit2 <- dc.fit(data = dat, 
    params = c("mu", "sum"), 
    model = model, n.iter = 1000,
    n.clones = K,
    unchanged = c("n", "m"), multiply = "K")
dcdiag(dcfit1)
dcdiag(dcfit2)
plot(dcfit1[,c("sigma_sq", "tau_sq")])
pairs(dcfit1[,c("sigma_sq", "tau_sq")])
cov2cor(vcov(dcfit1))
cov2cor(vcov(dcfit2))
pairs(dcfit1)
pairs(dcfit2)

plot(dctable(dcfit1))
plot(dctable(dcfit2))
coef(dcfit1)
coef(dcfit2)
c(sigma_sq, tau_sq, sigma_sq + tau_sq)
library(lme4)
g <- rep(1:n, m)
Yvec <- as.numeric(Y)
mod.lm <- lmer(Yvec ~ 1 + (1|g))
summary(mod.lm)
plot(alpha[1:n], ranef(mod.lm)$g[,1])
  • Run things with m = 1, check diagnostics, estimates
  • Run things with m = 2, check diagnostics, estimates

How do we predict $\alpha_i$ based on data cloning? alpha is now a vector of length K. We need a separate run for prediction:

model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 1 / tau_sq)
    }
    param[1:3] ~ dmnorm(cf[1:3], V[1:3,1:3])
    mu <- param[1]
    log_sigma <- param[2]
    sigma_sq <- exp(log_sigma)^2
    log_tau <- param[3]
    tau_sq <- exp(log_tau)^2
}")
## we need parameters on log scale
## calculate covariance matrix by hand
## create a matrix of the posterior samples
pos <- as.matrix(dcfit1)
head(pos)
pos[,"sigma_sq"] <- log(sqrt(pos[,"sigma_sq"]))
pos[,"tau_sq"] <- log(sqrt(pos[,"tau_sq"]))
colnames(pos)[2:3] <- c("log_sigma", "log_tau")
head(pos)
(V <- cov(pos) * nclones(dcfit1))
(cf <- colMeans(pos))
dat <- list(Y = Y, n = n, m = m,
    cf = cf, V = solve(V)) # precision matrix
pred <- jags.fit(data = dat, 
    params = c("alpha"), 
    model = model)

alpha_b <- cbind(est=coef(fit)[grep("alpha", varnames(fit))],
    t(quantile(fit[,grep("alpha", varnames(fit))], 
    probs = c(0.025, 0.975))))
alpha_dc <- cbind(est=coef(pred),
    t(quantile(pred, probs = c(0.025, 0.975))))
head(alpha_b)
head(alpha_dc)

plot(1:n, alpha[order(alpha)], type = "l", 
    ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1], 
    col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
    y0 = alpha_b[order(alpha),2], 
    y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_dc[order(alpha),1], 
    col=4, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
    y0 = alpha_dc[order(alpha),2], 
    y1 = alpha_dc[order(alpha),3], col = 4)

table(rowSums(sign(alpha - alpha_b[,-1])))
table(rowSums(sign(alpha - alpha_dc[,-1])))

model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
        }
        alpha[i] ~ dnorm(0, 1 / tau_sq)
    }
    sigma_sq ~ dunif(0.001, 5)
    mu ~ dnorm(10, 1)
    tau_sq ~ dgamma(0.01, 0.01)
}")
dat <- list(Y = Y, n = n, m = m)
fit2 <- jags.fit(data = dat, 
    params = c("mu", "sigma_sq", "tau_sq", "alpha"), 
    model = model, n.update = 30000)
alpha_b2 <- cbind(est=coef(fit2)[grep("alpha", varnames(fit))],
    t(quantile(fit2[,grep("alpha", varnames(fit))], 
    probs = c(0.025, 0.975))))

plot(1:n, alpha[order(alpha)], type = "l", 
    ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1], 
    col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
    y0 = alpha_b[order(alpha),2], 
    y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_b2[order(alpha),1], 
    col=3, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
    y0 = alpha_b2[order(alpha),2], 
    y1 = alpha_b2[order(alpha),3], col = 3)

Binomial GLMM

The model

\(Y_{ij} \sim Bernoulli(p_{ij})\), $i=1, 2, \ldots, n$ clusters, $j=1, 2, \ldots, m$ observetions within clusters, \(logit(p_{ij}) = \beta_{0} + \beta_{1} X_{ij} + \alpha_{i}\), \(\alpha_{i} \sim Normal(0, \sigma^2)\).

The problem

In these Neyman-Scott type problems, there are two types of asymptotics:

  • increase the number of clusters or
  • increase the number of observations in a cluster (keeping the number of clusters constant).

Of course, one can have both clusters and cluster size going to infinity but that is quite simple, unrealistic.

  • The first asymptotics leads to MLE for $\sigma^2$ that is consistent.
  • The second type of asymptotics does not lead to consistent MLE of $\sigma^2$. That was precisely the point in Kiefer and Wolfowitz1.

The second type of asymptotics where observations within a cluster increases but not the number of clusters, there are only fixed effects corresponding to $\alpha_{i}$. In this asymptotics, the information about $\sigma^2$ is not increasing and hence one cannot obtain consistent estimator.

Data generation

set.seed(150467)
n <- 100
m <- 2
beta <- c(1.5, -1)
sigma_sq <- 1
x <- runif(n * m, min = 0, max = 1)
g <- rep(1:n, each = 2)
alpha <- rnorm(n, mean = 0, sd = sqrt(sigma_sq))[g]
mu <- alpha + beta[1] + beta[2] * x
p <- exp(mu) / (1 + exp(mu))
Y <- rbinom(n * m, 1, p)
dim(Y) <- c(n, m)
summary(Y)

Cloning clusters

Cloning goes with n.

model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[i,j] ~ dbern(p[i,j])
            logit(p[i,j]) <- alpha[i] + beta[1] + beta[2] * X[i,j]
        }
        alpha[i] ~ dnorm(0, 1 / sigma_sq)
    }
    beta[1] ~ dnorm(0, 0.01)
    beta[2] ~ dnorm(0, 0.01)
    sigma_sq ~ dgamma(0.001, 0.001)
}")
dat <- list(Y = Y, 
    X = matrix(x, n, m), 
    n = n, m = m)
dcfit1 <- dc.fit(data = dat, 
    params = c("beta", "sigma_sq"), 
    model = model, n.iter = 1000,
    n.clones = c(1, 2, 4, 8),
    unchanged = "m", multiply = "n")

Alternative data cloning approach:

model <- custommodel("model {
  for (k in 1:K) {
    for (i in 1:n) {
      for (j in 1:m) {
        Y[i,j,k] ~ dbern(p[i,j,k])
        logit(p[i,j,k]) <- alpha[i,k] + beta[1] + beta[2] * X[i,j]
      }
      alpha[i,k] ~ dnorm(0, 1 / sigma_sq)
    }
  }
  beta[1] ~ dnorm(0, 0.01)
  beta[2] ~ dnorm(0, 0.01)
  sigma_sq ~ dgamma(0.001, 0.001)
}")
dat <- list(Y = dcdim(array(Y, c(n, m, 1))), 
    X = matrix(x, n, m), 
    n = n, m = m, K = 1)
dcfit2 <- dc.fit(data = dat, 
    params = c("beta", "sigma_sq"), 
    model = model, n.iter = 1000,
    n.clones = c(1, 2, 4, 8),
    unchanged = c("n", "m", "X"), multiply = "K")

Cloning replicates within clusters

Cloning goes with m, matrices transposed.

model <- custommodel("model {
    for (i in 1:n) {
        for (j in 1:m) {
            Y[j,i] ~ dbern(p[j,i])
            logit(p[j,i]) <- alpha[i] + beta[1] + beta[2] * X[j,i]
        }
        alpha[i] ~ dnorm(0, 1 / sigma_sq)
    }
    beta[1] ~ dnorm(0, 0.01)
    beta[2] ~ dnorm(0, 0.01)
    sigma_sq ~ dgamma(0.001, 0.001)
}")
dat <- list(Y = t(Y), 
    X = t(matrix(x, n, m)), 
    n = n, m = m)
dcfit3 <- dc.fit(data = dat, 
    params = c("beta", "sigma_sq"), 
    model = model, n.iter = 1000,
    n.clones = c(1, 2, 4, 8),
    unchanged = "n", multiply = "m")

Compare the outcomes

dcdiag(dcfit1)
dcdiag(dcfit2)
dcdiag(dcfit3)

Poisson GLMM

The model: \((Y_{i} \mid \lambda_i) \sim Poisson(\lambda_{i})\), $i = 1, 2, \ldots, n$, \(log(\lambda_i) = \alpha_i + \mathbf{X}^{\top}_{i} \mathbf{\beta}\), and \(\alpha_i \sim Normal(0, \sigma^2)\).

Data generation

set.seed(1234)
n <- 20
beta <- c(1, -1)
sigma_sq <- 0.5
x <- rnorm(n)
X <- model.matrix(~x)
alpha <- rnorm(n, mean = 0, sd = sqrt(sigma_sq))
lambda <- exp(alpha + X %*% beta)
Y <- rpois(n, lambda = lambda)
table(Y)

Bayesian analysis

model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dpois(lambda[i])
        log(lambda[i]) <- alpha[i] + inprod(X[i,], beta)
        alpha[i] ~ dnorm(0, 1 / sigma_sq)
    }
    for (j in 1:px) {
        beta[j] ~ dnorm(0, 0.001)
    }
    log_sigma ~ dnorm(0, 0.001)
    sigma_sq <- exp(log_sigma)^2
}")
dat <- list(Y = Y, X = X, n = n, px = ncol(X))
fit <- jags.fit(data = dat, 
    params = c("beta", "sigma_sq"), 
    model = model)

Data cloning

dcfit <- dc.fit(data = dat, 
    params = c("beta", "sigma_sq"), 
    model = model, n.iter = 1000,
    n.clones = c(1, 2, 4, 8),
    unchanged = "px", multiply = "n")
dcdiag(dcfit)
coef(dcfit)

Prediction

model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dpois(lambda[i])
        log(lambda[i]) <- alpha[i] + inprod(X[i,], beta)
        alpha[i] ~ dnorm(0, 1 / sigma_sq)
    }
    for (j in 1:px) {
        beta[j] ~ dnorm(0, 0.1)
    }
    log_sigma ~ dnorm(10, 1)
    sigma_sq <- exp(log_sigma)^2
}")
dat <- list(Y = Y, X = X, n = n, px = ncol(X))
pred1 <- jags.fit(data = dat, 
    params = "alpha", 
    model = model)
model <- custommodel("model {
    for (i in 1:n) {
        Y[i] ~ dpois(lambda[i])
        log(lambda[i]) <- alpha[i] + inprod(X[i,], beta)
        alpha[i] ~ dnorm(0, 1 / sigma_sq)
    }
    param[1:(px + 1)] ~ dmnorm(cf[], V[,])
    for (j in 1:px) {
        beta[j] <- param[j]
    }
    log_sigma <- param[px + 1]
    sigma_sq <- exp(log_sigma)^2
}")
dat <- list(Y = Y, X = X, n = n, px = ncol(X),
    cf = coef(dcfit), V = solve(vcov(dcfit)))
pred2 <- jags.fit(data = dat, 
    params = "alpha", 
    model = model)

alpha_b <- cbind(est=coef(pred1),
    t(quantile(pred1, probs = c(0.025, 0.975))))
alpha_dc <- cbind(est=coef(pred2),
    t(quantile(pred2, probs = c(0.025, 0.975))))

plot(1:n, alpha[order(alpha)], type = "l", 
    ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1], 
    col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
    y0 = alpha_b[order(alpha),2], 
    y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_dc[order(alpha),1], 
    col=4, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
    y0 = alpha_dc[order(alpha),2], 
    y1 = alpha_dc[order(alpha),3], col = 4)

table(rowSums(sign(alpha - alpha_b[,-1]))) / n
table(rowSums(sign(alpha - alpha_dc[,-1]))) / n

Use different prior distributions on the parameters and see how they affect the coverage.

What have we learnt?

  • Hierarchical models: Linear mixed models, measurement error.
  • Latent variables versus parameters.
  • Estimation and inference for the parameters.
  • Prediction and inference (coverage) for the latent variables.
  1. Kiefer, J. and Wolfowitz, J., 1956. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist., 27, 887–906. PDF from Project Euclid