We are now familiar with the basic concepts of statistical inference and the two philosophies that are commonly adopted to make the inferential statements. In this lecture, we will look at making inferential statements about realistic and hence complex ecological models. In the rest of the course, we will write the description of the model but will not discuss the philosophical aspects in detail. We will mostly use a graphical model and its JAGS version. We will provide tools to obtain either Bayesian or Frequentist inferential statements. We will discuss pros and cons of these inferential statements. The choice of the inferential statement will be left to the scientist.
Let us continue with the simple occupancy model we used previously. Most applied ecologists are aware that the occupancy and abundance surveys have some level of detection error. Even if the species is present, for various reasons we may not observe its presence. Similarly we may not be able to count all the individuals that are present at a location. Let us look at how to model such a situation. We will discuss the model and then show how it can be looked upon as a hierarchical model.
We can extend the Bernoulli model from Lecture 1 as follows:
An important thing to note here is that we only observe $W$’s and not the true statuses ($Y$) which are unknown. We can use the standard probability rules to compute:
\[\begin{align} P(W_i = 1) & = P(W_i = 1 \mid Y_i = 1) P(Y_i = 1) + P(W_i = 1 \mid Y_i = 0) P(Y_i = 0) \\ & = p \varphi + 0 \cdot (1 - \varphi) \\ & = p \varphi \end{align}\] \[\begin{align} P(W_i = 0) & = P(W_i = 0 \mid Y_i = 1) P(Y_i = 1) + P(W_i = 0 \mid Y_i = 0) P(Y_i = 0) \\ & = 1 - p \varphi \end{align}\]This is called the marginal distribution of $W$. We can write down the likelihood function as a function of parameters $(p, \varphi)$.
\[\begin{align} L(p, \varphi; w_{1}, w_{2}, \ldots, w_{n}) & = \prod_{i=1}^{n} P(W_i = w_i; p, \varphi) \\ & = \prod_{i=1}^{n} (p \varphi)^{w_i} (1 - p \varphi)^{1 - w_i} \end{align}\]Cautionary note
Just because one can write down the likelihood function, it does not mean one can estimate the parameters.
This is a simple situation with two parameters and hence we can plot the likelihood function as a contour plot.
R code for data generation:
set.seed(4321)
n <- 100
p <- 0.6
phi <- 0.4
y <- rbinom(n = n, size = 1, prob = phi)
w <- rbinom(n = n, size = y, prob = p)
table(Y = y, W = w)
Given the data, plot the likelihood contours.
## setting up the grid for p and phi
grid <- expand.grid(p = seq(0, 1, by = 0.01),
phi = seq(0, 1, by = 0.01),
L = NA)
## the likelihood function
L_fun <- function(w, p, phi) {
prod((p * phi)^w * (1 - p * phi)^(1 - w))
}
## calculating the likelihood for the grid
for (i in 1:nrow(grid)) {
grid$L[i] <- L_fun(w = w, p = grid$p[i], phi = grid$phi[i])
}
## plot the likelihood surface
dcpal_reds <- colorRampPalette(c("#f9f2f4", "#c7254e"))
L_mat <- matrix(grid$L, sqrt(nrow(grid)))
image(L_mat,
xlab = "p", ylab = expression(varphi),
col = dcpal_reds(12))
abline(h = phi, v = p, col = "#f9f2f4", lwd = 3)
abline(h = phi, v = p, col = "#c7254e", lwd = 1)
curve((p * phi) / x, 0, 1, add = TRUE,
col = "#18bc9c", lwd = 2)
We can see that the likelihood function looks like a mountain with a ridge tracing a curve corresponding to the product $p \varphi = c$.
library(rgl)
open3d()
bg3d("white")
material3d(col = "black")
dcpal_grbu <- colorRampPalette(c("#18bc9c", "#3498db"))
Col <- rev(dcpal_grbu(12))[cut(L_mat, breaks = 12)]
persp3d(L_mat / max(L_mat), col = Col,
theta=50, phi=25, expand=0.75, ticktype="detailed",
xlab = "p", ylab = "phi", zlab = "L")
When the likelihood function attains maximum at more than one parameter combination, we call the parameters non-estimable. There are various reasons for such non-estimability (Reference: Campbell and Lele, 2013 and a couple of references from that paper).
Structural problems with the model: it might be that the structure of the problem is such that no matter what, you cannot estimate the parameters. This is called non-identifiability.
Sometimes there are no structural issues but the observed data combination is such that the likelihood is problematic. This is called non-estimability. An example will be collinear covariates in regression.
Consequences of non-identifiability: management decisions can be based only on identifiable components of the model.
For models with more than two parameters, it is very difficult to plot the likelihood function. It is nearly impossible to diagnose non-identifiability and non-estimability of the parameters. Data cloning method provides a very simple approach to diagnose non-estimability for general hierarchical models.
We can skip all the mathematical details in the calculation of the likelihood function and use JAGS and MCMC to do almost all of the above analysis.
library(dclone)
library(rjags)
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dbern(phi)
W[i] ~ dbern(Y[i] * p)
}
#p ~ dunif(0.001, 0.999)
#phi ~ dunif(0.001, 0.999)
p ~ dbeta(1, 1)
phi ~ dbeta(0.5, 0.5)
}")
dat <- list(W = w, n = n)
#ini <- list(Y = w)
ini <- list(Y = rep(1, n))
fit <- jags.fit(data = dat, params = c("p", "phi"),
model = model, init = ini)
summary(fit)
plot(fit)
pairs(fit)
Bayesian inference
Observe what happens to convergence diagnostics.
To make sure that both locations and klones are independent (i.i.d.), it is safest to include and extra dimension and the corresponding loop.
library(dclone)
library(rjags)
model <- custommodel("model {
for (k in 1:K) {
for (i in 1:n) {
Y[i,k] ~ dbern(phi)
W[i,k] ~ dbern(Y[i,k] * p)
}
}
#p ~ dunif(0.001, 0.999)
#phi ~ dunif(0.001, 0.999)
p ~ dbeta(1, 1)
phi ~ dbeta(0.5, 0.5)
}")
dat <- list(W = dcdim(data.matrix(w)), n = n, K = 1)
ini <- list(Y = dcdim(data.matrix(w)))
ifun <- function(model, n.clones) {
dclone(list(Y = dcdim(data.matrix(w))),
n.clones)
}
dcfit <- dc.fit(data = dat, params = c("p", "phi"),
model = model, inits = ini,
n.clones = c(1,2,4,8), unchanged = "n", multiply = "K",
initsfun = ifun, n.iter = 10000)
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)
If locations are treated as i.i.d., it is possible to
replicate the vector, so that length
becomes n * K
.
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dbern(p)
W[i] ~ dbern(Y[i] * phi)
}
p ~ dunif(0.001, 0.999)
phi ~ dunif(0.001, 0.999)
}")
dat <- list(W = w, n = n)
ini <- list(Y = w)
ifun <- function(model, n.clones) {
dclone(list(Y = w), n.clones)
}
dcfit <- dc.fit(data = dat, params = c("p", "phi"),
model = model, inits = ini,
n.clones = c(1,2,4,8), multiply = "n",
initsfun = ifun)
Data cloning inference
Observe what happens to the standard errors as we increase the number of clones. It does not converge to 0 as it did before. This indicates non-estimabilty of the parameters.
Suppose we go to the same location more than once, say $T$ times. Then sometimes we will observe the species and sometimes we will not. These changes may help us learn about the detection error process.
The occupancy model with replicate visits is:
The likelihood function is:
\[L(p, \varphi; w_{1,1}, \ldots, w_{n,T}) = \prod_{i=1}^{n} \left[ \varphi \left( \binom{Y}{w_{i \cdot}} p^{w_{i \cdot}} (1 - p)^{T - w_{i \cdot}} \right) + (1 - \varphi) I(w_{i \cdot} = 0)\right]\]where \(w_{i \cdot} = \sum^{t=1}_{T} w_{i,t}\) and \(I( w_{i \cdot} = 0 )\) is an indicator function that is equal to one if \(w_{i \cdot} = 0\).
R code for data generation:
set.seed(1234)
n <- 50
T <- 5
p <- 0.6
phi <- 0.4
y <- rbinom(n = n, size = 1, prob = phi)
w <- matrix(NA, n, T)
for (t in 1:T)
w[,t] <- rbinom(n = n, size = y, prob = p)
Given the data, plot the likelihood contours.
## setting up the grid for p and phi
grid <- expand.grid(p = seq(0, 1, by = 0.01),
phi = seq(0, 1, by = 0.01),
L = NA)
## the likelihood function
L_fun <- function(w, p, phi) {
wdot <- rowSums(w)
T <- ncol(w)
prod(phi * (choose(T, wdot) * p^wdot * (1 - p)^(T - wdot)) +
(1 - phi) * (wdot == 0))
}
## calculating the likelihood for the grid
for (i in 1:nrow(grid)) {
grid$L[i] <- L_fun(w = w, p = grid$p[i], phi = grid$phi[i])
}
## plot the likelihood surface
dcpal_reds <- colorRampPalette(c("#f9f2f4", "#c7254e"))
L_mat <- matrix(grid$L, sqrt(nrow(grid)))
image(L_mat,
xlab = "p", ylab = expression(varphi),
col = dcpal_reds(12))
abline(h = phi, v = p, col = "#f9f2f4", lwd = 3)
abline(h = phi, v = p, col = "#c7254e", lwd = 1)
library(rgl)
open3d()
bg3d("white")
material3d(col = "black")
dcpal_grbu <- colorRampPalette(c("#18bc9c", "#3498db"))
Col <- rev(dcpal_grbu(12))[cut(L_mat, breaks = 12)]
persp3d(L_mat / max(L_mat), col = Col,
theta=50, phi=25, expand=0.75, ticktype="detailed",
ylab = "p", xlab = "phi", zlab = "L")
library(dclone)
library(rjags)
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dbern(phi)
for (t in 1:T) {
W[i,t] ~ dbern(Y[i] * p)
}
}
p ~ dunif(0.001, 0.999)
phi ~ dunif(0.001, 0.999)
}")
dat <- list(W = w, n = n, T = T)
ini <- list(Y = ifelse(rowSums(w) > 0, 1, 0))
fit <- jags.fit(data = dat, params = c("p", "phi"),
model = model, inits = ini)
summary(fit)
plot(fit)
pairs(fit)
Effect of priors on the estimation and prediction of the occupancy proportion:
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dbern(p)
for (t in 1:T) {
W[i,t] ~ dbern(Y[i] * phi)
}
}
p <- ilogit(logit_p)
phi <- ilogit(logit_phi)
logit_p ~ dnorm(-2, 0.01)
logit_phi ~ dnorm(2, 0.01)
}")
dat <- list(W = w, n = n, T = T)
ini <- list(Y = ifelse(rowSums(w) > 0, 1, 0))
fit2 <- jags.fit(data = dat, params = c("p", "phi"),
model = model, inits = ini)
summary(fit2)
plot(fit2)
pairs(fit2)
Frequentist inference: Identifiability check, independence from the specification of the prior check, confidence intervals and predictions for the occupancy proportion.
library(dclone)
library(rjags)
model <- custommodel("model {
for (k in 1:K) {
for (i in 1:n) {
Y[i,k] ~ dbern(phi)
for (t in 1:T) {
W[i,t,k] ~ dbern(Y[i,k] * p)
}
}
}
p ~ dunif(0.001, 0.999)
phi ~ dunif(0.001, 0.999)
}")
dat <- list(W = dcdim(array(w, c(n,T,1))), n = n, T = T, K = 1)
ini <- list(Y = data.matrix(rep(1, n)))
ifun <- function(model, n.clones) {
list(Y = dclone(dcdim(data.matrix(rep(1, n))), n.clones))
}
dcfit <- dc.fit(data = dat, params = c("p", "phi"),
model = model, inits = ini,
n.clones = c(1,2,4,8), multiply = "K", unchanged = c("n","T"),
initsfun = ifun)
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)
$p$ and $\varphi$ can be a function of independent variables with values varying across the $n$ location, for example:
R code for data generation:
set.seed(1234)
n <- 1000
x <- rnorm(n)
z <- rnorm(n)
beta <- c(0, 1)
theta <- c(0.2, -0.5)
p <- exp(theta[1] + theta[2] * z) / (1 + exp(theta[1] + theta[2] * z))
phi <- exp(beta[1] + beta[2] * x) / (1 + exp(beta[1] + beta[2] * x))
#p <- plogis(model.matrix(~z) %*% theta)
#phi <- plogis(model.matrix(~x) %*% beta)
y <- rbinom(n = n, size = 1, prob = phi)
w <- rbinom(n = n, size = y, prob = p)
table(Y = y, W = w)
naive <- glm(w ~ x, family = binomial("logit"))
summary(naive)
library(detect)
m <- svocc(w ~ x | z)
summary(m)
model <- custommodel("model {
for (i in 1:n) {
W[i] ~ dbin(p[i] * phi[i], K)
logit(p[i]) <- inprod(Z[i,], theta)
logit(phi[i]) <- inprod(X[i,], beta)
}
beta[1] ~ dnorm(0, 0.001)
beta[2] ~ dnorm(0, 0.001)
theta[1] ~ dnorm(0, 0.001)
theta[2] ~ dnorm(0, 0.001)
}")
dat <- list(W = w, n = n, K = 1,
X = model.matrix(~x), Z = model.matrix(~z))
dcfit <- dc.fit(data = dat,
params = c("beta", "theta"), model = model,
n.clones = c(1, 10), n.iter = 2000,
unchanged = c("W", "n", "X", "Z"), multiply = "K")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)
For a quasi-Bayesian approach, see here how to utilize the naive estimator to stabilize single visit based estimates:
model <- custommodel("model {
for (i in 1:n) {
W[i] ~ dbin(p[i] * phi[i], K)
logit(p[i]) <- inprod(Z[i,], theta)
logit(phi[i]) <- inprod(X[i,], beta)
}
beta[1] ~ dnorm(naive[1], penalty)
beta[2] ~ dnorm(naive[2], penalty)
theta[1] ~ dnorm(0, 0.001)
theta[2] ~ dnorm(0, 0.001)
}")
dat <- list(W = w, n = n, K = 1,
X = model.matrix(~x), Z = model.matrix(~z),
naive = coef(naive), penalty = 0.5)
dcfit <- dc.fit(data = dat,
params = c("beta", "theta"), model = model,
n.clones = c(1, 10, 100),
n.update = 5000, n.iter = 2000,
unchanged = c("W", "n", "X", "Z", "naive", "penalty"),
multiply = "K")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)
We can easily generalize this to model abundance surveys. The N-mixture model is the simplest (though unrealistic in practice).
set.seed(1234)
n <- 200
T <- 1
p <- 0.6
lambda <- 4.2
N <- rpois(n = n, lambda = lambda)
Y <- matrix(NA, n, T)
for (t in 1:T) {
Y[,t] <- rbinom(n = n, size = N, prob = p)
}
table(N = N, Y = apply(Y, 1, max))
library(dclone)
library(rjags)
model <- custommodel("model {
for (i in 1:n) {
N[i] ~ dpois(lambda)
for (t in 1:T) {
Y[i,t] ~ dbin(p, N[i])
}
}
p ~ dunif(0.001, 0.999)
lambda ~ dlnorm(0, 0.001)
}")
dat <- list(Y = Y, n = n, T = T)
ini <- list(N = apply(Y, 1, max) + 1)
fit <- jags.fit(data = dat, params = c("p", "lambda"),
n.update = 10000,
model = model, inits = ini)
summary(fit)
gelman.diag(fit)
plot(fit)
pairs(fit)
ifun <- function(model, n.clones) {
dclone(list(N = apply(Y, 1, max) + 1), n.clones)
}
dcfit <- dc.fit(data = dat,
params = c("p", "lambda"), model = model,
inits = ini, initsfun = ifun,
n.clones = c(1, 2, 4, 8),
unchanged = "T", multiply = "n")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
pairs(dcfit)
As before it is easy to include covariates in the models. There are various extensions and modifications proposed to this basic model. See Lele et al., Solymos et al, Solymos and Lele, Dail and Madsen. (Here we can advertise our work on single survey method and the poster.)
Single visit abundance model with covariates:
set.seed(1234)
n <- 200
x <- rnorm(n)
z <- rnorm(n)
beta <- c(0.9, 0.5)
theta <- c(0.8, -0.5)
Z <- model.matrix(~z)
X <- model.matrix(~x)
p <- plogis(Z %*% theta)
lambda <- exp(X %*% beta)
N <- rpois(n = n, lambda = lambda)
Y <- rbinom(n = n, size = N, prob = p)
table(N = N, Y = Y)
## naive abundance parameter estimates
m <- glm(Y ~ x, family = poisson("log"))
coef(m)
library(detect)
md <- svabu(Y ~ x | z, zeroinfl = FALSE)
coef(md)
model <- custommodel("model {
for (i in 1:n) {
N[i] ~ dpois(lambda[i])
Y[i] ~ dbin(p[i], N[i])
log(lambda[i]) <- inprod(X[i,], beta)
logit(p[i]) <- inprod(Z[i,], theta)
}
for (j in 1:px) {
beta[j] ~ dnorm(naive[j], 0.1)
}
for (j in 1:pz) {
theta[j] ~ dnorm(0, 0.01)
}
}")
dat <- list(Y = Y, n = n, X = X, Z = Z,
px = ncol(X), pz = ncol(Z), naive = coef(m))
ini <- list(N = Y + 1)
fit <- jags.fit(data = dat, params = c("beta", "theta"),
n.update = 5000,
model = model, inits = ini)
## DC
ifun <- function(model, n.clones) {
dclone(list(N = Y + 1), n.clones)
}
dcfit <- dc.fit(data = dat,
params = c("beta", "theta"), model = model,
inits = ini, initsfun = ifun,
n.clones = c(1, 2, 4, 8),
# n.update = 5000,
unchanged = c("px", "pz", "naive"), multiply = "n")
model <- custommodel("model {
for (i in 1:n) {
N[i] ~ dpois(lambda[i])
Y[i] ~ dbin(p[i], N[i])
log(lambda[i]) <- inprod(X[i,], beta)
logit(p[i]) <- inprod(Z[i,], theta)
}
cf[1:(px + pz)] ~ dmnorm(pr[,1], pr[,2:(px + pz + 1)])
beta <- cf[1:px]
theta <- cf[(px + 1):(px + pz)]
}")
dat <- list(Y = Y, n = n, X = X, Z = Z,
px = ncol(X), pz = ncol(Z),
pr = unname(cbind(c(coef(m), rep(0, ncol(Z))),
diag(0.01, ncol(X) + ncol(Z)))))
ini <- list(N = Y + 1)
ifun <- function(model, n.clones) {
dclone(list(N = Y + 1), n.clones)
}
ufun <- function(model, n.clones) {
cbind(coef(model), solve(vcov(model)))
}
dcfit <- dc.fit(data = dat,
params = c("beta", "theta"), model = model,
inits = ini, initsfun = ifun,
update = "pr", updatefun = ufun,
n.clones = c(1, 2, 4, 8),
n.update = 5000,
unchanged = c("px", "pz", "pr"), multiply = "n")
This really becomes an issue when $T = 1$. With $T > 1$ it is much easier to distinguish non occupied ($O_i = 0$ or $N_i = 0 | O_i = 1$) locations when all the detection history is 0, and non-detections when some of the detection history is >0 if $p$ is not too small.
set.seed(1234)
n <- 100
T <- 2
p <- 0.6
lambda <- 3.5
q <- 0.25
O <- rbinom(n, size = 1, prob = q)
N <- O * rpois(n = n, lambda = lambda)
Y <- matrix(NA, n, T)
for (t in 1:T) {
Y[,t] <- rbinom(n = n, size = N, prob = p)
}
table(N = N, Y = apply(Y, 1, max))
library(dclone)
model <- custommodel("model {
for (i in 1:n) {
O[i] ~ dbern(q)
N[i] ~ dpois(lambda * O[i])
for (t in 1:T) {
Y[i,t] ~ dbin(p, N[i])
}
}
p ~ dunif(0.001, 0.999)
lambda ~ dlnorm(0, 0.001)
q ~ dunif(0.001, 0.999)
}")
dat <- list(Y = Y, n = n, T = T)
ini <- list(N = ifelse(rowSums(Y) > 0, 1, 0) * (apply(Y, 1, max) + 1),
O = ifelse(rowSums(Y) > 0, 1, 0))
fit <- jags.fit(data = dat, params = c("p", "lambda", "q"),
n.update = 10000,
model = model, inits = ini)
summary(fit)
gelman.diag(fit)
plot(fit)
pairs(fit)
Some historical comments: Fisher and Pearson arguments about MOM and MLE; Fisher and Neyman arguments about testing and p-values. This is an example where Fisher is wrong about consistency and efficiency of the MLE.
Animal breeding example (Linear mixed models, LMM): Consider the model underlying one way analysis of variance (ANOVA).
\(Y_{ij} = \mu + \alpha_i + \varepsilon_{ij}\), $i = 1, 2, \ldots, n$ and $j = 1, 2$, \(\varepsilon_{ij} \sim Normal(0, \sigma_{\varepsilon}^2)\).
There are two offsprings per individual. We want to know something about the genetic potential of the individuals so that we can turn some into hamburgers and some can have fun as studs/dams.
These are the unknown quantities that we want to learn about from the data. In this model, the parameters are (\(\mu, \alpha_1, \alpha_2, \ldots, \alpha_n, \sigma_{\varepsilon}^2\)):
It is easy to write down the likelihood function. What happens if we compute MLE for these parameters? These are familiar quantities from ANOVA, except for the estimate of the variance.
\[\hat{\mu} = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{2} Y_{ij}\] \[\hat{\alpha}_{i} = \frac{1}{2} (Y_{i1} + Y_{i2})\] \[\hat{\sigma}_{\varepsilon}^2 = \frac{1}{2n} \sum_{i=1}^{n} \sum_{j=1}^{2} (Y_{ij} - \hat{\mu} - \hat{\alpha}_{i})^2\]One of the most important results from Neyman and Scott (1949) is that as the sample size increases, \(\hat{\sigma}_{\varepsilon}^2 \rightarrow \sigma_{\varepsilon}^2 / 2\). It is almost obvious that we cannot estimate consistently (although it is an unbiased estimator).
Moral of the story
Increasing the sample size does not guarantee good estimators. What is important is that the information in the sample about the parameter should converge to infinity.
In this model, the number of parameters is increasing at the same rate as the sample size (the information). Hence we have limited information about any particular parameter. We are spending the information profligately. This kind of situation is not unusual in practice.
What can we do?
We need more information but more data are not going to give us more information.
In mathematics the solution always exists: assume!
These assumptions should, in effect, reduce the number of parameters. Hopefully we can reduce them to the level that information increases sufficiently faster than the number of parameters. Usually we make assumptions so that the final number of parameters is unchanging with the sample size but this is not necessary.
Smoothness assumptions:
There is no such thing as “Bayesian model” or “Frequentist model”. There are stochastic models; there are deterministic models; there are descriptive models. Some Bayesians claim that specification of the prior on the parameters is on the same level as specifying a stochastic model for the data. Hence they consider all hierarchical models as “Bayesian models”. Some historical lesson might be useful here.
We do not agree with this. As we have pointed out, the effect of the modeling assumption does not vanish as we increase the sample size, whereas the effect of the prior does.
Unknown quantities in the stochastic models: We have come across two different types of unknown quantities in the hierarchical models: latent variables and parameters.
Analogy with estimation and prediction in time series or regression: If we have large amount of data (and, the model is correct), then the standard error for the parameter estimates goes to zero but the prediction error does not. Latent variables in the hierarchical models are similar to the prediction of unobserved time points.
Imposing a distributional assumption is qualitatively same as imposing regression model. This is not a ‘prior distribution’ of any kind. This is a misleading terminology commonly used in the Bayesian literature.
Prior distributions are smoothness assumptions on the parameters and their effect goes to zero as the information increases.
set.seed(1234)
n <- 100
m <- 2
mu <- 2.5
sigma_sq <- 0.2^2
eps <- rnorm(n * m, mean = 0, sd = sqrt(sigma_sq))
tau_sq <- 0.5^2
alpha <- rnorm(n, mean = 0, sd = sqrt(tau_sq))
Y <- mu + alpha + eps
dim(Y) <- c(n, m)
summary(Y)
library(dclone)
model <- custommodel("model {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
}
alpha[i] ~ dnorm(0, 0.001)
}
log_sigma ~ dnorm(0, 0.001)
sigma_sq <- exp(log_sigma)^2
mu ~ dnorm(0, 001)
}")
dat <- list(Y = Y, n = n, m = m)
fit <- jags.fit(data = dat, params = c("mu", "sigma_sq","alpha"),
model = model, n.update = 30000)
summary(fit[,c("mu","sigma_sq")])
plot(fit[,c("mu","sigma_sq")])
pairs(fit[,c("mu","sigma_sq")])
plot(fit[,c("alpha[1]","alpha[100]")])
library(dclone)
model <- custommodel("model {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
}
alpha[i] ~ dnorm(0, 1 / tau_sq)
}
log_sigma ~ dnorm(0, 0.001)
sigma_sq <- exp(log_sigma)^2
mu ~ dnorm(0, 001)
log_tau ~ dnorm(0, 0.001)
tau_sq <- exp(log_tau)^2
}")
dat <- list(Y = Y, n = n, m = m)
fit <- jags.fit(data = dat,
params = c("mu", "sigma_sq", "tau_sq", "alpha"),
model = model, n.update = 30000)
summary(fit[,c("mu","sigma_sq", "tau_sq")])
plot(fit[,c("mu","sigma_sq", "tau_sq")])
pairs(fit[,c("mu","sigma_sq", "tau_sq")])
plot(fit[,c("alpha[1]","alpha[100]")])
plot(alpha[1:n], coef(fit)[grep("alpha", varnames(fit))])
model <- custommodel("model {
for (k in 1:K) {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j,k] ~ dnorm(mu + alpha[i,k], 1 / sigma_sq)
}
alpha[i,k] ~ dnorm(0, 1 / tau_sq)
}
}
log_sigma ~ dnorm(0, 0.001)
sigma_sq <- exp(log_sigma)^2
mu ~ dnorm(0, 001)
log_tau ~ dnorm(0, 0.001)
tau_sq <- exp(log_tau)^2
sum <- sigma_sq + tau_sq
}")
dat <- list(Y = dcdim(array(Y, c(n, m, 1))), n = n,
m = m, K = 1)
str(dat)
K <- c(1, 10, 25, 50)
dcfit1 <- dc.fit(data = dat,
params = c("mu", "sigma_sq", "tau_sq"),
model = model, n.iter = 1000,
n.clones = K,
unchanged = c("n", "m"), multiply = "K")
dcfit2 <- dc.fit(data = dat,
params = c("mu", "sum"),
model = model, n.iter = 1000,
n.clones = K,
unchanged = c("n", "m"), multiply = "K")
dcdiag(dcfit1)
dcdiag(dcfit2)
plot(dcfit1[,c("sigma_sq", "tau_sq")])
pairs(dcfit1[,c("sigma_sq", "tau_sq")])
cov2cor(vcov(dcfit1))
cov2cor(vcov(dcfit2))
pairs(dcfit1)
pairs(dcfit2)
plot(dctable(dcfit1))
plot(dctable(dcfit2))
coef(dcfit1)
coef(dcfit2)
c(sigma_sq, tau_sq, sigma_sq + tau_sq)
library(lme4)
g <- rep(1:n, m)
Yvec <- as.numeric(Y)
mod.lm <- lmer(Yvec ~ 1 + (1|g))
summary(mod.lm)
plot(alpha[1:n], ranef(mod.lm)$g[,1])
m = 1
, check diagnostics, estimatesm = 2
, check diagnostics, estimatesHow do we predict $\alpha_i$ based on data cloning?
alpha
is now a vector of length K
. We need
a separate run for prediction:
model <- custommodel("model {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
}
alpha[i] ~ dnorm(0, 1 / tau_sq)
}
param[1:3] ~ dmnorm(cf[1:3], V[1:3,1:3])
mu <- param[1]
log_sigma <- param[2]
sigma_sq <- exp(log_sigma)^2
log_tau <- param[3]
tau_sq <- exp(log_tau)^2
}")
## we need parameters on log scale
## calculate covariance matrix by hand
## create a matrix of the posterior samples
pos <- as.matrix(dcfit1)
head(pos)
pos[,"sigma_sq"] <- log(sqrt(pos[,"sigma_sq"]))
pos[,"tau_sq"] <- log(sqrt(pos[,"tau_sq"]))
colnames(pos)[2:3] <- c("log_sigma", "log_tau")
head(pos)
(V <- cov(pos) * nclones(dcfit1))
(cf <- colMeans(pos))
dat <- list(Y = Y, n = n, m = m,
cf = cf, V = solve(V)) # precision matrix
pred <- jags.fit(data = dat,
params = c("alpha"),
model = model)
alpha_b <- cbind(est=coef(fit)[grep("alpha", varnames(fit))],
t(quantile(fit[,grep("alpha", varnames(fit))],
probs = c(0.025, 0.975))))
alpha_dc <- cbind(est=coef(pred),
t(quantile(pred, probs = c(0.025, 0.975))))
head(alpha_b)
head(alpha_dc)
plot(1:n, alpha[order(alpha)], type = "l",
ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1],
col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
y0 = alpha_b[order(alpha),2],
y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_dc[order(alpha),1],
col=4, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
y0 = alpha_dc[order(alpha),2],
y1 = alpha_dc[order(alpha),3], col = 4)
table(rowSums(sign(alpha - alpha_b[,-1])))
table(rowSums(sign(alpha - alpha_dc[,-1])))
model <- custommodel("model {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j] ~ dnorm(mu + alpha[i], 1 / sigma_sq)
}
alpha[i] ~ dnorm(0, 1 / tau_sq)
}
sigma_sq ~ dunif(0.001, 5)
mu ~ dnorm(10, 1)
tau_sq ~ dgamma(0.01, 0.01)
}")
dat <- list(Y = Y, n = n, m = m)
fit2 <- jags.fit(data = dat,
params = c("mu", "sigma_sq", "tau_sq", "alpha"),
model = model, n.update = 30000)
alpha_b2 <- cbind(est=coef(fit2)[grep("alpha", varnames(fit))],
t(quantile(fit2[,grep("alpha", varnames(fit))],
probs = c(0.025, 0.975))))
plot(1:n, alpha[order(alpha)], type = "l",
ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1],
col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
y0 = alpha_b[order(alpha),2],
y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_b2[order(alpha),1],
col=3, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
y0 = alpha_b2[order(alpha),2],
y1 = alpha_b2[order(alpha),3], col = 3)
\(Y_{ij} \sim Bernoulli(p_{ij})\), $i=1, 2, \ldots, n$ clusters, $j=1, 2, \ldots, m$ observetions within clusters, \(logit(p_{ij}) = \beta_{0} + \beta_{1} X_{ij} + \alpha_{i}\), \(\alpha_{i} \sim Normal(0, \sigma^2)\).
In these Neyman-Scott type problems, there are two types of asymptotics:
Of course, one can have both clusters and cluster size going to infinity but that is quite simple, unrealistic.
The second type of asymptotics where observations within a cluster increases but not the number of clusters, there are only fixed effects corresponding to $\alpha_{i}$. In this asymptotics, the information about $\sigma^2$ is not increasing and hence one cannot obtain consistent estimator.
set.seed(150467)
n <- 100
m <- 2
beta <- c(1.5, -1)
sigma_sq <- 1
x <- runif(n * m, min = 0, max = 1)
g <- rep(1:n, each = 2)
alpha <- rnorm(n, mean = 0, sd = sqrt(sigma_sq))[g]
mu <- alpha + beta[1] + beta[2] * x
p <- exp(mu) / (1 + exp(mu))
Y <- rbinom(n * m, 1, p)
dim(Y) <- c(n, m)
summary(Y)
Cloning goes with n
.
model <- custommodel("model {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j] ~ dbern(p[i,j])
logit(p[i,j]) <- alpha[i] + beta[1] + beta[2] * X[i,j]
}
alpha[i] ~ dnorm(0, 1 / sigma_sq)
}
beta[1] ~ dnorm(0, 0.01)
beta[2] ~ dnorm(0, 0.01)
sigma_sq ~ dgamma(0.001, 0.001)
}")
dat <- list(Y = Y,
X = matrix(x, n, m),
n = n, m = m)
dcfit1 <- dc.fit(data = dat,
params = c("beta", "sigma_sq"),
model = model, n.iter = 1000,
n.clones = c(1, 2, 4, 8),
unchanged = "m", multiply = "n")
Alternative data cloning approach:
model <- custommodel("model {
for (k in 1:K) {
for (i in 1:n) {
for (j in 1:m) {
Y[i,j,k] ~ dbern(p[i,j,k])
logit(p[i,j,k]) <- alpha[i,k] + beta[1] + beta[2] * X[i,j]
}
alpha[i,k] ~ dnorm(0, 1 / sigma_sq)
}
}
beta[1] ~ dnorm(0, 0.01)
beta[2] ~ dnorm(0, 0.01)
sigma_sq ~ dgamma(0.001, 0.001)
}")
dat <- list(Y = dcdim(array(Y, c(n, m, 1))),
X = matrix(x, n, m),
n = n, m = m, K = 1)
dcfit2 <- dc.fit(data = dat,
params = c("beta", "sigma_sq"),
model = model, n.iter = 1000,
n.clones = c(1, 2, 4, 8),
unchanged = c("n", "m", "X"), multiply = "K")
Cloning goes with m
, matrices transposed.
model <- custommodel("model {
for (i in 1:n) {
for (j in 1:m) {
Y[j,i] ~ dbern(p[j,i])
logit(p[j,i]) <- alpha[i] + beta[1] + beta[2] * X[j,i]
}
alpha[i] ~ dnorm(0, 1 / sigma_sq)
}
beta[1] ~ dnorm(0, 0.01)
beta[2] ~ dnorm(0, 0.01)
sigma_sq ~ dgamma(0.001, 0.001)
}")
dat <- list(Y = t(Y),
X = t(matrix(x, n, m)),
n = n, m = m)
dcfit3 <- dc.fit(data = dat,
params = c("beta", "sigma_sq"),
model = model, n.iter = 1000,
n.clones = c(1, 2, 4, 8),
unchanged = "n", multiply = "m")
dcdiag(dcfit1)
dcdiag(dcfit2)
dcdiag(dcfit3)
The model: \((Y_{i} \mid \lambda_i) \sim Poisson(\lambda_{i})\), $i = 1, 2, \ldots, n$, \(log(\lambda_i) = \alpha_i + \mathbf{X}^{\top}_{i} \mathbf{\beta}\), and \(\alpha_i \sim Normal(0, \sigma^2)\).
set.seed(1234)
n <- 20
beta <- c(1, -1)
sigma_sq <- 0.5
x <- rnorm(n)
X <- model.matrix(~x)
alpha <- rnorm(n, mean = 0, sd = sqrt(sigma_sq))
lambda <- exp(alpha + X %*% beta)
Y <- rpois(n, lambda = lambda)
table(Y)
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha[i] + inprod(X[i,], beta)
alpha[i] ~ dnorm(0, 1 / sigma_sq)
}
for (j in 1:px) {
beta[j] ~ dnorm(0, 0.001)
}
log_sigma ~ dnorm(0, 0.001)
sigma_sq <- exp(log_sigma)^2
}")
dat <- list(Y = Y, X = X, n = n, px = ncol(X))
fit <- jags.fit(data = dat,
params = c("beta", "sigma_sq"),
model = model)
dcfit <- dc.fit(data = dat,
params = c("beta", "sigma_sq"),
model = model, n.iter = 1000,
n.clones = c(1, 2, 4, 8),
unchanged = "px", multiply = "n")
dcdiag(dcfit)
coef(dcfit)
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha[i] + inprod(X[i,], beta)
alpha[i] ~ dnorm(0, 1 / sigma_sq)
}
for (j in 1:px) {
beta[j] ~ dnorm(0, 0.1)
}
log_sigma ~ dnorm(10, 1)
sigma_sq <- exp(log_sigma)^2
}")
dat <- list(Y = Y, X = X, n = n, px = ncol(X))
pred1 <- jags.fit(data = dat,
params = "alpha",
model = model)
model <- custommodel("model {
for (i in 1:n) {
Y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha[i] + inprod(X[i,], beta)
alpha[i] ~ dnorm(0, 1 / sigma_sq)
}
param[1:(px + 1)] ~ dmnorm(cf[], V[,])
for (j in 1:px) {
beta[j] <- param[j]
}
log_sigma <- param[px + 1]
sigma_sq <- exp(log_sigma)^2
}")
dat <- list(Y = Y, X = X, n = n, px = ncol(X),
cf = coef(dcfit), V = solve(vcov(dcfit)))
pred2 <- jags.fit(data = dat,
params = "alpha",
model = model)
alpha_b <- cbind(est=coef(pred1),
t(quantile(pred1, probs = c(0.025, 0.975))))
alpha_dc <- cbind(est=coef(pred2),
t(quantile(pred2, probs = c(0.025, 0.975))))
plot(1:n, alpha[order(alpha)], type = "l",
ylim = range(alpha, alpha_b, alpha_dc))
points(1:n - 0.2, alpha_b[order(alpha),1],
col = 2, pch = 19, cex = 0.5)
segments(x0 = 1:n - 0.2, x1 = 1:n - 0.2,
y0 = alpha_b[order(alpha),2],
y1 = alpha_b[order(alpha),3], col = 2)
points(1:n + 0.2, alpha_dc[order(alpha),1],
col=4, pch = 19, cex = 0.5)
segments(x0 = 1:n + 0.2, x1 = 1:n + 0.2,
y0 = alpha_dc[order(alpha),2],
y1 = alpha_dc[order(alpha),3], col = 4)
table(rowSums(sign(alpha - alpha_b[,-1]))) / n
table(rowSums(sign(alpha - alpha_dc[,-1]))) / n
Use different prior distributions on the parameters and see how they affect the coverage.
What have we learnt?
- Hierarchical models: Linear mixed models, measurement error.
- Latent variables versus parameters.
- Estimation and inference for the parameters.
- Prediction and inference (coverage) for the latent variables.
Kiefer, J. and Wolfowitz, J., 1956. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist., 27, 887–906. PDF from Project Euclid ↩