This document is a collection of tutorials for the the one-day short course ‘Hierarchical models for conservation biologists made easy’ at at NACCB congress in Madison, WI, on July 16th. The tutorials are intended to help participants making sure that necessary software is installed ahead of the course. Please spend some time reading the R quickstart, statistical concepts so that we can jump right into the fun part of hierarchical modeling on July 16th.
This introduction is provided in PDF and Rmd file formats. Use RStudio (see installation guide) to open the Rmd (R markdown file format) file to directly access R code used in the PDF file.
We aim this training course towards conservation professionals who need to understand and feel comfortable with modern statistical and computational tools used to address conservation issues. Conservation science needs to be transparent and credible to be able to make an impact and translate information and knowledge into action.
The congress program is out (see here).
Congress registration is now open here closed.
Communicating scientific methods and results require a full understanding of concepts, assumptions and implications. However, most ecological data used in conservation decision making are inherently noisy, both due to intrinsic stochasticity found in nature and extrinsic factors of the observation processes. We are often faced with the need to combine multiple studies across different spatial and temporal resolutions. Natural processes are often hierarchical. Missing data, measurement error, soft data provided by expert opinion need to be accommodated during the analysis. Data are often limited (rare species, emerging threats), thus small sample corrections are important for properly quantify uncertainty.
Hierarchical models are useful in such situations. Fitting these models to data, however, is difficult. Advances in the last couple of decades in statistical theory and software development have fortunately made the data analysis easier, although not trivial. In this course, we propose to introduce statistical and computational tools for the analysis of hierarchical models (including tools for small sample inference) specifically in the context of conservation issues.
We will teach both Bayesian and Likelihood based approaches to these models using freely available software developed by the tutors. By presenting both Bayesian and Likelihood based approaches participants will be able to go beyond the rhetorics of philosophy of statistics and use the tools with full understanding of their assumptions and implications. This will help ensure that when they use the statistical techniques, be they Bayesian or Frequentist, they will be able to explain and communicate the results to the managers and general public appropriately.
Participants should bring their own laptops. Open source and free software, and electronic course material will be provided by organizers.
Morning coffee, and afternoon iced tea and lemonade, plus light snack of chips and salsa will be provided.
Information and course notes are going to be posted on the course website at http://datacloning.org/courses/2016/madison/.
Please follow these steps and install necessary software onto your computer that you are going to use at the course. This way we can spend more time on talking about modeling.
Follow the instructions at the R website to download and install the most up-to-date base R version suitable for your operating system (the latest R version at the time of writing these instructions is 3.3.1).
Having RStudio is not absolutely necessary, but our course material will follow a syntax that is close to RStudio’s R markdown notation, so having RStudio will make our life easier. RStudio is also available for different operating systems. Pick the open source desktop edition from here (the latest RStudio Desktop version at the time of writing these instructions is 0.99.902).
We will use JAGS during the course because it is robust, easy to install, and cross-paltform available. Download the latest version suitable for your operating system from here (the latest JAGS version at the time of writing these instructions is 3.4.2).
Note: due to recent changes in R’s Windows toolchain (which impacts Windows specific installation only), pay attention to matching versions:
JAGS-4.2.0-Rtools33.exe
,JAGS-4.2.0.exe
.Once R/RStudio and JAGS is installed, run the following commands in R/RStudio to install the necessary R packages (rjags, dclone, coda, snow, rlecuyer):
install.packages(c("rjags","dclone","coda","snow","rlecuyer"))
Because there are dependencies and version requirements, best to check that everything works. Please run the following code and follow the prompts:
check_if_ready_for_the_course <-
function()
{
if (getRversion() < "2.15.1")
stop("R >= 2.15.1 required")
cat("--- R version is",
as.character(getRversion()), "--- OK\n")
if (!require(dclone))
stop("dclone package not installed")
if (packageVersion("dclone") < "2.1.1")
stop("dclone >= 2.1.1 required")
cat("--- dclone package version is",
as.character(packageVersion("dclone")), "--- OK\n")
if (!require(coda))
stop("coda package not installed")
if (packageVersion("coda") < "0.13")
stop("coda >= 0.13 required")
cat("--- coda package version is",
as.character(packageVersion("coda")), "--- OK\n")
if (!require(rjags))
stop("rjags package not installed")
if (packageVersion("rjags") < "4.4")
stop("rjags >= 4.4 required")
cat("--- rjags package version is",
as.character(packageVersion("rjags")), "--- OK\n")
if (!require(snow))
stop("snow package not installed")
cat("--- snow package version is",
as.character(packageVersion("snow")), "--- OK\n")
if (!require(rlecuyer))
stop("rlecuyer package not installed")
cat("--- rlecuyer version is",
as.character(packageVersion("rlecuyer")), "--- OK\n")
cat("\n--- YOU ARE READY TO GO!\n\n")
invisible(NULL)
}
check_if_ready_for_the_course()
Congratulations! Now your computer is ready for the course.
Contact Peter Solymos if you still have problems.
This section covers the data structures and operators we will use during the short course.
R is a great calculator:
1 + 2
Assign a value and print an object:
print(x <- 2)
(x = 2) # shorthand for print
x == 2 # logical operator, not assignment
y <- x + 0.5
y # another way to print
Logical operators:
x == y # equal
x != y # not eaqual
x < y # smaller than
x >= y # greater than or equal
Vectors and sequences:
x <- c(1, 2, 3)
x
1:3
seq(1, 3, by = 1)
rep(1, 5)
rep(1:2, 5)
rep(1:2, each = 5)
Vector operations, recycling:
x + 0.5
x * c(10, 11, 12, 13)
Indexing vectors, ordering:
x[1]
x[c(1, 1, 1)] # a way of repeatig values
x[1:2]
x[x != 2]
x[x == 2]
x[x > 1 & x < 3]
order(x, decreasing=TRUE)
x[order(x, decreasing=TRUE)]
rev(x) # reverse
Character vectors, NA
values, and sorting:
z <- c("b", "a", "c", NA)
z[z == "a"]
z[!is.na(z) & z == "a"]
z[is.na(z) | z == "a"]
is.na(z)
which(is.na(z))
sort(z)
sort(z, na.last=TRUE)
Matrices and arrays:
(m <- matrix(1:12, 4, 3))
matrix(1:12, 4, 3, byrow=TRUE)
array(1:12, c(2, 2, 3))
Attribues:
dim(m)
dim(m) <- NULL
m
dim(m) <- c(4, 3)
m
dimnames(m) <- list(letters[1:4], LETTERS[1:3])
m
attributes(m)
Matrix indices:
m[1:2,]
m[1,2]
m[,2]
m[,2,drop=FALSE]
m[2]
m[rownames(m) == "c",]
m[rownames(m) != "c",]
m[rownames(m) %in% c("a", "c", "e"),]
m[!(rownames(m) %in% c("a", "c", "e")),]
Lists and indexing:
l <- list(m = m, x = x, z = z)
l
l$ddd <- sqrt(l$x)
l[2:3]
l[["ddd"]]
Data frames:
d <- data.frame(x = x, sqrt_x = sqrt(x))
d
Structure:
str(x)
str(z)
str(m)
str(l)
str(d)
str(as.data.frame(m))
str(as.list(d))
Summary:
summary(x)
summary(z)
summary(m)
summary(l)
summary(d)
dim
attribute, elements are in same mode,Random numbers can be generated from a distribution using the
r*
functions where the wildcard (*
) stands for
the abbreviated name of the distribution, for example
rnorm
.
n <- 1000
## draw n random numbers from Normal(0, 1)
hist(rnorm(n, mean = 0, sd = 1))
## Uniform (not as 'run-if' but as 'r-unif')
hist(runif(n, min = 0, max = 1))
## Poisson
plot(table(rpois(n, lambda = 5)))
## Binomial
plot(table(rbinom(n, size = 10, prob = 0.25)))
The coda R package defines MCMC list objects as:
mcmc.list
object equals the number of parallel chains,For example if we are monitoring 2 variables, a normally and a uniformly distributed one, the structure might look like this:
mcmc <- replicate(3,
structure(cbind(a = rnorm(n), b = runif(n)),
mcpar = c(1, n, 1), class = "mcmc"),
simplify = FALSE)
class(mcmc) <- "mcmc.list"
str(mcmc)
Some basic methods for such mcmc.list
objects are
defined in the coda
and dclone
packages:
library(dclone)
summary(mcmc)
str(as.matrix(mcmc))
varnames(mcmc)
start(mcmc)
end(mcmc)
thin(mcmc)
plot(mcmc)
traceplot(mcmc)
densplot(mcmc)
pairs(mcmc)
Science, as we envision it, is an interplay between inductive and deductive processes. Francis Bacon, the father of what is known as the scientific method, emphasizes the roles of observations, alternative explanations and tests to choose among various explanations. Bacon saw science as inductive process, moving from the particular to the general.
Karl Popper proposed the doctrine of falsification, which defines what is acceptable as a scientific hypothesis: if a statement cannot be falsified, then it is not a scientific hypothesis. This is intrinsically a deductive process. What is common to these different views is that theories need to be probed to assess their correctness. Observations play an important role in such probing.
In most scientific situations, we are interested in understanding the natural processes that have given rise to the observations. Such understanding generally leads to prediction and possibly control of the processes. Traditionally, we formulate our understanding of the processes in terms of mathematical models. These mathematical models may be deterministic or may be stochastic.
It is widely accepted, at least by the statisticians, that stochastic models represent nature more effectively than pure deterministic models. Aside from the natural stochasticity in the system, the observations themselves might have measurement error making it necessary to consider stochastic models to model observations.
One of the consequences of stochastic models is that Popper’s theory of falsification does not strictly apply. No data are strictly inconsistent with a stochastic model, except in artificial situations or trivially improper models. Thus, one can only say that the observed data are more likely under one model than the other; or that the strength of evidence for one hypothesis is larger than for an alternative. We cannot outright accept or reject a hypothesis.
Given a set of stochastic models (or, equivalently a set of alternative descriptions of the underlying process), the goal of statistical inference is to choose the model that is best supported by the data. Thus, statistical inference is both deductive (it makes some predictions) and inductive (data determines which model is best supported). An important feature that we demand from all our statistical inference procedures is that with infinite amount of information, the probability of choosing the correct model converges to one.
Another important feature of statistical inference is that it is uncertain. We want to know whether or not our inferential statement, namely the choice of the model, is trustworthy. Quantifying the uncertainty in the inferential statements is a critical issue and has led to different statistical philosophies of inference, in particular the frequentist philosophy and the Bayesian philosophy. Just as numbers without understanding the units are meaningless, statistical inferential statements without proper understanding of the uncertainty are meaningless.
We will discuss the differences in the two approaches to quantify uncertainty in the statistical inference in detail in the context of a simple example later. For the interested researcher, there are several resources available that discuss these issues in depth.
We also do not intend to give a detailed tutorial on the basics of statistical inference. There are many standard reference books for such introduction.
Let us start with a simple occupancy model. We will use this model to introduce various important concepts that will be used throughout the course. We will use it also to introduce some basic commands for analyzing data using the R package dclone.
In conservation biology, one of the first things we want to do is monitor the current status of the population. This can be done in terms of simple presence-absence data answering the question: what is the proportion of occupied sites? If this proportion is high, it may imply that we should not worry too much about the species (if it is something we want to maintain) or may be we want to do some biological control (if it is an invasive species). A simple monitoring procedure would consist of the following steps:
It is critical that we state the assumptions underlying the statistical model. In practice, however, we may or may not be able to know whether all the assumptions are fulfilled or not.
Mathematically we write this as follows:
\(Y_{i} \sim Binomial(1, p)\) (this is also known as the Bernoulli distribution) are independent, identically distributed (i.i.d.) random variables.
The probability mass function of the Bernoulli random variable is written as: $P(Y=y) = p^y (1-p)^{1-y}$, where $p \in (0,1)$ and $y=0,1$.
We can now write down the likelihood function. This is proportional to the probability of observing the data at hand:
\[L(p; y_{1}, y_{2}, \ldots, y_{n}) = \prod_{i=1}^{n} p^{y_{i}} (1-p)^{1-y_{i}}\]We take product because observations are assumed to be independent of each other.
To demonstrate this we simulate a single data set and vary the parameter value and get a function as represented below:
## random numbers from Binomial distribution
## Binomial with size=1 is a Bernoulli distribution
## p value set as 0.3
p <- 0.3
set.seed(1234) # set random seed for reproducibility
(y <- rbinom(n = 1000, size = 1, p = p))
y1 <- y[1:10] # take only the 1st 10 elements of y
## pt is our p value that we want the Likelihood to be calculated for
pt <- 0.3
## the Likelihood is based on the formula from above
(L <- prod(pt^y1 * (1 - pt)^(1-y1)))
## the following statement is equivalent to typing in the formula
## take advantage of bilt-in density functions
prod(dbinom(y1, size = 1, prob = pt))
## now pt is a vector between 0 and 1
pt <- seq(0, 1, by = 0.01)
## use the sapply function to calculate the likelihood
## using one element of the vector at a time (argument z becomes prob=z)
## by fixing the data y1
L <- sapply(pt, function(z) prod(dbinom(y1, size = 1, prob = z)))
Now that we calculated the likelihood function, let us plot it:
op <- par(las=1) # always horizontal axis, store old settings in op
## color palettes for nice figures
flatly <- list(
"red"="#c7254e",
"palered"="#f9f2f4",
"primary"="#2c3e50",
"success"="#18bc9c",
"info"="#3498db",
"warning"="#f39c12",
"danger"="#e74c3c",
"pre_col"="#7b8a8b",
"pre_bg"="#ecf0f1",
"pre_border"= "#cccccc")
dcpal_reds <- colorRampPalette(c("#f9f2f4", "#c7254e"))
dcpal_grbu <- colorRampPalette(c("#18bc9c", "#3498db"))
## now we plot the Likelihood function
plot(pt, L, type = "l", col = flatly$info,
main = paste("n =", length(y1)))
abline(v = p, lwd = 2, col = flatly$red) # true value
abline(v = pt[which.max(L)], lwd = 2, col = flatly$success) # ML extimate
As we change the data, the likelihood function changes.
The following code is to demonstrate how the likelihood function changes when we change the data that was simulated under the same parameter values. We keep everything else (e.g. sample size) the same.
## function f has a single argument, n: sample size
f <- function(n) {
y <- rbinom(n = n, size = 1, p = 0.5)
L <- sapply(pt, function(z) prod(dbinom(y, size = 1, prob = z)))
L / max(L)
}
## create a blank plot
plot(0, type = "n", main = "n constant, y changes",
ylim = c(0, 1), xlim = c(0, 1),
xlab = "pt", ylab = "L / max(L)")
## we simulate an n=25 data set 10 times and
## plot the scaled likelihood function [L / max(L)]
tmp <- replicate(100,
lines(pt, f(25),
col = flatly$info))
abline(v = p, lwd = 2, col = flatly$red)
As we increase the sample size, the likelihood becomes concentrated around the true value. This property of the maximum likelihood estimator is called consistency.
## try different sample sizes, data is fixed
## so samples can be nested
nvals <- c(10, 100, 250, 500, 1000)
## scaled likelihood function using different sample sizes
Lm <- sapply(nvals,
function(n) {
L <- sapply(pt, function(z)
prod(dbinom(y[1:n], size = 1, prob = z)))
L / max(L)
})
## plot the results
matplot(pt, Lm, type = "l",
lty = 1, ylab = "L / max(L)", main = "n increases",
col = unlist(flatly)[3:7])
abline(v = p, lwd = 2, col = flatly$red)
text(apply(Lm, 2, function(z) pt[which.max(z)]),
0.5+ 0.1*c(0:4), paste("n =", nvals),
col = unlist(flatly)[3:7])
The likelihood value represents the support in the data for a particular parameter value. This is intrinsically a relative concept. How much more support do we have for this parameter value vs. another parameter value. The likelihood ratio is a more fundamental concept than the likelihood function itself..
We can now summarize the goals of statistical inference:
Check out these following functions and play around to see how sample size affects the likelihood function. In order to access them, easiest is to install a half baked (i.e. not yet well documented) package from Github:
library(devtools)
install_github("datacloning/dcapps")
Play with the parameter settings for these distributions:
library(dcapps)
op <- par(mfrow = c(2, 2))
distr("Bernoulli", binom_p = 0.3)
distr("Binomial", binom_p = 0.3, binom_size = 10)
distr("Poisson", poisson_lambda = 5)
distr("Uniform", unif_a = -1, unif_b = 1)
par(op)
op <- par(mfrow = c(2, 2))
distr("Normal", normal_mu = 0, normal_var = 1)
distr("Lognormal", normal_mu = 0, normal_var = 1)
distr("Beta", beta_shape1 = 1, beta_shape2 = 1)
distr("Gamma", gamma_shape = 1, gamma_rate = 1)
par(op)
Vary sample size (n
) and random seed (seed
) to see how that
impacts the maximum likelihood estimator (MLE):
op <- par(mfrow = c(2, 2))
mle(p = 0.3, n = 10, seed = 0)
mle(p = 0.3, n = 10, seed = 1234)
mle(p = 0.3, n = 100, seed = 0)
mle(p = 0.3, n = 1000, seed = 0)
par(op)
Which parameter value has the largest support in the data?
We can use numerical optimization to get the value of a parameter where the likelihood function is maximal. Such a parameter value is called a (point) estimate, while the function we are using to do the estimation (in this case the likelihood function, but there might be other functions, too) is called an estimator.
In numerical optimization, we often find the minimum of the negative of a function, instead of finding the maximum. Also, we use the log likelihood, because the product becomes a sum on the log scale. This is much easier to compute. That is why we ofteh find that programs define the negative log likelihood function as we do below.
## this functions simulates observations
sim_fun <- function(n, p) {
rbinom(n = n, size = 1, p = p)
}
## this function returns the negative log likelihood value
nll_fun <- function(p, y) {
-sum(dbinom(y, size = 1, prob = p, log = TRUE))
}
We use $n=100$ and $p=0.5$ for simulating the observation vector $y$.
Then use the one dimensional optimization function, optimize
.
(For multivariate optimization problems, see the optim
function.)
What is different between using optimization vs. manually setting up a set of values is that optimization starts with a sparse grid first to see
what region of the parameter space is of interest. In this region then
the search for the minimum (or maximum) is continued with more intensity,
i.e. until the difference in subsequent candidate estimates reaches a pre defines
tolerance threshold (tol
argument in optimize
).
n <- 100
p <- 0.3
y <- sim_fun(n, p)
optimize(nll_fun, interval = c(0, 1), y = y)
Once we can write down the likelihood, we can in principle write a program to calculate the value of the (negative log) likelihood function given some parameter value and the data.
Let us revisit now what happened when we kept the sample size fixed but changed the data. In this case, we get different parameter estimates (MLEs) for different data sets. A natural question to ask would be: How much would the answers vary if we have different samples?
In the following program we pre-allocate a vector of length $B$, we simulate
the data $B$ times and store the corresponding MLEs in the object res
:
B <- 1000
res <- numeric(B)
for (i in 1:B) {
y <- sim_fun(n, p)
res[i] <- optimize(nll_fun, interval = c(0, 1), y = y)$minimum
}
Some summary statistics reveal interesting things: the $B$ estimates now have a sampling distribution that can be characterized by its mean and various percentiles:
summary(res)
quantile(res, c(0.025, 0.975))
The bias is defined as the deviation between the estimate and the true parameter values. When the bias converges to 0 with increasing sample size, we say that an estimator is consistent:
mean(res - p)
Precision is the corresponding feature of an estimate.
The 2.5% and 97.5% percentiles of the sampling distribution correspond to the 95% analytical confidence intervals around the true parameter value.
level <- 0.95
a <- (1 - level) / 2
a <- c(a, 1 - a)
(ci0 <- quantile(res, a))
An estimator is called efficient when the variation in the sampling distribution and the confidence interval gets smaller with increasing sample size. Accuracy is the corresponding feature of an estimate. When the percentiles of the sampling distribution are close to the corresponding analytical confidence intervals, we say that the estimator has nominal coverage.
The following plot shows the sampling distribution of the estimates, the true parameter value, the mean of the estimates, the analytical confidence intervals and the quantiles of the sampling distribution. The values overlap perfectly, that is why the red lines are not visible:
hist(res, col = flatly$red, border = flatly$pre_border,
xlim = c(0, 1), main = "Sampling distribution")
rug(res+runif(B, -0.01, 0.01), col = flatly$info, lwd = 2)
## sampling distribution based sumarry statistics
abline(v = mean(res), lwd = 2, col = flatly$success)
abline(v = quantile(res, c(0.025, 0.975)),
lwd = 2, col = flatly$success, lty = 2)
Of course, in real life, we do not have the luxury of conducting such repeated experiments. So what good are these ideas?
One way to quantify the uncertainty in the estimate is to use asymptotic confidence intervals as we saw above. We called it analytical because for this particular model we could calculate it analytically. This, however, might not be the case in all situation. One can estimate the asymptotic standard error and confidence interval of an estimate:
## our data
y <- sim_fun(n, p)
## MLE
(est <- optimize(nll_fun, interval = c(0, 1), y = y)$minimum)
(ci1 <- qnorm(a, mean = est,
sd = sqrt(est * (1-est) / n)))
We have the MLE. The MLE is kind of close to the true parameter value. So suppose we pretend as if the MLE is the true parameter value, we can get the sampling distribution and the confidence interval. This is the idea behind the parametric bootstrap confidence intervals.
B <- 1000
pbres <- numeric(B)
for (i in 1:B) {
yb <- sim_fun(n, est) # treat est as 'true' value and estimate
pbres[i] <- optimize(nll_fun, interval = c(0, 1), y = yb)$minimum
}
(ci2 <- quantile(pbres, a))
The non-parametric bootstrap is based on a similar principle, but instead of simulating data sets under our initial estimate, we mimic the experiment by resampling the original data set with replacement $B$ times:
## we use the same settings and data as for non-parametric bootstrap
npbres <- numeric(B)
for (i in 1:B) {
yb <- sample(y, replace = TRUE)
npbres[i] <- optimize(nll_fun, interval = c(0, 1), y = yb)$minimum
}
(ci3 <- quantile(npbres, a))
Let us compare the true CI with the estimated CIs:
rbind(true = ci0, asy = ci1, pb = ci2, npb = ci3)
We repeat the above experiments multiple times: generate the data, estimate the 95 percent confidence intervals, and check if the interval contains the true value. If the true value is contained between the confidence limits at least 95 percent of the cases, se say the coverage of the estimator is nominal. Here is the code:
R <- 100
ci_res <- list()
for (j in 1:R) {
# cat("run", j, "of", R, "\n") # print run number if you like
flush.console()
## our data
y <- sim_fun(n, p)
## asymptotic CI
est <- optimize(nll_fun, interval = c(0, 1), y = y)$minimum
ci1 <- qnorm(a, mean = est,
sd = sqrt(est * (1-est) / n))
## parametric bootstrap
pbres <- numeric(B)
for (i in 1:B) {
yb <- sim_fun(n, est)
pbres[i] <- optimize(nll_fun,
interval = c(0, 1), y = yb)$minimum
}
ci2 <- quantile(pbres, a)
## non-parametric bootstrap
npbres <- numeric(B)
for (i in 1:B) {
yb <- sample(y, replace = TRUE)
npbres[i] <- optimize(nll_fun,
interval = c(0, 1), y = yb)$minimum
}
ci3 <- quantile(npbres, a)
## store the results
ci_res[[j]] <- rbind(asy = ci1, pb = ci2, npb = ci3)
}
Calculating coverage for the 3 different methods of obtaining the 95% confidence intervals:
## a single run
ci_res[[1]]
## compare with true p value
ci_res[[1]] - p
## we expect lower CL to be negative
## and upper CL to be positive
sign(ci_res[[1]] - p)
## so row sum is 0 if the true value is within CI
coverage <- t(sapply(ci_res, function(z) rowSums(sign(z - p))))
## turn non-0 valies into 1
coverage[coverage != 0] <- 1
## take the complement
coverage <- 1 - coverage
colMeans(coverage)
Visualizing the results:
op <- par(mfrow = c(3,1), mar = c(4, 4, 2, 1) + 0.1)
plot(0, type = "n", xlim = c(1, R), ylim = c(0, 1),
main = "Coverage: asymptotic", xlab = "", ylab = "p")
segments(1:R, sapply(ci_res, "[", 1, 1),
1:R, sapply(ci_res, "[", 1, 2),
lwd = 2,
col = ifelse(coverage[,1] == 1, flatly$success, flatly$warning))
abline(h = p, col = flatly$red)
plot(0, type = "n", xlim = c(1, R), ylim = c(0, 1),
main = "Coverage: parametric bootstrap", xlab = "", ylab = "p")
segments(1:R, sapply(ci_res, "[", 2, 1),
1:R, sapply(ci_res, "[", 2, 2),
lwd = 2,
col = ifelse(coverage[,2] == 1, flatly$success, flatly$warning))
abline(h = p, col = flatly$red)
plot(0, type = "n", xlim = c(1, R), ylim = c(0, 1),
main = "Coverage: non-parametric bootstrap",
xlab = "Replicates", ylab = "p")
segments(1:R, sapply(ci_res, "[", 3, 1),
1:R, sapply(ci_res, "[", 3, 2),
lwd = 2,
col = ifelse(coverage[,3] == 1, flatly$success, flatly$warning))
abline(h = p, col = flatly$red)
par(op)
This kind of analysis is called the frequentist analysis. We are studying the properties of the inferential statement under the hypothetical replication of the experiment. This analysis tells us about the reliability of the procedure.
The implicit logic is that if the procedure is reliable, we could rely on the inferential statements obtained from only one data set. We choose a procedure that is most reliable.
This is similar to relying more on the blood pressure results from a machine that has small measurement error instead of one with large measurement error.
One major criticism of the Frequentist approach is that we do not repeat the experiment. What we want to know is: What do the data at hand tell us? Bayesian approach does not quite answer that question but answers a different question: Given these data, how do I change my beliefs?
Our goal is to infer about the true parameter value (the true occupancy proportion). Prior distribution, $\pi(\theta)$: This quantifies in probabilistic terms our personal beliefs about the true occupancy rate.
We may believe that it is most likely to be 0.7. Then we consider a distribution with mode at 0.7. We cannot determine the entire distribution from such information. But that is what a Bayesian inference demands. It is a very difficult task but is a necessary task if you want to use the Bayesian approach.
Posterior distribution: This quantifies the beliefs as modified by the data. The mathematical formulation is as follows:
\[\pi(\theta \mid y) = \frac{L(\theta;y) \pi(\theta)}{ \int L(\theta;y) \pi(\theta) d\theta }\]$\pi(\theta)$ is the prior distribution.
Use the dcapps package to play around with different priors for the Bernoulli model. First, try Beta prior:
op <- par(mfrow = c(2,2))
beta_prior(p = 0.3, n = 5, a = 1, b = 1, scale = "prob")
beta_prior(p = 0.3, n = 5, a = 1, b = 1, scale = "logit")
beta_prior(p = 0.3, n = 5, a = 0.5, b = 0.5, scale = "prob")
beta_prior(p = 0.3, n = 5, a = 0.5, b = 0.5, scale = "logit")
par(op)
op <- par(mfrow = c(2,2))
beta_prior(p = 0.3, n = 100, a = 1, b = 1, scale = "prob")
beta_prior(p = 0.3, n = 100, a = 1, b = 1, scale = "logit")
beta_prior(p = 0.3, n = 100, a = 0.5, b = 0.5, scale = "prob")
beta_prior(p = 0.3, n = 100, a = 0.5, b = 0.5, scale = "logit")
par(op)
Beta priors, including Uniform (Beta(1, 1)) and Jeffrey’s prior (Beta(0.5, 0.5)) we can see that:
It is clear that priors have an effect on the Bayesian inference. And, they should have an effect. However, this subjectivity is bothersome to many scientists. There is an approach that is euphemistically called an objective Bayesian approach. In this approach, we try to come up with priors that have least influence on the inference (e.g. Bernardo, 1980). There is no unique definition of what we mean by non-informative priors. Various priors have been suggested in the literature. The most commonly used non-informative priors are: Uniform priors and the large variance priors. Other priors are nearly impossible to specify for the complex models that ecologists are faced with.
Do these priors affect the inference?
The answer is that ``they most certainly do’’. There is nothing wrong with the priors affecting the inference as long as the researchers can justify their priors.
Use the apps with Beta, Normal, and bimodal priors and check the effects of probability vs. logit scale on the prior and posterior. It is obvious that we get different answers for the same data. Only Jeffrey’s prior is not affected.
There is also a function for normal_prior
and a bimodel_prior
n <- 5
op <- par(mfrow = c(1,2))
normal_prior(p = 0.3, n = n, mu = 0, var = 1, scale = "prob")
bimodal_prior(p = 0.3, n = n,
mu1 = -2, var1 = 1, mu2 = 1, var2 = 2, scale = "prob")
par(op)
Fundamentally there is no such thing as `objective’ Bayesian inference. Those who want to use Bayesian inference should simply accept that the inferences are affected by the choice of the priors.
In certain cases (when conjugate prior distributions are used) one can interpret the prior as a set of pseudo-observations. This means that for example in the Bernoulli model and a Beta($a$, $b$) prior distribution, the number of pseudo observations is $a + b - 2$.
Difference between Frequentist and Bayesian inferential statements: summarize the philosophical differences between the inferential statements made by a Frequentist and a Bayesian in the context of occupancy model.
If the sample size is large, the numerical differences between the Frequentist and Bayesian answers vanish. However, their interpretation is different.
Imagine a hypothetical situation where an experiment is repeated by $k$ different observers, and all $k$ experiments happen to result in exactly the same set of observations, $y^{(k)} = \left(y,y,\ldots,y\right)$. The likelihood function based on the combination of the data from these $k$ experiments is $L(\theta, y^{\left(k\right)}) = \left[L\left(\theta, y\right)\right]^k$. The location of the maximum of $L(\theta,y^{(k)})$ exactly equals the location of the maximum of the function $L\left(\theta, y\right)$, and the Fisher information matrix based on this likelihood is $k$ times the Fisher information matrix based on $L\left(\theta, y\right)$.
One can use MCMC methods to calculate the posterior distribution of the model parameters ($\theta$) conditional on the data. Under regularity conditions, if $k$ is large, the posterior distribution corresponding to $k$ clones of the observations is approximately normal with mean $\hat{\theta}$ and variance $1/k$ times the inverse of the Fisher information matrix. When $k$ is large, the mean of this posterior distribution is the maximum likelihood estimate and $k$ times the posterior variance is the corresponding asymptotic variance of the maximum likelihood estimate if the parameter space is continuous.
Data cloning is a computational algorithm to compute maximum likelihood estimates and the inverse of the Fisher information matrix, and is related to simulated annealing. By using data cloning, the statistical accuracy of the estimator remains a function of the sample size and not of the number of cloned copies. Data cloning does not improve the statistical accuracy of the estimator by artificially increasing the sample size. The data cloning procedure avoids the analytical or numerical evaluation of high dimensional integrals, numerical optimization of the likelihood function, and numerical computation of the curvature of the likelihood function.
Use the following app to see what happens when we clone the data instead of increasing sample size:
op <- par(mfrow = c(2, 2))
data_cloning(p = 0.3, n = 10, a = 2, b = 1, K = 1, seed = 0)
data_cloning(p = 0.3, n = 10, a = 0.5, b = 0.5, K = 1, seed = 0)
data_cloning(p = 0.3, n = 10, a = 2, b = 1, K = 100, seed = 0)
data_cloning(p = 0.3, n = 10, a = 0.5, b = 0.5, K = 100, seed = 0)
par(op)
MCMC = Markov chain Monte Carlo
m <- glm(formula = y ~ 1, family=binomial("logit"))
summary(m)
coef(m)
exp(coef(m)) / (1 + exp(coef(m)))
plogis(coef(m))
mean(y)
confint(m)
library(dclone)
library(rjags)
model <- custommodel("model {
for (i in 1:n) {
#Y[i] ~ dbin(p, 1) # Binomial(N,p)
Y[i] ~ dbern(p) # Bernoulli(p)
}
#p ~ dunif(0.001, 0.999)
p ~ dbeta(1, 3)
}")
dat <- list(Y = y, n = n)
fit <- jags.fit(data = dat, params = "p", model = model)
summary(fit)
plot(fit)
To make sure that both locations and clones are independent (i.i.d.), it is safest to include and extra dimension and the corresponding loop.
model <- custommodel("model {
for (k in 1:K) {
for (i in 1:n) {
Y[i,k] ~ dbin(p, 1)
}
}
logit_p <- log(p/(1-p))
p ~ dbeta(0.001, 0.999)
}")
dat <- list(Y = dcdim(data.matrix(y)), n = n, K = 1)
dcfit <- dc.fit(data = dat, params = "logit_p", model = model,
n.clones = c(1,100), unchanged = "n", multiply = "K")
summary(dcfit)
plot(dcfit)
dctable(dcfit)
plot(dctable(dcfit))
dcdiag(dcfit)
plot(dcdiag(dcfit))
More in person. Safe travels and see you in Madison!