options(tidyverse.quiet = TRUE)
library(tidyverse)
library(gganimate)
set.seed(2021)
Computational Bayes (R)
Repo for this notebook.
This notebook was developed to provide an intuitive understanding of different bayesian inference techniques and the actual computations behind them. The focus is on the what and how and not on the why. Mathematical derivations for why a technique works are left for further research, instead I hope to demonstrate what each inference technique aims to accomplish and show how to accomplish this by implementing simple solutions with minimal tooling to make the mechanics visible and easily understood. I also hope to show how each technique relates to the other to help provide a unified picture of what we are trying to accomplish in bayesian inference.
This notebook is composed of two main sections representing two common problems faced in bayesian inference.
- Parameter inference (global parameters)
- Latent variable inference (local parameters)
Suppose we have a random variable \(X\) that is distributed
\[ \begin{aligned} & \mu = 2 \\ & \sigma = 1 \\ & X \sim Normal(\mu, \sigma ) \end{aligned} \]
<- 60
N <- 2
mu <- 1
sigma <- rnorm(N, mu, sigma)
x
hist(x)
If \(\mu\) is unknown how do we perform inference? In bayesian inference we hope to find the posterior of a parameter given some data \(p(\mu|X=x)\). We can accomplis this using bayes rule:
\[ p(\mu|x) = \frac{p(x|\mu)p(\mu)}{\int_{\mu} p(x,\mu)d\mu} = \frac{p(x|\mu)p(\mu)}{p(x)} \]
By specifying a prior over \(\mu\) we can then update this distribution given some data to produce a posterior distribution. Three common methods for solving this problem follow.
Once we get to variational inference we will also look at two ways of solving latent variable models.
Conjugate Prior
By matching the likelihood distribution with a conjugate prior we can solve for \(p(x)\) analytically and recover a posterior distribution whose parameters can be estimated analytically as well.
A normal distribution is the conjugate prior for \(\mu\) when the data likelihood is also a normal distribution and the standard deviation is known.
\[ \mu \sim Normal(\mu_\mu, \sigma_\mu) \]
The parameter estimates for the posterior distribution of \(\mu\) are:
\[ \begin{aligned} & \mu_{\mu|x} = \frac{\frac{\mu_\mu}{\sigma_\mu^2} + \frac{\sum X}{\sigma^2}}{1/\sigma_\mu^2 + N/\sigma^2} \\ & \sigma_{\mu|x} = \sqrt{\frac{1}{\frac{1}{\sigma_mu^2} + \frac{N}{\sigma^2}}} \end{aligned} \]
In this case lets set the hyperparameters for \(\mu\) as:
\[ \begin{aligned} & \mu_\mu = 0 \\ & \sigma_\mu = 10 \\ & \mu \sim N(\mu_\mu, \sigma_\mu) \\ & \sigma = 1 \\ & X \sim N(\mu, \sigma ) \end{aligned} \]
<- 0
prior_mu <- 10
prior_sigma
<- (prior_mu / prior_sigma^2 + sum(x) / sigma^2) / (1 / prior_sigma^2 + N / sigma^2) # posterior mean for mu
mu_mu_conj <- sqrt(1 / (1 / prior_sigma^2 + N / sigma^2)) # posterior sd for mu
mu_sigma_conj
<- data.frame(method = "Conjugate", mu_mu = mu_mu_conj, mu_sigma = mu_sigma_conj)
estimates ::kable(estimates) knitr
method | mu_mu | mu_sigma |
---|---|---|
Conjugate | 1.900653 | 0.1290887 |
This gives us an estimate of the mean of \(\mu\) of about 1.90 with a standard deviation of about 0.13, so our posterior \(p(\mu|x)\) is distributed:
\[ \begin{aligned} & \mu_{\mu|x} = 1.90 \\ & \sigma_{\mu|x} = 0.13 \\ & \mu \sim Normal(\mu_{\mu|x}, \sigma_{\mu|x}) \end{aligned} \]
MCMC
In some cases we may choose a prior for \(/mu\) which we are unable or unwilling to solve for \(p(x)\) analytically. In that case we cannot analytically recover the posterior \(p(\mu|x)\), however, using Markov Chain Monte Carlo we can sample from the posterior.
If we use the same prior distribution and parameters as before \(\mu_\mu=0,\sigma_\mu=10\), that means we wish to draw samples that are normally distributed with mean 1.90 and standard deviation 0.13 without actually knowing those parameters.
There are different methods for performing this sampling (Gibbs, Hamiltonian, Metropolis Hastings), but the basic idea behind all of them is that we generate a series of random samples (Monte Carlo), with each sample being dependent on the previous one (Markov Chain). By being particular about which of the samples in this series we keep we can shape an arbitrary proposal distribution into producing samples that match our posterior distribution.
For this we will use one of the simplest MCMC sampling methods, Metropolis Hastings. For this there are two things we need.
- A proposal distribution. This is an arbitrary distribution which we use to draw samples from.
For this we will use a normal distribution centered at the previous sample location (Markov Chain) with a standard deviation of 1.
\[ \mu_1 \sim Normal(\mu_0, 1) \]
# we need a proposal distribution
# very tiny jumps == high acceptance rate but slow mixing
<- 1
jump_sigma <- function(mu_) rnorm(1, mu_, jump_sigma) rproposal
The proposal distribution can be any arbitrary distribution we want. The choice of a symmetric distribution where \(p(\mu_1|\mu_0)=p(\mu_0|\mu_1)\) simplifies the acceptance rule for us. The choice the standard deviation of 1 is also fairly arbitrary, in theory it will not affect the results, but in practice it influences how quickly the markov chain explores the posterior (larger sd will jump to farther away locations in the posterior) as well as our overall acceptance probabilities (larger sd will result in lower acceptance probabilities) which both influence how long we need to run our chain.
- An acceptance rule. This is a rule dependent on the previous sample and current sample to determine whether or not to keep the current sample.
For Metropolis Hastings with a symettric proposal distribution this rule is the ratio of the posterior likelihood given the proposed \(\mu\) over the posterior likelihood given the previous \(\mu\)
\[ p(Accept) = \frac{p(\mu_1|x)}{p(\mu_0|x)} = \frac{p(x|\mu_1)p(\mu_1|\mu_\mu,\sigma_\mu)}{p(x|\mu_0)p(\mu_0|\mu_\mu,\sigma_\mu)} \]
<- function(mu1, mu0) {
pacceptance # calculations done in log for numerical stability (prevent underflows)
<- dnorm(mu1, prior_mu, prior_sigma, log = TRUE) + sum(dnorm(x, mu1, sigma, log = TRUE))
log_likelihood1 <- dnorm(mu0, prior_mu, prior_sigma, log = TRUE) + sum(dnorm(x, mu0, sigma, log = TRUE))
log_likelihood0 exp(log_likelihood1 - log_likelihood0)
}
Now that we have a proposal distribution and an acceptance rule we can run our MCMC algorithm.
<- rnorm(1, prior_mu, prior_sigma) # select initial value from the prior (can be done other ways)
mu0 <- c() # collect the markov chain values here
mus # run Metropolis Hastings MCMC
<- 10000
iters for(i in 1:iters){
<- rproposal(mu0) # propose new mu based on current mu
mu1 <- pacceptance(mu1, mu0) # calculate acceptance probability
acceptance_ratio # accept or reject with probability equal to acceptance_ratio
<- runif(1)
u <- ifelse(u <= acceptance_ratio, mu1, mu0) # if accept current mu it becomes previous mu for next step
mu0 <- c(mus, mu0) # add current sample value to chain
mus
}
# drop warmup samples
<- mus[100:length(mus)] mus
By calculating summary statistics on the sampled values we can see if they match the analytical solution.
<- mean(mus)
mu_mu_mcmc <- sd(mus)
mu_sigma_mcmc
<- rbind(estimates, data.frame(method = "MCMC", mu_mu = mu_mu_mcmc, mu_sigma = mu_sigma_mcmc))
estimates ::kable(estimates) knitr
method | mu_mu | mu_sigma |
---|---|---|
Conjugate | 1.900653 | 0.1290887 |
MCMC | 1.890423 | 0.1298439 |
We see they are fairly close, showing that we were able to sample from the posterior distribution. Lets plot the sampled distribution against the true posterior.
<- seq(1, 3, by = 0.01)
x1 hist(mus, freq = FALSE, main = "Histogram of MCMC Draws")
lines(y = dnorm(x1, mu_mu_conj, mu_sigma_conj ), x = x1)
legend("topleft", legend=c("Conjugate Posterior"), lty=c(1))
Variational Inference
Perhaps we are impatient with MCMC so we would like to turn this into an optimization problem. We can do this using variational inference. In this scenario we will specify an approximating posterior distribution \(q(\mu|x)\) along with the prior distribution. This posterior distribution can be any distribution we like but our goal is the select a distribution that can match the true posterior as closely as possible. We then optimize the parameters of the approximate posterior to do this.
Because we are using conjugate priors we actually know that the true posterior is normally distributed, so first let’s pick the normal as our approximate distribution and see if we can properly recover the parameters. Next, we will try using a distribution that cannot so closely approximate the true normal posterior and see what happens.
With a normal distribution as our posterior we have two parameters to optimize for it, \(\mu_q\) and \(\sigma_q\).
\[ q(\mu|x) \sim Normal(\mu_q, \sigma_q) \]
<- rnorm q_z
In order for us to optimize these parameters we need to define our loss function. This is normally taken to be the KL divergence between the approximate posterior and the true posterior. With some work the KL divergence can be simplified to a form which we can actually calculate, in practice this is generally converted to the ELBO to be maximized. The details of this is left to the reader to explore further.
\[ \begin{aligned} & KL(q(\mu|x)||p(\mu|x)) = \int_\mu q(\mu|x) \log\frac{q(\mu|x)}{p(\mu|x)}d\mu = \\ & E[\log\frac{q(\mu|x)}{p(\mu|x)}] = \\ & E[\log q(\mu|x) - \log p(\mu, x) + \log p(x)] = \\ & E[\log q(\mu|x) - \log p(x|\mu) - \log p(\mu) + \log p(x)] \end{aligned} \]
The last term \(\log p(x)\) does not rely on \(q(\mu|x)\) and is therefore constant within the optimization problem reducing the KL divergence to
\[ E[\log q(\mu|x) - \log p(x|\mu) - \log p(\mu)] \]
Because the result is an expectation, we can calculate this by Monte Carlo sampling the approximate posterior and take the average of the resulting calculations.
= 100 #number of draws for the expectation in KL divergence
n_draws
<- function(posterior_mu, posterior_sigma){
kl_divergence #take samples from approximate posterior distribution
<- q_z(n_draws, posterior_mu, posterior_sigma)
z #calculate elbo
<- sapply(z, function(z_i){
kl <- dnorm(z_i, posterior_mu, posterior_sigma, log = TRUE)
q_z <- sum(dnorm(x, z_i, sigma, log = TRUE))
p_x_z <- dnorm(z_i, prior_mu, prior_sigma, log = TRUE)
p_z - p_x_z - p_z
q_z
})mean(kl) #take expectation
}
All that is left is to optimize the posterior_mu
and posterior_sigma
. Some sort of gradient descent is generally used, but lets to a grid search over the parameter space so we can look at how kl divergence changes across this space. We will calculate the KL divergence at each combination of proposed posterior_mu
and posterior_sigma
, select the combination that produces the lowest value, and plot it on the space of values.
# grid search over all combinations of mu_vals and sigma_vals
<- seq(-2, 8, by = 0.1)
mu_vals <- seq(0.1, 10, by = 0.1)
sigma_vals <- expand.grid(mu_vals, sigma_vals)
to_search $kl <- 0
to_searchfor(i in 1:nrow(to_search)){
"kl"] <- kl_divergence(to_search[i,1], to_search[i,2]) # calculate value of kl divergence at this combination and save
to_search[i,
}<- setNames(to_search, c("mu_mu", "mu_sigma", "kl"))
to_search
# find best value in grid and report
<- to_search[which.min(to_search$kl),]
best_val <- rbind(estimates, data.frame(method = "VI-Grid Search", mu_mu = best_val$mu_mu, mu_sigma = best_val$mu_sigma))
estimates ::kable(estimates) knitr
method | mu_mu | mu_sigma |
---|---|---|
Conjugate | 1.900653 | 0.1290887 |
MCMC | 1.890423 | 0.1298439 |
VI-Grid Search | 1.900000 | 0.1000000 |
From our grid search we see that we have once again come very close to the true posterior values.
%>%
to_search ggplot(aes(x = mu_mu, y = mu_sigma, fill =kl)) +
geom_tile() +
geom_point(data = best_val, aes(x = mu_mu, y = mu_sigma), col="red")
While performing grid search is a useful way for visualizing the shape of the parameter space, in practice it is generally too computationally intensive and requires very tight grid spacing to recover the optimal paramters.
Now lets perform a very simplified form of gradient descent. We will be performing coordinate descent, where we optimize each parameter independently. This makes the updates simple to perform, though in more complex parameter spaces it will not perform well.
Also for simplicity we will compute the gradient via finite differences. Again this is not recommended in practice but is good enough for our purposes here.
# initialize the parameters
<- 0
mu_mu_kl <- 7
mu_sigma_kl
# optimization settings
<- 1000
iters <- 1e-1 # perturb the parameter by this much to estimate the gradient
h <- 1e-4
step_size # track how the posterior parameters change over the iterations
<- as.data.frame(matrix(0, nrow = iters, ncol = 2)) %>% setNames(c("mu", "sigma"))
posterior_steps <- c() # collect the kl divergence values here
kls
for(i in 1:iters){
# step mu
<- (kl_divergence(mu_mu_kl + h, mu_sigma_kl) - kl_divergence(mu_mu_kl, mu_sigma_kl)) / h # finite difference for gradient
gradient1 <- mu_mu_kl - gradient1 * step_size
mu_mu_kl
# step sigma
<- (kl_divergence(mu_mu_kl, mu_sigma_kl + h) - kl_divergence(mu_mu_kl, mu_sigma_kl)) / h
gradient2 <- mu_sigma_kl - gradient2 * step_size
mu_sigma_kl
<- c(mu_mu_kl, mu_sigma_kl)
posterior_steps[i,] <- c(kls, kl_divergence(mu_mu_kl, mu_sigma_kl))
kls }
First we can inspect to make sure the KL divergence between the approximate and true posterior was reduced over iterations.
plot(kls, type = "l")
And lets see how our approximate distribution compares to the previous estimates.
<- rbind(estimates, data.frame(method = "VI-Gradient Descent", mu_mu = mu_mu_kl, mu_sigma = mu_sigma_kl))
estimates ::kable(estimates) knitr
method | mu_mu | mu_sigma |
---|---|---|
Conjugate | 1.900653 | 0.1290887 |
MCMC | 1.890423 | 0.1298439 |
VI-Grid Search | 1.900000 | 0.1000000 |
VI-Gradient Descent | 1.850746 | 0.0842574 |
Because we traced the posterior parameters as the evolved we can plot the optimization trajectory over the parameter space.
%>%
to_search ggplot() +
geom_tile(aes(x = mu_mu, y = mu_sigma, fill = kl)) +
geom_path(data = posterior_steps, aes(x = mu, y = sigma), col="red")
Variational Inference Posterior Mismatch
In the previous section we had the fortune of working with a model that had a known posterior distribution so we could match the family of the approximate distribution to this known family providing us with accurate posterior parameter predictions. But we are not restricted to using the same family of distribution for the approximate distribution. In fact when using varitional methods we are often working on models in which we don’t know the true posterior distribution. To see what happens when the approximate distribution family does not match the true posterior, we will use the same generative model as before, but use a exponential distribution as the approximate distribution.
You will notice that this approximate distribution is a poor fit immediately due to it being bounded on the low end by 0.
<- rexp q_z
= 100 #number of draws for the expectation in KL divergence
n_draws
<- function(posterior_mu){
kl_divergence #take samples from approximate posterior distribution
<- q_z(n_draws, posterior_mu)
z #calculate elbo
<- sapply(z, function(z_i){
kl <- dexp(z_i, posterior_mu, log = TRUE)
q_z <- sum(dnorm(x, z_i, sigma, log = TRUE))
p_x_z <- dnorm(z_i, prior_mu, prior_sigma, log = TRUE)
p_z - p_x_z - p_z
q_z
})mean(kl) #take expectation
}
The exponential distribution only has one parameter \(\beta\) that needs to be optimized. Because \(\beta>0\) we will optimize it in unconstrained space by taking the \(\log\) of it.
# initialize the parameters
<- -1.5 #this is in log domain because beta is lower bounded at 0, need to exponentiate
mu_beta_kl
# optimization settings
<- 2000
iters <- 1e-2 # perturb the parameter by this much to estimate the gradient
h <- 1e-6
step_size # track how the posterior parameters change over the iterations
<- as.data.frame(matrix(0, nrow = iters, ncol = 1)) %>% setNames(c("mu"))
posterior_steps <- c() # collect the kl divergence values here
kls
for(i in 1:iters){
# step mu
<- (kl_divergence(exp(mu_beta_kl + h)) - kl_divergence(exp(mu_beta_kl))) / h # finite difference for gradient
gradient1 <- mu_beta_kl - gradient1 * step_size
mu_beta_kl
<- exp(mu_beta_kl)
posterior_steps[i,] <- c(kls, kl_divergence(exp(mu_beta_kl)))
kls }
plot(kls, type = "l")
And lets see how our approximate distribution compares to the previous estimates.
<- rbind(estimates, data.frame(method = "VI-Mismatch", mu_mu = 1 / exp(mu_beta_kl), mu_sigma = NA))
estimates ::kable(estimates) knitr
method | mu_mu | mu_sigma |
---|---|---|
Conjugate | 1.900653 | 0.1290887 |
MCMC | 1.890423 | 0.1298439 |
VI-Grid Search | 1.900000 | 0.1000000 |
VI-Gradient Descent | 1.850746 | 0.0842574 |
VI-Mismatch | 1.580656 | NA |
<- seq(-5, 10, by = 0.1)
x1 plot(y = dexp(x1, 1 / exp(mu_beta_kl)), x = x1, type = "l", lty = 1)
lines(y = dnorm(x1, mu_mu_conj, mu_sigma_conj ), x = x1, lty = 2)
legend("topright", legend=c("Approximate Posterior", "Conjugate Posterior"), lty=c(1, 2))
Latent Variable Variational Inference
In the previous section we used various methods to calculate or estimate the posterior distribution of a single, global parameter. In this section we we continue using variational inference to learn latent parameters.
In this context a latent parameter is an unobserved variable related to each observation, so rather than estimating one parameter we need to estimate at least as any parameters as there are data points.
For this we will generate data from mixture of normal distributions.
\[ \begin{aligned} & Z \sim Categorical(0.5, 0.5) \\ & \mu_1 = -1 \\ & \mu_2 = 1 \\ & \sigma = 1 \\ & Y \sim Normal(\mu_z, \sigma | Z=z) \end{aligned} \]
<- 60
N_ <- N_ + N_
N
<- c(-1, 1)
mus <- 1
sigma
<- rnorm(N_, mus[1], sigma)
y_1 <- rnorm(N_, mus[2], sigma)
y_2 <- c(y_1, y_2)
y
<- data.frame(class = as.factor(c(rep(1, N_), rep(2, N_))), y = y) y_df
%>%
y_df ggplot(aes(x = y, fill = class, group = class))+
geom_histogram(position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For our model let us again assume that \(\sigma\) is known, so we must find the global \(\mu\) parameters for each class, as well as the latent class parameter for each data point.
Our joint model is then
\[ p(y,z) = \prod_N{p(y|\mu_z,\sigma)p(z)} \]
Now that we have our data let us specify our posterior distribution and our KL divergence loss metric.
<- c(0.5, 0.5)
prior_prob <- 0
prior_mu <- 10
prior_sigma <- 100
ndraws
<- function(y, posterior_vals, mu_mu_params, mu_sigma_params, sigma_params){
kl_divergence #calculate elbo
<- sapply(1:ndraws, function(i){
kl #take samples from approximate distributions
<- ifelse(runif(length(y)) <= posterior_vals[,1], 1, 2)
z <- rnorm(2, mu_mu_params, mu_sigma_params)
mu_params # calculate KL
<- sum(log(posterior_vals[cbind(1:length(y), z)])) # approximate posterior
q_z <- sum(dnorm(y, mu_params[z], sigma_params[z], log = TRUE)) # likelihood
p_y_z <- sum(log(prior_prob[z])) # class prior
p_z <- sum(dnorm(mu_params, prior_mu, prior_sigma, log = TRUE)) # mu prior
p_mu - p_y_z - p_z - p_mu
q_z
})mean(kl)
}
# initialize parameters
<- matrix(0.5, nrow = length(y), ncol = 2)
posterior_vals <- c(-3, 3)
mu_mu_params <- log(c(1, 1)) # take log to put parameters in unconstrained space
mu_sigma_params <- c(1, 1)
sigma_params
<- 300
iters # perturbation for posterior z
<- 1
h # perturbation for mu values
<- 1e-1
h_ <- c(1, 0)
h1 <- c(0, 1)
h2
<- 1e-1
step_size <- 1e-4
mu_step_size <- 1e-4
sigma_step_size
<- as.data.frame(matrix(0, nrow = iters, ncol = length(y)))
posterior_steps <- as.data.frame(matrix(0, nrow = iters, ncol = 2)) %>% setNames(c("mu1", "mu2"))
mu_mu_steps <- as.data.frame(matrix(0, nrow = iters, ncol = 2)) %>% setNames(c("sigma1", "sigma2"))
mu_sigma_steps <- c()
kls
for(i in 1:iters){
# expectation
# convert p to log odds (unconstrained space) before perturbing to stay in (0-1)
<- 1 / (1 + exp(-log(posterior_vals[,1] / posterior_vals[,2]) + h))
offset_vals <- cbind(offset_vals, 1 - offset_vals)
offset_vals for(j in 1:N){
# get gradient
<- (kl_divergence(y[j], offset_vals[j,,drop=FALSE], mu_mu_params, exp(mu_sigma_params), sigma_params) -
gradient kl_divergence(y[j], posterior_vals[j,,drop=FALSE], mu_mu_params, exp(mu_sigma_params), sigma_params)) / h
1] <- 1 / (1 + exp(-log(posterior_vals[j,1] / posterior_vals[j,2]) - gradient * step_size))
posterior_vals[j, # step
2] <- 1 - posterior_vals[j, 1]
posterior_vals[j,
}
# maximization
# step mu1 params
<- (kl_divergence(y, posterior_vals, mu_mu_params + h_ * h1, exp(mu_sigma_params), sigma_params) -
gradient11 kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params), sigma_params)) / h_
<- mu_mu_params - h1 * gradient11 * mu_step_size
mu_mu_params <- (kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params + h_ * h1), sigma_params) -
gradient12 kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params), sigma_params)) / h_
<- mu_sigma_params - h1 * gradient12 * sigma_step_size
mu_sigma_params
# step mu2 params
<- (kl_divergence(y, posterior_vals, mu_mu_params + h_ * h2, exp(mu_sigma_params), sigma_params) -
gradient21 kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params), sigma_params)) / h_
<- mu_mu_params - h2 * gradient21 * mu_step_size
mu_mu_params <- (kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params + h_ * h2), sigma_params) -
gradient22 kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params), sigma_params)) / h_
<- mu_sigma_params - h2 * gradient22 * sigma_step_size
mu_sigma_params
# save intermediate values for later
<- posterior_vals[,2]
posterior_steps[i,] <- mu_mu_params
mu_mu_steps[i,] <- exp(mu_sigma_params)
mu_sigma_steps[i,] <- kl_divergence(y, posterior_vals, mu_mu_params, exp(mu_sigma_params), sigma_params)
kl <- c(kls, kl)
kls }
Inspect KL divergence over training iterations.
plot(kls, type = "l")
<- posterior_steps %>%
steps_df mutate(ix = row_number()) %>%
pivot_longer(-ix, names_to = "dim", values_to = "value") %>%
mutate(
class = as.factor(rep(c(rep(1,N_), rep(2,N_)), iters)),
y = rep(y, iters)
)
# these steps convert the posterior parameters for mu into a density curve for the plots
<- mu_mu_steps %>%
mu_df mutate(ix = row_number())
<- mu_sigma_steps %>%
sigma_df mutate(ix = row_number())
<- mu_df %>%
mu_sigma_df inner_join(sigma_df, by="ix") %>%
group_by(ix) %>%
nest() %>%
mutate(
density = map(data, ~{
= seq(min(y), max(y), by=0.01)
x data.frame(
x = x,
mu1 = dnorm(x, .x$mu1, .x$sigma1),
mu2 = dnorm(x, .x$mu2, .x$sigma2)
)
})%>%
) unnest(density) %>%
select(-data) %>%
pivot_longer(starts_with("mu")) %>%
mutate(value = value / max(value))
<- ggplot(steps_df, aes(y = value, x = y, color = class)) +
learning_plot geom_point() +
geom_line(data = mu_sigma_df, aes(x = x, color = name, y = value)) +
ggtitle("Posterior Probability of Class 2") +
ylab("p(c=2)") +
transition_manual(ix)
anim_save("latent_learning.gif", learning_plot)
::include_graphics("latent_learning.gif") knitr
Amortized Variational Inference
Amortized variational inference is really where VI is set free and we can start utilizing deep learning methods within bayesian models. This is the method that allows for variational auto-encoders.
Amortized inference is motivated by the fact that in complex latent models the number of parameters that needs to be estimated, grows with the size of the dataset. For very large datasets even variational methods will be slow.
# encoder network, this gives us out posterior distributio for z
<- function(x, params){
categorical_network <- x %*% params
z <- exp(z) / rowSums(exp(z))
z
z
}
# decoder network, this gives us our likelihood parametetrs given z
<- function(z, params){
normal_network <- z %*% params
x 2] <- exp(x[,2])
x[,
x
}
<- c(0.5, 0.5)
prior_prob <- 100
ndraws
<- function(encoder_params, decoder_params){
kl_divergence <- categorical_network(y, encoder_params)
posterior_vals #calculate elbo
<- sapply(1:ndraws, function(i){
kl #take samples from approximate distribution
<- ifelse(runif(length(y)) <= posterior_vals[,1], 1, 2)
z # convert to one hot encoding
<- matrix(0, nrow = length(y), ncol = 2)
z_hot cbind(1:length(y), z)] <- 1
z_hot[
# get likelihood parameters given z
<- normal_network(z_hot, decoder_params)
y_params
# calculate KL
<- sum(log(posterior_vals[cbind(1:length(y), z)]))
q_z <- sum(dnorm(y, y_params[,1], y_params[,2], log = TRUE))
p_y_z <- sum(log(prior_prob[z]))
p_z - p_y_z - p_z
q_z
})
mean(kl)
}
# randomly initialize parameters
<- matrix(rnorm(2, 0, 0.01), nrow = 1)
encoder_params <- matrix(rnorm(4, 0, 0.01), nrow = 2)
decoder_params
<- 100
iters # perturbation parameters
<- 1e-1
h <- c(1, 0)
h11 <- c(0, 1)
h12 <- matrix(c(1, 0, 0, 0), nrow = 2, byrow = TRUE)
h21 <- matrix(c(0, 1, 0, 0), nrow = 2, byrow = TRUE)
h22 <- matrix(c(0, 0, 1, 0), nrow = 2, byrow = TRUE)
h23 <- matrix(c(0, 0, 0, 1), nrow = 2, byrow = TRUE)
h24 <- 1e-2
step_size
# data structures for saving intermediate valuess
<- as.data.frame(matrix(0, nrow = iters, ncol = N))
posterior_steps <- as.data.frame(matrix(0, nrow = iters, ncol = 2)) %>% setNames(c("mu1", "mu2"))
mu_mu_steps <- as.data.frame(matrix(0, nrow = iters, ncol = 2)) %>% setNames(c("sigma1", "sigma2"))
mu_sigma_steps <- c()
kls
for(i in 1:iters){
# expectation
# optimize encoder
# get gradients and step
<- (kl_divergence(encoder_params + h * h11, decoder_params) -
gradient1 kl_divergence(encoder_params, decoder_params)) / h
<- encoder_params - h11 * gradient1 * step_size
encoder_params
<- (kl_divergence(encoder_params + h * h12, decoder_params) -
gradient2 kl_divergence(encoder_params, decoder_params)) / h
<- encoder_params - h12 * gradient2 * step_size
encoder_params
# maximization
# optimize decoder
# get gradients and step
<- (kl_divergence(encoder_params, decoder_params + h * h21) -
gradient3 kl_divergence(encoder_params, decoder_params)) / h
<- decoder_params - h21 * gradient3 * step_size
decoder_params
<- (kl_divergence(encoder_params, decoder_params + h * h22) -
gradient4 kl_divergence(encoder_params, decoder_params)) / h
<- decoder_params - h22 * gradient4 * step_size
decoder_params
<- (kl_divergence(encoder_params, decoder_params + h * h23) -
gradient5 kl_divergence(encoder_params, decoder_params)) / h
<- decoder_params - h23 * gradient5 * step_size
decoder_params
<- (kl_divergence(encoder_params, decoder_params + h * h24) -
gradient6 kl_divergence(encoder_params, decoder_params)) / h
<- decoder_params - h24 * gradient6 * step_size
decoder_params
# save intermediate values for later
<- categorical_network(y, encoder_params)[,1]
posterior_steps[i,] <- normal_network(matrix(c(1,0,0,1), nrow=2), decoder_params)[,1]
mu_mu_steps[i,] <- normal_network(matrix(c(1,0,0,1), nrow=2), decoder_params)[,2]
mu_sigma_steps[i,] <- c(kls, kl_divergence(encoder_params, decoder_params))
kls }
plot(kls, type = "l")
<- posterior_steps %>%
steps_df mutate(ix = row_number()) %>%
pivot_longer(-ix, names_to = "dim", values_to = "value") %>%
mutate(
class = as.factor(rep(c(rep(1,N_), rep(2,N_)), iters)),
y = rep(y, iters)
)
# these steps convert the posterior parameters for mu into a density curve for the plots
<- mu_mu_steps %>%
mu_df mutate(ix = row_number())
<- mu_sigma_steps %>%
sigma_df mutate(ix = row_number())
<- mu_df %>%
mu_sigma_df inner_join(sigma_df, by="ix") %>%
group_by(ix) %>%
nest() %>%
mutate(
density = map(data, ~{
= seq(min(y), max(y), by=0.01)
x data.frame(
x = x,
mu1 = dnorm(x, .x$mu1, .x$sigma1),
mu2 = dnorm(x, .x$mu2, .x$sigma2)
)
})%>%
) unnest(density) %>%
select(-data) %>%
pivot_longer(starts_with("mu")) %>%
mutate(value = value / max(value))
<- ggplot(steps_df, aes(y = value, x = y, color = class)) +
learning_plot geom_point() +
geom_line(data = mu_sigma_df, aes(x = x, color = name, y = value)) +
ggtitle("Posterior Probability of Class 2") +
ylab("p(c=2)") +
transition_manual(ix)
anim_save("amortized_latent_learning.gif", learning_plot)
::include_graphics("amortized_latent_learning.gif") knitr