## From monthly return rate to importance sampling to path sampling to the second law of thermodynamics to metastable sampling in Stan

(This post is by Yuling, not Andrew, except many ideas are originated from Andrew.)

This post is intended to advertise our new preprint Adaptive Path Sampling in Metastable Posterior Distributions  by Collin, Aki, Andrew and me, where we developed an automated implementation of path sampling and adaptive continuous tempering. But I have been recently reading a writing book and I am convinced that an advertisement post should always start by stories to hook the audience, so let’s pretend I am starting this post from the next paragraph with an example.

#### Some elementary math about geometric means

To start, one day I was computing my portfolio return, but I was doing that in Excel, in which the only function I knew is mean and sum. So I would simply calculate the average monthly return, and multiply it by 12.

Of course you do not really open your 401K account now. We can simulate data in R too, which amounts to:

month_return=rnorm(60,.01,0.04)
print(mean(month_return)*12)


Sure, it is not the “annualized return”. A 50% gain followed by a 50% loss in the next month results in 1.5* 0.5 = 0.75, or a 25% loss. Jensen’s inequality ensures that the geometric mean is always smaller than the arithmetic mean.

exp(mean(log(1+ month_return)))-1- mean(month_return)

That said, this difference is small since the monthly return itself is typically close enough to zero. But even to annualize these two estimates to a whole year still yields a nearly identical annualized return value:

exp(mean(log(1+month_return)))^12-1 - mean(month_return)*12

Intuitively, the arithmetic mean is larger per unit, but is also offset by lack of the compound interest.
It is not anything surprising beyond elementary math. The first order Taylor series expansion says
$latex \log (1+ x) \approx x,$
for any small x. And the geometric mean of x is really the arithmetic mean of log (1+ x). Indeed in the limit of per second return, by integrating both sides, these two approaches are just identical. When per unit change of x is not smooth enough (i.e. too big gap), the second order term -.5 sd(x) will kick in via Itô calculus, but that is irreverent to what I will discuss next. Also the implicit assumption here is that x>-1 otherwise log(1+x) becomes invalid, so this asymptotical equivalence will also fail if the account is liquidated, but that is more irreverent to what I will discuss.

#### The bridge between Taylor series expansion and importance sampling

Forget about finance, let’s focus on the normalizing constant in statistical computing. In many statistical problems, including marginal likelihood computation and marginal density/moment estimation in MCMC, we are given an unnormalized density $latex q(\theta, \lambda)$, where $latex \theta \in \Theta$ is a multidimensional sampling parameter and $latex \lambda \in \Lambda$ is a free parameter, and we need to evaluate the integrals $latex z(\lambda)$=$latex \int_{\Theta} q(\theta, \lambda) d \theta$, for any $latex \lambda \in \Lambda.$

Here z(.) is a function of $latex \lambda$. For convenience let’s still call z() the normalizing constant without mentioning function.

Two accessible but conceptually orthogonal approaches stand out for the normalizing constant. Viewing it as the expectation with respect to the conditional density $latex \theta|\lambda \propto q(\theta, \lambda)$, we can numerically integrate it using quadrature, where the simplest is linearly interpolation, and first order Taylor series expansion yields
$latex \log\frac{z(\lambda)}{z(\lambda_0)}$ $latex \approx (\lambda – \lambda_0) \frac{1}{ z(\lambda_0)}\int_\Theta (\frac {d}{d \lambda }q(\theta, \lambda) | {\lambda=\lambda_0} )d \theta.$
In contrast, we can sample from the conditional density $latex \theta| \lambda_0 \propto q(\theta, \lambda_0)$, and apply importance sampling,
$latex \frac{z(\lambda)}{z(\lambda_0)}\approx\frac{1}{S}\sum_{s=1}^S$ $latex \frac{ q (\theta_s, \lambda)}{q(\theta_s, \lambda_0)}$, $latex \theta_{s=1, \cdots, S} \sim q (\theta, \lambda_0).$
Due to the same reason that the annualized arithmetic return equivalents the geometric return, the first order Taylor series expansion and importance sampling reach the same first order limit when the proposal is infinitely close to the target. That is, for any fixed $latex \lambda_0$, as $latex \delta=|\lambda_1 – \lambda_0|\to 0$,
$latex \frac{1}{\delta}\log E_{\lambda_{0}}\left[\frac{q(\theta|\lambda_{1})}{q(\theta|\lambda_0)}\right]$ = $latex \int_{\Theta} \frac{\partial}{\partial \lambda} \log q(\theta|\lambda_{0}) p(\theta|\lambda_0)d\theta + o(1)= \frac{1}{\delta}E_{\lambda_{0}}\left[ \log \frac{q(\theta|\lambda_{1})} {q(\theta|\lambda_0)}\right].$

#### Path sampling

The path sampling estimate is just the integral of the dominate term in the middle of the equation above:
$latex \frac{z(\lambda_1)}{z(\lambda_0)}\approx\int_{\lambda_0}^{\lambda_1}\int_{\Theta} \frac{\partial}{\partial \lambda} \log q(\theta|\lambda_{l}) p(\theta|\lambda_l)d\theta d \lambda.$
In such sense, the path sampling is the continuous limit of both the importance sampling and the Taylor expansion approach.

More generally, if we draw $latex (a, \theta)$ sample from the joint density
$latex p(a,\theta)\propto\frac{1}{c(a)}q(\theta,\lambda=f(a)),$
where $latex c()$ is any pseudo prior, and $latex a$ is a transformed parameter of lambda through a link function f(), then the thermodynamic integration (Gelman and Meng, 1998, indeed dated back to Kirkwood, 1935) yields the identity
$latex \frac{d}{da}\log z(f(a))=E_{\theta|f(a)}\Bigl[\frac{\partial}{\partial a}\log\left(q(\theta, f(a)\right)\Bigr],$
where the expectation is over the invariant conditional distribution $latex \theta | a \propto q(\theta, f(a))$.

If a is continuous, this conditional $latex \theta|f(a)$ is hard to evaluate, as typically there is only one theta for each unique $latex a$. However, it is still a valid stochastic approximation to approximate
$latex E_{\theta|f(a_i)}\Bigl[\frac{\partial}{\partial a}\log\left(q(\theta,f(a)\right)\Bigr]\approx\frac{\partial}{\partial a}\log\left(q(\theta_i, f(a_i)\right)$
by one Monte Carlo draw.

#### Continuously simulated tempering on an isentropic circle

Simulated tempering and its variants provide an accessible approach to sampling from a multimodal distribution. We augment the state space $latex \Theta$ with an auxiliary inverse temperature parameter $latex \lambda$, and employ a sequence of interpolating densities, typically through a power transformation $latex p_j\propto p(\theta|y)^{\lambda_j}$ on the ladder 0=$latex \lambda_0 \leq \cdots\leq \lambda_K=1$,
such that $latex p_K$ is the distribution we want to sample from and $latex p_0$ is a (proper) base distribution. At a smaller $latex \lambda$, the between-mode energy barriers in $latex p(\theta|\lambda)$ collapse and the Markov chains are easier to mix. This dynamic makes the sampler more likely to fully explore the target distribution at $latex \lambda=1$. However, it is often required to scale the number of interpolating densities K at least O(parameter dimension), soon becoming unaffordable in any moderately high dimensional problems.

A few years ago, Andrew came up with this idea to use path sampling in continuous simulated tempering (I remember he did so during one BDA class when he was teaching simulated temperimg). In this new paper, we present an automated way to conduct continuously simulated tempering and adaptive path sampling.

The basic strategy is to augment the target density $latex q(\theta)$, potentially multimodal, by a tempered path
$latex 1/c(\lambda)\psi(\theta)^{1-\lambda}q(\theta)^{\lambda}$ where $latex \psi$ is some base measurement. Here the temperature lambda is continuous in [0,1], so as to adapt to regions where the conditional density changes rapidly with the temperature (which might be missed by discrete tempering).

Directly sampling from theta and lambda makes it hard to access the samples from the target density (as Pr(lambda=1)=0 in any continuous densities). Hence we further transform theta into a transformed $latex a$ using a piecewise polynomial link function $latex \lambda=f(a)$ like this,

An $latex a$-trajectory from 0 to 2 corresponds to a complete $latex \lambda$ tour from 0 to 1 (cooling) and back down to 0 (heating). This formulation allows the sampler to cycle back and forth through the space of lambda continuously, while ensuring that some of the simulation draws (those with a between 0.8 and 1.2) are drawn from the exact target distribution with $latex \lambda=1$. The actual sampling takes place in $latex a\in [0,2]\times\theta\in\Theta$. It has a continuous density and is readily implemented in Stan.

We then use path sampling compute the log normalizing constant of this system, and adaptive modify the a-margin to ensure a complete cooling-heating cycle.

Notable, in discrete simulated tempering and annealed importance sampling, we typically compute the marginal density and the normalization constant
$latex \frac{z_{\lambda_K}}{z_{\lambda_0}}=E\left[\exp\mathcal{W}(\theta_0,\dots,\theta_{K-1})\right],\mathcal{W}(\theta_0,\dots,\theta_{K-1})=\log\prod_{j=0}^{k-1}\frac{q(\theta_j,\lambda_{j+1})}{q(\theta_j,\lambda_{j})}.$
In statistical physics, the $latex \mathcal{W}$ quantity can be interpreted as virtual work induced on the system. The same Jensen’s inequality that has appeared twice above leads to $latex E \mathcal{W} \geq \log z(\lambda_K)- \log z(\lambda_0)$. This is a microscopic analogy to the second law of thermodynamics: the sum of work entered in the system is always larger than the free energy change, unless the switching is processed infinitely slow, which corresponds to the proposed path sampling scheme where the interpolating densities are infinitely dense. A physics minded might call this procedure a reversible-adiabatic/isentropic/quasistatic process—It is the conjunction of path sampling estimation (unbiased for free energy) and the continuous tempering (infinitely many interpolating states) that  preserves the entropy.

This asymptotic equivalence between importance sampling based tempering and continuous tempering is again the same math behind the annualize return example. The log (1+ monthly return) term now corresponds the free energy (log normalization constant) difference between two adjacent temperatures, which are essentially dense in our scheme.

#### Practical implementation in Stan

Well, if the blog post is an ideal format for writing math equations, there would not be journal articles, the same reason that MCMC will not exist if importance sampling scales well. Details of our methods are better summarized in the paper. We also provide an easy access to continuous tempering in R and Stan for a black box implementation of path sampling based continuous tempering.
Consider the following Stan model:

data {
real y;
}
parameters {
real theta;
}
model{
y ~ cauchy(theta, 0.1);
-y ~ cauchy(theta, 0.1);
}

With a moderately large input data y, the posterior distribution of theta will be only be asymptotically changed at two points close to y and -y. As a result, Stan cannot fully sample from this two-point-spike even with a large number of iterations.

To run continuous tempering, a user can specify any base model, say normal(0,5), and list it in an alternative model block as if it is a regular model.

model{ // keep the original model
y ~ cauchy(theta,0.1);
-y ~ cauchy(theta,0.1);
}
alternative model{ // add a new block of the base measure (e.g., the prior).
theta ~ normal(0,5);
}

Save this as cauchy.stan. To run path sampling, simply run

devtools::install_github("yao-yl/path-tempering/package/pathtemp")
library(pathtemp)
update_model <- stan_model("solve_tempering.stan")
#https://github.com/yao-yl/path-tempering/blob/master/solve_tempering.stan
file_new <- code_temperature_augment("cauchy.stan")
sampling_model <- stan_model(file_new) # compile
path_sample_fit <- path_sample(data=list(gap=10), # the data list in original model,
sampling_model=sampling_model,
N_loop=5, iter_final=6000)

The returned value provides access to the posterior draws from the target density and base density, the join path in the final adaptation, and the estimated log normalizing constant.

 sim_cauchy <- extract(path_sample_fit$fit_main) in_target <- sim_cauchy$lambda==1
in_prior <- sim_cauchy$lambda==0 # sample from the target hist(sim_cauchy$theta[in_target])
# sample from the base
hist(sim_cauchy$theta[in_prior]) # the joint "path" plot(sim_cauchy$a, sim_cauchy$theta) # the normalizing constant plot(g_lambda(path_sample_fit$path_post_a), path_sample_fit\$path_post_z)


Here is the output.

Have fun playing with code!

Let me close with two caveats. First, we don’t think tempering can solve all metastability issues in MCMC. Tempering imposes limitations on dimensions and problem scales. We provide a failure mode example where the proposed tempering scheme (or any other tempering methods) fails. Adaptive path sampling comes with an importance-sampling-theory-based diagnosis that makes this failure manifest. That said, in many cases we find this new path-sampling-adapted-continuous-tempering approach performs better than existing methods for metastable sampling. Second, posterior multimodality often betokens model misspecification (we discussed this in our previous paper on chain-stacking). The ultimate goal is not to stop at the tempered posterior simulation draws, but to use them to check and improve the model in a workflow.

1. Phil says:

I understand just a little bit of this, but not as much as I would like.

I’m hoping someone (perhaps Yuling) can write a brief summary that describes: (1) what problem is being solved, and (2) how.

I know that’s what the post itself is supposed to do, but I need a simpler version, if someone is willing to write that.

• Yuling Yao says:

The tl;nr version is we use adaptive path sampling to compute the log normalizing constant in simulated tempering. This enables a continuous sampler (Stan) to sample from multimodal density.

• Phil says:

Yuling,
Thanks, I did understand that part. It’s not a case of tl;dr, it’s ts;di (too stupid, didn’t understand). Maybe I simply need more background knowledge than I have, for example I am very familiar with simulated annealing but had never heard of simulated tempering.

And although I am extremely familiar with multi-modality I can’t think of a case in which I would simulate y and -y as being drawn from the same distribution. Probably this is mathematically equivalent to the way I’m used to thinking about this stuff, but that equivalence is not at all intuitive to me.

Also, I missed how (if at all) the bit about portfolio return is relevant to the rest of the post.

I do have multi-modal models I am interested in fitting but cannot currently fit, so I would like to understand this better, but I need a much more elementary introduction. Can you suggest a place to start? Maybe just Gelman and Meng 1998?

• Yuling Yao says:

Phil, I always wish I could be a better writer! I made this post too complicated by trying to draw some deep connection between importance sampling and Taylor series expansion, which used to bother me a lot. And that is why I start by a portfolio return example.
Yes, Gelman and Meng (1998) covers most background of relevant methods.
Annealing and Tempering can often refer to very similar ideas. Depends on whether there is extra metropolis hastings adjustment, they can be used in both optimization (to find the global mode) and sampling (to draw from all modes).
y and -y are just a lazy way to write y=(10, -10).

2. Christian says:

Great to see these adaptive sampling methodologies being discussed in the literature on statistics. My personal preference would evidently be to discuss them starting from:

Bartels, C. & Karplus, M. (1997). Multidimensional Adaptive Umbrella Sampling: Applications to Mainchain and Sidechain Peptide Conformations. J. Comp. Chem. 18, 1450–1462.

Bartels, C., Widmer, A. & Ehrhardt, C. (2005). Absolute Free Energies of Binding of Peptide Analogs to the HIV-1 Protease from Molecular Dynamics Simulations. J. Comput. Chem. 26, 1294-1305.

Interesting also to see how these approaches have evolved for a long time in parallel in biophysics and stats.

Cheers

Christian

• Yuling Yao says:

Hi Christian, yes there is always a great need of communications among researchers, especially from different fields. It is unfortunate that people also use different vocabulary and it is hard for google to automatically translate “free energy” into “Bayes factor”.

3. Possibly dumb question: what is this alternative model block? Doesn’t seem to exist anywhere in the current documentation. Is that a real thing or just a shorthand for something else?

• Yuling Yao says:

It is a new feature in this path sampling framework. Currently the underlying stan cannot read it, so we use an R interface to first convert this new syntax into a workable stan program. But it can be of general interest to fit multiple model blocks all at once even without computation concerns.

4. I’ve done a similar sort of thing in the past, where I sample from the unnormalized density:

p(q)^(1/t)

and t = 1 + (tmax-1) * inv_logit(l)

and l is given a prior normal(m,s)

Let’s unpack that a little… by adjusting m,s we can force l to stay in a certain region… if that region is entirely much less than 0 (like say less than -5) then t = 1 for all l of interest, and we’re sampling from p(q)^1 = p(q). This is the posterior we really want. However this posterior is assumedly multi-modal and has lots of difficult to sample regions, so if we shift and stretch the prior for l so that it spends more time near 0 or above… then t is between 1 and tmax… and we’re sampling from p(q)^(1/t) which is a tempered distribution where the energy gaps between different portions are relatively less severe… if t goes to infinity then p(q)^(1/t) goes to 1 and in that limit the q value can be anything.

In the end, we simply drop all the samples where t was greater than say 1.0001 and these are our sample from p(q)

Now I think Yuling’s idea was a slight variation on this, with instead of just tempering p(q) we do p(q)^(1-t) * b(q)^t and t is now between 0 and 1, and we’re continuously deforming between p(q) and b(q)… which is another good idea, but differs mainly in the details. we’re still ultimately just looking at the samples were t ~ 0 so we’re sampling from p(q)

5. Carlos Ungil says:

Nice paper. I have a question about that “flower shaped distribution” (Nemeth et al. (2019) doesn’t seem to discuss it, by the way, unless I didn’t check the right paper).

Maybe I’m missing something (what parameters are you using?) but the plots in Figure 9 do not seem to correspond to the density function you wrote just before.

6. Ahmed says:

Thanks for post. FYI – a small typo.

exp(mean(log(1+ month_return)))-1- mean(month_return)

From what I am seeing from the source you wanted to write it as:

exp(mean(log(1+ month_return)))-1- mean(month_return) = 0

7. fin says:

Hi Yuling,

Interesting post! Another small typo: you are using “irreverent” when you mean “irrelevant”. Irreverent means “disrespectful,contemptuous, scornful” (purposefully dropping the country’s flag would be an irreverent act). Irrelevant = ¬relevant.