I’ve been talking to Michael Betancourt and Charles Margossian about implementing analytic derivatives for HMMs in Stan to reduce memory overhead and increase speed. For now, one has to implement the forward algorithm in the Stan program and let Stan autodiff through it. I worked out the adjoint method (aka reverse-mode autodiff) derivatives of the HMM likelihood (basically, reverse-mode autodiffing the forward algorithm), but it was stepwise and the connection to forward-backward wasn’t immediately obvious. So I thought maybe someone had already put a bow on this in the literature.

It was a challenging Google search, but I was rewarded with one of the best papers I’ve read in ages and by far the best thing I’ve ever read on hidden Markov models (HMM) and their application:

- Qin, F., Auerbach, A. and Sachs, F., 2000. A direct optimization approach to hidden Markov modeling for single channel kinetics.
*Biophysical Journal*79(4):1915–1927.

The paper provides elegant one-liners for the forward algorithm, the backward algorithm, the likelihood, and the derivative of the likelihood with respect to model parameters. For example, here’s the formula for the likelihood:

$latex L = \pi^{\top} \cdot \textrm{diag}(B_1) \cdot A \cdot \textrm{diag}(B_2) \cdot \cdots A \cdot \textrm{diag}(B_T) \cdot 1.$

where $latex \pi$ is the initial state distributions, $latex B_t$ is the vector of emission densities for the states, $latex A$ is the stochastic transition matrix, and $latex 1$ is a vector of 1s. Qin et al.’s software uses an external package to differentiate the solution for $latex \pi$ as the stationary distribution for the transition matrix $latex A,$, i.e., $latex \pi^{\top} \cdot A = \pi^{\top}.$

The forward and backward algoritms are stated just as neatly, as are the derivatives of the likelihood w.r.t parameters. The authors put the likelihood and derivatives together to construct a quasi-Newton optimizer to fit max likelihood estimates of HMM. They even use second derivatives for estimating standard errors. For Stan, we just need the derivatives to plug into our existing quasi-Newton solvers and Hamiltonian Monte Carlo.

But that’s not all. The paper’s about an application of HMMs to single-channel kinetics in chemistry, a topic about which I know nothing. The paper starts with a very nice overview of HMMs and why they’re being chosen for this chemistry problem. The paper ends with a wonderfully in-depth discussion of the statistical and computational properties of the model. Among the highlights is the joint modeling of multiple experimental data sets with varying measurement error.

In conclusion, if you want to understand HMMs and are comfortable with matrix derivatives, read this paper. Somehow the applied math crowd gets these algorithms down correctly and cleanly where the stats and computer science literatures flail in comparison.

Of course, for stability in avoiding underflow of the densities, we’ll need to work on the log scale. Or if we want to keep the matrix formulation, we can use the incremental rescaling trick to rescale the columns of the forward algorithm and accmulate our own exponent to avoid underflow. We’ll also have to autodiff through the solution to the stationary distirbution algorithm, but Stan’s internals make that particular piece of plumbing easy to fit and also a potential point of dervative optimization. We also want to generalize to the case where the transition matrix $latex A$ depends on predictors at each time step through a multi-logit regression. With that, we’d be able to fit anything that can be fit with the nicely designed and documented R package moveHmm, which can already be used in R to fit a range of maximum likelihood estimates for HMMs.

For information, the derivative of the forward algorithm (optimal filter/predictor) and the corresponding likelihood for finite state-space HMM can also be found in the following conference paper:

Francois LeGland and Laurent Mevel, Recursive estimation in hidden Markov models,Proceedings of the 36th IEEE Conference on Decision and Control, vol. 4, pp. 3468–3473, 1997.

These authors use such recursions to propose an online parameter estimation scheme; see also the subsequence paper:

Francois LeGland and Laurent Mevel, Exponential forgetting and geometric ergodicity in hidden Markov models, Mathematics of Control, Signals and Systems, vol. 13, no. 1, pp. 63–93, 2000.

Thanks for the link. The LeGland and Mevel paper is much more mathematical and much harder for me to read. It cites an earlier paper with an ergodicity result, but I’m not sure for which sampler. In our practical experience, Hamiltonian Monte Carlo (HMC) has mixed well for HMMs, but autodiffing the forward algorithm in Stan is a lot slower than a custom expectation-maximization (EM) implementation. We can also use Stan’s quasi-Newton optimizer (L-BFGS), which takes advantage of derivatives (and approximate Hessians), to find maximum likelihood estimates (MLE) efficiently. That’ll also get faster and more scalable when we move from autodiffing the forward algorithm to analytical gradients.

If I’m understanding the gist of their recursive algorithm, it’s very similar to how I implemented stochastic gradient descent (SGD) for HMMs and conditional random fields (CRFs) in LingPipe (the NLP toolkit I worked on before Stan). I built recursive accumulators following the forward backward algorithm and fed them into a standard batched SGD with an interface to control the learning rate (the ones that satisfy the Robbins-Monro conditions work in theory, but not so well in practice). The main reason I chose to go back into academia to work with Andrew is so that I could understand discussions like these :-).

This reply has so many acronyms I feel like I owe readers a glossary.

I remember Jason Eisner telling me about the connection between forward backward and auto-diff in the early 2000s; I don’t know if he published anything on this, though. This came up in his work on the Dyna framework, I believe. I can chase this up if you are interested.

Mark,

Was this what you had in mind? https://www.cs.jhu.edu/~jason/papers/eisner.spnlp16.pdf

Thanks Adam,

I haven’t seen this paper, but yes, it seems to have all the things that Jason was saying (although I recall him saying these things a decade before the paper).

Anyway, thanks for the reference, and Bob can add this to his bibliography!

Mark

@Adam: Thanks for the reference.

@Mark: Jason cites some earlier work on the relation in his paper. The new part for me is the connection of derivatives to expectations in log-linear models.

What I like about the paper I linked above is the clear presentation of the forward-backward algorithm and derivatives in matrix form. That particular form makes the connection between the algorithm for expectations and the general derivative propagation we need for Stan (reverse-mode autodiff) very clear. For example, we’ll have varying transition matrices defined as multi-logit regressions based on time-specific predictors (aka features in NLP). I think there’s a name for that model in NLP where the transitions are modeled with logistic regressions.

Forward-mode autodiff is less burdensome than reverse mode, but it could probably also benefit from the analytical treatment. The trick would be to find an efficient way to have reverse-mode nested in forward mode so we could get second derivatives (Hessians) more efficiently and accurately than approximate gradient methods.