Bob, have you considered that perhaps the cat in the photo (Rocky) is actually programming upright in Stan?

]]>Bob:

You’re overestimating me. People send me cat pictures and I put them in the blog as is!

]]>When one or two chains are stuck, we’ve also been talking about using stacking to average over chains. Lowering the step size etc. can work, but if you’ve already run Stan and you find that one chain has not mixed, you can use stacking to mix it in with the others and get an approximate inference.

We’ve also been talking about being able to run Stan for an (approximately) fixed amount of time per chain rather than a fixed number of iterations. That would help with this sort of problem, but it requires some programming effort so it has to wait on line behind various other Stan improvements. Stacking you can do right away with existing technology.

]]>Given it seems so obvious, I assume Andrew put it that way on purpose as a joke.

]]>If one chain gets stuck, it may be beause it’s the only one that found the high curvature part of the posterior. In those cases, we need to increase target acceptance rate to lower step size or usually, reparameterize.

Getting stuck can also be from adaptation going astray. We’re introducing a parallel adaptation mode that’ll combine information cross-chain to try to get around this kind of bad adaptation in one chain.

]]>1. The data analysis pipeline is wider in scope than what is implied here. A nice platform for documenting and sharing a data analysis pipeline is openml.org, see: Vanschoren J, van Rijn JN, Bischl B, and Torgo , OpenML: networked science in machine learning. SIGKDD Explorations 15(2), pp 49-60, 2013.

2. The data analysis pipeline is also feeding the current replicability crisis. See:Botvinik-Nezer R, Holzmeister F et al, Variability in the analysis of a single neuroimaging dataset by many teams, bioRxiv, 2019, https://www.biorxiv.org/content/10.1101/843193v1

3. Starting in 2010, I worked for 8 years on a “Note on the theory of Applied Statistics” (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2171179). The “note” reached version 18 with inputs from many statisticians. At some point I let it rest. I am glad to see that Andrew adopted a similar terminology in https://gelmanstatdev.wpengine.com/2017/05/26/theoretical-statistics-theory-applied-statistics-think/. It is indeed time for a strong theory of applied statistics. MRP is surely part of it. Yesterday’s and today’s conference triggered by a suggestion from Andrea was excellent, thank you Lauren and org team!. ]]>

In my problem, when I did optimization and/or ADVI to get initial points, the sampling usually worked well in about 3 out of 4 chains, but would get stuck in the 4th… I’m sure it was a tricky model to sample, but whichever chains did work… they tended to work mostly fine. The problem was consistently getting all chains to work fine, and convincing myself that it wasn’t just a fluke that the appeared to work fine…

I think a bunch of techniques have been created to improve the diagnostics etc, like your rank plot ideas and soforth. Probably I should return to that model and see if I can make it work.

]]>Stan’s a bit better, but not a lot better at adapation than it was two years ago.

That’s a good point—we can run into problems where the tails are stiff and the body of the distribution is relatively easy to sample. The only thing we’ve been able to do in those cases is (a) find better initializations, (b) reparameterize so all parameters are on unit scale, and (c) reparameterize so parameters are less correlated.

In the specific kinds of models you’re talking about, where there are multilevel intercepts, the problem is usually only very weak identifiability. It’s what Andrew calls the Folk Theorem. The model just isn’t well determined by the data. In those cases, we’re often uncertain about hierarchical parameters and wind up with high funnel-like curvature when they’re low variance. This is another case where zero-avoid priors and things like that can help.

I don’t know if that’s also a siilar problem of sampling near the mode. It’s in an energy well which can be pretty deep. I know PyMC does that. The real problem is that most multilevel models we use don’t have posterior modes.

Often if conditioning is bad for sampling, it’ll be bad for optimization too. On the other hand, it’s scalable enough to use quasi-Newton methods in sampling because we don’t need to maintain detailed balance. So there can be a win here. The real win will be if we can figure out how to condition for local geometry without some horrendous cubic penalty in number of parameters.

]]>Daniel,

Bob has a really cool idea for warmup that combines optimization and Nuts to rapidly get into the typical set. I think he’s done some experiments on it but he (and others) keep getting distracted with other Stan things, so it’s still in the speculative stage. I have high hopes that it could help solve the problems you’re talking about here. More generally, we’ve been thinking about better integration of computation into workflow, for example the idea that if your model isn’t working, it’s better to learn this right away rather than running your algorithms for several hours.

]]>In the problems I’ve had where Stan couldn’t really do a great job… it was often that I couldn’t even get past warmup… But that model probably had on the order of 10-20k parameters (a handful for each public use microdata area in the census, and there are several thousand of those, plus several hierarchical parameters that connected all the pumas together).

That was ~ 2 years ago, perhaps Stan is much better today.

One thing I think is a good thing is to use optimizing to find a starting point. Sure the optimal location isn’t in the typical set, but it *is* in the high probability density set, and moving away from the optimum moves you towards the typical set pretty much no matter what direction you go… A big problem with my complex models was that Stan would start in la-la-land and decide that in la-la-land it needed to take MICROSCOPIC timesteps, and it would never get anywhere.

]]>Andrew originally hired me and Matt Hoffman not to work on Stan, but to work on workflow for hierarchical regression-like models. We were busy trying to come up with a language that was more expressive than lme4’s (which can’t express many of the models in Gelman and Hill’s regression book) and would allow this kind of model exploration. After struggling a couple moneths, we gave up. The problem you immediately run into is that (a) regression notation is very limiting [hence elaborate extensions like SEM], and (b) the combinatorics of automatic model exploration are absurd. If you have 10 predictors, as many of Andrew’s poli sci models do, that’s a whole lot of possible interaction terms—45 binary, 120 ternary, 210 quaternary, and so on. But the real problem is that you’re selecting a subset. We can’t ever consider all 2^45 possible subsets of binary predictors. We also didn’t have any special computation in mind for hierarchical models, though Matt was very interested in variational approaches, especially stochastic ones (he also wrote the stochastic gradient variational inference paper while a postdoc with Andrew). So we leaned into the inventor’s paradox and decided to tackle harder problems—a more general language and more general sampler. We started with autodiff implementations of HMC in C++, then I quickly realized we could build a BUGS-like language on top of that and Matt realizedthe problem wasn’t that GIbbs implementations were too slow, but that we needed something like HMC if only we cold automatically tune it.

I do believe the neural network models offer advantages in that they learn both non-linearities and interactions. It’s not as automatic as some of the more gee-whiz sales pitches, but it’s a lot more automatic than typing formulas into lme4.

@James and Andrew: warm starting on the parameters doesn’t help much. Stan converges pretty quickly to the typical set (75 iterations is the default, but it often only takes 10 or 20 iterations). That’s counterintutive for those coming from BUGS, but HMC is very good at directed search when it starts in the tails. It’s the exploration in the typical set for estimating adaptation parameters in Stan that takes a long time during warmup. So any warm start needs to also do someting about adaptation. We were thinking about this at the point we were thinking about discrete parameters, because as discrete parameters change during updates (say with block Gibbs or Metropolis), then the adaptation for HMC also has to update. I was really hoping the PyMC folks could evaluate this since they have the toolkit.

This post had a lot of triggers! My mom’s dissertation was based on Linda Flower’s work and extended it to heterogeneous populations with different approaches to writing (mainly around whether they worked with or without a global plan and outline). A very Andrew idea! Flower was also down the hall from me when I was a professor at Carnegie Mellon. As I recall, the English department was so split between people who read and wrote and those who did continental philosophy that they couldn’t even have a joint department meeting!

The fact that model building is Lego and we all have the same components is why all of our models wind up looking the same. The first thing I did with Andrew (before moving to Columbia) was fit some noisy measurement models for crowdsourcing. Turns out Phil Dawid had already come up with that model in the 1970s because he was using the same Lego bricks.

]]>