## Hierarchical stacking

(This post is by Yuling)
Gregor Pirš, Aki, Andrew, and I wrote:

Stacking is a widely used model averaging technique that yields asymptotically optimal predictions among linear averages. We show that stacking is most effective when the model predictive performance is heterogeneous in inputs, so that we can further improve the stacked mixture by a full-Bayesian hierarchical modeling. With the input-varying yet partially-pooled model weights, hierarchical stacking improves average and conditional predictions. Our Bayesian formulation includes constant-weight (complete-pooling) stacking as a special case. We generalize to incorporate discrete and continuous inputs, other structured priors, and time-series and longitudinal data. We demonstrate on several applied problems.

As per the spirit of hierarchical modeling, if some quantity could vary in the population, then it should vary. Hierarchical stacking is an approach that combines models with input-varying yet partially-pooled model weights. Besides all the points we made in the paper, I personally like our new method for two reasons:

1. We started this paper by practical motivations. Last year I was trying to use model averaging to help epidemiological predictions. Complete-pooling stacking was a viable solution, until I realized that a model good at predicting new daily cases in New York state was not necessarily good in predictions in Paris. In general, how good the model fits the data depends on what input location to condition on—All models are wrong, but some are somewhere useful. This present paper is aimed for learning where that somewhere is, and construct a local model averaging such that the model weight varies in input space. This extra flexibility comes with additional complexity, which is why we need hierarchical priors for regularization. The graph above is an example in which we average a sequence of election forecast models and allow the model weight to vary by states.
2. In addition to this dependence on inputs, our new approach also features by its full-Bayesian formulation, whereas stacking was originally viewed as an optimization problem, or an optimization approach toward Bayesian decision problem. For some simple model like a linear regression with big n, it probably makes little difference to use MAP or MLE or average over full posterior distributions. But for complex models, this full-Bayesian approach is appealing for it
• makes hierarchical shrinkage easier: Instead of exhaustive grid search of tuning parameters, we will be using gradient-informed MCMC;
• provides built-in uncertainty estimation of model weights;
• facilitates generative modeling and open-ended data inclusion: When running hierarchical stacking on election predictions, we encode prior correlations of states to further stabilize model weights in small states;
• enables post-processing, model check, and approximate LOO-CV as if in a usual Bayesian model.

Regarding (2), we justify why this full-Bayesian inference makes sense in our formulation (via data augmentation). Looking forward, the general principle of converting a black-box learning algorithms into a generative modeling (or equivalently, a chunk of Stan code) could be a fruitful for many other methods (existing examples include KNN and SVM).  The generative model needs be specified on case-by-case basis. Take linear regression for example, the least-squares estimate of regression coefficients equivalents the MLE in a normal-error model. But we should not directly take the negative loss function as a log density and sample $latex \beta$ from $latex \log p(\beta|y)= -(x^T \beta – y)^2$+constant. The latter density  differs from the “full Bayesian inference” of the posit normal model unless $latex \sigma$ is a known constant 1. From the “marginal data augmentation” point of view, $latex \sigma$ is a working parameter that needs to be augmented in the density and avenged over in the inference—in general there could be a nuisance normalizing constant and that is when some path sampling is useful. For this additional challenge, we leave other scoring rules (interval scores, CRPS) mostly untouched in this paper, but they could be interesting to investigate, too.

This preprint is the third episode of our Bayesian model averaging series (previously we’ve had complete-pooling stacking and combining non-mixing posterior chains).