## Hierarchical stacking, part II: Voting and model averaging

(This post is by Yuling)
Yesterday I have advertised our new preprint on hierarchical stacking. Apart from the methodology development, perhaps I could draw some of your attention to the analogy between model averaging/selection and voting systems.

• Model selection = we have multiple models to fit the data and we choose the best candidate model.
• Model averaging = we have multiple models to fit the data and we design a weighted average of candidate models for future predictions.
• Local model averaging = we have multiple models to fit the data and we design a weighted average of candidate  models for future predictions, and this time model weight varies in input predictors.

For notation simplicity I will consider a two-candidate situation. The easiest voting scheme is some popular vote: every ballot will be counted. In model averaging/selection, this approach corresponds to first fitting each model individually and count the utility of each model. Here voters are data points in the observation, and they are more sophisticated than casting a binary ballot. What they vote by is a continuous number depending a pre-chosen utility function (negative pointwise L2 loss, pointwise log predictive density, etc). We could replace this training error by leave-one-out cross validated error too. But the bottomline is that we will count each of the ballots and add them up to obtain either the negative mean squared error or mean log predictive densities.  For Bayesian model averaging or pseudo Bayesian model averaging, we have a proportional representation system: model 1’s weight is proportional to its total voting shares (sum of pointwise log predictive densities).  For model selection using LOO-CV, we pick the candidate with who wins the popular vote.

One implication of this proportional representation system is that as a candidate you would like to take care of every voter. It is something related to the “median voter theorem” in economics, but even worse: data points vote in log scale and can be widely mad if you ignore them. To some extent this explains why BMA is typically polarized. Imagine in a presidential election, but instead of a binary ballot to cast, you endorse your preferences  by two positive real numbers a and b  to two candidates, such that a+b=1. A candidate’s total vote count is the multiplication of all endorsements they receive (equivalently, the summation of log a or log b they receive). The voting result will typically be quite distinct.

We explain in the paper (Section 3) that that what stacking behaves like a “winner takes all”/electoral college system. That is, we first group data points/voters according to their bipartisan preference: everyone in district A likes candidate A more (even slightly more), and so is in district B  (ties are ignored). Then the winner takes all shares of that district no matter how big or small the winning margin is. There can be some interesting phenomenon under this mechanism. In the paper we provide an example in which when two models become closer and closer to each other (smaller KL divergence, higher correlation of predictions, etc), their stacking weights are more distinct and eventually approach 0 and 1 when two models are nearly identical. This dynamic would never happen in a representative voting: if two candidates are identical-twins, why shouldn’t they get the same voting share? But from the standpoint of the an oracle planner/stacking, if these two candidates do function identically, why should they be both kept?  Put it in another way, observing a nationwide-aggregated popular voting share of  51%/49% can correspond to two stories:

• In the first story, candidate A appeals to 51% of the population but is hated by the remaining 49%. Then both BMA and stacking will assign models with the 51%/49% weight.
• In the second possibility, assuming there are two candidates who are non-identical twins between whom Candidate A is superior to Candidate B by tiny margins in all aspects, such that every voter slightly prefers Candidate A and would vote for A by probability 0.51. Then BMA assigns candidate A with weight 0.51 and B with 0.49. Fair but not ideal. What stacking does is to realize that candidate A actually makes everyone better off, so as to assign it with weigh 1 regardless of the small winning margin. I think economists may call this “Pareto improvement”.

It is disputable whether or not the popular vote is more democratic than the electoral college. One advantage of the latter approach, or stacking in our context, is that a model can simply ignore some part of date and only focus on where its expertise is.  Because of the weighting in the end, two distinct models can then collaborate. What matters is not how overall-good one single candidate is, but rather how the combination can be optimized, which implicitly encourages some individual diversity.

That said, one remedy we can make to improve the electoral college system is to have some finer-grained election districts. This is precisely what hierarchical stacking is aimed for by assigning a local weight conditional on predictors X.  That is, we group voters according to their preferences first, such that people living in the same tribe have similar views. Then we assign each tribe a local combination of candidates as per their preference. Apparently such arrangement is too complicated for any society to adopt, but at least you could apply it to your stan models and make your data points happier.