I (Aki) talk about reference models in model selection in Laplace’s demon series 24 June 15UTC (Finland 18, Paris 17, New York 11). See the seminar series website for a registration link, the schedule for other talks, and the list of the recorded talks.

The short summary: 1) Why a bigger model helps inference for smaller models, 2) Bayesian decision theoretic justification, 3) Examples. There’s time for questions and discussion. Yes, it will be recorded.

The abstract:

I discuss and demonstrate the benefits of using a reference model in variable selection. A reference model acts as a noise-filter on the target variable by modeling its data generating mechanism. As a result, using the reference model predictions in the model selection procedure reduces the variability and improves stability leading to improved model selection performance and improved predictive performance of the selected model. Assuming that a Bayesian reference model describes the true distribution of future data well, the theoretically preferred usage of the reference model is to project its predictive distribution to a reduced model leading to projection predictive variable selection approach. Alternatively, reference models may also be used in an ad-hoc manner in combination with common variable selection methods.

The talk is based on work with many co-authors. The list of papers and software with links.

Juho Piironen, Markus Paasiniemi, and Aki Vehtari (2020). Projective inference in high-dimensional problems: prediction and feature selection. Electronic Journal of Statistics, 14(1):2155-2197. https://doi.org/10.1214/20-EJS1711

Federico Pavone, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari (2020). Using reference models in variable selection. arXiv preprint arXiv:2004.13118. https://arxiv.org/abs/2004.13118

Juho Piironen and Aki Vehtari (2016). Projection predictive input variable selection for Gaussian process models. 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), doi:10.1109/MLSP.2016.7738829. https://dx.doi.org/10.1109/MLSP.2016.7738829

Juho Piironen and Aki Vehtari (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711-735. doi:10.1007/s11222-016-9649-y. https://link.springer.com/article/10.1007/s11222-016-9649-y

Homayun Afrabandpey, Tomi Peltola, Juho Piironen, Aki Vehtari, and Samuel Kaski (2019). Making Bayesian predictive models interpretable: A decision theoretic approach. arXiv preprint arXiv:1910.09358 https://arxiv.org/abs/1910.09358

Donald R. Williams, Juho Piironen, Aki Vehtari, and Philippe Rast (2018). Bayesian estimation of Gaussian graphical models with projection predictive selection. arXiv preprint arXiv:1801.05725. https://arxiv.org/abs/1801.05725

Software:

Juho Piironen, Markus Paasiniemi, Alejandro Catalina and Aki Vehtari (2020). projpred: Projection Predictive Feature Selection. https://mc-stan.org/projpred

Video:

Recorded video of the talk and discussion

Case studies:

Model selection case studies

EDIT: added one paper

EDIT2: video link + case studies

Aki,

“Why a bigger model helps inference for smaller models” reminds me of this wonderful Radford Neal quote:

Radford had also a comment in favor of using reference models and then fitting the smaller model to approximate the reference model, but I think he never did that in his papers.

I have always wondered if this is always true about complex models, particularly in the physical sciences. Often the “complex” part is either very hard to model, or would require much more data to estimate, and for the purposes of the model often just add noise to the model, so that not including them (the more complex features) essentially filters out the noise, hence produces better estimates. I have done some work with multi-scale spatial-temporal models, where say we are interested in the time trends and seasonal shifts in the data. We have data at 1km and even 750m scale, and we can (and have) run the model at this scale, which would capture all sorts of complex features like eddies and frontal structures, but when we look at the trends say, and look at the contribution from each spatial scale, the finest scale is basically just adding noise, often a lot of noise, which just makes estimation harder. Perhaps just complex enough and no more – which i believe is essentially something Einstein said.

And oops just noticed this is likely going in the wrong place – meant to respond to Andrew’s comment.

Wow!

Your post appeared 2 minutes (!) after I asked about model comparison here:

https://gelmanstatdev.wpengine.com/2020/06/18/estimating-the-effects-of-non-pharmaceutical-interventions-on-covid-19-in-europe/#comment-1362950

… so this is really helpful for me. Thanks!

I was so disappointed that I wasn’t able to join the lecture until near the end! I keep checking back to see if they’ve posted it online yet so I can see the whole thing.

I’m wondering if the reference model approach is similar to Hinton’s Dark Knowledge (https://arxiv.org/abs/1503.02531 etc). To quote a bit from the paper — this is on a classification task:

“An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. For this transfer stage, we could use the same training set or a separate “transfer” set. When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or geometric mean of their individual predictive distributions as the soft targets. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.

For tasks like MNIST in which the cumbersome model almost always produces the correct answer with very high confidence, much of the information about the learned function resides in the ratios of very small probabilities in the soft targets. …”

As I understand Reference Models better, I also want to see trade-offs and how applicable it is to non-Bayesian techniques.

I just added the link to the recorded video