Skip to content

Using partial pooling when preparing data for machine learning applications

Geoffrey Simmons writes:

I reached out to John Mount/Nina Zumel over at Win Vector with a suggestion for their vtreat package, which automates many common challenges in preparing data for machine learning applications.
The default behavior for impact coding high-cardinality variables had been a naive bayes approach, which I found to be problematic due its multi-modal output (assigning probabilities close to 0 and 1 for low sample size levels). This seemed like a natural fit for partial pooling, so I pointed them to your work/book and demonstrated it’s usefulness from my experience/applications. It’s now the basis of a custom-coding enhancement to their package.
You can find their write up here.
Cool.  I hope their next step will be to implement in Stan.
It’s also interesting to think of Bayesian or multilevel modeling being used as a preprocessing tool for machine learning, which is sort of the flipped-around version of an idea we posted the other day, on using black-box machine learning predictions as inputs to a Bayesian analysis.  I like these ideas of combining different methods and getting the best of both worlds.

One Comment

  1. John Mount says:

    Definitely have some Stan projects in the pipeline.

    Also, it is fun to try to sneak some well founded Bayesain methods into machine learning (and evidently also vice versa). I think there is a lot to be gained.

    Finally I really suggest R users working with machine learning or predictive modeling in R try out vtreat, it can be game changing (makes messy real world data behave almost as well as example data).

Leave a Reply