*“Hey, remember me? I’ve been busy working like crazy”* – Fever Ray

I’m at the Banff International Research Station (BIRS) for the week, which is basically a Canadian version of Disneyland where during coffee breaks a Canadian woman with a rake politely walks around telling elk to “shoo”.

The topic of this week’s workshop isn’t elk-shooing, but (more interestingly) spatial statistics. But the final talk of the workshop spurred some non-spatial thoughts.

Lance Waller was tasked with drawing together the various threads from the last few days, but he actually did something better. He spent his 45 minutes talking about the structures and workflows that make up applied spatial statistics. Watch the talk. It’s interesting.

#### Rage of Travers

The one thing in the talk that I strongly disagreed with in Lance’s talk was the way he talked about priors in Bayesian analysis and penalties in frequentist analysis as coming from the same place.

Now Lance is a world-beating epidemiologist and biostatistician and, as such, works in a field where some people are vehemently against specifying priors. In this context, it makes sense to emphasize the similarities between the two approaches, especially given that a log-posterior density and a penalized log-likelihood look exactly the same. Let there be a thousand blossoms bloom, as far as I am concerned. But I ain’t spendin’ any time on it, because in the mean time, every three months, a person is torn to pieces by a crocodile in North Queensland.

My fear is always that if we aren’t careful when emphasizing that two things are very similar, people will think that they’re the same. And a penalty is not a prior.

Why not? Well they focus on different things. A penalty is concerned only with what the maximum looks like. For example, the Lasso can be interpreted as saying

bring me the value of $latex \beta$ that maximizes the likelihood

whileensuring that $latex \sum_j |\beta_j| $ is less than some fixed (small) number.

On the other hand, a prior worries about what the plausible spread of parameters are. The (terrible) Bayesian Lasso prior can be interpreted as saying

I think the plausible values of $latex \beta_j$ lie between $latex \pm 3.7\lambda^{-1}$, where $latex \lambda$ is the rate parameter of the double exponential distribution. I also think that 5% of the time, the value of $latex \beta_j$ lies between $latex \pm 0.03\lambda^{-1}$.

To put it differently: A penalty is a statement about what a maximum would look like, while a prior is a statement about where the estimate is likely to be (or, conversely, where the estimate is likely not to be).

This is yet another way to see that the Bayesian Lasso will struggle to support both small and large $latex \beta_j$.

#### Firewood and candles

This pushes us again to one of the biggest difference between Bayesian and Classical methods. They differ in when you can begin simulating new data. For classical methods, you can only predict new data after you’ve already seen data from the process. While in Bayesian methods, you can predict before you see data.

Obviously, you should be able to make better predictions after you’ve seen data, but the key idea that you can use the prior predictive to do useful things!

#### Antarctica starts here

At some point after his talk, I was talking to Lance and he mentioned a colleague of his who is strongly against adding prior information. This colleague was of the view that you should elicit more data rather than eliciting a prior.

But really, that’s a key use of the prior predictive distribution. Often you can elicit information from scientists about what the data will (or will not) look like, and the prior predictive distribution is useful for seeing how well you have incorporated this information into your model. Or, to put it differently, when you build generative models, you can generate data and you should ask people what the data you can generate should look like.

+1 for the crocodile reference.

Without having watched Lance’s talk, I think the connection between penalties and priors can be big, useful news for some researchers. I work with people that work with ML methods and think of Bayesian methods as “something that takes too long”. A group of them seemed surprised that penalties could be chosen intelligently using something other than cross-validation, which is not tractable for even a few parameters.

But of course, your point that while MAP estimates are exactly equivalent to a certain penalized MLE method, MAP estimates are not exactly the golden standard for Bayesian estimation.

It’s also worth noting the point made in BDA: sometimes it makes a lot of sense to use a different prior if you plan on using a MAP estimate rather than sampling from the posterior. Your favorite Lasso example is clearly such a case, and BDA also points out that, especially with variance terms, zero avoiding priors are important for MAP estimates, even if they don’t reflect prior belief of parameters.

I agree with this sentiment, but i think it’s much cleaner to say “use penalised maximum likelihood with an appropriate penalty”. Not everything needs to be “Bayesian” (especially not things that aren’t* Bayesian).

* I try really hard to stay away from drawing a line in the sand around what is and isn’t Bayesian, but if you change your “prior” based on your estimation method (you’d never use a boundary avoiding prior with a full posterior calculation – it’s not what they’re made for), then it’s not a prior.

I think it’s worth noting that in penalized frequentist methods, you can incorporate prior information and you can make predictions before you’ve seen any data from the process. For example, when I don’t have the patience to do a full Bayesian regression and I have a prior estimate for the regression coefficients (just prior means and not prior variances), I will run an elastic net model and penalize the difference in the coefficients from the prior mean rather than from zero. Further, if I do have some idea as to the prior variances of the coefficients, I can penalize coefficients inversely proportional to my prior variance estimate. Obviously, this is not ideal when compared to using Stan, but it’s faster and easier to program (I can do this using glmnet for example).

You still need to get the parameters from somewhere to build a frequentist predictive distribution. Where do they come from?

I wouldn’t do this if I needed a full predictive distribution. I mean that I can make predictions (just the mean) for new observations before seeing any data using my prior mean and afterwards using the model I’ve fit.

Hi Dan, a bit off-topic, but can you point me on any literature how to use a prior to regularize an ill-posed problem (Hessian of the likelihood has very large eigenvalues)? I don’t want to / cannot incorporate substantial prior knowledge, because very little is known about the quantity to be estimated (the autocovariance of a process). In particular, does it make sense to tailor the prior to the specific ill-posedness of the problem? Thanks!

I’d need to know a bit more a about the problem to have a firm opinion, but my default method for priors on autocorrelation is this: https://arxiv.org/abs/1608.08941

Dan, thanks for the answer and pointer to the paper. But my question was meant more generally: Are there specific Bayesian ways to deal with ill-posed problems? Or is the idea that nothing special is necessary because having a decent prior allows one to deal with well-posed and ill-posed problems in the same way?

Not commenting on Dan’s points, just the relationship between penalized likelihoods and priors:

Kimeldorf and Wahba (1970), A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines (https://projecteuclid.org/euclid.aoms/1177697089)

Akaike (1979), Likelihood and the Bayes procedure.

There is quite a large time series literature from that time about the relationship between smoothness priors and penalized likelihoods. In particular work of Kitagawa and Gersch (summarized in 1996 in http://www.springer.com/us/book/9780387948195). While these days most of the literature points to the “british” take on time series decomposition, this sphere of work predates that work IMO, though they lead to the same place. Basically like the Kalman smoother/filter can be derived as a Bayesian solution of kind.

There definitely is a lot of this about – you can always consider a penalty as a log-prior. But that prior may not work well (and for a lot of famous penalties, it doesn’t).

For example, this paper by Lassas and Siltanen shows that the Kimeldorf and Wahba stuff doesn’t hold if you replace the L2 norm with an L1 norm

http://iopscience.iop.org/article/10.1088/0266-5611/20/5/013

OH, but rabbit tastes so much better than duck!

http://gelmanstatdev.wpengine.com/2017/04/19/representists-versus-propertyists-rabbitducks-good/

But seriously, I think the biggest loss of opportunity comes from ignoring the prior when penalizing and ignoring repeated sampling performance when doing Bayes.

If penalizing corresponds to a silly prior that should not be disregarded without very good reasons and when Bayes leads to poor repeated sampling performance that should not be disregarded without very good reasons for sticking with the prior.

Rod Little’s argues for the later in this talk here – https://ww2.amstat.org/meetings/ssi/2017/onlineprogram/AbstractDetails.cfm?AbstractID=304106

> If penalizing corresponds to a silly prior that should not be disregarded without very good reasons and when Bayes leads to poor repeated sampling performance that should not be disregarded without very good reasons for sticking with the prior.

I agree with the last part. I don’t agree with the first part. Because a prior has to deal with things like containment and penalties don’t, you can get away with much simpler penalties.

It wasn’t clear to me whether Keith missed that point or if he got it but just wanted to discuss a different point, so I forebore to comment

I just assumed he didn’t agree, which is fair enough.

If penalizing corresponds to a silly prior that should not be disregarded without very good reasons… such as it also corresponding to a sensible prior. Penalized estimates can correspond to multiple priors – an entire equivalence class of them – so just one of these priors being “silly” is a weak criticism of the penalized estimate.

I was being too vague (or wrong?).

It was in reference to case study 1 in the link to the Representists versus Propertyists post where the repeated sampling performance from a “silly prior” (flat) shows a property that is usually taken as good (e.g. uniform coverage of credible intervals that match confidence intervals) is actually poor in some meaningful sense when a sensibly prior is used (informative for small effects in a given context).

So not that you need a sensible prior to get various good frequency properties but that to know if the properties are really really good (good for what?) that should be evaluated under a sensible prior. On the other hand, if there is reason for the property to be taken as all important/beyond question – then the silliness can be disregarded.

Dan, I’m sure you’re not protected, for it’s plain to see the Diamond Dogs are poachers and they hide behind trees.