## How to figure out what went wrong with this model?

Tony Hu writes:

Could you please help look at an example of my model fitting? I used a very flexible model—Bayesian multivariate adaptive regression spline. The result is as follows:

I fitted the corn yield data with multiple predictors for counties of the US (the figure shows results of Belmont County in Ohio). My advisor told me that there was something wrong with the model based on two problems exposed in the figure:

1. As marked by ellipses in the figure, the model behaves opposite to the data (e.g., the data points are going down while the model going up, or vice versa). Such behavior not only happens in this county but also in some other counties. I understand that a fitted model does not have to go through every data point, but should it follow the trend that data points show?

2. As marked by arrows, the model sometimes can capture the very low yield, but sometimes it does not. Is that normal?

I can’t tell what was wrong with the model. Do you have any thoughts?

My reply:

I don’t know what’s wrong with the model! I’ve never fit a multivariate adaptive regression spline, myself. But I thought this would be a great excuse to talk more about workflow.

The above graph does look problematic. Something’s definitely wrong. The ellipse on the left side of the plot doesn’t look so bad—it just looks like the fitted curve is doing some smoothing—but the ellipse on the right looks wrong. And the cases where the fitted curve is more extreme than the data—that makes no sense at all. Sometimes I’ve seen time series models overshoot the data; for example, random walk models “want” to be quadratic curves, and this can sometimes cause an estimated curve to have extreme peaks—but nothing like that seems to be going on here. One possibility is that the hierarchical model is estimating these peaks from other counties and partially pooling.

So, what do do? Take the model apart. Fit the model to subsets of the data. Fit the model to simulated data where you know the true parameter values. Fit the model in Stan so you can play around with it. If you’re concerned about oversmoothing across counties, increase the county-level variance parameter. Basically, you’ll have to build some scaffolding.