## We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling

Aki points out that in cases of antithetical sampling, our effective sample size calculations were unduly truncated above at the number of iterations. It turns out the effective sample size can be greater than the number of iterations if the draws are anticorrelated. And all we really care about for speed is effective sample size per unit time.

NUTS can be antithetical

The desideratum for a sampler Andrew laid out to Matt was to maximze expected squared transition distance. Why? Because that’s going to maximize effective sample size. (I still hadn’t wrapped my head around this when Andrew was laying it out.) Matt figured out how to achieve this goal by building an algorithm that simulated the Hamiltonian forward and backward in time at random, doubling the time at each iteration, and then sampling from the path with a preference for the points visited in the final doubling. This tends to push iterations away from their previous values. In some cases, it can lead to anticorrelated chains.

Removing this preference for the second half of the chains drastically reduces NUTS’s effectiveness. Figuring out how to include it and satisfy detailed balance was one of the really nice contributions in the original NUTS paper (and implementation).

Have you ever seen 4000 as the estimated n_eff in a default Stan run? That’s probably because the true value is greater than 4000 and we truncated it.

The fix is in

What’s even cooler is that the fix is already in the pipeline and it just happens to be Aki’s first C++ contribution. Here it is on GitHub:

Aki’s also done simulations, so the new version is actually better calibrated as far as MCMC standard error goes (posterior standard deviation divided by the square root of the effective sample size).

A simple example

Consider three Markov processes for drawing a binary sequence y[1], y[2], y[3], …, where each y[n] is in { 0, 1 }. Our target is a uniform stationary distribution, for which each sequence element is marginally uniformly distributed,

```Pr[y[n] = 0] = 0.5     Pr[y[n] = 1] = 0.5
```

Process 1: Independent. This Markov process draws each y[n] independently. Whether the previous state is 0 or 1, the next state has a 50-50 chance of being either 0 or 1.

Here are the transition probabilities:

```Pr[0 | 1] = 0.5   Pr[1 | 1] = 0.5
Pr[0 | 0] = 0.5   Pr[1 | 0] = 0.5
```

More formally, these should be written in the form

```Pr[y[n + 1] = 0 | y[n] = 1] = 0.5
```

For this Markov chain, the stationary distribution is uniform. That is, some number of steps after initialization, there’s a probability of 0.5 of being in state 0 and a probability of 0.5 of being in state 1. More formally, there exists an m such that for all n > m,

```Pr[y[n] = 1] = 0.5
```

The process will have an effective sample size exactly equal to the number of iterations because each state in a chain is independent.

Process 2: Correlated. This one makes correlated draws and is more likely to emit sequences of the same symbol.

```Pr[0 | 1] = 0.01   Pr[1 | 1] = 0.99
Pr[0 | 0] = 0.99   Pr[1 | 0] = 0.01
```

Nevertheless, the stationary distribution remains uniform. Chains drawn according to this process will be slow to mix in the sense that they will have long sequences of zeroes and long sequences of ones.

The effective sample size will be much smaller than the number of iterations when drawing chains from this process.

Process 3: Anticorrelated. The final process makes anticorrelated draws. It’s more likely to switch back and forth after every output, so that there will be very few repeating sequences of digits.

```Pr[0 | 1] = 0.99   Pr[1 | 1] = 0.01
Pr[0 | 0] = 0.01   Pr[1 | 0] = 0.99
```

The stationary distribution is still uniform. Chains drawn according to this process will mix very quickly.

With an anticorrelated process, the effective sample size will be greater than the number of iterations.

Visualization

If I had more time, I’d simulate, draw some traceplots, and also show correlation plots at various lags and the rate at which the estimated mean converges. This example’s totally going in the Coursera course I’m doing on MCMC, so I’ll have to work out the visualizations soon.

1. Keith O'Rourke says:

Bit of history – antithetical sampling goes back to (at least) 1956 – chapter 5 http://www.cs.fsu.edu/~mascagni/Hammersley-Handscomb.pdf

2. Aki Vehtari says:

To be more specific about what was truncated: N_eff computation uses truncated sum of autocorrelations. Stan had wrong truncation of the autocorrelation series, so that it gave quite close the intended result if lag 1 autocorrelation was estimated to be positive, but if the lag 1 was estimated to be negative it truncated too early. Antithetic Markov chain has negative odd lag autocorrelations, and the correct truncation needs to use paired sum of even and odd lags, and truncate only when the estimated pair sum is negative (Geyer’s intial positive sequence). Corrected truncation allows N_eff>N. The pull requests adds also Geyer’s initial monotone sequence to reduce the variance of N_eff estimate.

3. You defined MCMC standard error as “posterior standard deviation divided by the square root of the effective sample size.” I think you meant standard deviation of the MCMC sample.

• Right—everything’s based on estimates. We estimate posterior standard deviation using the sample standard deviation. We also estimate effective sample size. Then we calculate an estimate of MCMC standard error based on these estimates.

• Oh, and I forgot to mention that Aki ran simulations for calibration where he knew the MCMC estimate and knew the true value, so he could calculate the actual estimation error. The MCMC standard error then gives you the expected distribution of the errors.

4. Ed says:

Is there a new upper bound on N_eff? That is, how much larger than N could N_eff become?