Radford:

Thanks for commenting!

Regarding your first two paragraphs: I’m starting from the position that HMC is better than simple Metropolis. When I talk about problems with HMC, I’m not saying that Metropolis is better, I’m saying that these are problems that even HMC gets stuck on.

But I don’t have a good intuition about this. I guess the way for me (and others) to understand your last two paragraphs above is to do some simulations with some simple examples.

]]>Andrew: You say, “Radford talked about the challenge of HMC that it stays in a narrow band of potential energy. HMC can move efficiently within the level sets of log p, but it has random walk behavior in the dimension of log p itself.”

I don’t think I said that, since I don’t think it’s true, at least when comparing HMC with simple alternatives such as random walk Metropolis updates.

Caracciolo, Pelisseto, and Sokal pointed out that a Metropolis update can change log p by only order 1 on average, and this constrains how fast a pure Metropolis method can be, since it has to explore the range of log p by a random walk with steps of order 1. HMC has a Metropolis step, but it also has a momentum replacement step, which can change log p by an amount that is typically comparable to its range (about half the range, in some simple cases, I think).

Maybe you’re thinking of a comment I may have made somewhere or other in a different context about how trajectories accepted by HMC can’t typically change H by more than about order 1. This is of relevance when the variation in log p is not just due to high dimensionality, but also to extreme skewness or multimodality. For example, suppose a high-dimensional distribution has two modes, one narrow and high, the other broad and low, with equal probability in each mode. You can’t expect trajectories that move from one mode to the other to be accepted often, because the probability density of a point in the low, broad mode is much, much lower than for a point in the high, narrow mode. (If the difference in modes is extreme enough, variation in the kinetic energy won’t be enough to solve this problem.) This motivates the modifications to HMC discussed towards the end of my 2011 article.

]]>Carlos:

It’s hard for me to say. In the document quoted in my above post, Michael seems to be referring to the typical set as a subset of Theta. But I couldn’t find any precise definition of typical set there, which is why I went to wikipedia.

]]>“Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern. And the definitions implicitly used by the Stan people are clearly working within Theta, not Theta^n. See for example the quotes by Bob and Michael above. Just to be clear, the n in the official definition of the typical set is not the same as the dimension of theta, which is 10 in my example. In my example, Theta is 10-dimensional, and Theta^n, if we were to work with it, would have dimension 10n.”

Andrew, I think you added a clarification in that paragraph. But it’s really far from clear that “the Stan people” are working with N=1. At least this discussion from Michael Betancourt suggests otherwise:

“In information theory compression is defined with respect to messages comprised of words. In isolation each component word is likely to be near the modal word, but as we construct longer and longer messages we provide more and more opportunity for some of the words to deviate. The probability that a long message contains no deviant words or all deviant words both become negligible, with most messages confined to the set of messages with “typical” fluctuations.”

n is the number of words in a message (the length of the sequence x_1, …, x_n), the elements of the typical set are messages (sequences) of length n.

“Probabilistically we can interpret a message as a point in our ambient space, with each word interpreted as the projection of that point along each dimension in the space. Any single component is likely to be found near the projection of the mode because of the pull of the target probability density function. The volume, however, induces fluctuations in some of the components which push the constituent point away from the global mode. Exact samples are “typical”, each influenced by fluctuations that push it somewhere within the typical set.”

Messages with n words => Points with n coordinates

]]>Thanks Andrew…I will check out the comments there.

Perhaps the thing I am missing is that the “typical set” is useful in the initial optimisation/warm up phase used in STAN. After we have a “good” set of MCMC samples, the relevance of the “typical set” is then not important (…the angle I have been coming from).

cheers

Al

]]>Al:

I followed up with this on the Stan forums:

]]>Talking about log p rather than the typical set has been very liberating, and now it is making me think of some other questions.

One way to think of the typical set is in terms of level sets of log p. The typical set of theta corresponds to the level sets of log p(theta) near its expected value. But then this makes me thing of . . . slice sampling! In slice sampling, we’re implicitly reparameterizing the density in terms of level sets of p, which is mathematically equivalent to level sets of log p.

In particular, slice sampling has two steps: Draw theta from within the level set of constant p(theta), and draw a value of p(theta) given theta. That first step (draw theta given p(theta)) is a lot like HMC. This is no surprise–the original NUTS algorithm uses slice sampling–but at the very least it’s helping me understand what people have been talking about when they say “typical set.”Hamiltonian paths have constant energy: that’s potential energy (i.e., – log p) and kinetic energy (from the speed of the particle moving through space). In his famous 2011 article, Radford talked about the challenge of HMC that it stays in a narrow band of potential energy. HMC can move efficiently within the level sets of log p, but it has random walk behavior in the dimension of log p itself.

But this makes me wonder: are there some HMC diagnostics that would help us understand this? We’re already monitoring log p and have a sense of how it varies. But we also know the distribution of kinetic energy of the Hamiltonian paths. As Bob put it (https://gelmanstatdev.wpengine.com/2020/08/02/the-typical-set-and-its-relevance-to-bayesian-computation/#comment-1399953), “In HMC, we use the gradient not to literally climb toward the mode, but to exert force toward the mode on our simulated particle that’s balanced with the particles momentum to keep the flow in the volume where we want to sample.”So it seems that we should have internal evidence from our HMC runs (in warmup as well as in sampling) as to how the sampler is moving within and between level sets of log p. I’m wondering if this sort of thinking could move us beyond talking about the typical set?

Hi Bob,

Thanks for finding the time to reply.

2) I understand something about the origin of “typical set”, but see no relevance of these ideas in our field (=using HMC for Bayesian modelling). For example, Y is not discrete in our field. I am not encoding anything!

3) Hopefully I understand HMC. The U and K nomenclature came from Radford Neal’s (brilliant) “MCMC Using Hamiltonian Dynamics” book chapter, so hopefully not completely alien to you :-) For what it’s worth (on the nomenclature generally used for HMC), it might be helpful to consider replacing “q” with theta, since I expect most of the STAN users “think” model parameters = theta (or at least never use “q”). The only advantage I see in using “q” is consistency with early published material (but that shouldn’t completely restrict us…e.g. the H in HMC changing from “Hybrid” to “Hamiltonian” for example).

You wrote “The reason the typical set is useful is that it’s the domain for computing expectations, which is all we ever use MCMC for. “.

I am not sure I understand what specific expectations you are taking, and what relevance there it has to the “typical set”. For me, I give each point(=vector of parameters(thetas)) of my, say 10000 samples, 0.01% of my belief. I then construct summaries based on functions of these parameters (e.g. outcome = theta1 * (theta2)^theta3). No where am using the “typical set” per se, but rather am just assuming my 10000 samples are indeed a (pseudo) random sample from the joint posterior distribution of my model parameters. Perhaps in your applications you are doing something else, hence my ignorance/question here.

You wrote “We don’t explicitly constrain ourselves to it (in fact, there are infinitely many typical sets depending on coverage), but the effect of preserving the Hamiltonian in the dynamics keeps our draws in the typical set.”

As you mention, there is no actual “typical set”, but rather it is just a construct of whatever coverage we would want it to be (= I see it has no relevance to anything). I see no special relationship between the “typical set” and HMC. You could get a “typical set” from any MCMC method…it is no more specific to HMC than standard Metropolis-Hastings (MH) (…in Andrew’s nice example, he isn’t using HMC, and we are just looking at the distribution of log density).

Thus far, I still have found anything useful/relevant about the “typical set”. If we want to explain that our joint posterior distributions are very difficult to effectively explore in high dimensions via, for example, MH, we can, but we do not need to reference any “typical set” to do that. Of note, Radford Neal’s book chapter on HMC makes no reference to the “typical set”.

Hope this was clear. This has proved very useful to me…thus far I haven’t read anything to make me think I am indeed “missing something” here (as I originally feared).

cheers

Al

]]>Hi Carlos,

Thanks for adding to this discussion. I fully agree with your comments “I think that’s all” and your last sentence “..but I don’t know about its practical relevance”.

cheers

Al

]]>> we’re trying to get across the point that random draws don’t wind up near the mode

And still, they are more likely to wind up in a unit volume around the mode than in a unit volume anywhere else. It’s true that they are more likely to wind up in a much larger region that excludes the mode. But the points in that region are very different from each other (unlike the points in a small region around the mode).

In the multivariate standard normal example the points at some fixed distance from the origin look exchangeable given the spherical symmetry of the distribution but if those co-ordinates represent parameters in a model the symmetry disappears. {mu=1, theta=-0.3, rho=0.2} and {mu=0.1, theta=0.6, rho=-0.9} are completely different animals.

While I’m stating the obvious, I think this may be another source of confusion for those who try to understand the relevance of these ideas.

]]>> The typical draws are the ones you want to use to calculate expectations with MCMC. (…) If we just take random draws (or any draws with the right stationary distribution like those from MCMC), these and other expectations work out.

It’s a good thing that random draws are not different from typical draws! Of course that’s why we say they are typical in the first place.

If you want to stress that random draws from a multivariate distribution will be far from the mode, because everything is far in high dimensions, just say that.

It seems that people gets confused when they understand that instead of drawing from the distribution they should be drawing from some mysterious typical set, as if it was a different thing.

A much more clear message, I think, would be to explain that one has to take draws according to the actual distribution instead of “choosing draws near the mode”, whatever that means.

]]>2) E[log p_Y(Y)] is the entropy of Y. It’s a more general way of thinking about uncertainty than variance. When Y is discrete, E[log_2 p_Y(Y)] is how many bits on average a compressive coder needs to encode a stream of draws from Y. That’s where the notion of typical set comes from—that stream’s average log density converges to the entropy.

3) The standard notation for Hamiltonian mechanics is H(q, p) = V(q) + T(p) where q is the position in parameter space and V(q) = -log p(q) is the potential energy. The kinetic energy is usually taken to be T(p) = p’ * inv(M) * p, where M is the mass matrix. The reason we care about H(q, p) is that’s where we do the equivalent of the Metropolis accept step and that’s what’s preserved when we solve the implied differential equation for the trajectrory of a particle with velocity p and position q.

But back in the world we care about, the velocity variable p is just for convenience and it’s the q variables we care about. Now we can get draws for q from draws for (q, p) just by dropping p, which is what we do. Then we can look at properties of the distribution of q.

The reason the typical set is useful is that it’s the domain for computing expectations, which is all we ever use MCMC for. We don’t explicitly constrain ourselves to it (in fact, there are infinitely many typical sets depending on coverage), but the effect of preserving the Hamiltonian in the dynamics keeps our draws in the typical set.

Andrew’s post isn’t saying this isn’t true, just that the bounds of the typical set were a bit more generous than we thought and it’s not the only set for which this construction works, just one that’s convenient for proofs information theoretically.

]]>@Carlos: Setting aside this notion of typical set or concentration of measure, we’re trying to get across the point that random draws don’t wind up near the mode (they also don’t wind up near each other). You’re right that we’re defining functions like “distance to the mode” or “log density” to show that they are atypical of random draws. The typical draws are the ones you want to use to calculate expectations with MCMC. If you choose draws near the mode, they have high density, but they throw off expectations like E[log p(Y)] (the entropy) or E[(d(Y, 0)] (average distance of a point to the origin). If we just take random draws (or any draws with the right stationary distribution like those from MCMC), these and other expectations work out.

For instance, in a standard multivariate normal, the distribution of square distances to the origin is chi-squared with degrees of freedom equal to the number of dimensions. Thus we don’t want a bunch of draws near the mode because they’ll throw off expectations like E[log p(Y)] or E[d(Y, 0)].

In HMC, we use the gradient not to literally climb toward the mode, but to exert force toward the mode on our simulated particle that’s balanced with the particles momentum to keep the flow in the volume where we want to sample.

]]>(As Andrew pointed out, the elements of the typical set are points.)

> All except POINT C have a Log-density of -14.19, since the sum of each term squared adds up to 10. POINT C is our mode, and hence a Log-density of -9.19. I struggle to see how POINTS A, B, D and E are notable in any way. Sure, their average contribution to the log-density is “typical”, but what more beyond that?

I think that’s all. What makes the set of points at that distance sqrt(10) from the origin kind of notable is that when you take points at random from an N-dimensional standard normal (with N very large) they will all surely be at distance sqrt(N) from the origin (using an appropriate definition involving epsilons and limits and that kind of things).

In the same way that when you take a sequence of N coin flips (with N very large) you will surely get the same number of heads and tails. In that sense, a sequence of 10 flips with 5 heads and 5 tails more typical than a sequence with 6 heads and 4 tails or a sequence with 1 head and 9 tails (even though the probability of getting 5 heads and 5 tails is less than 25%). [The textbook definition of typical set would actually include all the possible sequences for a fair coin, as all have the same probability. I’m using “typical” in a more general sense in this paragraph.]

You’re right that we are not usually confronted to infinite-dimensional standard normal distributions, though. The discussion about typical sets is interesting but I don’t know about its practical relevance.

]]>1) Yes, better to label A-E as points A-E. My observation is that points A, B, D and E are unremarkable/uninteresting to me, but happy to hear why I should care (e.g. A and E were deliberately made to be odd/unrepresentative in the parameter space).

2) I agree that log p(theta) is a univariate summary, but am struggling to understand why it is interesting to anyone!

3) Indeed the Hamiltonian preserves the total energy, but if Hamiltonian=U(potential) + K(kinetic) energies, and K is chosen randomly at each HMC iteration, why do we care about any particular value of U (=log p (theta))? For example, if H = -5 (e.g. U = -4 and K = -1 OR U = -2 and K = -3), the value of U (-4 or -2) is not particularly of interest. Being provocative, who cares?

There perhaps are ways the “typical set” is of actual use, so hopefully someone will chip in and defend why and how they use it, which would be very interesting to me!

Cheers

Al

]]>Al:

As noted above, I think the Stan community might be better off abandoning the term “typical set,” at least we can agree on a definition. Just a few comments:

1. To be consistent with our notation, you should label A, B, C, D, and E as “points” or values of theta, not as “sets.” We’re using the term “set” to refer to a region within Theta, not a single point.

2. When we’re talking about typical sets, we’re talking about values of log p(theta), which can be thought of as a one-dimensional aspect of theta.

3. There’s a reason to focus on p(theta) rather than some other one-dimensional summary. That reason is that Hamiltonian paths preserve the total energy. The function p(theta) is the potential energy in the Hamiltonian path.

]]>Many thanks for writing this blog article!

As a newcomer to the STAN world (but not Bayesian statistics), perhaps the most confusing topic I have encountered has been the “typical set”. As an analyst, I understand why I want to use NUTS/HMC, but thus far haven’t managed to appreciate why I need to concern myself with the “typical set” Here are some observations on your informative notes:

In your d=10 Multivariate Normal (MVN) example, you mentioned that the mode had log density:

-(d/2)*log(2*pi) = -9.19

The expectation of the log density you observed across your 10000 samples (at -14.18) should be (and is) about 5 lower, as we are summing the squares of 10 independent N(0,1) variates, which will be chi-squared on 10 degrees of freedom. Hence this chi-squared distribution will have a mean of 10, and variance of 20. Thus we could describe the distribution of the log-density as something like:

Log-density = -(d/2)*log(2*pi) – chi-squared (d) / 2

Various metrics of this simple “typical set” (like expected value, percentiles etc.) could be determined using the above for any d.

However I would ask, what value is this? Why should I care? Consider the following 5 sets of theta1-theta10:

SET A = (-sqrt(10), 0, 0, 0, 0, 0, 0, 0, 0, 0)

SET B = (-1, -1, -1, -1, -1, -1, -1, -1, -1, -1)

SET C = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

SET D = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

SET E = ( sqrt(10), 0, 0, 0, 0, 0, 0, 0, 0, 0)

All except SET C has a Log-density of -14.19, since the sum of each term squared adds up to 10. SET C is our mode, and hence a Log-density of -9.19. I struggle to see how SETS A, B, D and E are notable in any way. Sure, their average contribution to the log-density is “typical”, but what more beyond that? In the joint parameter space, none are particular interesting. If we are to provide STAN with initial estimates, I cannot see why SET C wouldn’t be reasonable/appropriate, although clearly no random MCMC sample will ever have all 10 theta’s exactly equal to zero (and hence SET C is a bound on the log-density, and hence “atypical” from a log-density perspective, but not from a parameter perspective).

For me, the goal of the d=10 example is simply to obtain 10000 MCMC samples which yield (approximately) the true means and var-covariance matrix for theta1-theta10. In the real world, these joint distributions of our model parameters will not be MVN, and any summaries (as means, modes, variances, covariances etc.) will always be crude and simplistic representations of potentially much more complex interrelationships.

I hope this note was useful and error free, and I look forward to reading what others think on this interesting topic, since it would be useful to clarify for new/current users.

Cheers

Al

]]>I have no issues with the idea of concentration of measure.

My point is that including the word *concentrated* in the statement “the region around the mode has low probability and most of the probability mass is *concentrated* somewhere else” creates more confustion than enlightment, given that this “somewhere else” is a extremely larger region where the densitiy is very low.

Saying “most of the population of England is outside of London” is fine, saying “most of the population of England is *concentrated* outside of London” would be bizarre.

]]>But even in the pure measure theory context concentration of measure is defined in terms of the Lévy mean, not the original multi-dimensional space. I take this to be the general concept and the coding/information theory version to be a specific instance of it where stationarity or independence is assumed, allowing the multi-dimensional measure to be defined in terms of just the transition kernel or marginal distribution.

]]>But as Andrew mentioned, that concentration happens in a different space. In the usual example of the multivariate normal (with d going to infinity) there will be concentration of measure in the one-dimensional distribution of radius or density (a new random variable which depends on those many independent component random variables) and not in the original multi-dimensional space where the probability distribution of the multivariate normal is defined.

]]>What I mean is that it seems that to use the concept of typical set here one has to make the correspondence x_1 => θ_1 , … , x_n => θ_d.

Which may not make much sense, but I don’t see what alternative interpretation makes more sense.

]]>“Concentration” wasn’t even from Michael Betancourt…

]]>This is what one of the Stan guys writes in the case study linked in the post:

“For example, Bayesian parameter estimates minimizing expected square error are expectations of the value of a parameter θ

in the posterior (…) where we integrate over the legal values of θ∈Θ.”

“Rather than integrating over the support Θ of θ in the posterior, it suffices to integrate over the typical set (…) That is, if Θ

is the support for a density p(θ), then the typical set T⊆Θ has the property that for “well-behaved” functions, (…)”

θ is a multi-dimensional parameter {θ_1, …, θ_d}. Θ is the set of all possible values of θ.

The typical set is a subset of Θ, each element of the typical set is a value of θ, i.e. a multi-dimensional parameter {θ_1, …, θ_d}.

]]>Carlos, this is a source of confusion. In the usual information theory construction, there’s some parameter, which is *usually thought of as a univariate thing*, and then we discuss the typical set in the cross product of that univariate thing across n dimensions (ie. replications of the experiment). So you could think of

theta^10 as sequences of 10 numbers each of which comes independently from the same univariate distribution, which need to be transmitted across a communications medium

However, when the Stan group discusses it, they’re thinking usually of a single *vector* parameter, and the vector is not just repetitions of independent parameters, in fact what’s going on is

Theta^1

where Theta is a 10 dimensional vector with a complex dependent posterior distribution.

To use the information theoretic definitions, we’d have to think about sequences like

Theta^10 which would be a sequence of 10 vectors each of which has say 10 numbers, so that there’s 100 numbers total.

But that’s not what the Stan guys typically are discussing typically they’re discussing points in the space

Theta^1

And I think Andrew is right that there’s confusion in using the words “typical set”. I like much better Andrew’s idea for saying say “the set of typical density” or some such thing.

]]>Carlos:

Regarding point 2, yes, there’s some confusion in the notation. I’ll try to clean that up in the above post. In the official definition of typical set, there are points theta_1, …, theta_n. That is, each theta_i is itself a vector, a vector of length 10 in my example. It’s my bad that I introduced confusion by referring to theta_1, …, theta_10 near the end of my post.

]]>1. I know it wasn’t from you. That discussion of neighborhoods is in the space of theta, if the “concentrates in a neighborhood” has to be understood in other space the formulation is a bit confusing (which was my point).

2. The typical set in your example is defined in terms of theta_1, …, theta_10 as far as I can see.

You say “Using the wikipedia definition, the typical set for any epsilon corresponds to the values of theta for which u(theta) is within epsilon of – H. For example, if epsilon = 1, it’s the values for which u(theta) is in the range [-15.18, -13.18].”

Isn’t theta in the previous paragraph a 10-dimensional vector? Isn’t the typical set composed of 10-dimensional sequences?

]]>Carlos:

1. The first quote is from Michael Betancourt, not from me. In his defense, there’s a limit as to how far we can convey mathematical concepts in regular language. His point, I think, is that the typical set is concentrated in the space of u, not in the space of theta.

2. In my example, d=10 is the number of dimensions of the distribution. I don’t have an “n” in my example. The “n” parameter comes from the definition of typical set, which is defined in terms of theta_1,…,theta_n, so that the typical set is defined in the space Theta^n. I’m following the discussion in the Stan community, in which the typical set is defined for a single theta in the space Theta. Thus n=1.

3. It’s clear to me. In all the Stan documents, whenever people refer to the typical set they’re talking about a subset of Theta. See the quotes by Bob and Michael above for examples. Or if they are referring to the space Theta^n, I think the documentation needs to be rewritten!

]]>I don’t think “concentrates” is the right word to use there (it evoques the high probability set, not the much more spread out typical set).

> Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern.

Do you mean n=10? Or don’t you think that the dimensionality of the problem is what gives the length of the sequences?

> And the definitions implicitly used by the Stan people are clearly working within Theta, not Theta^n.

It doesn’t seem so clear to me.

]]>The definition of differential entropy isn’t invariant to changes in parameterization. Jaynes wanted to measure entropy relative to a base measure defined in terms of a “limiting density of discrete points” and I feel certain that other folk have come up with equivalent notions; such a thing would be required in e.g., Wallace’s (fully Bayesian) Minimum Message Length paradigm. The fact that MCMC doesn’t sample actual continuous values needs to be taken into account, which I believe means that you are right: the typical set as the Stan community conceives of it isn’t invariant to parameterization because the IEEE 754 representation for theta isn’t being taken into a transformed space of discrete values; rather the IEEE 754 spec applies untransformed to phi as well, which means there is no volume Jacobian to balance the density Jacobian.

]]>Sp:

I could well believe there are some subtleties here that I’m missing! I sent a draft of this post around to a few colleagues and they caught a few confusing points which I tried to clarify, but I’m sure there’s more to the story. I wrote the above post in large part because different things that people were saying about the typical set seemed to be contradictory, so this is an attempt to reveal my own confusion as well as that of others.

]]>