Evan Warfel asks a question:
Let’s say that a researcher is collecting data on people for an experiment. Furthermore, it just so happens that due to the data collection procedure, data is gathered and recorded in 100-person increments. (Making it so that the researcher effectively has a time series, and at some point t, they decide to stop collecting data.)
Now, let’s assume that the researcher is not following best practices and wants to compute summary statistics at each timestep. How might they tell if the data they are collecting has or is likely to have finite vs. non-finite variance? E.g. How might they tell that what they are studying isn’t best described by, say, a Cauchy distribution (where they can’t be sure that estimates of the standard deviation will stabilize with more data)? Is the only solution to run goodness-of-fit tests at each time t and just hope for finite variance?
I ask because my understanding is that many statistical calculations relating to NHST rely on the central limit theorem. Given cognitive biases like scope neglect, I wonder if people might be systematically bad at reasoning about the true amount of variation in human beings, and if this might cause them to accept the CLT as applicable to all situations.
He elaborates:
Clearly, if one’s data comes from a source generating approximately normal data, one can have a decent indication that the variance is finite via standard issue tests of normality. Maybe at large samples (like those implied by 100-person increments), this question is easy enough to resolve? But what if one only has the time/resources to gather fewer data points? Or, to flip it around—Is it definitionally possible to have diabolical distributions where the first N points sampled are likely to look normal or have finite variance, and after N points things devolve? Or does this push the definition of finite variance too far?
I’ll answer the questions in reverse order:
1. Yes, you can definitely see long-tailed distributions where the tail behavior is unclear even from what might be considered a large sample. For an example, see section 7.6 of BDA3, which is an elaboration of an example from Rubin (1983).
2. You can also see this by drawing some samples from a Cauchy distribution. In the R console, type sort(round(rcauchy(100))) or sort(round(rcauchy(1000))).
3. By the way, if you want insight into the Cauchy and related distributions, try thinking of them as ratios of estimated regression coefficients. If u and v have normal distributions, then u/v has a distribution with Cauchy tails. (If v is something like normal(4,1) then you won’t typically see those tails, but they’re there if you go out far enough.)
4. One reason why point #3 is important is that it can make you ask why you’re interested in the distribution of whatever you’re looking at in the first place.
5. To return to point #1 above, that example in BDA3, one way to get stable inferences for a long-tailed distribution is to put a constraint on the tail. This could be a hard constraint (saying that there are no values in the distribution greater than 10^7 or whatever) or a soft constraint bounding the tail behavior. In a real-world problem you should be able to supply such information.
6. To get to the original question: I don’t really care if the underlying distribution has finite or infinite variance. I don’t see this mapping to any ultimate question of interest. So my recommendation is to decide what you’re really trying to figure out, and then go from there.