Statistics of Coin-Toss Patterns
Yesterday, I watched Peter Donnelly’s TED presentation on statistical mind-benders. Among other things (statistician jokes!), Peter observes that humans don’t have good intuition for some kinds of statistical thinking. In the presentation, Donelly posses a coin toss problem to demonstrate his point. He chooses one that is easy to get wrong.
Consider a series of fair coin tosses. For example, one possible sequence of coin tosses is THTTHHTHTTH. How many tosses are required to get a particular pattern? How does this depend on the length of the pattern?
Peter poses a concrete question as follows. Consider the pattern HTH. If we do the experiment of tossing a coin repeatedly and counting the number of tosses, we find that the first occurrence of HTH arises in some average number of coin tosses . For a different pattern, say TTH, we can repeat the experiment and find that the first occurrence of this pattern arises in some average number of coin tosses .
One of the following statements must be true:
Which statement is correct?
My reflex reaction was (a). The heuristic leading to this conclusion is that the probability of getting TTH is the same as the probability of getting HTH in any 3 coin tosses, i.e.,
On the other hand, if the pattern was HHH the probability of getting this pattern on any three coin tosses is the same. But intuitively, I expect to have to make more coin tosses on average to get this pattern.
It is difficult to get to the correct answer with this kind of reasoning.
A little counter intuitively, (b) is the answer. It takes more tosses on average to get the pattern HTH than the pattern TTH. Peter spends some time arguing that this is plausible–watch his presentation to get those arguments. Below, I pursue calculating this for myself…
Remember that the average or expected value of n is calculated by,
The probability of any single coin toss sequence is where n is the length of the sequence.
An approach to calculating P(n) is to count the number of permutations of coin-toss sequences of length 0 to n that contain the target pattern only in the last position. Each of these sequences has the length-dependent probability above, so that
To take into account that some target patterns can overlap themselves, I force the last (l-1) places in the sequence to match the first (l-1) places of the pattern. The chance of getting the pattern on the next toss is1/2. So multiply by 1/2 to get the formula above.
The table below shows the function N(n-1) for the two patterns HTH and TTH. It is easy to see that for longer sequences, there are many more possible sequences not containing HTH than HHT. Since the mean is calculated as a sum of N(n-1) multiplied by the probability which only depends on the length of the sequence, HTH is going to have the greater expected n.
You can use a pad a paper to count sequences, but this gets out of hand quickly. I wrote a Python script to help keep things sorted out. This script makes all the calculations shown for the rest of this entry and can be downloaded here.
|Tosses||Coin Toss Sequence
|Coin Toss Sequence
It is easy to use these counts and the formula above to calculate the expected value of n for short patterns. Here is an example for a pattern of length 2. This calculation applies to either TH or HT and shows that the expected value of n is 4.00.
|Coin Toss Sequences
without Pattern (TH or HT)
|Average Coin Tosses
This series converges slowly and the numbre of permutations becomes very large for longer patterns. So, although this brute-force summing is straight forward, it runs up against practical limits surprisingly quickly.
Another approach to calculated the average n for a pattern is to use a Monte Carlo simulation of the coin toss events. The Python script above also performs this calculation. The table below shows the results for patterns up to length 4.
|Patterns of length 1|
|Patterns of length 2|
|Patterns of length 3|
|Patterns of length 4|
This table is useful to compare the different results for various patterns of the same length.
Finally, the Monte Carlo method gives an estimate of the distribution of n for the patterns. (The numbers above are based on 20,000 sequences.) The histogram below shows the Distribution of Sequence length for the pattern TTTH. The average sequence length is estimated to be 16 (from the table above). Because the distribution is asymmetric and has a very long tail to the right, the average doesn’t appear near any interesting features (the peak, for example) of the distribution.