The Tall Head of URL Exploration
In The Long Tail of URL Exploration, I looked at the distribution of URL visits by 102K people in a day. In What does the Nth Explorer of the Web Find?, I looked at how adding users grows the long tail and the number of unique URLs explored. In this 3rd (of 4) post, I look at the how the tall head changes as we add explorers.
The tall head is the short list of sites that get the most visits. We could call it the Top 10, Top 100 or whatever we think is relevant. Later I propose a couple of heuristics for determining tall head membership in real time.
The tall head is made up of URLs that much of the population visits. These are the "winner take all" URLs of user attention–URLs like cnn.com, google.com, or facebook.com. For this reason, we might expect that while the long tail is growing with more unique URLs and the number of URLs with 2-3 hits is growing rapidly, the tall head is relatively stable.
This is the case.
One way to think about how stable the tall head might be is to ask how well a subset of the population predicts membership in the tall head for the entire sample. The data from the last two posts is well-suited to look at this question.
Below is a plot of the accuracy of various subsets of the population (the same subsets we used previously) in predicting the Top 10, Top 20, Top 50 and Top 100 URLs of the entire population. Just over 40% percent of the population predicts the Top 100 URLs with 90% accuracy. The Top 10 are predicted to 90% accuracy by 10% of the population.
The composition of the tall head depends relatively weakly on the subset of the population doing the predicting.
How can I predict the tall head for the day by 9 am in the morning? This is the real-time problem of long tail distributions. The dynamics of the system are that real time Web exploration data appears as a time-ordered list of URLs from whatever users happen to be surfing. This means that a real time heuristic for determining top URLs for the day has to rely on the properties of the time series including a small surfer sample size and recent counts of visits.
Fortunately, by the results illustrated above, a small sample size is a pretty good bet for determining tall head URLs. What we are still missing is metrics or intuition for how the long tail distribution evolves over time.
We do know that for a URL to end up in the tall head, it must be visited by many Web explorers. This means that we can rule out all URLs that are visited by only one or two users. This assumption also leads to a heuristic based on the time between visits–URLs visited by many people should have the same visit/time distribution as the users/time distribution of the entire sample. More specifically, we might guess that if the time between visits has an average near 1 day/number of visits and relatively low variance, it is likely to be in the tall head. A project for a little later…
Next post: How does the composition of the tall head change from day to day?