The Long Tail of URL Exploration
Unlike robot crawlers, people visit only the links they think will be interesting. People follow news stories, follow links from friends, follow links to videos or pictures. People much more rarely follow links to boring stuff like privacy policies, lists of data or company mission statements.
In order to understand the pattern or people discovering content on the Web, I looked at the click streams of real Web users. (For privacy hounds, I have real data, but the user names and URLs are hashed so I can’t personally identify anyone or look up the actual URLs they visited.)
How much overlap is there between the URLs 10 people visit and those in the 11th person’s click stream? How about the 100th or 100,000th person? Does the millionth user explore any unique URLs at all? Can we build a model to answer How many people are required to crawl 10% of the Web?
The first part of the answer is to look at the distribution of URLs created by a group of users. The sample has 102,000 user’s click streams for 1 day. I use only 1 day, because using multiple days complicates the estimates due to nearly all users having a set of URLs they visit daily. In the sample, users make 18.6M visits to just under 8.4M unique URLs.
When the URLs are ranked by number of visits, the distribution of visits over the 8.4M URLs shows that a few URLs get many hits while the tail of the distribution (way out to the right) is a long flat curve with many URLs getting only a handful of hits.
The distribution of visits can be fitted fairly accurately with a power law, p(x)=ax^k. I don’t plot the curve, because the head is so tall compared to the tail and the distribution falls off so quickly that the plot is a very sharp "L" shape and we don’t get much from looking at it. It is more useful to look at the cumulative distribution function (CDF) of the distribution. This is the sum of the probabilities over the rank from highest to lowest. Summing the probabilities to the lowest ranked URL, gives 100% of the visits recorded. Using the CDF perspective gives insight we can apply to practical situations.
Below is a plot of the CDF. The red dots are the data points calculated from the sample while the blue line is the best fit to the CDF of the proposed power law distribution. (The fit parameters are a=0.00219 and k=-0.690.)
One conclusion that comes out of this view of the data is that URL visits follow the so-called "80/20 Rule." This predicts 80% of the visits for the day went to roughly 20% of the URLs. Actually, for this data, the proportion is about 80/50–80% of the traffic when to the top 4.5M URLs or the top 57% of URLs.
What does the long tail look like? The tail is just as surprising. For this data set, 5.8M URLs or 69% of the URLs visited during the day were visited only once. The number of URLs visited twice is 1.4M.
How do these numbers scale with the size of the group? That’s coming in the next post.