Unless you have been living under a rock you are probably aware that AOL released 2.2 GB of their search logs to the general public in early August. Obviously there were some fairly serious privacy implications which AOL apparently failed to consider before releasing the data. Now the AOL research site (research.aol.com) has been shut down, the download has been taken away and it is likely we’ll be waiting a long time to see data like this again.
While the privacy problems cannot be underplayed it’s important to understand how useful this data is. Greg Linden says it well:
Some may cheer AOL getting a firm spanking over this privacy issue, but I think the long-term costs are grave. I suspect this pretty much eliminates any future access for the academic research community to large scale data sets. After this, the only work on big data will be at the search giants.
Hindering academic research will slow progress on building the next generation of search. It is hard to measure the cost of difficulty finding the information you need — the productivity loss of a few minutes a day over millions of people is difficult to measure — but it is a cost we will all be paying.
http://glinden.blogspot.com/2006/08/chance-to-play…
I was fortunate enough to get a copy of the data before most people became interested in it. That puts me in an interesting position - is it ethical to use data which so obviously (in retrospect) should not have been released? Some people have no problem in digging though the data to try and find scandalous searches, and that’s not what I’m interested in doing. What I want to know is a few basic things like:
- What is the average query length, and how does this compare historically?
- How many people are not finding what they want?
- How many people are doing navigation using a search engine?
Thankfully, ARSTechnica have done the ethical worrying for me (or at least interviewed people who are professionals at it). Needless to say, views are mixed.
Jon Kleinberg:
“Now it’s sitting there, in cold storage,” he said. “The number of things it reveals about individual people seems much too much. In general, you don’t want to do research on tainted data.”
Jeffrey Seglin:
the ethical obligation “here falls upon the companies that are releasing the data. If these companies have made a commitment to keep individual behavior private, then they have an obligation to make sure that the data they release can’t be manipulated to discover the identities of the users.”
When it comes to research, though, Seglin has no problem with people who want to use the data for their own projects—provided they do not cross one important boundary. “If researchers are using the data to identify individual users, I believe they’ve crossed an ethical line,” Seglin says. But if they don’t try to match up the data with actual people, he believes that they are (ethically) in the clear.
Anyway, I’m going with the “use your powers for good, not evil theory”.
The Data
The notes on the data set extracted from the README that came with it state the following:
- 36,389,567 lines of data
- 21,011,340 instances of new queries (w/ or w/o click-through)
- 7,887,022 requests for “next page” of results
- 19,442,629 user click-through events
- 16,946,938 queries w/o user click-through
- 10,154,742 unique (normalized) queries
- 657,426 unique user ID’s
In my analysis I can’t replicate their results on the number of “next page” queries. I found:
- 36,389,567 queries
- 21,008,404 non paging queries (which I think should be the same as the 21,011,340 instances of new queries cited in the README)
- 15,381,163 requests for “next page” of results
I can’t explain the discrepancy in my measurements compared to the documentation (and an unfortunate side effect of this whole debacle is that there is no one at AOL to ask). However, my measurements are internally consistent (21,008,404 + 15,381,163 = 36,389,567) which doesn’t appear to be the case with the documented numbers.
Average Query Length
The average query length (not including request for “next page” of results) is 2.34 words per query.
However, a significant proportion of all queries are navigational queries for domain names. Ignoring these queries, the average query length increases to 2.86 words per query.
Navigational Queries
A surprising (to me) characteristic of this dataset was the large number of searches that were for domain names. For instance, a user would enter “whitepages.com” and then exit directly to http://www.whitepages.com. 28.26% of all non-paging queries involved the user entering a domain name as the search term. These kind of numbers go a long way to explaining why domain-name squatting has become so popular.
Paging query results
It is often assumed that most users will not page through result sets. Preliminary analysis of these results shows that this assumption is not generally the case. 42.27% of all queries logged were “next page” requests. I was very surprised at this number, but some reading shows that it is possible this is correct. For instance, “Analysis of a Very Large Web Search Engine Query Log” looks at 1998 data from AltaVista. In that sample 32% of queries were for the “next page” of the result set.