Skip navigation

AOL Data - The Analysis

Unless you have been living under a rock you are probably aware that AOL released 2.2 GB of their search logs to the general public in early August. Obviously there were some fairly serious privacy implications which AOL apparently failed to consider before releasing the data. Now the AOL research site (research.aol.com) has been shut down, the download has been taken away and it is likely we’ll be waiting a long time to see data like this again.

While the privacy problems cannot be underplayed it’s important to understand how useful this data is. Greg Linden says it well:

Some may cheer AOL getting a firm spanking over this privacy issue, but I think the long-term costs are grave. I suspect this pretty much eliminates any future access for the academic research community to large scale data sets. After this, the only work on big data will be at the search giants.

Hindering academic research will slow progress on building the next generation of search. It is hard to measure the cost of difficulty finding the information you need — the productivity loss of a few minutes a day over millions of people is difficult to measure — but it is a cost we will all be paying.

http://glinden.blogspot.com/2006/08/chance-to-play…

 

I was fortunate enough to get a copy of the data before most people became interested in it. That puts me in an interesting position - is it ethical to use data which so obviously (in retrospect) should not have been released? Some people have no problem in digging though the data to try and find scandalous searches, and that’s not what I’m interested in doing. What I want to know is a few basic things like:

  • What is the average query length, and how does this compare historically?
  • How many people are not finding what they want?
  • How many people are doing navigation using a search engine?

Thankfully, ARSTechnica have done the ethical worrying for me (or at least interviewed people who are professionals at it). Needless to say, views are mixed.

Jon Kleinberg:

“Now it’s sitting there, in cold storage,” he said. “The number of things it reveals about individual people seems much too much. In general, you don’t want to do research on tainted data.”

Jeffrey Seglin:

the ethical obligation “here falls upon the companies that are releasing the data. If these companies have made a commitment to keep individual behavior private, then they have an obligation to make sure that the data they release can’t be manipulated to discover the identities of the users.”

When it comes to research, though, Seglin has no problem with people who want to use the data for their own projects—provided they do not cross one important boundary. “If researchers are using the data to identify individual users, I believe they’ve crossed an ethical line,” Seglin says. But if they don’t try to match up the data with actual people, he believes that they are (ethically) in the clear.

Anyway, I’m going with the “use your powers for good, not evil theory”.

 

The Data 

The notes on the data set extracted from the README that came with it state the following:

  • 36,389,567 lines of data
  • 21,011,340 instances of new queries (w/ or w/o click-through)
  • 7,887,022 requests for “next page” of results
  • 19,442,629 user click-through events
  • 16,946,938 queries w/o user click-through
  • 10,154,742 unique (normalized) queries
  • 657,426 unique user ID’s

In my analysis I can’t replicate their results on the number of “next page” queries. I found:

  • 36,389,567 queries
  • 21,008,404 non paging queries (which I think should be the same as the 21,011,340 instances of new queries cited in the README)
  • 15,381,163 requests for “next page” of results

I can’t explain the discrepancy in my measurements compared to the documentation (and an unfortunate side effect of this whole debacle is that there is no one at AOL to ask). However, my measurements are internally consistent (21,008,404 + 15,381,163 = 36,389,567) which doesn’t appear to be the case with the documented numbers.

Average Query Length

 The average query length (not including request for “next page” of results) is 2.34 words per query.

However, a significant proportion of all queries are navigational queries for domain names. Ignoring these queries, the average query length increases to 2.86 words per query.

Navigational Queries

A surprising (to me) characteristic of this dataset was the large number of searches that were for domain names. For instance, a user would enter “whitepages.com” and then exit directly to http://www.whitepages.com. 28.26% of all non-paging queries involved the user entering a domain name as the search term. These kind of numbers go a long way to explaining why domain-name squatting has become so popular.

Paging query results

It is often assumed that most users will not page through result sets. Preliminary analysis of these results shows that this assumption is not generally the case. 42.27% of all queries logged were “next page” requests. I was very surprised at this number, but some reading shows that it is possible this is correct. For instance, “Analysis of a Very Large Web Search Engine Query Log” looks at 1998 data from AltaVista. In that sample 32% of queries were for the “next page” of the result set.

6 Comments

  1. Posted August 31, 2006 at 12:55 am | Permalink

    hmm, strangely Gaurdian’s writer belive differently about paging queries. It seems like they used the same AOL data. What do you think? He says just 11% of people page, while you said 42%.

  2. Nick
    Posted August 31, 2006 at 2:55 am | Permalink

    Hi Pooya,

    I haven’t seen the Guardian’s analysis, and I can’t find it on their site. Do you have a link for it? Even using AOL’s number of “next page” queries (7,887,022/36,389,567) it still gives 21% of paging queries.

    The only possiblity I can see is they counted the number of individual users who paged, rather than the number of paging queries (ie they are saying that 11% of the 657,426 users use the paging feature). I haven’t done any analysis to confirm than, though.

  3. Posted August 31, 2006 at 7:18 pm | Permalink

    Oops, forgot to include the link to guardian’s article: http://technology.guardian.co.uk/weekly/story/0,,1861112,00.html

  4. Nick
    Posted September 1, 2006 at 2:36 am | Permalink

    Ok - for those interested, the analysis is here: http://www.seo-portal.com/aol-data-analysis-i-clicks-on-search-engine-results/2006/08/09/

    He’s saying 11% of click-throughs are on the second search result link (not to the second page). This isn’t inconsistant with my analysis - we are looking at very different things.

  5. Posted September 4, 2006 at 7:37 am | Permalink

    Perhaps the “next page” events are the ones where the timestamp is identical yet shows an itemrank of 0. I noticed lots of those and thought they must be duplicates, but perhaps not.

    Not sure how to indicate those - I’m reluctant to go through tweaking the data having finally got it organised.

    (Yes, I wrote the article.)

    Charles

  6. Sean
    Posted September 19, 2006 at 8:41 pm | Permalink

    “A surprising (to me) characteristic of this dataset was the large number of searches that were for domain names.”

    Remember that these searches come from the AOL Client. The interface of the AOL client (in the latest version) uses a single text field for both search and URL navigation (and some AOL keywords). This may skew the results compared to users not using the AOL client. I would suspect that the number of URL searches by users not using the AOL client would be much lower. Searches for “ebay”, “yahoo”, and the like would probably not change much if at all.

3 Trackbacks/Pingbacks

  1. My New Blog…

    For a variety of reasons I’ve started a new blog: http://wwwscope.com/. I probably wont write much Java stuff on there (not that I seem to here, either at the moment), but my intention is that it will contain mostly longer technical content.

    I’ve st…

  2. Like a bad penny…

    Nick Lothian has shifted/diverged to another blog: WWWScope, kicking things off with a look at the AOL search data. His analysis is generally spot on, so I’m looking forward to him becoming a little more prolific (?) in the coming months. Subscribed….

  3. [...] There’s another twist to the story: to make the contest work, they have to release the database of rental histories. Unlike the AOL search data debacle, however, Netflix carefully considered the privacy implications and got the nod from privacy experts. The data are also just easier to anonymize; a person’s web portal search records are generally much more personal than a list of rented movies. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*