According To The Library Of Congress, Searching Twitter Is Really Hard

Nearly three years after Twitter granted the Library of Congress access to its archive of tweets, the Library has finally been able to preserve and stabilize an archive of roughly 170 billion tweets.  That represents the archive Twitter had at the time of the agreement (covering 2006-early 2010) and 150 billion tweets in the subsequent months (the Library receives over half a million new tweets each day, and that number continues to rise).  The archive includes the tweets and relevant metadata for a total of nearly 67 terabytes of data.

You can read more about the archive building project in a white paper issued by the Library.  Researchers do not yet have the ability to access the tweets, though over 400 requests have been submitted.  I’m a bit surprised that there have been so few requests, but perhaps the researchers most interested in accessing this archive are simply waiting for the doors to open.

Part of the delay in opening the archive can be attributed to a factor of this collection unique to social media.  This collection is updated essentially continuously.  Archival infrastructure had to be adapted to handle a steady stream of information.  In addition, it’s not enough to simply have the data.  It must be organized in a consistent way that makes it useable.  Anyone who has tried to search Twitter may understand how unwieldy and frustrating it can be.  Even the company tacitly acknowledges it when they announced this year that they are working on making it possible for Twitter users to access their own tweets.  The CEO described preparing a search engine for all tweets as being a completely separate problem from granting individual access.  It’s not exactly tracking water molecules in a stream, but it might as well be.

There’s also the issue of search time.  If a single search query of the 21 billion tweets in the 2006-2010 archive takes the Library 24 hours right now, searching 170 billion and counting is a daunting prospect.  Throwing more servers at the problem will help, but that costs resources not readily available to public institutions.

And here’s where what could be an enormous public resource could get shut behind a paywall.  You may notice that a Google  employee reached out in the comments section to offer assistance in getting the search times down.  But I’d be curious to see what conditions would come with that assistance.  I’m reminded of how Ancestry.com managed to get exclusive access to Census records.  While the Bureau benefitted from getting the records digitized, having this taxpayer-paid information controlled by a private company is problematic.  Hopefully the search problem can be managed in such a way to honor Twitter’s initial gesture of making the information available to the public.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.