#WikiLeaks & Twitter Trending Topics: Manual Interference or Algorithms as Usual?

By: Devin Gaffney on 9 December 2010

The recent outcry over whether or not “Wikileaks”, “#cablegate”, “Assange”, or similar terms being blacklisted on Twitter has brought up a now familiar question: to what degree does Twitter enhance their trending topics algorithm with manual interference? As in June with #iranElection and previously with “Bieber”, the proponents of these topics accuse Twitter of this form of censorship when they do not reach trending status, despite some apparent uptick in traffic.

Before assuming anything in either direction, it’s important to look at the data behind the trending topics in question. On December 6th, articles, most notably Angus Johnston’s blog post, started springing up about this apparent manual interference. In it, he cites the traffic patterns from Trendistic on “Wikileaks” versus “Sundays”, and posits that the major uptick in traffic should have labeled the term as a trending topic.

By this logic, however, shouldn’t “lol” have showed up in trending topics on December 5th?

Our own data collection, however, similarly shows that amongst popular topics on Twitter, Wikileaks was significant - beyond any of the trending topics on December 5th.

The data collection was begun on the afternoon of December 6th, and continued over-night through Tuesday afternoon. The method of collection was the Search api (http://search.twitter.com/), which has the nifty ability to go back up to 1,500 previous Tweets, thus ensuring that any reasonably efficient algorithm will be able to pick up the majority of tweets actually posted under the term. Ultimately, 781,092 Tweets and 465,065 Users were collected, which is a relatively normal breakdown of data (a roughly 2:1 Tweets/Users breakdown is continuously seen in datasets that are based on any particular term in an organic, non-spamming environment). The following table illustrates the raw volume of Tweets for all terms collected against, which were a combination of trending topics, political terms, and general high-volume terms, including the Tweets/User ratio, which can be used to quickly determine the legitimacy of any particular trending topic in terms of spam:

+-----------+-------------+-----------+--------------------+
|     Users |      Tweets |       T/U |               Term |
+-----------+-------------+-----------+--------------------+
|      1615 |        1719 |     1.064 | RIP+Mark           | 
|      1473 |        1881 |     1.277 | Adnet              | 
|       587 |        1891 |     3.221 | VnezuelaLovesBiebs | 
|       970 |        1968 |     2.029 | Cassiopeia         | 
|      1718 |        1983 |     1.154 | Dailey             | 
|      2700 |        2844 |     1.053 | Introducing+Nexus  | 
|      1993 |        3470 |     1.741 | Varanasi           | 
|      2778 |        3895 |     1.402 | #thingsimiss       | 
|      1737 |        4263 |     2.454 | #budget11          | 
|      4137 |        5443 |     1.316 | SDK                | 
|      2973 |        5891 |     1.981 | #cuandoyoerachico  | 
|      4636 |        6468 |     1.395 | Don+Meredith       | 
|      6371 |        7274 |     1.141 | Assange+Arrested   | 
|      2806 |        8612 |     3.069 | #wearejonasfans    | 
|      8949 |       12240 |     1.368 | congress           | 
|      8162 |       12252 |     1.501 | Rubin              | 
|     10725 |       18815 |     1.754 | #cablegate         | 
|     21240 |       26975 |     1.270 | Pearl+Harbor       | 
|     29555 |       37745 |     1.277 | Elizabeth+Edwards  | 
|     24232 |       63502 |     2.621 | #lemmeguess        | 
|     44269 |       69468 |     1.569 | assange            | 
|     38242 |       78504 |     2.053 | Obama              | 
|     67548 |       95729 |     1.417 | #alliwant          | 
|     64892 |      103515 |     1.595 | wikileaks          | 
|     62541 |      110045 |     1.759 | lol                | 
+-----------+-------------+-----------+--------------------+

Clearly, Wikileaks is a top contender, and certainly has a Tweets/Users ratio that is not suggestive of any spamming by a small number of accounts. In short, we could reasonably say from an admittedly cursory review of the data that the large volume, significantly organic conversation about Wikileaks dominated the Twittersphere on December 5th. In contrast to something like VnezuelaLovesBiebs (where relatively fewer Users were Tweeting more often), it may even seem that Wikileaks was a “truer” trending topic. However, something that the “Bieberites” have grasped, something that is coming slower to the Wikileaks crowd, is the power of new hashtags. Perhaps this is the reason behind #cablegate, which did become a trending topic. Judging from data on previous traffic, it would seem that one major difference stands between #cablegate and Wikileaks as a trending topic: the velocity of volume relative to previous activity.
Although #cablegate was mentioned as much as a week before matching velocity with Wikileaks traffic, it lacked the long-term historical context that Wikileaks had, which likely factors into the popularity of a topic.

The problem with claiming that Twitter is blacklisting any particular term ultimately comes down to a likely misunderstanding of the mechanics behind trending terms. Volume alone does not dictate what is trending - if that were the case, barring any stop-word list, we would see terms like “lol” and “Bieber” constantly trending, and, ultimately, the lack of churn in trending topics would make them a useless feature on the site. For this reason, the likely major component is the velocity of volume rather than volume itself (the algorithm itself is not publicly known). Beyond that, major factors are likely the Tweets/Users ratio or some similar metric used to determine the rough “organic-ness” of the topic, the average account age of Users who are posting, and similar metrics that can establish what really counts in the Twittersphere. In some cases, such as the #iranElection accusations in June, there simply isn’t enough volume to make any legitimate claim (beyond the significant historical context of the term, as well as the newness of the algorithmic shift in May 2010).

In this case, however, there seems to be a legitimate claim to a conversation that perhaps should have been a trending topic. The question is not whether or not the term was blacklisted - it’s absolutely against Twitter’s interest to take part in that type of activity, and with such open access to their data, it would be a relatively tangible task to uncover any major manual interference. Rather, the question is whether or not this is the trending algorithm that we as users want. Churn is important for Twitter the company, but volume, holding against any trivial trends, is possibly a larger factor in some cases for Twitter the communications platform.

Twitter published their own post about this subject in response to the latest outcry over Wikileaks not trending, and neatly summed up their algorithm favoring velocity above all else by finishing with the following:

Sometimes a topic doesn’t break into the Trends list because its popularity isn’t as widespread as people believe. And, sometimes, popular terms don’t make the Trends list because the velocity of conversation isn’t increasing quickly enough, relative to the baseline level of conversation happening on an average day; this is what happened with #wikileaks this week.

With all due respect to other trending topics, which on this day covered topics largely dealing with the death of Elizabeth Edwards and the attack in Varanasi, perhaps the Wikileaks story should have been trending, and perhaps the algorithm is due for some form of an overhaul to balance the needs of Twitter the company and Twitter the communications platform. The situation, then, is not whether or not Wikileaks is being discriminated against, but whether or not we value that algorithmic discrimination as users.