OpenNet Initiative: Bulletin 005

Probing Chinese search engine filtering

August 19, 2004
Last Updated: August 19, 2004
http://www.opennetinitiative.net/bulletins/005/

Contents:
- Background
- Methodology & Results
- Conclusions
- About the OpenNet Initiative

Background

As Internet content filtering practices proliferate around the world, the companies that manufacture the technologies employed or implicated in such practices have been subject to increasing criticism. No where is this tension more evident than in the case of foreign direct investment in China, where western corporations have come under fire from human rights activists for their compliance with China's internet censorship and surveillance policies (1).

A recent Reporters Without Borders bulletin (2) draws attention to the practice of search engine filtering in China, with a focus on Baidu.com and Yisou.com, two popular Chinese search engines. Although initially home-grown, the two search engines have attracted interest from American companies anxious to penetrate China's massive IT market. Google has recently invested in Baidu while Yisou is owned by Google's main search engine rival, Yahoo!. The RSF bulletin admonishes both companies for complying with China's policies regarding self-censorship and, in doing so, acting contrary to the spirit of both US legislation -- in particular the Global Internet Freedom Act -- and to broader principles of human rights.

The RSF bulletin provides some preliminary examples of how content filters are configured in each of the search engines in China. Following upon the RSF bulletin, our probe through both Yisou and Baidu indicates:

Methodology & Results

The first step in our probe was to determine the geographical location of the servers in question. An examination of the IP addresses to which Yisou and Baidu resolve using APNIC records and tcp traceroute tools shows that they are hosted on CNCNET and CNCGROUP respectively - two major backbone networks in China. Using the same methods, we also verified the same for Yahoo!'s Chinese portal, which is a separate entity from Yisou (even though Yahoo owns the latter.) (1)

The geographical location of the servers is very significant, for it adds a layer of complexity in determining the nature of filtering in question. As has been reported by ONI researchers here and here, and in numerous media accounts, China employs its own extensive Internet content filtering technologies at Internet cafe, Internet Service Provider (ISP), backbone, and International Gateway/Exchange Points -- in other words, upstream from the search engine servers themselves. One cannot determine whether filtering is taking place at the search engine level or further upstream simply by entering keywords, in other words. To determine if the search engines themselves are filtering, we need to unravel and clearly differentiate any possible overlapping layers of filtering.

Filtering by Keyword
We conducted our own searches for various key words reported by RSF as not returning any results. We found that, as reported, a search for free tibet in Yisou does not return any results. (2) In fact, a search for anything along with the words free and tibet does not return any results. In addition, a search for free tibet in cn.websearch.yahoo.com does not return any results. (3) However, a search for freetibet, free tibetan or freedom tibet does return results. A search for freetibet at cn.websearch.yahoo.com returns the expected results of sites such as freetibet.org but the same search in Yisou returns 5 results, but not the sites one would expect. Contrary to the results returned by other search engines, none of the results returned by Yisou appear to contain information relating to Tibet but rather some sites with lists of email addresses and a Taiwanese heavy metal forum. (4)

De-Listing Results
A search for free tibet in Baidu does yield results, the first of which is tibet.com, the official website of the Tibetan government in exile. (5) However, a search in Baidu for falun does not yield any results. (6) But a search with the modifier inurl:falun does return results. (7) In addition, searches for inurl:falundafa (8) and site:falundafa.ca (9) return one result. Similar behaviour occurs with a search for hrichina. While hrichina returns no results (10), inurl:hrichina (11) and site:hrichina.org (12)do return some results.

These results are significant in several ways. First, they indicate that "banned" sites are indexed, just not shown when certain combinations of words are searched for. In other words, searches for key words should cause certain URLs to be listed in the results because they have been indexed by the search engine. The search engine has URLs that contain these key words in the domain name itself, but is not listing them in the results. It appears that the search engine itself is filtering the returned results when certain words are searched for.

Second, the results suggest that the search engine spiders (presumably behind the Great Firewall) are able to crawl and index sites that are (mostly) blocked in China. Search engines use automated "crawlers" to methodically browse and index the World Wide Web. If these "crawlers" are behind China's national filtering regime they should not be able to index sites that are blocked because they would not be able to access these sites. It is generally believed that China's national filtering regime is centralized and uniform but the fact that the search engines spiders are able to index some of these "blocked" sites indicates that there may be holes in the Great Firewall, that the crawlers have a special exemption from the filtering regime or that the crawlers operate from a third country.

Third, the results show that the cached copies of blocked websites (tibet.com, falundafa.ca, hrichina.org) are sporadically accessible. So not only have the crawlers been able to access and index blocked sites, they have been able to make cached copies as well. However, it appears that the problems in accessing these cached copies is due to interference from upstream filtering (discussed below).

Interference from Upstream Filtering
There is no concrete publicly available information about the specific workings of the Great Firewall of China. So by "blocked in China" we mean a high probability that an average user in China will be unable to access the content they've requested due to filtering by the Chinese authorities. It is beyond the scope of this bulletin to address the complexities of Internet filtering in China as a whole, a subject we will address in future reports in more detail. Instead, we will limit ourselves in this bulletin to observed behavior of search engine filtering.

One anomaly we found while checking the RSF list of blocked sites concerned searches for "Huang Qi", which RSF reported as turning up empty when searched for through Baidu, but which returned results when we did the same. We have determined that this anomaly is likely due to a peculiar form of interference from upstream filters. An inbound request containing a "banned" key word, in our case the word "falun", in a standard HTTP GET request is often disrupted, whether or not the server being connected to has the requested content.

$ telnet 202.43.217.94 80
Trying 202.43.217.94...
Connected to 202.43.217.94.
Escape character is '^]'.
GET /falun HTTP/1.0
Connection closed by foreign host.
Rather than receiving a 404, a file not found error, the connection is simply dropped. No HTTP headers are returned. Often, attempts to reconnect will also be dropped if the previous connection was terminated as a result of a request that contains banned key words. Thus access to the host IP, in our case Yisou (202.43.217.94), is effectively blocked for a varying period of time. (Similar behavior was observed on Baidu, but with more inconsistency. Note that using the domain name e.g. baidu.com will be even more inconsistent as it will resolve to more than one IP address.) This behavior can be re-created with varying success on most servers located in China.

The connection is disrupted because an RST packet is being sent back to the user making the request containing the banned word. TCP has a reset functionality used when significant errors occur during a connection and TCP cannot initiate a graceful termination. In these instances a RST packet terminates the connection without acknowledgments. Using a packet sniffer, it can be determined that after the 3-way TCP handshake, an HTTP request for "GET /falun HTTP/1.0" is issued followed by a retransmission of the same request. Instead of the expected response, an RST packet is sent. After sending the RST, the host (in this case Yisou (202.43.217.94)) then advertises a ZeroWindow size.

Another request transmission cannot be made until the host advertises a non-zero window size. In effect, you cannot connect to the same host from the same IP address for a period of time. using a packet sniffer we can see what happens when attempting to re-connect to the same server. Although a 3-way TCP handshake occurs, the host continues to advertise a ZeroWindow size and terminates the connection by sending an RST. This behaviour occurs for a varying period of time.

This may explain why RSF reports no returned results in Baidu for "Huang Qi" - an RST and ZeroWindow had been received for a previous request (possibly falundafa), and a new connection could not be made resulting in the alert "The document contains no data." A search for "Huang Qi" does return results in Baidu, although not necessarily sites about the imprisoned cyber-dissident. (13)

Conclusions

Several conclusions follow from this limited probe. As RSF reports, the Yisou and Baidu search engines are indeed actively filtering keyword search requests, as evidenced by the differential results yielded by Yisou, Baidu and Yahoo!'s China portal search engine requests. Considering both Yahoo!'s and Google's investment relationships with these companies, it is certainly legitimate to raise questions of corporate responsibility  and adherence to human rights, as RSF does in their report.

At the same time, this probe illustrates once again the inherently complex nature of  content filtering practices, and in particular the overlapping filtering techniques now in place in China. Perhaps the furthest along in sophistication among censoring regimes, China has imposed a matrix of technical and non-technical methods to restrict and control information flows. These methods provide the context within which all other forms of Internet-based communications take place in China. Whatever filtering is imposed by search engines at lower levels it is done prior to upstream filtering imposed by the Chinese state itself. This multifaceted regime of Internet censorship needs to be taken into account when specific claims are made about the sources of content filtering in that country.

About the OpenNet Initiative

The OpenNet Initiative is a partnership of the Citizen Lab at the Munk Centre for International Studies, University of Toronto, the Berkman Center for Internet & Society at Harvard Law School, and the Advanced Network Research Group at the Cambridge Security Programme, University of Cambridge.

The OpenNet Initiative releases occassional bulletins based on our ongoing research. These bulletins are meant to be limited responses to current events, policy debates, and/or issues raised by our ongoing research that we feel justify immediate wider circulation. Our more detailed analyses can be found in our major reports.

www.opennetinitiative.net


Notes:

1. Walton, Greg. China's Golden Shield: Corporations and the Development of Surveillance Technology in the People's Republic of China. Available: http://go.openflows.org/CGS_ENG.PDF

2. Reporters Without Borders http://www.rsf.org/article.php3?id_article=11031