OpenNet Initiative: Bulletin 006

Google Search & Cache Filtering Behind China's Great Firewall

August 30, 2004
Last Updated: September 3, 2004
http://www.opennetinitiative.net/bulletins/006/

Contents:
- Background
- Methodology & Results
- Conclusions
- About the OpenNet Initiative

Background

The popular search engine Google was temporarily blocked by the government of China in September 2002. Chinese Internet users were unable to access the search engine. In addition, requests for Google from within China were redirected to Chinese search engines. (1) At the time, Google issued a statement indicating that they were working with the Chinese authorities to restore access to Google. (2) (3) China lifted the block on the Google search engine soon thereafter. (4)

In a recent interview, the co-founder of Google, Sergey Brin, stated that Google did not negotiate with the Chinese government to have Google unblocked. (5) Brin suggests that 'popular demand' forced the government to enable access to Google. Google is currently accessible to Internet users in China.

However, our tests show that while Google is accessible to Chinese users, not all of its functions are available; because of China's content filtering technologies, users of Google within China experience a much different Google than those outside. As our tests below indicate, China blocks access to the Google cache and to searches that contain certain keywords. (6) Neither China's keyword filtering nor the mechanism used to filter the Google cache is specific to Google. China's usual Internet controls apply when users search for specific keywords in Google: their connection to Google is terminated and they receive no results from their search. Thus, while China's filtering affects Google searches, the system is entirely independent of Google.

The keyword filtering technique blocks access to the Google cache. Google takes a snapshot of each page it indexes; users can click the 'cached' link in their search results to access a copy of the web page. Accessing Google's cache in this way is a well known method of ad hoc circumvention of Internet censorship. The only connection is to Google's servers for the cached copy of a blocked web page -- since no connection is issued to a blocked page, the user can view the content of blocked web sites. China's filtering of Google's cache limits this method of circumvention -- but not entirely. As we describe below, users can still access Google's cache, but only for web sites that do not contain filtered keywords in the URL.

Methodology & Results (7)

The ONI connected to 37 Google servers (each with a different IP address) from 11 remote computers located on 4 different backbone networks in China and from our remote testing facility at the University of Toronto. Each Google server was accessible from China. However, when requests to these same servers contain a request for the Google cache, these same servers become inaccessible. Therefore, China does not filter the Google cache by IP address or by domain name.

Our tests indicate that to disable the Google cache, China implements a filtering mechanism that disrupts access to any server if the text string 'search?q=cache' exists in the HTTP GET request. Moreover, this filtering occurs whether the server is Google's or not. Any request with 'search?q=cache' is disrupted. For example, a connection to http://ice.citizenlab.org/search?q=cache is also blocked. (This disruption occurs only if 'search?q=cache' appears in the HTTP GET request; the request is not blocked if the text appears in the body of the page.)

As previously discussed in ONI Bulletin 005, China employs packet sniffing technology to scan HTTP GET requests for certain keywords. If keywords are found, an RST packet is sent to the requesting computer to terminate the connection. (8) This technique effectively blocks access to Google's cache.

These results show that Google's servers are accessible:

These results show that requests to Google's servers for cached content are inaccessible:

However, this filtering mechanism is simple to defeat. The ampersand symbol '&' is used to concatenate key/value pairs in HTTP GET and POST queries. Although the ampersand is usually not prepended to the first key/value pair, it can be added without disrupting the request. Thus, if the blocked text string 'search?q=cache' is modified to read 'search?&q=cache', Google will properly process the request and return the cache result to the user. Since the blocked text string has been modified, though, China's filtering system will not block the request to Google's cache. Thus, China's filtering of the Google cache can be circumvented by adding a single character to the request. (9)

These results show that modified requests to Google's servers for cached content are accessible:

There are, however, some limitations to this circumvention technique. For example, the technique does not work if used to access cached copies of domain names or URL paths that contain a banned keyword. Although the filter will not match 'search?q=cache' because the request string has been modified, it will still match the additional banned keyword. Google appends the domain name and URL path of the cached content to the cache request; for example, '/search?q=cache:NyKCo1PdvUUJ:www.falundafa.org/+gong&hl=zh-CN'. Therefore, access to the cached copy will be blocked because the filter has matched a keyword. In our test case, the filter matched the banned keyword 'falun'.

Other Search Engines' Cache Functions

Although the cache filtering mechanism that China employs would affect any webserver, our tests suggest that it was deliberately put in place to affect Google's cache. Other search engines with caching functionality that we checked are not subjected to the same type of cache blocking. For example, both Yahoo! and Gigablast have caching functionality. The syntax for a cache request to Yahoo! is 'search/cache?p='. While similar to Google's cache string, Yahoo!'s is not filtered, even though it too can be used as an ad hoc form of circumvention. The Yahoo! cache also appends the requested domain to cache requested, so it is subject to the same limitations as the Google cache when used for circumvention.

The Gigablast search engine's syntax for a cache request is 'get?q=' and it too is not filtered. However, Gigablast does not append the requested domain to cache requests. As a result, if a blocked site is returned in the search results, the cached copy can be accessed even if the domain contains banned keywords making its cache function an effective form of circumvention.

These results show that requests to other search engine servers and their cached content are accessible:

Conclusions

Although China no longer blocks Google in its entirety, a Chinese user of Google can potentially have a much different Google experience than one from another country due to China's content filtering practices. Chinese Internet users' access to Google is filtered for specific keywords, and this filtering disrupts Google searches as well as access to the Google cache. The isolation and filtering of the text string 'search?q=cache' suggest that the Chinese state knows how access to Google's cache can be used as an ad hoc form of censorship circumvention and has taken steps to limit it accordingly. The fact that other search engines' cache functions are not filtered suggests that China has deliberately targeted Google for filtering.

About the OpenNet Initiative

The OpenNet Initiative is a partnership of the Citizen Lab at the Munk Centre for International Studies, University of Toronto, the Berkman Center for Internet & Society at Harvard Law School, and the Advanced Network Research Group at the Cambridge Security Programme, University of Cambridge.

The OpenNet Initiative releases occasional bulletins based on our ongoing research. These bulletins are meant to be limited responses to current events, policy debates, and/or issues raised by our ongoing research that we feel justify immediate wider circulation. Our more detailed analyses can be found in our major reports.

www.opennetinitiative.net


Notes:

1. http://cyber.law.harvard.edu/filtering/china/google-replacements/

2. http://news.bbc.co.uk/1/hi/technology/2231101.stm

3. http://www.newsfactor.com/perl/story/19279.html

4. http://www.theregister.co.uk/2002/09/12/google_china_crisis_over/

5. https://www.ipo.google.com/data/prospectus.html#toc59330_25b

6. Other browsers may respond to the filtering behavior differently.

7. Google cache filtering is not new and knowledge on how to circumvent it has been available to Chinese users since 2003. A report in Chinese outlining how to overcome Google cache filtering can be found here: kmnet-net.pdf

8. http://www.opennetinitiative.net/bulletins/005/

9. Seth Finkelstein has described such circumvention by manipulating query strings in order to access Google's cache in the context of N2H2's BESS filtering software. http://sethf.com/anticensorware/bess/google.php