Google.cn Filtering: How It Works

By: nart on 25 January 2006
Posted in China, Asia

Google has opened a new Chinese-language search engine at www.google.cn that filters out results from sites that are considered “sensitive” by the Chinese government. In addition to filtering news.bbc.co.uk search results are also filtered for the human right groups hrw.org and hrichina.org and all of the geocities.com free hosting community. This filtering is quite similar to the filtering conducted by domestic Chinese search engines.

The filtering takes place in two ways:

1. de-listed domains: specific websites are removed entirely from search results; it is as if the website never existed.
2. de-listed urls: specific urls are removed from search results if they contain a de-listed domain.

For example, the domain news.bbc.co.uk has been removed from www.google.cn. Using Google’ “site:” modifier, a search for “site:news.bbc.co.uk” in google.cn returns no results and appears as if there is not such a website all. In addition to Google’s usual text that appears when searching for a non-existent website additional text appears informing the user that results have been removed to comply with local law.

However, using Google’s “inurl:” modifier, a search for “inurl:news.bbc.co.uk” does appear to return results although they are not listed and instead are replaced with text informing the user that results have been removed to comply with local law. Furthermore, a search for “site:bbc.co.uk inurl:news” shows that although bbc.co.uk is indexed and searchable the specific domain news.bbc.co.uk is not listed in the search results.

Another illustrative example is a search in google.cn for “site:www9.beijing999.com inurl:dmirror” versus the same search in google.com. In google.com 3 results are returned and all three are listed whereas google.cn returns 3 results but only lists 2 of them. The missing URL is “https://www9.beijing999.com/dmirror/http/mirror.epochtimes.com/gb/nf3154...” which contains the text “epochtimes.com/” in the URL path.

The website epochtimes.com is treated as a de-listed domain (site:epochtimes.com) however, a search with the modifier “inurl:” (inurl:epochtimes.com) does return results although none of the results are actual the requested website. But a search for “inurl:epochtimes.com/” (with a trailing slash) also returns results but does not list them for the user.

This fine grain control allows google.cn to keep websites such as “epochtimes.com.ua” in its index while eliminating epochtimes.com. There is similar fine grain control targeting Chinese language content. While there are results for "site:faluninfo.net" there are no results for "site:chinese.faluninfo.net".

To be clear, this filtering only affects www.google.cn; users who choose to access Google’s Chinese language search engine at http://www.google.com/ig?hl=zh-CN are not subjected to this filtering.

While this filtering can be easily circumvented most users will simply use google.cn, since users from China are redirected there by default.

Here are just some of the sites that have been de-listed by google.cn:

site:hrw.org
site:hrichina.org
site:boxun.com
site:tsquare.tv
site:freechina.net
site:rfa.org
site:news.bbc.co.uk
site:geocities.com
site:peacehall.com
site:64memo.com
site:voa.gov
site:falundafa.org
site:epochtimes.com
site:xinsheng.net
site:savetibet.org
site:bignews.org
site:topforum.com
site:omnitalk.com
site:laogai.org

(Crossposted on ICE)