Wednesday, December 21, 2011

Preventing your site from being indexed, the right way

The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots (typically search engine robots) on how to crawl & index pages on their website.
Some people use robots.txt files to prevent sites from being indexed and thus showing up in the search engines.
But robots.txt doesn't actually do the latter, even though it does prevent your site from being indexed.
So, if you want to effectively hide pages from the search engines, and this might seem contradictory, you need them to index those pages. Why? Because when they index those pages, you can tell them not to List them. There's two ways of doing that:

Using meta tag on every page
By using robots meta tags, like this:
<meta name="robots" content="noindex,nofollow"/>

The issue with a tag like that is that you have to add it to each and every page. That's why the search engines came up with the X-Robots-Tag HTTP header.

Setting up for a whole site using IIS
If your site is running on IIS you can do by adding a custom HTTP Header
IIS 6: right-click on the "Web Sites" folder > Properties > HTTP Headers
IIS 7: on the server home screen, click on HTTP Request Headers, choose "add"
Name: X-Robots-Tag
Value: noindex, nofollow
And it'd have the effect that that entire site can be indexed, but will never be shown in the search results.
So, get rid of that robots.txt file with Disallow: / in it, and use the X-Robots-Tag instead!

Submit this story to DotNetKicks