The Danger Of Robots.txt
2009-05-03 00:05:04
Sat, 08 Nov 08 14:50:39 -0700
Almost every source will tell you to use a robots.txt file, including the all powerful Google.
The issue is a security one. Lets look at an example robots.txt file:
User-agent: *
Disallow: /rss
Disallow: /portal
Disallow: /search1
Disallow: /login
Disallow: /secret
See the problem here?
There are actually a few.
First of all, and most obviously, /secret is clearly a private directory.
It might be unprotected, and contain huge amounts of any sort of information.
Even if it is protected, it gives an attacker another door to pound on.
/login is the same case, it gives an attacker an obvious login page, from where he can try anything from SQL injection to brute forcing. Neither is good for business.
A robots.txt file is a public thing. Any person can see the file.
This is Google's
To view a robots.txt file, an attacker simply navigates to yourdomain.com/robots.txt
This will usually give an attacker a whole host of information that can be used to attack your website, including unspiderable pages that may be unknown otherwise.
One of the first things an attacker will do when attempting to hack a website is view the robots.txt file and spider the site. This is called information gathering.
Any time an attacker gets a new piece of information, such as a directory, error, login page, or anything, that is one step closer to a successful attack.
The Better Option:
Robots meta tags.
For HTML sites:
<meta name="robots" content="noindex, nofollow">
And for XHTML sites:
<meta name="robots" content="noindex, nofollow" />
Often times, a robots.txt will luck out and not give any harmfull information, such as only exposing otherwise spiderable directories, such as /images or /foobar. But it is never good to use them unless you are highly knowledgeable in web security.
If you insist on having one, for SEO purposes, try something like:
User-agent: *
Disallow:
or simply create an empty robots.txt file. No data exposition, no harm. The handling can be done with the meta tags.
Considering the security risks involved, the extra few bytes on some pages shouldn't be much of an issue.