Thursday, August 16, 2012

Blocking Crawlers With Robots.txt

The robots.txt standard is a text file
placed in the root server's HTML
directory. For example, if I did not
want the entire site to be
indexed, I would make a file that
would be found under the following
An engine respecting the standard
would ask for the file before trying to
index any page within the site. To
exclude the entire site, the file would
User-agent: *
Disallow: /
The user-agent portion lets you
specify engines or browsers that
should obey the next line. Chances
are, you want them all to do so. The
* is a way to specify everything.
The disallow portion is where you
specify directories or file names. In
the example above, the * is used to
protect everything within the site.
You can also be more specific and
block particular directories or pages:
User-agent: *
Disallow: /webmasters/
Disallow: /access/
Disallow: /classroom/stats.htm
Now the engines respecting the
standard will not index anything in
the site with the addresses:
And this page is also blocked:
Because the robots.txt file must go in
the server's root directory, many of
those using free web space will not
be able to use it. You cannot simply
put it within your space. For example,
here's a scenario with AOL:
Not OK
The first works because the file is in
the server's root directory. The
second doesn't work because it is
located in a sub-directory.
Because of this problem, the meta
robots tag was created to help those
without access to the robots.txt file.
It is described on the
Blocking Search
Engines With The Meta Robots Tag
Security Issues
If you don't want something to be
accessed, don't put it on the web.
Period. Certainly don't expect the
robots.txt file to protect it. Not every
search engine respects the
convention, though all the majors do.
More importantly, humans may take
advantage of the file. All anyone has
to do is enter the address to your
robots.txt file, and they can read the
contents in their web browser. They
can see exactly what you consider
off-limits for spiders, which
sometimes also means off-limits for
Consider this as you create your
robots.txt file. You don't want it to be
a roadmap to sensitive areas on your
server. If you do list them, password
protect the areas. Keeping them off
the web, of course, is the safest route
of all.
Other Notes
Occasionally, reports come about
problems with having either a blank
robots.txt file or no robots.txt file. In
either case, the issue seemed to be
that because there was no valid
robots.txt file explicitly allowing
indexing of some or all pages within
the site, no pages were indexed at all.
This really shouldn't happen, but if
you are encountering problems with
getting indexed, try installing a
robots.txt file that allows some or all
of your pages to be indexed.
More Resources
The Web Robots Pages: The Robots
Exclusion Protocol
The official word on using a
robots.txt file.
The motto for the Robotcop project is
"robots.txt: it's the Law." The
robots.txt file is the mechanism that
web site owners can use to block
spiders from crawling all or portions
of their web sites. It's widely
recognized and honored by the major
crawlers, but it remains an unofficial
law. Even worse, it's a law with no
law enforcement agency. Enter
Robotcop. This is an open source
project designed to produce plug-in
modules for popular web servers. Did
a crawler just fly past your robots.txt
file? Robotcop can spot this and give
you a variety of options with real
teeth to them, such as blocking or
trapping the spiders. Currently
available for Apache 1.3 as a beta,
there are plans to support Apache 2.0
and ISAPI webservers such as Zeus
and IIS, in the future.

Related Articles :









LOVE IS TO ACCEPT OTHERS FOR WHAT THEY ARE Copyright © 2011-2012 BloggerTemplate is Designed by Cerpen666