Who's Knocking on the Door?
or, what you should know about
the Robots.txt file.
Hi! You say you're Scooter from AltaVista. Let me check my list. You're on the list - the door is open.
Knock, knock! Who's there? Arach. Arach who? Sorry, my list says you're from SearchButton.Com. You can't come in.
Robots.txt file
This little story parallels a robots.txt file's basic use on a website. However, there are other reason to the file:
1. "optimize" the same page (same content) for different engines
2. use "optimizing techniques" not acceptable by all engines
3. use "many" doorways
4. if you've been banned in the past
Now you say, "What is a robots.txt file?" Quoting from the help file of the RoboGen software:
ROBOTS.TXT, a file that spiders look in for information on how the site is to be
cataloged. It is a ASCII text file that sits in the document root of the server. It defines
what documents and/or directories that confirming spiders are forbidden to index.
This text file helps in the indexing of the Internet because:
Search engines such as Excite and AltaVista use web spiders, also known as robots, to create the indexes for their search databases. These robots transverse HTML trees by
loading pages and following hyperlinks, and they report the text and/or meta-tag
information to create search indexes.
To understand better, we need to know who, or what, these robots are and the other names they might be known by.
A robot is a program that automatically traverses the Web's hypertext structure by
retrieving a document, and recursively retrieving all documents that are referenced.
Note that "recursive" here doesn't limit the definition to any specific traversal algorithm;
even if a robot applies some heuristic to the selection and order of documents to visit
and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because the are operated by a human, and don't
automatically retrieve referenced documents (other than inline images)
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders.
These names are a bit misleading as they give the impression the software itself moves
between sites like a virus; this not the case, a robot simply visits sites by requesting
documents from them.
From: http://info.webcrawler.com/mak/projects/robots/faq.html#what
Though most of the time the robots do not harm to a site on the web, there are times when they become a problem. These are the times the robots.txt file is good to have on your site. Here are some of the problems.
-Certain robot implementations can (and have in the past) overloaded networks and
servers. This happens especially with people who are just starting to write a robot;
these days there is sufficient information on robots to prevent some of these
mistakes.
-Robots are operated by humans, who make mistakes in configuration, or simply
don't consider the implications of their actions. This means people need to be
careful, and robot authors need to make it difficult for people to make mistakes
with bad effects.
-Web-wide indexing robots build a central database of documents, which doesn't
scale too well to millions of documents on millions of sites.
From: http://info.webcrawler.com/mak/projects/robots/faq.html#bad
It was inevitable that regulation of some kind was needed. Here is what the RoboGen help file had to say on the subject:
The robot exclusion protocol was introduced by Martijn Koster in 1994 to deal with
problems that had been arising due to the increasing popularity of the internet and the
toll web spiders were having on system resources. Some of the problems were caused
by robots rapid-firing requests, that is loading pages in rapid succession. Other
problems such as robots indexing information deep in directory trees, temporary
information, and even accessing cgi-scripts. The robot exclusion protocol was quickly
adopted by webmasters and web robot makers as a way to organize and control the
indexing process.
Since then, the size of the Internet has increased dramatically and millions of people are
using it. The number of web robots crawling the web is greater than before and it is
more important than ever for all web sites to have a properly created and maintained
ROBOTS.TXT file.
For more information on the Robot Exclusion Protocol follow this link:
http://info.webcrawler.com/mak/projects/robots/exclusion.html#robotstxt
Now that you understand a little more about what the file is, we'll cover how to make one. It's not hard. Open Notepad, type in the lines as described in the Protocol above. Then save it as an ASCII text file. I see you're wondering what this file should look like. Here is a simple file that lets all robots into the site but tells them to stay out of the three directories mentioned:
User-agent: * (*- this indicates all robots)
Disallow: /cgi-bin/
Disallow: /_Private/
Disallow: /logs/
Here is another word we need to know the meaning of: user-agent
The word "agent" is used for lots of meanings in computing these days. Specifically:
Autonomous agents are programs that do travel between sites, deciding themselves
when to move and what to do. These can only travel between special servers and are
currently not widespread in the Internet.
Intelligent agents are programs that help users with things, such as choosing a product,
or guiding a user through form filling, or even helping users find things. These have
generally little to do with networking.
User-agent is a technical name for programs that perform networking tasks for a user,
such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer,
and Email User-agent like Qualcomm Eudora etc.
From: http://info.webcrawler.com/mak/projects/robots/faq.html
To write a little more complete version of one of these robots.txt files you need to have more information, like the name of the robot/spider you want to exclude. This name would replace the * in the file above. There are several sites on the web that offer lists of these robots/spiders. Some of these sites are:
http://info.webcrawler.com/mak/projects/robots/active.html
http://searchenginewatch.internet.com/webmasters/spiderchart.html
http://www.searchengineworld.com/spiders/
http://www.searchengineworld.com/spiders/spider_ips.htm
http://www.tardis.ed.ac.uk/~sxw/robots/check/ (you can use this site to check your file
if your have written the text file by hand in NotePad.