Who's Knocking on the Door?
or, what you should know about
the Robots.txt file.
Hi! You say you're Scooter from AltaVista. Let me check my list. You're on the list - the door is open.
Knock, knock! Who's there? Arach. Arach who? Sorry, my list says you're from SearchButton.Com. You can't come in.

Robots.txt file
This little story parallels a robots.txt file's basic use on a website. However, there are other reason to the file:

1. "optimize" the same page (same content) for different engines
2. use "optimizing techniques" not acceptable by all engines
3. use "many" doorways
4. if you've been banned in the past

Now you say, "What is a robots.txt file?" Quoting from the help file of the RoboGen software:
ROBOTS.TXT, a file that spiders look in for information on how the site is to be cataloged. It is a ASCII text file that sits in the document root of the server. It defines what documents and/or directories that confirming spiders are forbidden to index.

This text file helps in the indexing of the Internet because:

Search engines such as Excite and AltaVista use web spiders, also known as robots, to create the indexes for their search databases. These robots transverse HTML trees by loading pages and following hyperlinks, and they report the text and/or meta-tag information to create search indexes.

To understand better, we need to know who, or what, these robots are and the other names they might be known by.

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images)

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them. From: http://info.webcrawler.com/mak/projects/robots/faq.html#what

Though most of the time the robots do not harm to a site on the web, there are times when they become a problem. These are the times the robots.txt file is good to have on your site. Here are some of the problems.

-Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.

-Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects.

-Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites. From: http://info.webcrawler.com/mak/projects/robots/faq.html#bad

It was inevitable that regulation of some kind was needed. Here is what the RoboGen help file had to say on the subject:

The robot exclusion protocol was introduced by Martijn Koster in 1994 to deal with problems that had been arising due to the increasing popularity of the internet and the toll web spiders were having on system resources. Some of the problems were caused by robots rapid-firing requests, that is loading pages in rapid succession. Other problems such as robots indexing information deep in directory trees, temporary information, and even accessing cgi-scripts. The robot exclusion protocol was quickly adopted by webmasters and web robot makers as a way to organize and control the indexing process.

Since then, the size of the Internet has increased dramatically and millions of people are using it. The number of web robots crawling the web is greater than before and it is more important than ever for all web sites to have a properly created and maintained ROBOTS.TXT file.


For more information on the Robot Exclusion Protocol follow this link:
http://info.webcrawler.com/mak/projects/robots/exclusion.html#robotstxt

Now that you understand a little more about what the file is, we'll cover how to make one. It's not hard. Open Notepad, type in the lines as described in the Protocol above. Then save it as an ASCII text file. I see you're wondering what this file should look like. Here is a simple file that lets all robots into the site but tells them to stay out of the three directories mentioned:

User-agent: * (*- this indicates all robots)
Disallow: /cgi-bin/
Disallow: /_Private/
Disallow: /logs/
Here is another word we need to know the meaning of: user-agent

The word "agent" is used for lots of meanings in computing these days. Specifically:

Autonomous agents are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet.

Intelligent agents are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.

User-agent is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.
From: http://info.webcrawler.com/mak/projects/robots/faq.html

To write a little more complete version of one of these robots.txt files you need to have more information, like the name of the robot/spider you want to exclude. This name would replace the * in the file above. There are several sites on the web that offer lists of these robots/spiders. Some of these sites are:

http://info.webcrawler.com/mak/projects/robots/active.html

http://searchenginewatch.internet.com/webmasters/spiderchart.html

http://www.searchengineworld.com/spiders/

http://www.searchengineworld.com/spiders/spider_ips.htm

http://www.tardis.ed.ac.uk/~sxw/robots/check/ (you can use this site to check your file if your have written the text file by hand in NotePad.