Robots.txt

DJilla · 2006-04-17, 11:08 AM

Broken internal and external links are the burden of an active webmaster. You always have to double check periodically for these and then hunt them down. This task however may only be tagentially related to a robots.txt file. Robots.txt usually serves to guide the SE and other bots into and out of you file system and perform other housecleaning services:

Begin Format Example For robots.txt file:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
Disallow: /dbase/

User-agent: msnbot
Crawl-delay: 10

User-agent: Teoma
Crawl-delay: 10

User-agent: Slurp
Crawl-delay: 10

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

End Format Example

Notes:

The * asterisk is a wild card that means "All" crawlers/spiders/bots should stay out of that group of files or directories listed.

"Disallow: /" means they should stay out entirely.

"Crawl-delay:" slows the bot down so as to not overtax server resources. Crawl-delay should only be needed on very large sites with hundreds or thousands of pages.

Most larger search engines (good bots) will crawl your site whether or not you use a robots.txt file. However, some such as MSN seem to require it before they will begin crawling at all. All of the search engine bots will generally request the file on a regular basis to see if any changes have occurred.

I'm sure there's a good thread in here:
http://www.greenguysboard.com/board/...earchid=344302

For general use:
http://www.robotstxt.org/

2006-04-17, 11:08 AM	#3
DJilla You can now put whatever you want in this space :) Join Date: Sep 2005 Posts: 525	Broken internal and external links are the burden of an active webmaster. You always have to double check periodically for these and then hunt them down. This task however may only be tagentially related to a robots.txt file. Robots.txt usually serves to guide the SE and other bots into and out of you file system and perform other housecleaning services: Begin Format Example For robots.txt file: User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /private/ Disallow: /dbase/ User-agent: msnbot Crawl-delay: 10 User-agent: Teoma Crawl-delay: 10 User-agent: Slurp Crawl-delay: 10 User-agent: aipbot Disallow: / User-agent: BecomeBot Disallow: / User-agent: psbot Disallow: / End Format Example Notes: The * asterisk is a wild card that means "All" crawlers/spiders/bots should stay out of that group of files or directories listed. "Disallow: /" means they should stay out entirely. "Crawl-delay:" slows the bot down so as to not overtax server resources. Crawl-delay should only be needed on very large sites with hundreds or thousands of pages. Most larger search engines (good bots) will crawl your site whether or not you use a robots.txt file. However, some such as MSN seem to require it before they will begin crawling at all. All of the search engine bots will generally request the file on a regular basis to see if any changes have occurred. I'm sure there's a good thread in here: http://www.greenguysboard.com/board/...earchid=344302 For general use: http://www.robotstxt.org/ __________________ Submit Free Sites, Blogs, Movies, TGP's, Triple XXX Info