HomeGeneral

How to control or stop Search Engines to crawl your Website using Robots.txt

Like Tweet Pin it Share Share Email

Control or Stop Search Engines to crawl your Website using Robots.txt

Website owner can instruct search engines on which pages to crawl and index, They can use a robots.txt file to do so.

A search engine robot want to visit a website URL, say http://www.domainname.com/index.html (as defined in directory index)before visit, it first check http://www.domainname.com/robots.txt, and looks to see if there are specific directives to follow. Let’s suppose it finds the following code in the robots.txt.

User-agent: *
Disallow: /

 

The “User-agent: *” means this is a directive for all robots. The * symbol means all.
The “Disallow: /” tells the robot that it should not visit any pages on the site.

 

Important considerations to use robots.txt file.

1) Robots that choose to follow the instructions try to search this file and read the instructions before visiting the website.If this file doesn’t exist web robots assume that the web owner wishes to provide no specific instructions.

2) A robots.txt file on a website will function as a request that specified robots ignore specified files or directories during crawl.

3) For websites with multiple sub domains, each sub domain must have its own robots.txt file. If domainname.com had a robots.txt file but sub.domainname.com did not, the rules that would apply for domainname.com would not apply to sub.domainname.com.

4) The robots.txt file is available to the public to view. Anyone can see what sections of your server you don’t want robots to use.

5) Robots can ignore your /robots.txt.

6) Your robots.txt file should be in the root for your domain. In our server’s configurations this would be the public_html folder in your account. If your domain is “domainname.com” then the bots will look for the file path http://domainname.com/robots.txt.If you have add-on domains and want to use a robots.txt file in those as well you will need to place a robots.txt file in the folder you specified as the root for the add-on domain.

 

Some examples:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

 

In the example, web-owner told ALL robots (remember the * means all) to not crawl four directories on the site (cgi-bin, images, tmp, private), if you do not specify files or folders to be excluded it is understood the bot then has permission to crawl those items.

 

To exclude ALL bots from crawling the whole server.
User-agent: *
Disallow: /

 

To allow ALL bots to crawl the whole server.
User-agent: *
Disallow:

 

To exclude A SINGLE bot from crawling the whole server.
User-agent: BadBot
Disallow: /

 

To allow A SINGLE bot to crawl the whole server.
User-agent: Google
Disallow:

User-agent: *
Disallow: /

 

To exclude ALL bots from crawling the ENTIRE server except for one file.
🙂 Tricky since there’s no ‘allow’ directive in the robots.txt file. What you have to do is simply place all the files you do not want to be crawled into one folder, and then leave the file to be crawled above it. So if we placed all the files we didn’t want crawled in the folder called SCT we’d write the robots.txt rule like this.

 

User-agent: *
Disallow: /SCT

 

Or you can do each individual page like this.
User-agent: *
Disallow: /SCT/home.html

 

To create a Crawl Delay for the whole server.
User-agent: *
Crawl-delay: 10

 

If you wish to block one page, you can add a <meta> robots tag.
<meta name=”robots” content=”” />

You can get more knowledge about robots.txt file from http://www.robotstxt.org/

 

Comments (1)

  • Pingback:

Comments are closed.