Web Robots/ Search Engine Google Robots on Internet are often termed as Web Wanderers, Crawlers, or Spiders or Bots. These are basically the Search Engine programmes that traverse the Internet automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.
What is Robot.txt?
Robots.txt is a file that is used to exclude content from the crawling process of search engine spiders / bots. Robots.txt is also called the Robots Exclusion Protocol.Robots.txt file is a special text file that is always located in your Web server’s root directory. Robots.txt file contains restrictions for Web Spiders, telling them where they have permission to search. A Robots.txt is like defining rules for search engine spiders (robots) what to follow and what not to. It should be noted that Web Robots are not required to respect Robots.txt files, but most well written Web Spiders follow the rules you define.
Robots have certain directives that are per-defined to make the Search Engine understood the nature of page-
- User-agent: this parameter defines, for which bots the next parameters will be valid. * is a wildcard which means all bots or Googlebot for Google.
- Disallow: defines which folders or files will be excluded. None means nothing will be excluded, / means everything will be excluded or /folder name/ or /filename can be used to specify the values to excluded. Folder name between slashes like /folder name/ means that only folder name/default.html will be excluded. Using 1 slash like /folder name means all content inside the folder name folder will be excluded.
- Allow: this parameter works just the opposite of Disallow. You can mention which content will be allowed to be crawled here. * is a wildcard.
- Request-rate: defines pages/seconds to be crawled ratio. 1/20 would be 1 page in every 20 second.
- Crawl-delay: defines how-many seconds to wait after each successful crawling.
- Visit-time: you can define between which hours you want your pages to be crawled. Example usage is: 0100-0330 which means that pages will be indexed between 01:00 AM – 03:30 AM GMT.
- Sitemap: this is the parameter where you can show where your sitemap file is. You must use the complete URL address for the file.
The “/robots.txt” file is a text file, with one or more records. Usually contains a single record looking like this-
When you start making complicated files i.e. you decide to allow different user agents access to different directories problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.
The more serious problem is with logical errors. For instance:
The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Google-bot. When Googlebot starts reading robots.txt, it will see that all user agents (including Google bot itself) are allowed to all folders except /temp/. This is enough for Google-bot to know, so it will not read the file to the end and will index everything except /temp/ – including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.
The simplest robots.txt file uses two rules:
- User-agent: the robot the following rule applies to
- Disallow: the URL you want to block
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
User-agent: * Disallow: /folder1/
In this example only the URLs matching /folder2/ would be disallowed for Googlebot.
User-agents and bots
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
- To block the entire site, use a forward slash.
- To block a directory and everything in it, follow the directory name with a forward slash.
- To block a page, list the page.
- To remove a specific image from Google Images, add the following:
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
- To remove all images on your site from Google Images:
User-agent: Googlebot-Image Disallow: /
- To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot Disallow: /*.gif$
- To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn’t share pages with the other Google user-agents. For example:
User-agent: * Disallow: /
Note that directives are case-sensitive. For instance,
Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.
Googlebot (but not all search engines) respects some pattern matching.
- To match a sequence of characters, use an asterisk (*). For instance, to block access to all sub-directories that begin with private:
User-agent: Googlebot Disallow: /private*/
- To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot Disallow: /*?
- To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn’t crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: * Allow: /*?$ Disallow: /*?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
The Test robots.txt tool will show you if your robots.txt file is accidentally blocking Googlebot from a file or directory on your site, or if it’s permitting Googlebot to crawl files that should not appear on the web. When you enter the text of a proposed robots.txt file, the tool reads it in the same way Googlebot does, and lists the effects of the file and any problems found.
To test a site’s robots.txt file:
- On the Webmaster Tools Home page, click the site you want.
- Under Site configuration, click Crawler access
- If it’s not already selected, click the Test robots.txt tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
Any changes you make in this tool will not be saved. To save any changes, you’ll need to copy the contents and paste them into your robots.txt file.
This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard robots.txt protocol. It understands Allow: directives, as well as some pattern matching. So while the tool shows lines that include these extensions as understood, remember that this applies only to Google-bot and not necessarily to other bots that may crawl your site.
Server-Side Robots Generator
Server side Robots are controlled by a plain text file located on your root web at http://www.yoursite.com/robots.txt. The robots.txt file is a set of standardized rules that govern to search engines and their user-agents (spiders), which webpages and directories to allow indexing and which to deny.
You may impose restrictions on which webpages to disallow indexing. By default, most users will want to allow all directories except their /cgi-bin directory which commonly holds scripts. To enable all webpages, check the “Enable All Webpages” check-box. Otherwise, enter each webpage or directory path in the exclusion box, one per line (all paths must end with a “/”). If you checked off Enable All Webpages, this box will not be read.
Example: “http://www.sample.com/cgi-bin/” (Excludes /cgi-bin/ directory)
Example: “http://www.sample.com/images/” (Excludes /images/ directory)
Example: “http://www.sample.com/hello.html” (Excludes /hello.html webpage)
- SEO Domain Name Correlation
- Search Engine Market Share- Targeted Search Engine