Killer Robots From Outer SEO Space: How to Dominate the Robots.txt File

3d_tin_robotIf you haven’t heard of Mr. Robots, don’t blame yourself. It wasn’t even on the SEO map till just a couple years ago. Most of you, however, know what it is but don’t know exactly how to dominate the robots.

Robots.txt files are no secret. You can spy on literally anyone’s robots file by simply typing “www.domain.com/robots.txt.” The robots.txt should always and only be in the root of the domain and EVERY website should have one, even if it’s generic and I’ll tell you why.

There’s mixed communication about the robots. Use it. Don’t use it. Use meta-robots. You could have also heard advice to abandon the robots.txt all together. Who is right?

Here’s the secret sauce. Check it out.

First things first, understand that the robots.txt file was not designed for human usage. It was designed to command search ‘bots’ about how exactly they can behave on your site. It sets parameters that the bots have to obey and mandates what information they can and cannot access.

This is critical for your sites SEO success. You don’t want the bots looking through your dirty closets, so to speak.

What is a Robots.txt File?

The robots.txt is nothing more than a simple text file that should always sit in the root directory of your site. Once you understand the proper formats it’s a piece of cake to create. This system is called the Robots Exclusion Standard.

Always be sure to create the file in a basic text editor like Notepad or TextEdit and NOT in an HTML editor like Dreamweaver or FrontPage. That’s critically important. The robots.txt is NOT an html file and is not even remotely close to any web language. It has its own format that is completely different than any other language out there. Lucky for us, it’s extremely simple once you know how to use it.

Robots.txt Breakdown

The robots file is simple. It consists of two main directives: User-agent and Disallow.

User Agent
Every item in the robots.txt file is specified by what is called a ‘user agent.’ The user agent line specifies the robot that the command refers to.

Example:

User-agent: googlebot

On the user agent line you can also use what is called a ‘wildcard character’ that specifies ALL robots at once.

Example:

User-agent: *

If you don’t know what the user agent names are, you can easily find these in your own site logs by checking for requests to the robots.txt file. The cool thing is that most major search engines have names for their spiders. Like pet names. I’m not kidding. Slurp.

Here some major bots:

Googlebot
Yahoo! Slurp
MSNbot
Teoma
Mediapartners-Google (Google AdSense Robot)
Xenu Link Sleuth

Disallow
The second most important part of your robots.txt file is the ‘disallow’ directive line which is usually written right below the user agent. Remember, just because the disallow directive is present does not mean that the specified bots are completely disallowed from your site, you can pick and choose what they can and can’t index or download.

The disallow directives can specify files and directories.

For example, if you want to instruct ALL spiders to not download your privacy policy, you would enter:

User-agent: *
Disallow: privacy.html

You can also specify entire directories with a directive like this:

User-agent: *
Disallow: /cgi-bin/

Again, if you only want a certain bot to be disallowed from a file or directory, put its name in place of the *.

This will block spiders from your cgi-bin directory.

Super Ninja Robots.txt Trick

Security is a huge issue online. Naturally, some webmasters are nervous about listing the directories that they want to keep private thinking that they’ll be handing the hackers and black-hat-ness-doers a roadmap to their most secret stuff.

But we’re smarter than that aren’t we?

Here’s what you do: If the directory you want to exclude or block is “secret” all you need to do is abbreviate it and add an asterisk to the end. You’ll want to make sure that the abbreviation is unique. You can name the directory you want protected ‘/secretsizzlesauce/’ and you’ll just add this line to your robots.txt:

User-agent: *
Disallow: /sec*

Problem solved.

This directive will disallow spiders from indexing directories that begin with “sec.” You’ll want to double check your directory structure to make sure you won’t be disallowing any other directories that you wouldn’t want disallowed. For example, this directive would disallow the directory “secondary” if you had that directory on your server.

To make things easier, just as the user agent directive, there is a similar wildcard command for the disallow directive. If you were to disallow /tos then by default it will disallow files with ‘tos‘  such as a tos.html as well as any file inside the /tos directory such as /tos/terms.html.

Important Tactics For Robot Domination

  • Always place your robots in the root directory of your site so that it can be accessed like this: www.yourdomain.com/robots.txt
  • If you leave the disallow line blank, it indicates that ALL files may be retrieved.
    User-agent:*
    Disallow:
  • You can add as many disallow directives to a single user agent as you need to but all user agents must have a disallow directive whether the directive disallows or not.
  • To be SEO kosher, at least one disallow line must be present for every user agent directive. You don’t want the bots to misread your stuff, so be sure and get it right. If you don’t get the format right they may just ignore the entire file and that is not cool. Most people who have their stuff indexed when they don’t want it to be indexed have syntax errors in their robots.
  • Use the Analyze Robots.txt tool in your Google Webmaster Account to make sure you set up your robots file correctly.
  • An empty robots is the exact same as not having one at all. So, if nothing else, use at least the basic directive to allow the entire site.
  • How to add comments to a robots? To add comments into your robots, all you need to do is throw a # in front and that entire line will be ignored. DO NOT put comments on the end of a directive line.  That is bad form and some bots may not read it correctly.
  • What stuff do you want to disallow in your robots?
    • Any folder that you don’t want the public eye to find or those that aren’t password protected that should be.
    • Printer friendly versions of pages (mostly to avoid the duplicate content filter).
    • Image directory to protect them from leeches and to make your content more spiderable.
    • CGI-BIN which houses some of the programming code on your site.
    • Find bots in your site logs that are sucking up bandwidth and not returning any value

Killer Robot Tactics

• This set up allows the bots to visit everything on your site and sometimes on your server, so use carefully. The * specifies ALL robots and the open disallow directive applies no restrictions to ANY bot.

User-agent: *
Disallow:

• This set up prevents your entire site from being indexed or downloaded. In theory, this will keep ALL bots out.

User-agent: *
Disallow: /

• This set up keeps out just one bot. In this case, we’re denying the heck out of Ask’s bot, Teoma.

User-agent: Teoma
Disallow: /

• This set up keeps ALL bots out of your cgi-bin and your image directory:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

• If you want to disallow Google from indexing your images in their image search engine but allow all other bots, do this:

User-agent: Googlebot-Image
Disallow: /images/

• If you create a page that is perfect for Yahoo!, but you don’t want Google to see it:

User-Agent: Googlebot
Disallow: /yahoo-page.html
#don’t use user agents or robots.txt for cloaking. That’s SEO suicide.


If You Don’t Use a Robots.txt File…

A well written robots.txt file helps your site get indexed up to 15% deeper for most sites. It also allows you to control your content so that your site’s SEO footprint is clean and indexable and literal fodder for search engines. That, is worth the effort.

Everyone should have and employ a solid robots.txt file. It is critical to the long term success of your site.

Get it done.

Bling.

Get Internet Marketing Insight For Your Company - SEO.com

4 Comments

  1. Russ says

    I’ve read a few places that this is a robots.txt file that people like to use for wordpress. Anybody have any thoughts on this?

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /category
    Disallow: /tag
    Disallow: /author
    Disallow: /trackback
    Disallow: /*trackback
    Disallow: /*trackback*
    Disallow: /*/trackback
    Disallow: /*?*
    Disallow: /*.html/$
    Disallow: /page
    Disallow: /*feed*

    # Google Image
    User-agent: Googlebot-Image
    Disallow:
    Allow: /*

    # Google AdSense
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /*

    Sitemap:

    #

  2. Ivan says

    Thanks for the informative article.

    Blocking Baiduspider using the Disallow method does not seem to work. I have seen numerous posts where people have basically given up since baiduspider seems to ignore the robots.txt file.

    Is there an effective method to achieve this that you are aware of?

  3. Ashley Nixon says

    This is all interesting but if i mess with this file will this upset the search engine gods. Im fresh out of stuff to sacrafice to them. :)

Leave a Reply