Feb 272012
 

You’ve installed WordPress, added some decent plug-ins (link), and your site is up and running.  You’ve completed some SEO (See my guide – SEO for WordPress).  Next step is to make sure your robots.txt file is optimised.

You’re probably wondering whether this is really necessary.  If I just leave my WordPress install alone will it be alright? Do I need to worry about this?

The short answer.  Yes.

Setting up the best robots.txt file for WordPress can make a huge difference to your search rankings, as well as to the performance of your site

Reasons to implement robots.txt

  • It will speed up your site by stopping irrelevant pages being crawled
  • It will boost your position in the SERPs (search engine results pages) by blocking duplicate content
  • It will increase security, by stopping crawlers from going through wp files that they shouldn’t see.
  • It will prevent your server from being paralysed by vast crawler requests (especially important on shared hosting)

When I was trying to set up my Robots file I read numerous articles online, but nothing quite worked. There was too much poor quality info and guesswork.  Instead I went back to basics, learned how the crawlers work from scratch, and then experimented until I got the perfect file.

Below I will break it down. If you want to jump straight to the chase, scroll on down to the “Best Robots.txt files for WordPress” section to find the file I use.  Simply copy and paste for your own site.

How does robots.txt work

The robots.txt file is a file that sits in the root of your site and tells crawlers (also known as bots) what they should index and what they should stay away from.

By default you won’t have a robots.txt file at all.  That is something we are going to fix right now.

Before I cover what to put in your robots.txt, have a look at a few other sites and you will see they all have a robots.txt file.  Even Google itself has one, to tell other search engines what to index and what to ignore.   Have a look here:  www.google.com/robots.txt

To see the robots.txt file for any site, if it exists, just type the URL and add /robots.txt at the end.

Robot.txt increases site security

We want to block all WordPress directories, as well as a few other places that you don’t want anyone to be able to see in the search results.  Do you really want the scripts in your cgi bin to be public and appear in Google?

By using robots.txt you get to block these places.

Here is the block I use:

User-agent: Googlebot
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /wp-admin/
Disallow: /cgi-bin/

If you want to allow your images to get crawled and they are in the default place for WordPress then explicitly allow the folder your images are in.  For example, I put mine in /wp-content/images/

allow: /wp-content/images/

Avoid Duplicate content like the plague

Now you know all about this  – don’t you?  Google hates duplicate content, and WordPress is full of the stuff.  If you just set up WordPress and leave the robots file alone you will have duplicate content all over the place.  For each article or blog post you have the front page, the article page itself, and then the exact same content under the tags, under the categories, under your author page, even under the date archives.

You shouldn’t worry too much that the content is on the front page and on the blog post page itself.  That’s normal.  But all those other dupes?  Not so good.

By using the best robots.txt file you can stop Google (and of course the other bots) from seeing lots of copies of the exact same content on your WordPress blog.  Not only will that help you in rankings, it also makes sure that the main post is the one that appears in the search results.  And that is what you want.

The following settings will weed out duplicate content, as well as a few other files you don’t want crawled, like style sheets.

Disallow: /trackback/
Disallow: /feed/
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /tag/

Setting a Crawl delay will speed up your site

Setting a crawl delay won’t harm your site. It simply stops crawlers from asking for too many pages too quickly. Use crawl delay to lower the load on your server, and speed up the site.

This stuff matters. If the load on your server (CPU usage or bandwidth usage) goes too high shared hosts will block your site for the rest of that day, or sometimes longer. This could already be happening without you knowing.

Reducing the server load of course also means your site is faster for your real visitors.

Setting a crawl delay is easy.  I use 5 seconds for Google, and more for the other engines.

Crawl-delay: 5

Blocking bad bots

Personally I don’t bother. If you use the setup above you will slow them down enough that it won’t cause any problems.

Some people block certain bots, but the reason I don’t is 1) new bots appear all the time 2) bad bots will often ignore robots.txt anyway

Instead I rely on the crawl delay which keeps the requests low.

Treat other crawlers differently from Google

The sections above are targeted at the main Google crawler.  You can, if you wish put separate instructions for every crawler.  Personally I don’t bother.  Instead I do one for Google, and a slightly different one for everything else.

One difference, for example, is that I don’t want Google to crawl my feeds as I don’t like the duplicate content.  However I don’t want to block all bots from the feeds as I want my site to appear in RSS directories.

The syntax to use for all bots, is:

User-agent: *

Another important thing to know is that Googlebot recognises the wildcard (*) but most other robots don’t.  So I take all wildcards out.  That’s important.  You don’t want the parser to ignore the whole block because of one command it doesn’t understand.

The best robots.txt file for WordPress

Putting all that together, here is my recommendation for the best robots.txt file for a WordPress site.  Feel free to copy this into a new text file and use it yourself

User-agent:  Googlebot
# Slow the bot down to speed up my site
Crawl-delay: 5
# Stop access to these directories
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /wp-admin/
Disallow: /cgi-bin/
# Avoid duplicate content and files with certain extensions
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.cgi$
Disallow: /*.wmv$
Disallow: /feed/
Disallow: /*/feed/
Disallow: /trackback/
Disallow: /*/trackback/
Disallow: /tag/
Disallow: /*/tag/

#For other bots I allow feed and slow it down a bit
User-agent: *
Crawl-delay: 10
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /wp-admin/
Disallow: /cgi-bin/
Disallow: /trackback/
Disallow: /index.php
Disallow: /tag/

http://www.linkbuildingtools.co.uk/sitemap.xml

Validate your Robots file

  • You can look at the robots.txt of your competitor sites. See what they are doing. Remember what they have won’t necessarily work for you.
  • There is a great WordPress plug-in that will generate the robots.txt for you (see here). However, if you use the file above you won’t need that.
  • Validate your robots file using this free online validator – http://tool.motoricerca.info/robots-checker.phtml

Share the word

If this article helped you please spread the word using the links below.

Comments?

Do you use a similar file?  What else do you include or exclude?  Leave questions or comments below and I will try and help you out.

 

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)