WordPress Robots txt file! How To Create Configure And Optimized It

47
WordPress Robots txt file

WordPress robots txt file is Introduced by the robotstxt.org to instruct the search engine who to crawl their website. It is a very powerful file (we also can say it a tool) if you working on a Site SEO.

You can control which part of your website, you want to share with a Search engine.

wordpress robots txt file
Robots text file example

By simply placing a WordPress robots txt file in the root of your domain, let you stop the search engines from indexing Sensitive Information from your site.

For example,

  • Which plugin you are using to secure your WordPress,
  • Wp-admin area to protect password and
  • Some other confidential information.

Over the years, especially Google changed a lot in how it crawls the web/internet. So the old best practices of WordPress robots txt file are no longer valid now.

In this article,

First, we start with the basic OLD robots.txt file to understand its feature and functionality and then move to advance robots.txt file.

I will also cover the importance of robots file, how we can create WordPress robots txt file, Test robots.txt file with Google webmaster tool to ensure everything is working fine.

What Is WordPress Robots.txt file?

The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl and index pages on their website.

allow and disallow robots.txt file command
syntax

Let me Explain it, In brief. Suppose our website is like a House.

A house consists of a large number of Rooms. Some Room is important for us and some are not like the storage room, private room. When someone comes to our home, we focus on that thing they never go toward the private area.

Similarly, we have to stop search engine to crawl sensitive area of our website like Wp-admin, cg-bin etc.

In this way, we guide the search engine bots by using robots file, what should have to crawl on a website, which new things are available for indexing, which part not need to crawl.

What Happen If I Don’t Use Robots.txt File?

In the absence of a robots.txt file, search engine bots can index and crawl every part of your site. So it is highly recommended that you create one.

Without a robots.txt file, your website is not optimized in terms of crawl-ability. If you at least one time deal with an SEO. You know very well, who WordPress robots txt file help you to clean your or your client website.

Major search engines will follow the rules that you set, expected the malicious bots and poor search engines. They index whatever they want.

Thankfully, major search engines follow the standard, including Google, Bing, Yandex, Yahoo, and Baidu.

Importance of Robots.txt file

Most of the search engine does not provide webmaster tool, like Ask. So how they crawl your website. To guide that search engine we add XML sitemap in the robots.txt file.

Another important reason to add sitemap in the WordPress robots txt file is that all the bots first check robots file and then move next.

So by placing a sitemap in the robots file, you can help bots to crawl and get a faster index of your new or old post.

How To Create WordPress Robots txt file

Robots.txt is a General Text file. So, if you don’t have this file in root of your directory, then

  • Open any Text Editor as you like ( Notepad, text) and
  • Make Robots.txt file
write robots command in text file
Robots.txt file
  • And upload it to your site. It’s done.

By default, WordPress has the following robots.txt file in the root of the domain.

User-agent: *
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-adin/

You can check your wordpress Robots.txt file by simply typing yourwebsitename.com/robots.txt in the new Tab of your browser.

Search In Browsere about robots.txt file
Search In Browser

So this is, who robots.txt file look like.

Basic Robots txt file Syntax

Robots txt syntax is very simple, you don’t need to learn a new Programming language to make a robots.txt file

Available commands for directives are few. In fact, knowing just two of them is enough for most purposes.

Here is a command;

  • User-Agent – Defines the search engine crawler like Google, Yandex, Bing etc
  • Disallow – Tells the crawler to stay away from defined directories/page/file/image.

An asterisk (*) can be used to define universal directives for all the search engine.

For example,

To block everyone from your entire website, you would Configure robots txt file in the following way:

User-agent: *
Disallow: /

Here, the Slash (/) tell don’t crawl this.

Let, we first clear this Example of the robots txt File and then we move next.

I want to tell search engine no need to index my website. then I simply write a command in a .txt file and upload it to the root of my directory.

Disallow: /

But this command is incomplete. I have to also mansion the search engine that is User-agent

User-agent: *
Disallow: /

Here Asterisk (*) defines, all the search engine. So according to this command, all search engine will not index my website.

But, if you only want to block google, then you have to Configure robots txt file in the following way,

User-agent: Googlebot
Disallow: /

Note: – This command only blocks the google bots to crawl your website.

But, If you want to allow Only Googlebot and block all the other search engine then you have to write the following command in your WordPress robots txt file

User-agent: Googlebot
Disallow: 
User-agent: *
Disallow: /

This code inside your robots.txt would give only Google full access to your website while keeping everyone else out.

Note: – Command always runs in sequence. So it is important to first allow the search engine and then disallow it.

Some Other Syntex of Robots file

  • Allow –allows crawling your site
  • Sitemap – Tell where your sitemap file
  • Host – Tell the Primary domain

Allow Directory:

A common misconception about Allow robots txt file directory is that this rule is used to tell search engines to check out your site

Basically, Allow is used to give permission to a subfolder.

For example;

Useragent: *
Allow: /content/myfile.php
Disallow: /content/

The search engines would stay away from a Content folder in general, but still access my-file.php.

Note: – it’s important to note that you need to place the directive allow first in order for this to work.

Sitemap Directory:

This can be used to tell search engines or other robots where your sitemap is located. For example, the complete robots.txt could look like this,

For example

Sitemap: https://www.hitechwork.com/post-sitemap.xml
Sitemap: https://www.hitechwork.com/page-sitemap.xml
Sitemap: https://www.hitechwork.com/category-sitemap.xml

WordPress robots txt file are used to block particular directory where the sitemap may is used to give the robot a list of pages that is available for indexing.

As I already told you, By giving the search engine a sitemap you can increase the number of pages that it indexes. The sitemap can also tell the robots when the page was last modified, the priority of the page, and how often the page is likely to be updated.

Host Directory:

A Host is only supported by Yandex. This Command let you to decided whether you want to show www.example.com or example.com in the search result.

Host Sydnex use by the yandex
Host Syntex

For example;

Host: www.hitechwork.com

I don’t recommend to do this because only Yandex support this. But if you want to do, You can learn more about Host directive here

Do only that setting that follow by all the search engine. Like, google Use 301 redirects to handle this situation.

For example;

If your domain starts with www. Peoples who are searching your website without www (hitechwork.com) will automatically redirect to www.hitechwork.com

Advanced Robots txt Syntex

Robots.txt file not only uses to prevent the search engine from crawling your site.

Sometimes it is used to provide useful information to search engine and block unnecessary file to clear your website.

For example:

On your website, you have a folder for testing content, affiliate links, unnecessary image and many other things.

You want to keep this folder out from the search engine index. then you have to write the following command in robots file.

Disallow: /testfolder/

All the content in the testfolder is now blocked.

But, if you wanted to block all folders from access that begin with wp or something else. You could so like that:

User-agent: *
Disallow: /wp-*/

If you want to exclude all PDF files in my media folder from showing up in search results, again you have to write the following command in the robots.txt file.

User-agent: *
Disallow: /wp-content/uploads/*/*/*.pdf

Note: – when you upload any file it will go to the Uploads folder

See the below screenshot

URL of image exmple
URL of image

I replaced the month and day directories that WordPress automatically sets up with wildcards Asterisk (*)

According to this command no matter when they upload. All the file in the uploads folder that ending with .pdf are blocked.

www.hitechwork.com/wp-content/uploads/2017/04/SEO.pdf

“Officially”, the robots.txt standard doesn’t support regular expressions (wildcards)

But, all major search engines can understand it. This means you can use these line to block groups of files:

Note: – Robots.txt file is case sensitive. pdf and PDF are two different things for robots.

List of the Search engine Bots

Search Engine Field User-agent
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yande

More can be found on User-Agents.org

What to Place in Your Robots.txt File

I always change my robots.txt file time to time, according to the change in the Trend market.

My current robots.txt file.

The best thing about the robots.txt file is that you can check the robots file of any website.

By typing a simple command in your browser.

www.example.com/robots.txt

If you check out the robots.txt file of some WordPress websites, you will see that website owners define different rules for search engines.

For example;

  • Worpdress.com
  • perishablepress.com
  • askapache.com
  • moz.com
  • backlinko.com
  • matcutt.com

After searching a lot of robots file and reading the research paper, I reach the end of this file,

User-agent: *
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /cgi-bin/

Note: – It is only from my best of knowledge, You can go with your decision

The Robots Exclusion Standard can be used to stop search engines crawling files and directories that you do not want to be indexed, however, if you enter the wrong code, you may end up blocking important pages from being crawled.

Hint: – By placing an XML Sitemap file in the robots file, help you to faster indexed by the search engine.

Just add your XML sitemap with the robots.txt file and upload it to the root of your site.

Sitemap: https://www.hitechwork.com/post-sitemap.xml
Sitemap: https://www.hitechwork.com/page-sitemap.xml
Sitemap: https://www.hitechwork.com/category-sitemap.xml

Basic Rule for Robots.txt file

  1. Don’t keep the space at the beginning of any line and don’t make ordinary space in the file.
    Allow: /foldername1/filename.html
  2. Don’t change the sequence of the command (first Specified the agent and then decide the allow or disallow).
  3. If you want no index, more than one directory or page don’t right along with these names (Disallow: /support /cgi-bin /images/).
  4. Robots.txt file is case sensitive As the example, you want no index “Download” directory but write “download” on Robots.txt file. It makes miss understand for search bot.
  5. Don’t place sitemap at the top of the robots.txt file.

Here is a major search engine recommendation to create a robots txt file

Add Comment To Your Robots File

If you are working with your client website or you have a large website. You can add a comment too in WordPress robots txt file.

This will help you quickly understand the rules you have added when you refer to it later.

# use to add a comment to the file. 

# Block the Googlebot from crawling the site
 
User-agent: Googlebot
Disallow: /

A comment can be placed at the start of a line or after the end of the line.

User-agent: Googlebot-Image # The Google Images crawler
Disallow: /images/ # Hide the images folder

I recommended you to add code to the robots file. It only helps you in future but also helps you to understand the rules you create.

Test Your Robots.txt File Before Uploading

There are a number of ways in which you can test your robots.txt file by using an online or offline too. But, I recommended Create google webmaster Account and use the robots.txt file feature In a webmaster tool.

Because if, Google Update any change in the robots file, it will show you here and also tell how to fix it.

Robots file in google webmaster tool
Robots file in webmaster

Note: – The robots.txt file that is displayed comes from the last copy of robots.txt that Google retrieved from your website.

  • Replace this with your robots.txt file and click on Test.
Paste Robots.txt file in google webmaster tool
Paste Robots.txt file
  • A check is there any error or not. If there is an error, fix it and click on submit.
  • A new window will be open with download option.
submitting file in googl ewebmaster tool
Submit file
  • Click on Download to download the robots.txt file.

Use Cpanel To upload Robots txt file in the root of your directory.

After Uploading Robots.txt file in the root of your site, come back to webmaster tool and then click on View uploaded version.

  • When you click on it. A new window will be open with your new Robots.txt file.
Upload file to cpanel
Upload file to Cpanel

Note: – Check your file is uploaded or not. If not then reload the page or check in your Cpanel.

  • Now click on Submit.
submit file in webmster tool
Submit file

It’s done. Your new WordPress robots txt file is live.

 Maximum Size Of The Robots txt file

According to the Google’s John Mueller clarified the issue

“If you have a giant robots.txt file, remember that Googlebot will only read the first 500kB. If your robots.txt is longer, it can result in a line being truncated in an unwanted way. The simple solution is to limit your robots.txt files to a reasonable size.”

Try to make your robots.txt file less the 500KB. You can check your robots.txt file size in google webmaster tool.

robots.txt file should be Less than 500KB
Less than 500KB

Command that you can use to block file

Allow indexing of everything

User-agent: *
Disallow:

Disallow indexing of everything

User-agent: *
Disallow: /

Disallow indexing of a specific folder

User-agent: *
Disallow: /folder name/

Disallow Googlebot from indexing

User-agent: Googlebot
Disallow: /

Disallow Googlebot and allow other

User-agent: Googlebot
Disallow: 
User-agent: *
Disallow: /

Disallow specific file in folder

User-agent: *
Allow: /content/my-file.php
Disallow: /content/

Disallow file that starting with name wp-

User-agent: *
Disallow: /wp-*/

Disallow file that ends with pdf

User-agent: *
Disallow: /wp-content/uploads/*/*/*.pdf

What Experts Are Doing

According to Yoast

No longer is Google the dumb little kid that just fetches your sites HTML and ignores your styling and JavaScript. It fetches everything and renders your pages completely. This means that when you deny Google access to your CSS or JavaScript files, it doesn’t like that at all.

Yoast Robots txt file are

User-Agent: *
Disallow: /out/

Yoast only blocks the /out/ directory.

Why Yoast not blocking anything from their site. The reason is very simple when Google fetch your website, It checks CSS, JS, and styling component to render your content properly. if you blocking this then google think your site looks like crap and penalize you for it with devastating effects

if you blocking this then google think your site looks like crap and penalize you for it with devastating effects according to Yoast

Yoast also recommended not to block directory with a robots.txt file. They recommended using nofollow noindex tag to stop indexing of page.
WordPress Found Matt Mullenweg follow the similar approach
User-agent: *
Disallow:

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /dropbox
Disallow: /contact
Disallow: /blog/wp-login.php
Disallow: /blog/wp-admin

Matt Mullenweg also blocking the Wp-admin and Wp-login.php. Dropbox and contact are another folders that are blocked. Maybe it contains important or unuseful information that he does not want to share with google.

Another example come form WPBeginner

User-Agent: *
Allow: /?display=wide
Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /readme.html
Disallow: /refer/

You can see that WPBeginner blocking plugin folder for security. And its refer folder contain affiliate links and other information.

Recommended Post

Conclusion

Robost.txt file is just not a .txt file. if you want to deal with the SEO on your client site or on your own personal blog. Then Robots.txt file is the first tool that helps you to Fix damage and help to increase the reputation of your site in the search result.

At last, be aware of the syntax of a command. A little mistake in the robots.txt file can block your entire site. Always test your file before upload it to the root of your site. After Adding uploading your file, monitor your traffic if you found nay peak change (decrease) in traffic. Then recheck your file and find the reason behind it.

Remember to share this post with anyone who might benefit from this information, including your Facebook friends, Twitter followers and members of your Google+ group! And also Support Us By Liking Our FacebookTwitter, and Google+ Page.

If you have any suggestion or problem about WordPress robots txt file please feel free to comment below.