Markup Web Development

Robots.txt: What, When, and Why

Do you know that preventing search engines from accessing and indexing some pages on your site can can be beneficial both for your SEO and your privacy? Read this post and you will learn how to achieve this by changing the contents of a small but extremely important file - robots.txt.

thumbnail

Can robots read? As Artificial Intelligence is developing rapidly, nobody can say for sure how much robots understand. Yet, if we speak about search engine bots, crawlers, or robots, we know that they can read and follow instructions written in the robots.txt file. In this article, you’ll learn what robots.txt is, where you can find it, how to create it, and when to use it. Keep on reading!

What Is robots.txt?

As you can understand from its name, robots.txt is a simple text file.

A robots.txt file is a useful tool that instructs search engine bots which pages they should access on your website. Using it, you can control crawler traffic and thus avoid overloading your website with requests. It’s also a tool to keep files out of the search index. Robots.txt is also called ‘Robots Exclusion Protocol’ because it excludes website pages from bots’ crawling.

How to Find robots.txt?

You can find the robots.txt file in the root folder of your domain. Usually, this is the same place where you can see the site’s main ‘index.html’ page. The exact location and the ways to add the file may depend on your web server software. Remember to add the file to the correct place on your server. 

If you want to check whether a website has a robots.txt file, just type /robots.txt after the domain name, and you’ll get the result, as you can see in the screenshot below.

For WordPress Websites

If you have a WordPress website, you can find your robots.txt file with the help of an FTP (File Transfer Protocol) client. FTP is a standard way of transferring files from your computer to the website’s hosting. In other words, an FTP client helps you manage website files on the Internet. An example of an FTP client that you might have on your computer is FileZilla. Similarly, you can get to your website’s hosting admin panel and open the root folder. Then, you’ll see the following standard WordPress folders:

  • wp-admin
  • wp-content
  • wp-includes

Afterward, in the same root folder, you can find the robots.txt file.

For Shopify Websites

Shopify generates a robots.txt file by default. If you need to edit the robots.txt on your Shopify website, you can modify it with the robots.txt.liquid theme template. The location of the robots.txt.liquid template is in the templates > customers directory of the theme.

How to Use robots.txt: Directives

The robots.txt file includes one or more sets of directives. Using these, you can:

  • set instructions for all search engines simultaneously
  • regulate the behavior of different search engines separately

Each set of directives includes a line that refers to search crawlers (either to all of them or a specific one). Afterward, there is an instruction allowing or blocking access to URLs.

The user-agent Directive

The first directive in robots.txt is a user-agent directive. It specifies what search engine the following rule is applied to.

If you want to address all search crawlers, you can use a wildcard (*):

User-agent: *

As you can see, mailchimp.com has a very brief robots.txt that blocks access for all search crawlers to certain pages:

If necessary, you can provide instructions not only for the most popular user agents (such as Baidu, Bing, Google, Yahoo! And Yandex) but also for the less common ones. This is the case with apple.com. Besides general directives for all search bots, they specify rules for the Baidu spider:

Moreover, apple.com restricts access for several other crawlers, such as HaoSouSpider:

Often, search engines have specific spiders for various aims. For instance, there are different spiders for normal indexes, ads, news, images, etc. So, if you have a news website, you can provide instructions for a bot with the user-agent ‘Googlebot-News’, as in the screenshot below.

The disallow Directive

As you remember, robots.txt is a ‘Robots Exclusion Protocol’ and primarily serves for excluding URLs, i.e., disallowing search bots to crawl them. So, the most common directive in the robots.txt file is ‘Disallow’. It blocks bots’ access to specific pages.

If you leave the ‘Disallow’ field empty, bots will be able to reach any URLs. The set of rules below doesn’t restrict any search engines from crawling the website.

User-agent: *
Allow: /
Disallow:

Adding just one character, ‘/’, to the Disallow directive creates an opposite effect. It restricts all search bots from crawling the entire website.

User-agent: *
Disallow: /

Regular Expressions and Special Characters in robots.txt

Officially, robots.txt doesn’t support regular expressions or wildcards. However, the most popular search bots, such as Google and Bing, understand them. This lets you group and block multiple URLs and files.

In the example above, the ‘*’ expands the directions for all .jsx files while ‘$’ means that this is the end of the URL. Additionally, you can add comments to lines with the ‘#’.

The sitemap Directive

This line indicates the location of the website’s sitemap. The directive should include a fully qualified URL, and the sitemap URL should not necessarily be on the same host as your robots.txt file. Additionally, you can add multiple sitemap fields. As you can see in the example below, bbc.com is a huge website with numerous sitemaps, and the robots.txt file lists them.

Non-Standard robots.txt Directives

Besides the ‘Disallow’ directive, you can use some other instructions for crawlers. Mainly, the ’Allow’ directive is helpful if your folder contains numerous subfolders, and you need to provide access to just one of them. A typical command for WordPress websites is:

Using this rule, you restrict access to all files in the /wp-admin/ folder except the admin-ajax.php. The other way to get the same result would be to block (by applying ‘Disallow’) all files within the /wp-admin/ folder.

The host Directive

This directive is only supported by Yandex. It specifies whether you want the search engine to display yoursite.com or www.yoursite.com. You can apply the host directive to let search engines know your preferences:

host: yoursite.com

Take into consideration that this directive doesn’t regulate whether you wish to use http or https for your website. Instead of relying on the ‘Host’ directive, you should consider setting 301 redirects that will work for all search engines.

The crawl-delay Directive

With this line, you’re indicating the frequency and instructing search bots on how often they can request pages on your website. If you set a delay of 10 seconds, you let search engines access your website once in 10 seconds or wait for 10 seconds after a crawling action. The difference is slight, and its interpretation depends on various bots. The gucci.com website’s robots.txt clearly explains what their instructions mean:

robots.txt for Messages and Humor

Besides providing instructions to search robots, some webmasters use the robots.txt files for secret messages or nerd humor. A popular theme among developers is adding Bender, a robot from the Futurama cartoon series (you can check robots.txt files of tindeck.com).

It’s also possible to come across other humorous variations in robots.txt. For instance, you can see a cute image consisting of text symbols that ask a robot ‘to be nice’:

Indeed, if you check the entire robots.txt file on the cloudfare.com site, you’ll see the continuation and precise directions for crawlers.

Furthermore, you can find hidden messages in robots.txt. For instance, Shopify’s robots.txt file encourages visitors ‘to board the rocketship’ and check SEO careers at Shopify.

Similarly, you can see that Pinterest is hiring, and its robots.txt invites visitors to check available positions:

Here, the robots.txt file also explains why some agents are blocked. Some of them, such as ‘Block MJ12bot as it is just noise’ are indeed impressive.

Coming across just a ‘Blank robots.txt’, as on the honda.com website, is also possible.

How to Edit robots.txt and Check It

There are different ways and tools for creating and editing robots.txt. Firstly, you can do this manually. Since it is an ordinary text file, you can change it with any text editor, such as Notepad.

Secondly, if you have a WordPress website, it’s possible to facilitate the process of finding, editing, and uploading the robots.txt file by using plugins. For instance, the All in One SEO and Yoast SEO plugins are at your disposal.

After finishing, you may check your robots.txt file. Google’s Webmaster Tools has a robots.txt Tester that you can find in the old version of the Google Search Console. You must select the property for which you want to create a robots.txt, remove an old file version (if there was any), and add your new file. Then, click ‘Test’ and run a check. After checking, if you see ‘Allowed’, your robots.txt file is fully functioning, and you may use it.

Why Use the robots.txt File?

Robots.txt is a tool to control search engines’ access to your website. It means you can restrict crawlers from reaching specific URLs, such as pdf-files. Disallowing robots from crawling your website helps you prevent overload and control your crawl budget.

Additionally, the robots.txt sitemap directive tells crawlers where they can find your sitemap. So, you improve interaction with bots and list the pages that should be a part of search indexes.

Disadvantages of robots.txt

Although robots.txt restricts crawlers, it is not a tool for excluding a page from the search index. Firstly, it’s up to crawlers to follow the instructions. Secondly, if Google finds the page through external links, the URL can still get into the index. So, robots.txt doesn’t guarantee web page noindexing.

Additionally, if you have external backlinks on the page but it is blocked, Google will not know about those backlinks. As a result, the link flow is broken.

Robots.txt Helpful Tips

There are some tricky details in working with robots.txt. If you know them, you can avoid making mistakes.

1. Pay attention to the difference in domains and subdomains. A robots.txt on a specific subdomain is valid for only that subdomain. That means, if you want to create a blog for your website on a separate subdomain (such as blog.yoursite.com), it should have its own robots.txt file. In the screenshots below, you can see various robots.txt files. The first refers to the main domain, hubspot.com.

The second one is valid for its blog – blog.hubspot.com.

You can check other examples of robots.txt for sites and their subdomains, such as British Council and LearnEnglish by British Council.

2. You should place the robots.txt file in the root folder, not in subdirectories. Consequently, http://yoursite.com/folder/robots.txt is not a valid robots.txt.

3. Remember that URLs and robots.txt name are case-sensitive. So, avoid using capital letters for your file’s name. Likewise, the values in directives are case-sensitive, i.e., /books/ and /Books/ are not the same. It’s essential to check capitalization to avoid blocking necessary folders by mistake.

Robots.txt: FAQ

To summarize information about the robots.txt file, we’ve listed short answers to the most common questions about the robots.txt file.

What is robots.txt?

Robots.txt is a text file that instructs search crawlers on which pages they can crawl. It’s also called ‘Robots Exclusion Protocol’ because its primary function is to exclude pages from bots’ crawling.

Where can I find robots.txt?

This file is usually situated in the root directory of your website. To check if a website has a robots.txt file, just add /robots.txt to the domain, such as yoursite.com/robots.txt.

What should be in my robots.txt file?

Robots.txt should tell search bots what pages they should not access. Thus, firstly, the robots.txt file contains specifications for bots (that you should indicate in the ‘User-agent’ line). You can state if you address all bots or a specific one. Secondly, it’s necessary to show which page crawlers should not (or, rarely, should) access.

What are the basic directives?

Besides the User-agent directive, the ‘Disallow’ directive is the most common one. It restricts bots from crawling. Also, you can use the ‘Allow’ directive to let search engines reach specific files or folders within a restricted one.

Furthermore, with the ‘Crawl-delay’ directive, it’s possible to set up the time bots should wait before making another crawling attempt.

How can I edit the robots.txt file?

You can use a text editor to open and change the robots.txt file. Then, you need to put it in your root directory. If you have a WordPress website, it’s possible to edit robots.txt with plugins. In Shopify, you can modify the file with the robots.txt.liquid theme template.


World-Class Web Development Services from PSD2HTML

PSD2HTML is a leading provider of web and mobile development services.

Our offering includes pixel-perfect, responsive design to HTML/CSS conversion, creating email templates and engaging HTML5 banners, frontend development with popular JS frameworks (Vue, React, Angular), eCommerce development (Shopify, Magento, WooCommerce), CMS-based development (Drupal, WordPress, HubSpot), and much, much more.

Let’s discuss your project today!

Dmitriy Maschenko

Dmitriy is the Head of a division at PSD2HTML, an experienced web developer, and a prolific author of in-depth technology and business-related posts. He is always eager to share his years-long expertise with everyone who wants to succeed in the web development field.