The primary source for SEO guidance with clear and expert-level insights.

How to Fix Indexed, though Blocked by robots.txt

Last updated: Jan 05, 2023

Disclaimer: Our team is constantly compiling and adding new terms that are known throughout the SEO community and Google terminology. You may be sent through SEO Terms in cmlabs.co from third parties or links. Such external links are not investigated, or checked for accuracy and reliability by us. We do not assume responsibility for the accuracy or reliability of any information offered by third-party websites.

The process of indexing website pages can be monitored from the Google Search Console. In the process, there may be issues with pages, including "indexed though blocked by the robots.txt". In this guide, we will help you understand what there is to understand about this issue.

To avoid this problem, you can learn how to fix "indexed, though blocked by robots.txt". Check out the full discussion below.

What is robots.txt?

Figure 1: Illustration of a search robot or web crawler. Robots.txt is used by web crawlers as a guide when crawling a website.

A robots.txt file contains a collection of instructions used by web crawlers as a guide in the process of crawling a website. You can use robots.txt to tell web crawlers which pages to visit or not to visit.

This file can be used if your website has several pages that are created only for users. In other words, you don't want web crawlers to find and display them on the SERP. Examples of these are checkout pages or those that require payment access that can only be accessed after logging in.

About the “indexed, though blocked by robots.txt” Issue

As you already know, robots.txt is able to block web crawlers from accessing some pages that you have specified from appearing in search results.

Even so, there are times when the web crawler does not follow the instructions in the file, so it continues to crawl and index the page.

You can find out about the issue through the Google Search Console. An error warning will appear in the GSC with the words "indexed, though blocked by robots.txt" or "indexed, even though blocked by robots.txt".

indexed though blocked by robots.txt — Figure 2: Display of the "indexed, though blocked by robots.txt" warning in Google Search Console.

If this error warning appears, it means that Google has indexed a URL that is blocked in robots.txt. Google displays a warning if they are not sure whether you want the page to be indexed or not.

This can be a problem, especially if the page displays private information or data. Therefore, it is important to know how to fix "indexed, though blocked by robots.txt"..

The Benefits of robots.txt for Website

The way search engines work is by using a search robot, or web crawler. This robot will browse every website on the internet, save the data to their database or index, and display it to searchers.

The role of robots.txt is very important to managing the web crawler browsing process on the website. This file is useful as a guide where the web crawler will visit it first before crawling all other pages on the website.

With robots.txt, you can provide various instructions, such as the 'nofollow' attribute to block robot access to a URL and 'dofollow' to allow robots to crawl that URL.

If you want to create a robots.txt file, you can simply use the robots.txt generator tools from cmlabs. With this tool, you can create a robots.txt file easily and quickly.

Pay Attention to Page Indexing

To better understand how to fix "indexed, though blocked by robots.txt", let's first understand the process of indexing a page. The index itself is a database or data repository belonging to search engines. It is about all websites found by search robots.

Websites that are in the search engine index have gone through the indexing process, which is the process of storing the contents of the website page content. It does not only contain website URLs but also all text, images, videos, tags, and attributes in the HTML code of the page.

The indexing process also analyzes the contents of the stored data, such as the language used, country of origin, page role, and so on. The web crawler will also analyze whether a page is a duplicate or not.

Search engines can decide whether a page will be indexed or not. There are several reasons why search engines decide not to index a page, namely:

There is a meta tag that blocks web crawler access, such as the 'nofollow' attribute.
Content is of low quality or is a duplicate.
The website has a complicated navigation structure that makes it difficult for robots to index.

When URLs Don't Need to be Indexed

Now that you know what robots.txt is for and how the indexing process works, it's time for you to learn how to fix "indexed, though blocked by robots.txt" in GSC.

The thing you have to do before knowing how to fix "indexed, even though blocked by robots.txt" is to determine whether the error page really needs to be indexed or not.

This is because the actions you will take in these two conditions will be different. If you don't want the URL to be indexed by Google, then here are some things you can do:

Check the robots.txt File

The easiest way to prevent a page from being crawled is to check your website's robots.txt file. Make sure that the page you want to block has a disallow statement.

Although this seems trivial, website managers often forget to set a statement on the page they want to block.

Use Noindex Directive

You need to know that crawling and indexing are two different processes. Crawling refers to the search process carried out by web crawlers to find every page on a website. While indexing is done to analyze and save a page.

If a page keeps getting an "indexed, though blocked by robots.txt" warning, you can use the "noindex" meta tag. With this meta tag, search engines will not index the page even though it has been crawled.

Pages Linked to Other Websites

Figure 4: Illustration of a chain or link connecting another website to your page. If there are other websites that provide backlinks to your page, then the web crawler can still crawl that page.

A page may be crawled even if it has been blocked by robots.txt. The reason is that there are other websites that provide links. The trick is to first check the backlinks that point to pages you don't want indexed and delete them.

When URLs Need to be Indexed

If a page gets an "indexed, though blocked by robots.txt" warning and it's a page you intended to index, then you should check the crawling settings on that page.

This is because the page that you intend to appear on the SERP is not actually crawled by Google, even though it is already indexed. You can check your crawl settings in the following ways:

Check Crawl Block in robots.txt

First, check the settings in the robots.txt file, whether the page you want to index is actually blocked in robots.txt. You can access the robots.txt file by typing in domainname.com/robots.txt.

After that, you can find out if the page you want to index has a disallow statement. The form of the disallow statement looks like this:

Disallow: /

cmlabs

If the disallow is specified for a specific user agent, then it looks like this:

User-agent: *
Disallow: /

cmlabs

Pages that have disallow statements will not be crawled by robots, so you have to change the statements on that page to 'allow'.

Check User Agent Block

Websites can block access to a user agent such as Googlebot or Ahrefsbot so they cannot be crawled. When this happens, you may still be able to find them on other search engines.

However, your website will not be found when using Google or Ahrefs because the access of both user-agents is blocked. This problem can occur due to blockages on several systems on the website, such as .htaccess, CDN, firewall, server configuration, and others.

The best way you can solve this problem is to contact your hosting provider or CDN to find out where the blocking is coming from and how you can fix it.

Check Intermittent Block

An intermittent block is a condition where the cause of a page not being crawled is difficult to determine. To solve this, you need to check the history of your website's robots.txt.

Use a tool like GSC robots.txt Tester to look at the previous version of the file and check for any incorrect instructions in that version. Solutions to this issue may vary and depend on the cause.

One of the causes that often occurs is due to the use of cache. When the test mode cache is active, robots.txt blocks page access. However, when the live mode cache is active, the page can be crawled.

To solve this, you can remove the robots.txt file from the cache in test mode.

Check IP Block

If you have checked the three things above and no problems were found, then the cause could be due to a block in the IP address.

The solution to this problem is to contact the hosting provider or CDN. The reason is that the condition of IP blocks is very difficult to track, so you need help to find the source of the block and how to solve it.

This is an explanation of how to fix "indexed, even though blocked by robots.txt". Hopefully, with this guide, you can solve the crawling problems that exist on web pages.

If you need further handling, you can use SEO services that provide a professional team to help with the SEO optimization process, including website crawlability.

cmlabs

WDYT, you like my article?

How to Fix Indexed, though Blocked by robots.txt

What is robots.txt?

About the “indexed, though blocked by robots.txt” Issue

The Benefits of robots.txt for Website

Pay Attention to Page Indexing

When URLs Don't Need to be Indexed

Check the robots.txt File

Use Noindex Directive

Pages Linked to Other Websites

When URLs Need to be Indexed

Check Crawl Block in robots.txt

Check User Agent Block

Check Intermittent Block

Check IP Block

cmlabs

Subscribe to Our Newsletter

Enter your email to receive news from us

Subscribe to Our Newsletter

Enter your email to receive news from us

Need help?