How robots.txt Files Work | moakh
Your progress

Crawling and indexing · 14

14. How robots.txt Files Work

Learn how to control bots with the robots.txt file.

A robots.txt file allows websites to tells bots which parts of the site shouldn’t be crawled.

If a website has a robots.txt file, it’s always in the same place on every website — at the root of the domain.

So for www.google.com, it’s located here:

https://www.google.com/robots.txt

What robots.txt Does

Bots are supposed to avoid fetching URLs that are specified by the robots.txt file, but they don’t always behave nicely.

Blocking a page or section of a website with robots.txt won’t necessarily stop search engines from displaying it in the search results. It’s more like a meek suggestion to bots, “please don’t fetch these URLs, but there is nothing I can do to stop you”.

The Four Forms of Sites

As mentioned above, a robots.txt file will always be found at the root of a website. So to find the file on google.com, you would visit google.com/robots.txt.

The full URL is:

https://www.google.com/robots.txt

Google will redirect you to the HTTPS and WWW versions of their site.

Google considers these forms to be four different websites, so a robots.txt file on one of them won’t affect URLs on the other forms:

  • http://google.com/ — no-HTTPS, no-www
  • http://www.google.com/ — no-HTTPS, yes-www
  • https://google.com/ — yes-HTTPS, no-www
  • https://www.google.com/ — yes-HTTPS, yes-www

So a robots.txt file at http://google.com/robots.txt would have different effects from one at https://www.google.com/robots.txt, because Google considers them to be different sites.

(The best practice is to redirect three of those to a single version, which we will practice later. Google redirects them, so there is only one file.)

A full robots.txt tutorial is coming soon. In the meantime, check out Google’s documentation to learn about the syntax of the file.

Takeaways

Things you should remember from this section:

  • The robots.txt file can be used to politely ask bots not to crawl parts of a site.
  • Bots don’t have to pay attention to the robots.txt file.
  • Even if a site blocks Google with robots.txt Google still might list the blocked pages in the search results. (Other ways to block search engines are coming next.)
  • Every website has four possible forms, and they are all considered different sites by Google. The forms are based on HTTP vs. no-HTTP and WWW vs. no-WWW.