14. How robots.txt Files Work

Learn how to control bots with the robots.txt file.

A robots.txt file allows websites to tells bots which parts of the site shouldn’t be crawled.

If a website has a robots.txt file, it’s always in the same place on every website — at the root of the domain.

So for www.google.com, it’s located here:

https://www.google.com/robots.txt

What `robots.txt` Does

Bots are supposed to avoid fetching URLs that are specified by the robots.txt file, but they don’t always behave nicely.

Blocking a page or section of a website with robots.txt won’t necessarily stop search engines from displaying it in the search results. It’s more like a meek suggestion to bots, “please don’t fetch these URLs, but there is nothing I can do to stop you”.

The Four Forms of Sites

As mentioned above, a robots.txt file will always be found at the root of a website. So to find the file on google.com, you would visit google.com/robots.txt.

The full URL is:

https://www.google.com/robots.txt

Google will redirect you to the HTTPS and WWW versions of their site.

Google considers these forms to be four different websites, so a robots.txt file on one of them won’t affect URLs on the other forms:

http://google.com/ — no-HTTPS, no-www
http://www.google.com/ — no-HTTPS, yes-www
https://google.com/ — yes-HTTPS, no-www
https://www.google.com/ — yes-HTTPS, yes-www

So a robots.txt file at http://google.com/robots.txt would have different effects from one at https://www.google.com/robots.txt, because Google considers them to be different sites.

(The best practice is to redirect three of those to a single version, which we will practice later. Google redirects them, so there is only one file.)

A full robots.txt tutorial is coming soon. In the meantime, check out Google’s documentation to learn about the syntax of the file.

Takeaways

Things you should remember from this section:

The robots.txt file can be used to politely ask bots not to crawl parts of a site.
Bots don’t have to pay attention to the robots.txt file.
Even if a site blocks Google with robots.txt Google still might list the blocked pages in the search results. (Other ways to block search engines are coming next.)
Every website has four possible forms, and they are all considered different sites by Google. The forms are based on HTTP vs. no-HTTP and WWW vs. no-WWW.

Saved on this device, no account needed.

14. How robots.txt Files Work

What robots.txt Does

The Four Forms of Sites

Takeaways

What `robots.txt` Does