Update robots.txt to ensure review app isn’t indexed #2234

36degrees · 2021-05-21T10:34:47Z

The review app does two things to prevent pages from being indexed – it adds a X-Robots-Tag header to responses, and it also serves a /robots.txt file which disallows all robots from crawling the site.

However, the robots.txt disallow statement actually prevents robots from ever seeing the X-Robots-Tag, which means that although pages from the review app can't be crawled they can still appear in search indexes.

From Google's own documentation:

Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

– https://developers.google.com/search/docs/advanced/crawling/block-indexing

While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

– https://developers.google.com/search/docs/advanced/robots/intro#understand-the-limitations-of-robots.txt

Update the robots.txt to allow crawling by all user agents, move the route so that it’s closer to the code that sets the X-Robots-Tag header and add a test to check that the robots.txt file matches the expected contents.

The review app does two things to prevent pages from being indexed – it adds a `X-Robots-Tag` header to responses, and it also serves a `/robots.txt` file which disallows all robots from crawling the site. However, the `robots.txt` disallow statement actually prevents robots from ever seeing the X-Robots-Tag, which means that although pages from the review app can't be crawled they can still appear in search indexes. From Google's own documentation [1] [2]: > Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it. > While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely). Update the robots.txt to allow crawling by all user agents, move the route so that it’s closer to the code that sets the `X-Robots-Tag` header and add a test to check that the robots.txt file matches the expected contents. [1]: https://developers.google.com/search/docs/advanced/crawling/block-indexing [2]: https://developers.google.com/search/docs/advanced/robots/intro#understand-the-limitations-of-robots.txt

lfdebrux · 2021-05-21T10:40:32Z

Interesting!

Is there any point in having a robots file now, if we're not excluding any robots?

36degrees · 2021-05-21T10:44:33Z

Is there any point in having a robots file now, if we're not excluding any robots?

Good question! I initially planned to just remove the route it, but changed my mind. I think there's 2 (small) benefits:

It's explicit, so it's less likely that someone in the future assumes it's missing and adds it back, disallowing robots access again.
Robots will make requests for it, and if we didn't have a route mapped it'd be a 404. I haven't checked whether this applies to the review app, but there may be situations where 404s will be logged, so it potentially reduces noise.

lfdebrux

I think this makes sense, good spot.

govuk-design-system-ci temporarily deployed to govuk-frontend-pr-2234 May 21, 2021 10:35 Inactive

36degrees force-pushed the robots-txt branch from 23908ef to 299d1fa Compare May 21, 2021 10:37

govuk-design-system-ci temporarily deployed to govuk-frontend-pr-2234 May 21, 2021 10:38 Inactive

lfdebrux approved these changes May 21, 2021

View reviewed changes

36degrees merged commit 4f71da6 into main May 21, 2021

36degrees deleted the robots-txt branch May 21, 2021 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update robots.txt to ensure review app isn’t indexed #2234

Update robots.txt to ensure review app isn’t indexed #2234

36degrees commented May 21, 2021 •

edited

Loading

lfdebrux commented May 21, 2021

36degrees commented May 21, 2021

lfdebrux left a comment

Update robots.txt to ensure review app isn’t indexed #2234

Update robots.txt to ensure review app isn’t indexed #2234

Conversation

36degrees commented May 21, 2021 • edited Loading

lfdebrux commented May 21, 2021

36degrees commented May 21, 2021

lfdebrux left a comment

Choose a reason for hiding this comment

36degrees commented May 21, 2021 •

edited

Loading