Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update robots.txt to ensure review app isn’t indexed #2234

Merged
merged 1 commit into from
May 21, 2021
Merged

Conversation

36degrees
Copy link
Contributor

@36degrees 36degrees commented May 21, 2021

The review app does two things to prevent pages from being indexed – it adds a X-Robots-Tag header to responses, and it also serves a /robots.txt file which disallows all robots from crawling the site.

However, the robots.txt disallow statement actually prevents robots from ever seeing the X-Robots-Tag, which means that although pages from the review app can't be crawled they can still appear in search indexes.

From Google's own documentation:

Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

https://developers.google.com/search/docs/advanced/crawling/block-indexing

While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

https://developers.google.com/search/docs/advanced/robots/intro#understand-the-limitations-of-robots.txt

Update the robots.txt to allow crawling by all user agents, move the route so that it’s closer to the code that sets the X-Robots-Tag header and add a test to check that the robots.txt file matches the expected contents.

@govuk-design-system-ci govuk-design-system-ci temporarily deployed to govuk-frontend-pr-2234 May 21, 2021 10:35 Inactive
The review app does two things to prevent pages from being indexed – it adds a `X-Robots-Tag` header to responses, and it also serves a `/robots.txt` file which disallows all robots from crawling the site.

However, the `robots.txt` disallow statement actually prevents robots from ever seeing the X-Robots-Tag, which means that although pages from the review app can't be crawled they can still appear in search indexes.

From Google's own documentation [1] [2]:

> Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

> While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

Update the robots.txt to allow crawling by all user agents, move the route so that it’s closer to the code that sets the `X-Robots-Tag` header and add a test to check that the robots.txt file matches the expected contents.

[1]: https://developers.google.com/search/docs/advanced/crawling/block-indexing
[2]: https://developers.google.com/search/docs/advanced/robots/intro#understand-the-limitations-of-robots.txt
@govuk-design-system-ci govuk-design-system-ci temporarily deployed to govuk-frontend-pr-2234 May 21, 2021 10:38 Inactive
@lfdebrux
Copy link
Member

Interesting!

Is there any point in having a robots file now, if we're not excluding any robots?

@36degrees
Copy link
Contributor Author

Is there any point in having a robots file now, if we're not excluding any robots?

Good question! I initially planned to just remove the route it, but changed my mind. I think there's 2 (small) benefits:

  1. It's explicit, so it's less likely that someone in the future assumes it's missing and adds it back, disallowing robots access again.
  2. Robots will make requests for it, and if we didn't have a route mapped it'd be a 404. I haven't checked whether this applies to the review app, but there may be situations where 404s will be logged, so it potentially reduces noise.

Copy link
Member

@lfdebrux lfdebrux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, good spot.

@36degrees 36degrees merged commit 4f71da6 into main May 21, 2021
@36degrees 36degrees deleted the robots-txt branch May 21, 2021 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants