Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots: block semver-based URLs #2695

Merged
merged 3 commits into from
Dec 22, 2024
Merged

Conversation

jsha
Copy link
Contributor

@jsha jsha commented Dec 19, 2024

Part of #1438

@jsha jsha requested a review from a team as a code owner December 19, 2024 20:36
@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Dec 19, 2024
@jaskij
Copy link

jaskij commented Dec 20, 2024

Google has opensourced their parsing and matching C++ library. They also include a CLI program to check if specific URL is allowed or not.

Having tested with it, the change does not seem to work:

$ ./robots robots.txt - https://docs.rs/tokio-postgres/0.7.7/tokio_postgres/index.html
user-agent '-' with URI 'https://docs.rs/tokio-postgres/0.7.7/tokio_postgres/index.html': ALLOWED
$ ./robots robots.txt - https://docs.rs/tokio-postgres/\^0.7.7/tokio_postgres/index.html
user-agent '-' with URI 'https://docs.rs/tokio-postgres/^0.7.7/tokio_postgres/index.html': ALLOWED
$ ./robots robots.txt - https://docs.rs/tokio-postgres/\~0.7.7/tokio_postgres/index.html
user-agent '-' with URI 'https://docs.rs/tokio-postgres/~0.7.7/tokio_postgres/index.html': ALLOWED

If there's a need, I could probably whip up a tool that goes over a file and checks if a URL is allowed or not, or maybe something that makes sure robots.txt works as intended.

@syphar
Copy link
Member

syphar commented Dec 20, 2024

Google has opensourced their parsing and matching C++ library. They also include a CLI program to check if specific URL is allowed or not.

There seems to be a rust port of that library, which means we could write unit-tests for it.

And remove / from the Disallow rule
@jsha
Copy link
Contributor Author

jsha commented Dec 20, 2024

Ooh thanks for finding that tool and checking. I tried a few different revisions against https://technicalseo.com/tools/robots-txt/. There seem to be two problems:

  • I hadn't specified User-Agent: *
  • For some reason Disallow: */^ wasn't resulting in disallowing the URL we wanted to block. But Disallow: *^ does. I'm not sure why they are treated differently. But for now I've uploaded a new version with Disallow: *^.

@jaskij
Copy link

jaskij commented Dec 20, 2024

There seems to be a rust port of that library, which means we could write unit-tests for it.

Thing is, that port is three years old, and doesn't seem to be maintained. Whether it's up to date with what Google is doing is anyone's guess.

While it's non-ideal, I'd much rather pull Google's code and write a Rust program that executes their CLI - at least this way it can be up-to-date. Assuming the CI is even set up for building C++ code. From a quick look, there are only two dependencies, both downloaded from source at build time.

Both options require some work and it's very much a pick your poison situation.

@syphar
Copy link
Member

syphar commented Dec 20, 2024

There seems to be a rust port of that library, which means we could write unit-tests for it.

Thing is, that port is three years old, and doesn't seem to be maintained. Whether it's up to date with what Google is doing is anyone's guess.

I just re-ran their tests (which run the tests from the google original library with the rust fork), and they still pass.
So I think we might be fine, of course assuming everything is covered by tests also in the google library.

( I might have missed an implementation / compilation detail, my C++ (and specifically cmake/make) times are very long ago)

@jaskij
Copy link

jaskij commented Dec 20, 2024

There seems to be a rust port of that library, which means we could write unit-tests for it.

Thing is, that port is three years old, and doesn't seem to be maintained. Whether it's up to date with what Google is doing is anyone's guess.

I just re-ran their tests (which run the tests from the google original library with the rust fork), and they still pass. So I think we might be fine, of course assuming everything is covered by tests also in the google library.

If that's the case, it would make sense to incorporate it, true. Maybe make sure robots_test.cc is up to date, but otherwise I don't think there are any objections.

( I might have missed an implementation / compilation detail, my C++ (and specifically cmake/make) times are very long ago)

I don't think so, but I also haven't looked closely at stuff. Do note that robots_test.cc is slightly different between the two projects, on account on linking against the Rust library to actually run the tests. Bot otherwise, it all seems good to me. Feel free to ping me (here or in the Rust discord) if there are any issues with the CMake/C++ part, my skills in both are quite current.

@jsha
Copy link
Contributor Author

jsha commented Dec 20, 2024

I downloaded and built the official Google robots checker linked above, and tested it with Disallow: */^.

Turns out that does work correctly, so long as I specify User-Agent: *. I think the third-party site I checked with last night was misleading.

$ cat ~/docs.rs/static/robots.txt
Sitemap: https://docs.rs/sitemap.xml
# Semver-based URL are always redirects, and sometimes
# confuse Google's duplicate detection, so we block crawling them.
# https://docs.rs/about/redirections
User-Agent: *
Disallow: */^
Disallow: */~
$ ./robots ~/docs.rs/static/robots.txt Googlebot 'https://docs.rs/tokio-postgres/^0.7.7/tokio_postgres/index.html'
user-agent 'Googlebot' with URI 'https://docs.rs/tokio-postgres/^0.7.7/tokio_postgres/index.html': DISALLOWED
$ ./robots ~/docs.rs/static/robots.txt Googlebot 'https://docs.rs/tokio-postgres/~0.7.7/tokio_postgres/index.html'
user-agent 'Googlebot' with URI 'https://docs.rs/tokio-postgres/~0.7.7/tokio_postgres/index.html': DISALLOWED

@syphar
Copy link
Member

syphar commented Dec 21, 2024

I downloaded and built the official Google robots checker linked above, and tested it with Disallow: */^.

Turns out that does work correctly, so long as I specify User-Agent: *. I think the third-party site I checked with last night was misleading.

$ cat ~/docs.rs/static/robots.txt
Sitemap: https://docs.rs/sitemap.xml
# Semver-based URL are always redirects, and sometimes
# confuse Google's duplicate detection, so we block crawling them.
# https://docs.rs/about/redirections
User-Agent: *
Disallow: */^
Disallow: */~
$ ./robots ~/docs.rs/static/robots.txt Googlebot 'https://docs.rs/tokio-postgres/^0.7.7/tokio_postgres/index.html'
user-agent 'Googlebot' with URI 'https://docs.rs/tokio-postgres/^0.7.7/tokio_postgres/index.html': DISALLOWED
$ ./robots ~/docs.rs/static/robots.txt Googlebot 'https://docs.rs/tokio-postgres/~0.7.7/tokio_postgres/index.html'
user-agent 'Googlebot' with URI 'https://docs.rs/tokio-postgres/~0.7.7/tokio_postgres/index.html': DISALLOWED

Should we also test of our other URLs still pass? I don't know too much about robots.txt syntax and if (for example) ^ or ~ need to be escaped? (I did some random tests, seems to be fine)

Apart from that I'm fine with the change.

General question for the deploy: do you feel that this should be watched after deploy? Can you watch the search console for this the next (= christmas) week? Otherwise I would postpone the deploy until there is time to watch the console.

@jsha
Copy link
Contributor Author

jsha commented Dec 21, 2024

do you feel that this should be watched after deploy? Can you watch the search console for this the next (= christmas) week? Otherwise I would postpone the deploy until there is time to watch the console.

Yeah, mainly I'd want to check to see if the URLs actually got blocked. The effect should be pretty quick, as soon as robot.txt is fetched next (about 24 hours). I'm happy to check in periodically over the next week.

@syphar syphar merged commit a9496dd into rust-lang:master Dec 22, 2024
9 checks passed
@github-actions github-actions bot added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Dec 22, 2024
@syphar syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants