-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robots: block semver-based URLs #2695
Conversation
Google has opensourced their parsing and matching C++ library. They also include a CLI program to check if specific URL is allowed or not. Having tested with it, the change does not seem to work:
If there's a need, I could probably whip up a tool that goes over a file and checks if a URL is allowed or not, or maybe something that makes sure robots.txt works as intended. |
There seems to be a rust port of that library, which means we could write unit-tests for it. |
And remove / from the Disallow rule
Ooh thanks for finding that tool and checking. I tried a few different revisions against https://technicalseo.com/tools/robots-txt/. There seem to be two problems:
|
Thing is, that port is three years old, and doesn't seem to be maintained. Whether it's up to date with what Google is doing is anyone's guess. While it's non-ideal, I'd much rather pull Google's code and write a Rust program that executes their CLI - at least this way it can be up-to-date. Assuming the CI is even set up for building C++ code. From a quick look, there are only two dependencies, both downloaded from source at build time. Both options require some work and it's very much a pick your poison situation. |
I just re-ran their tests (which run the tests from the google original library with the rust fork), and they still pass. ( I might have missed an implementation / compilation detail, my C++ (and specifically cmake/make) times are very long ago) |
If that's the case, it would make sense to incorporate it, true. Maybe make sure
I don't think so, but I also haven't looked closely at stuff. Do note that |
I downloaded and built the official Google robots checker linked above, and tested it with Turns out that does work correctly, so long as I specify
|
Apart from that I'm fine with the change. General question for the deploy: do you feel that this should be watched after deploy? Can you watch the search console for this the next (= christmas) week? Otherwise I would postpone the deploy until there is time to watch the console. |
Yeah, mainly I'd want to check to see if the URLs actually got blocked. The effect should be pretty quick, as soon as robot.txt is fetched next (about 24 hours). I'm happy to check in periodically over the next week. |
Part of #1438