Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #699 as well as a myriad of other bugs along the same lines. Unfortunately, as of about a year ago, it seems RobotFileParser is not up to date with the latest specifications on robots.txt formatting and parsing (see: python/cpython#114310). This means that user's links were not properly respecting any disallowed URL with a wildcard character in it. This PR users Protego, a faster, lightweight, and updated robots.txt parser which is up to date with the specs outlined by Google. https://github.com/scrapy/protego
List of files changed and why
crawl4ai/utils.py
- imports Protego and replaces instances of RobotFileParser with the proper new syntax for the lib.crawl4ai/tests/20241401/test_robot.py
- Add the test case with wildcard characters from #699. Also renamed fromtets_robot.py
in main which seemed to be a typo.crawl4ai/tests/20241401/test_robot_parser.py
- Fixed several minor bugs in the test cases, particularly for Windows systems. Prior, the tests cases would all pass and then the file would through a PermissionsError when cleaning up. Now, I've added checks to fail more gracefully if encountering this error, as opposed to crashing the file.crawl4ai/pyproject.toml
- Add Protego latest stable version for users downloading the librarycrawl4ai/requirements.txt
- Add Protego latest stable version for development installsHow Has This Been Tested?
Both
test_robot.py
andtest_robot_parser.py
were run with 100% test coverage. A new test case was added totest_robot.py
reflecting a robots.txt with a wildcard and using the latest spec. Note that several quality-of-life fixes were done on the tests as well to ensure repeatability across operating systems, no existing test cases were changed or modified.Checklist: