Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/robots.txt parsing #708

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

flancast90
Copy link

Summary

Fixes #699 as well as a myriad of other bugs along the same lines. Unfortunately, as of about a year ago, it seems RobotFileParser is not up to date with the latest specifications on robots.txt formatting and parsing (see: python/cpython#114310). This means that user's links were not properly respecting any disallowed URL with a wildcard character in it. This PR users Protego, a faster, lightweight, and updated robots.txt parser which is up to date with the specs outlined by Google. https://github.com/scrapy/protego

List of files changed and why

crawl4ai/utils.py - imports Protego and replaces instances of RobotFileParser with the proper new syntax for the lib.
crawl4ai/tests/20241401/test_robot.py - Add the test case with wildcard characters from #699. Also renamed from tets_robot.py in main which seemed to be a typo.
crawl4ai/tests/20241401/test_robot_parser.py - Fixed several minor bugs in the test cases, particularly for Windows systems. Prior, the tests cases would all pass and then the file would through a PermissionsError when cleaning up. Now, I've added checks to fail more gracefully if encountering this error, as opposed to crashing the file.
crawl4ai/pyproject.toml - Add Protego latest stable version for users downloading the library
crawl4ai/requirements.txt - Add Protego latest stable version for development installs

How Has This Been Tested?

Both test_robot.py and test_robot_parser.py were run with 100% test coverage. A new test case was added to test_robot.py reflecting a robots.txt with a wildcard and using the latest spec. Note that several quality-of-life fixes were done on the tests as well to ensure repeatability across operating systems, no existing test cases were changed or modified.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@mllife
Copy link

mllife commented Feb 18, 2025

Thanks brother.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: check_robots_txt not working
2 participants