-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: check_robots_txt not working #699
Comments
@mllife I have encountered the same issue, I'm planning on making a fix and a pull request tonight, and will update you when done so you can get a quick and timely solution! |
Oddly enough this comes from urllib RobotParser side of things. My research directed me here: https://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly. But nothing in Crawl4AI itself is handling things wrong as far as I can see - a urllib function is providing the wrong output. @mllife For now most likely what I'll do is provide a replacement method within Crawl4AI itself which corrects this. |
Further updates, it seems this is an open issue in cpython itself, see python/cpython#114310. Since it has been open for over a year without any sort of resolution, I believe it may be best to find a new library to parse the robots.txt file, and am looking for a pertinent one now to circumvent this (ongoing) issue. |
This issue is now fixed in my development environment @mllife , I will be putting out a PR shortly which you will be able to work off of in the meantime |
@flancast90 Thanks for raising PR. We'll review it soon. |
crawl4ai version
0.4.248
Expected Behavior
The library should give error for https://www.nsf.gov/awardsearch/advancedSearch.jsp as this path is not allowed to be scraped by the robots.txt https://www.nsf.gov/robots.txt here. Please fix.
Current Behavior
it is able to scraping this page, even though i have set check_robots_txt=True for CrawlerRunConfig
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
macOS
Python version
3.11.9
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
The text was updated successfully, but these errors were encountered: