Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: check_robots_txt not working #699

Open
mllife opened this issue Feb 17, 2025 · 6 comments · May be fixed by #708
Open

[Bug]: check_robots_txt not working #699

mllife opened this issue Feb 17, 2025 · 6 comments · May be fixed by #708
Labels
💪 - Intermediate Difficulty level - Intermediate 🥸 - Medium Priority - Medium 🐞 Bug Something isn't working

Comments

@mllife
Copy link

mllife commented Feb 17, 2025

crawl4ai version

0.4.248

Expected Behavior

The library should give error for https://www.nsf.gov/awardsearch/advancedSearch.jsp as this path is not allowed to be scraped by the robots.txt https://www.nsf.gov/robots.txt here. Please fix.

Current Behavior

it is able to scraping this page, even though i have set check_robots_txt=True for CrawlerRunConfig

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

macOS

Python version

3.11.9

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

@mllife mllife added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Feb 17, 2025
@unclecode unclecode added 💪 - Intermediate Difficulty level - Intermediate 🥸 - Medium Priority - Medium and removed 🩺 Needs Triage Needs attention of maintainers labels Feb 17, 2025
@flancast90
Copy link

@mllife I have encountered the same issue, I'm planning on making a fix and a pull request tonight, and will update you when done so you can get a quick and timely solution!

@flancast90
Copy link

Oddly enough this comes from urllib RobotParser side of things. My research directed me here: https://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly. But nothing in Crawl4AI itself is handling things wrong as far as I can see - a urllib function is providing the wrong output. @mllife

For now most likely what I'll do is provide a replacement method within Crawl4AI itself which corrects this.

@flancast90
Copy link

Further updates, it seems this is an open issue in cpython itself, see python/cpython#114310. Since it has been open for over a year without any sort of resolution, I believe it may be best to find a new library to parse the robots.txt file, and am looking for a pertinent one now to circumvent this (ongoing) issue.

@flancast90
Copy link

This issue is now fixed in my development environment @mllife , I will be putting out a PR shortly which you will be able to work off of in the meantime

@flancast90 flancast90 linked a pull request Feb 17, 2025 that will close this issue
6 tasks
@flancast90
Copy link

@mllife Please see #708 for the fixes, let me know if you have any issues or questions. Not sure on the timeline for the PR being merged into main, but for now working off the PR should be stable enough.

@aravindkarnam
Copy link
Collaborator

@flancast90 Thanks for raising PR. We'll review it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 - Intermediate Difficulty level - Intermediate 🥸 - Medium Priority - Medium 🐞 Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants