[Bug]: check_robots_txt not working #699

mllife · 2025-02-17T11:41:45Z

crawl4ai version

0.4.248

Expected Behavior

The library should give error for https://www.nsf.gov/awardsearch/advancedSearch.jsp as this path is not allowed to be scraped by the robots.txt https://www.nsf.gov/robots.txt here. Please fix.

Current Behavior

it is able to scraping this page, even though i have set check_robots_txt=True for CrawlerRunConfig

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

macOS

Python version

3.11.9

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

flancast90 · 2025-02-17T15:23:07Z

@mllife I have encountered the same issue, I'm planning on making a fix and a pull request tonight, and will update you when done so you can get a quick and timely solution!

flancast90 · 2025-02-17T16:44:24Z

Oddly enough this comes from urllib RobotParser side of things. My research directed me here: https://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly. But nothing in Crawl4AI itself is handling things wrong as far as I can see - a urllib function is providing the wrong output. @mllife

For now most likely what I'll do is provide a replacement method within Crawl4AI itself which corrects this.

flancast90 · 2025-02-17T16:55:16Z

Further updates, it seems this is an open issue in cpython itself, see python/cpython#114310. Since it has been open for over a year without any sort of resolution, I believe it may be best to find a new library to parse the robots.txt file, and am looking for a pertinent one now to circumvent this (ongoing) issue.

flancast90 · 2025-02-17T17:11:15Z

This issue is now fixed in my development environment @mllife , I will be putting out a PR shortly which you will be able to work off of in the meantime

flancast90 · 2025-02-17T18:00:08Z

@mllife Please see #708 for the fixes, let me know if you have any issues or questions. Not sure on the timeline for the PR being merged into main, but for now working off the PR should be stable enough.

aravindkarnam · 2025-02-18T10:15:57Z

@flancast90 Thanks for raising PR. We'll review it soon.

mllife added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Feb 17, 2025

unclecode added 💪 - Intermediate Difficulty level - Intermediate 🥸 - Medium Priority - Medium and removed 🩺 Needs Triage Needs attention of maintainers labels Feb 17, 2025

flancast90 linked a pull request Feb 17, 2025 that will close this issue

Fix/robots.txt parsing #708

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: check_robots_txt not working #699

[Bug]: check_robots_txt not working #699

mllife commented Feb 17, 2025 •

edited

Loading

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

aravindkarnam commented Feb 18, 2025

[Bug]: check_robots_txt not working #699

[Bug]: check_robots_txt not working #699

Comments

mllife commented Feb 17, 2025 • edited Loading

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

flancast90 commented Feb 17, 2025

aravindkarnam commented Feb 18, 2025

mllife commented Feb 17, 2025 •

edited

Loading