Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: extraction strategies are not async #704

Open
ederuiter opened this issue Feb 17, 2025 · 1 comment
Open

[Bug]: extraction strategies are not async #704

ederuiter opened this issue Feb 17, 2025 · 1 comment
Labels
🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers

Comments

@ederuiter
Copy link

crawl4ai version

2025-feb-alpha-1

Expected Behavior

With most of the codebase having async variants; I was suprised to see that the extraction is not async. I expected it to be.

Current Behavior

Extracting can take up longer (especially with LLMExtraction) than the scraping part. As this is currently not async (just parallelized with a thread executor) it will block any async functions until this is completed.

So I would expect something like this to happen

  • scrape url1
  • start extracting data from url1
  • scrape url2
  • start extracting data from url2
  • completed extracting data from url1
  • completed extracting data from url2

Instead this happens

  • scrape url1
  • start extracting data from url1
  • completed extracting data from url1
  • scrape url2
  • start extracting data from url2
  • completed extracting data from url2

Almost as it is not async at all.

Is this reproducible?

Yes

Inputs Causing the Bug

any

Steps to Reproduce

1. use llmextraction (as this can take long, it shows the problem faster)
2. start scraping multiple urls with `arun_many`
3. observe that the second url will only start with the scraping steps after the first has finished extracting.

Code snippets

OS

any

Python version

any

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

@ederuiter ederuiter added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Feb 17, 2025
@ederuiter
Copy link
Author

The draft PR is ready; I don't think this is the ultimate solution as this could create a large amount of concurrent requests .. I think this needs something like a semaphore in there that limits the overall outstanding requests.

But before investing more time into this I wanted to check if something like this would be considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers
Projects
None yet
Development

No branches or pull requests

1 participant