[Bug]: extraction strategies are not async #704

ederuiter · 2025-02-17T14:34:22Z

crawl4ai version

2025-feb-alpha-1

Expected Behavior

With most of the codebase having async variants; I was suprised to see that the extraction is not async. I expected it to be.

Current Behavior

Extracting can take up longer (especially with LLMExtraction) than the scraping part. As this is currently not async (just parallelized with a thread executor) it will block any async functions until this is completed.

So I would expect something like this to happen

scrape url1
start extracting data from url1
scrape url2
start extracting data from url2
completed extracting data from url1
completed extracting data from url2

Instead this happens

scrape url1
start extracting data from url1
completed extracting data from url1
scrape url2
start extracting data from url2
completed extracting data from url2

Almost as it is not async at all.

Is this reproducible?

Yes

Inputs Causing the Bug

any

Steps to Reproduce

1. use llmextraction (as this can take long, it shows the problem faster)
2. start scraping multiple urls with `arun_many`
3. observe that the second url will only start with the scraping steps after the first has finished extracting.

Code snippets

OS

any

Python version

any

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

ederuiter · 2025-02-17T15:15:19Z

The draft PR is ready; I don't think this is the ultimate solution as this could create a large amount of concurrent requests .. I think this needs something like a semaphore in there that limits the overall outstanding requests.

But before investing more time into this I wanted to check if something like this would be considered.

ederuiter added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Feb 17, 2025

ederuiter mentioned this issue Feb 17, 2025

[WIP] mvp for async extraction strategies #706

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: extraction strategies are not async #704

[Bug]: extraction strategies are not async #704

ederuiter commented Feb 17, 2025

ederuiter commented Feb 17, 2025

[Bug]: extraction strategies are not async #704

[Bug]: extraction strategies are not async #704

Comments

ederuiter commented Feb 17, 2025

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

ederuiter commented Feb 17, 2025