You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With most of the codebase having async variants; I was suprised to see that the extraction is not async. I expected it to be.
Current Behavior
Extracting can take up longer (especially with LLMExtraction) than the scraping part. As this is currently not async (just parallelized with a thread executor) it will block any async functions until this is completed.
So I would expect something like this to happen
scrape url1
start extracting data from url1
scrape url2
start extracting data from url2
completed extracting data from url1
completed extracting data from url2
Instead this happens
scrape url1
start extracting data from url1
completed extracting data from url1
scrape url2
start extracting data from url2
completed extracting data from url2
Almost as it is not async at all.
Is this reproducible?
Yes
Inputs Causing the Bug
any
Steps to Reproduce
1. use llmextraction (as this can take long, it shows the problem faster)
2. start scraping multiple urls with `arun_many`
3. observe that the second url will only start with the scraping steps after the first has finished extracting.
Code snippets
OS
any
Python version
any
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
The text was updated successfully, but these errors were encountered:
The draft PR is ready; I don't think this is the ultimate solution as this could create a large amount of concurrent requests .. I think this needs something like a semaphore in there that limits the overall outstanding requests.
But before investing more time into this I wanted to check if something like this would be considered.
crawl4ai version
2025-feb-alpha-1
Expected Behavior
With most of the codebase having async variants; I was suprised to see that the extraction is not async. I expected it to be.
Current Behavior
Extracting can take up longer (especially with LLMExtraction) than the scraping part. As this is currently not async (just parallelized with a thread executor) it will block any async functions until this is completed.
So I would expect something like this to happen
Instead this happens
Almost as it is not async at all.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. use llmextraction (as this can take long, it shows the problem faster) 2. start scraping multiple urls with `arun_many` 3. observe that the second url will only start with the scraping steps after the first has finished extracting.
Code snippets
OS
any
Python version
any
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
The text was updated successfully, but these errors were encountered: