[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

Popeyef5 · 2025-02-20T22:31:27Z

crawl4ai version

0.4.248

Expected Behavior

I'm crawiling Twitter, specifically the "following" section of a profile. I have a css selector for the relevant data (user's names and bios) and set up a JsonCssExtractionStrategy. If I don't use scroll_full_page, I understandably only expect to get the first N user profiles. But if I do enable scroll_full_page, I expect the returned data to contain the list to the fullest extent as visible when browsing manually.

Current Behavior

When not using scroll_full_page, I do get the first 16 profiles in this case. However, when setting scroll_full_page, I only get the LAST 12. It's important to note that there are over 40 profiles listed, so none of the 12 profiles intersect with the first 16. I checked the result's html property and it does only contain information about the last 12. However, strangely the screenshot saved contains all the profiles.

Is this reproducible?

Yes

Inputs Causing the Bug

https://x.com/SomeTwitterProfile/following

Steps to Reproduce

Execute the following snippet with both scoll_full_page on and off.

Code snippets

site_url = "https://x.com/elonmusk/following"

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json


async def main():
    schema = {
        "name": "Followers",
        "baseSelector": "button[data-testid='UserCell']",
        "fields": [
            {
              "name": "name",
              "selector": "span",
              "type": "text"
	    },
            {
              "name": "handle",
              "selector": 'a[role="link"] > div > div[dir="ltr"]:only-child > span ',
              "type": "text"
	    },
            {
              "name": "bio",
              "selector": "div[dir='auto'] > span",
              "type": "html",
	     }
                        
	 ]
    }
    
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
    
    browser_conf = BrowserConfig(
        extra_args=['--disable-web-security'],
        cookies=[
            {"name": "auth_token", "value": "YOURAUTHTOKEN", "domain": ".x.com", "path": "/"},
        ],
    )
    
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS, #        cache_mode=CacheMode.BYPASS/DISABLED,
        screenshot=True,
        wait_for="css:button[data-testid='UserCell']:nth-child(1)",
        page_timeout=5000,
        extraction_strategy=extraction_strategy,
        # scan_full_page=True,
        scroll_delay=1.5,
    )

    try:
        async with AsyncWebCrawler(
            config=browser_conf,
            verbose=True,
        ) as crawler:
            result = await crawler.arun(
                url=site_url ,
                config=crawler_config,
            )
                
            from base64 import b64decode
            with open("screenshot.png", "wb") as f:
              f.write(b64decode(result.screenshot))
              
            with open("index.html", "w") as f:
              f.write(result.html)
                                
            data = json.loads(result.extracted_content)
            print(f"Extracted {len(data)} users")
            print(json.dumps(data, indent=2) if data else "No data found")
                    
    except Exception as e:
        print(f"Something happened: {e}")
        raise e

if __name__ == "__main__":
    asyncio.run(main())

OS

Linux

Python version

3.12.9

Browser

Default

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Popeyef5 added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

Popeyef5 commented Feb 20, 2025

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

[Bug]: When extracting data with scroll_full_page, only the final elements get parsed #731

Comments

Popeyef5 commented Feb 20, 2025

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)