GitHub issues not scraped fully #537

hyperknot · 2025-01-05T17:56:22Z

hyperknot
Jan 5, 2025

When working with LLMs it'd be important to extract the full output of a GitHub issue, including expanding all comments.

Here is an example with expandable comments:
moby/moby#4737

Currently not even a single reply is included.

Ideally all replies would be included, even the one behind "x remaining items".

To get this, "Load more" buttons need to be clicked, until there is no more visible.

aravindkarnam · 2025-01-22T13:28:57Z

aravindkarnam
Jan 22, 2025
Collaborator

@hyperknot In order for the "Load more" button to be clicked before the scraping is done, you can write a JS script that will click on the button and pass it to the crawler_config like below. You just have to update the input string to querySelectorAll to match the "Load more" button in Github issues.

js_click_tabs = """
    (async () => {
        const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
        for(let tab of tabs) {
            tab.scrollIntoView();
            tab.click();
            await new Promise(r => setTimeout(r, 500));
        }
    })();
    """

crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        js_code=[js_click_tabs],
    )

Now just pass this crawler_config to the crawler while initialising. The script will run before the page is scraper, there by allowing you to get all comments.

1 reply

hyperknot Jan 22, 2025
Author

Thanks for posting this, I'll keep in mind. The broader question is that for some super common targets, like GitHub, shouldn't the package try to automatically just load it correctly? I mean I don't know what is your policy on making this scraper site specific, but I guess having some super popular sites like GitHub supported out-of-the-box would be a great advantage!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub issues not scraped fully #537

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GitHub issues not scraped fully #537

hyperknot Jan 5, 2025

Replies: 1 comment · 1 reply

aravindkarnam Jan 22, 2025 Collaborator

hyperknot Jan 22, 2025 Author

hyperknot
Jan 5, 2025

Replies: 1 comment 1 reply

aravindkarnam
Jan 22, 2025
Collaborator

hyperknot Jan 22, 2025
Author