GitHub issues not scraped fully #537
Unanswered
hyperknot
asked this question in
Forums - Q&A
Replies: 1 comment 1 reply
-
@hyperknot In order for the "Load more" button to be clicked before the scraping is done, you can write a JS script that will click on the button and pass it to the crawler_config like below. You just have to update the input string to querySelectorAll to match the "Load more" button in Github issues. js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
) Now just pass this crawler_config to the crawler while initialising. The script will run before the page is scraper, there by allowing you to get all comments. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When working with LLMs it'd be important to extract the full output of a GitHub issue, including expanding all comments.
Here is an example with expandable comments:
moby/moby#4737
Currently not even a single reply is included.
Ideally all replies would be included, even the one behind "x remaining items".
To get this, "Load more" buttons need to be clicked, until there is no more visible.
Beta Was this translation helpful? Give feedback.
All reactions