Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: cli crawler engine updates for crawlee version upgrade to 3.11 (#…
…2665) #### Details Update crawler engine for accessibility insights scan package to fix issue raised due to crawlee version upgrade(#2580) ##### Motivation 1. A new flag `retireInactiveBrowserAfterSecs` in @crawlee v3.9.1 checks for inactive browsers. The default value is 10 seconds. During AAD Auth, the browser remains inactive for 10 seconds while waiting for FormsAuthenticator and password selectors. Therefore, we have increased this time to prevent the browser from closing prematurely. 2. Removed user data folder. Crawler will maintain temp directory for each session. Below are the details for making this change: - SiteCrawlerEngine starts once per crawler CLI instance. It manages sessions and changes them based on max usage count (number of times session used while processing Urls, default 50) or session age (default 3000 seconds). When a session changes, it destroys the browser and launches a new one. The new browser fails due to a race condition where the old browser is shutting down but still connected to the folder. During parallel processing of URLs, the crawler runs multiple sessions in parallel, each with its own browser instance. - We started seeing issues after upgrading Crawlee from 3.5 to 3.11 ([Pull Request #2580 ](#2580)). In the earlier version, the browser launch would fail but had retries, so after a couple of retries, the folder would be released, and a single browser would launch again. In the latest version, there are no retries, and the crawler shuts down if the browser launch fails. - If we do not specify the directory, the crawler automatically maintains a temp directory for cache and cookies for each session. So if remove our data folder, crawler will maintain it. [PuppeteerLaunchContext | API | Crawlee · Build reliable crawlers. Fast.](https://crawlee.dev/api/3.7/puppeteer-crawler/interface/PuppeteerLaunchContext#userDataDir) - This will not impact authentication as the authenticator call in postLaunchhook so whenever a new browser is launched it calls the authenticator code first. Tested it by reducing the session usage count and authentication launched with new browser. ![image displaying cli package logs with authentication called at each browser launch](https://github.com/user-attachments/assets/d6ab906f-3885-4072-b629-09ea288d5320) ##### Context <!-- Are there any parts that you've intentionally left out-of-scope for a later PR to handle? --> <!-- Were there any alternative approaches you considered? What tradeoffs did you consider? --> #### Pull request checklist <!-- If a checklist item is not applicable to this change, write "n/a" in the checkbox --> - [ ] Addresses an existing issue: Fixes #0000 - [ ] Added relevant unit test for your changes. (`yarn test`) - [x] Verified code coverage for the changes made. Check coverage report at: `<rootDir>/test-results/unit/coverage` - [x] Ran precheckin (`yarn precheckin`) - [ ] Validated in an Azure resource group
- Loading branch information