Skip to content

Commit

Permalink
fix: cli crawler engine updates for crawlee version upgrade to 3.11 (#…
Browse files Browse the repository at this point in the history
…2665)

#### Details

Update crawler engine for accessibility insights scan package to fix
issue raised due to crawlee version upgrade(#2580)

##### Motivation

1. A new flag `retireInactiveBrowserAfterSecs` in @crawlee v3.9.1 checks
for inactive browsers. The default value is 10 seconds. During AAD Auth,
the browser remains inactive for 10 seconds while waiting for
FormsAuthenticator and password selectors. Therefore, we have increased
this time to prevent the browser from closing prematurely.

2. Removed user data folder. Crawler will maintain temp directory for
each session. Below are the details for making this change:

- SiteCrawlerEngine starts once per crawler CLI instance. It manages
sessions and changes them based on max usage count (number of times
session used while processing Urls, default 50) or session age (default
3000 seconds). When a session changes, it destroys the browser and
launches a new one. The new browser fails due to a race condition where
the old browser is shutting down but still connected to the folder.
During parallel processing of URLs, the crawler runs multiple sessions
in parallel, each with its own browser instance.
- We started seeing issues after upgrading Crawlee from 3.5 to 3.11
([Pull Request #2580
](#2580)).
In the earlier version, the browser launch would fail but had retries,
so after a couple of retries, the folder would be released, and a single
browser would launch again. In the latest version, there are no retries,
and the crawler shuts down if the browser launch fails.
- If we do not specify the directory, the crawler automatically
maintains a temp directory for cache and cookies for each session. So if
remove our data folder, crawler will maintain it.
[PuppeteerLaunchContext | API | Crawlee · Build reliable crawlers.
Fast.](https://crawlee.dev/api/3.7/puppeteer-crawler/interface/PuppeteerLaunchContext#userDataDir)
- This will not impact authentication as the authenticator call in
postLaunchhook so whenever a new browser is launched it calls the
authenticator code first. Tested it by reducing the session usage count
and authentication launched with new browser.
    
![image displaying cli package logs with authentication called at each
browser
launch](https://github.com/user-attachments/assets/d6ab906f-3885-4072-b629-09ea288d5320)


##### Context

<!-- Are there any parts that you've intentionally left out-of-scope for
a later PR to handle? -->

<!-- Were there any alternative approaches you considered? What
tradeoffs did you consider? -->

#### Pull request checklist
<!-- If a checklist item is not applicable to this change, write "n/a"
in the checkbox -->

- [ ] Addresses an existing issue: Fixes #0000
- [ ] Added relevant unit test for your changes. (`yarn test`)
- [x] Verified code coverage for the changes made. Check coverage report
at: `<rootDir>/test-results/unit/coverage`
- [x] Ran precheckin (`yarn precheckin`)
- [ ] Validated in an Azure resource group
  • Loading branch information
v-viyada authored Jan 30, 2025
1 parent 8eb205e commit 5a097f5
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 6 deletions.
2 changes: 1 addition & 1 deletion packages/cli/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "accessibility-insights-scan",
"version": "3.1.1",
"version": "3.1.2",
"description": "This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.",
"scripts": {
"build": "webpack --config ./webpack.config.js \"$@\"",
Expand Down
9 changes: 4 additions & 5 deletions packages/crawler/src/crawler/site-crawler-engine.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

import * as fs from 'fs';
import * as Crawlee from '@crawlee/puppeteer';
import { inject, injectable, optional } from 'inversify';
import { isEmpty } from 'lodash';
Expand Down Expand Up @@ -35,16 +34,12 @@ export class SiteCrawlerEngine implements CrawlerEngine {
this.crawlerConfiguration.setMemoryMBytes(crawlerRunOptions.memoryMBytes);
this.crawlerConfiguration.setSilentMode(crawlerRunOptions.silentMode);

const userDataDirectory = `${__dirname}/ChromeData`;
fs.rmSync(userDataDirectory, { recursive: true, force: true });

const puppeteerOptions = crawlerRunOptions.browserOptions ? crawlerRunOptions.browserOptions.map((o) => `--${o}`) : [];
const puppeteerDefaultOptions = [
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
'--js-flags=--max-old-space-size=8192',
`--user-data-dir=${userDataDirectory}`,
];

const pageProcessor = this.pageProcessorFactory();
Expand Down Expand Up @@ -83,6 +78,10 @@ export class SiteCrawlerEngine implements CrawlerEngine {
}
},
],
// A new flag in @crawlee v3.9.1 checks for inactive browsers. The default value is 10 seconds.
// During AAD Auth, the browser remains inactive for 10 seconds while waiting for FormsAuthenticator and password selectors.
// Therefore, we have increased this time to prevent the browser from closing prematurely.
retireInactiveBrowserAfterSecs: 30,
},
};

Expand Down

0 comments on commit 5a097f5

Please sign in to comment.