A fast alternative for JavaScript-based scraping tools, intended for both
frontend and backend. fast-wasm-scraper is practically a wrapper for
scraper (intended for parsing
HTML and querying with CSS selectors) -- which compiles to WebAssembly.
$ yarn add fast-wasm-scraper
const { Document } = require('fast-wasm-scraper');
const doc = new Document('<html>Hello world!</html>');
doc.root.inner_html;
// => <html>Hello world!</html>
const { Document } = require('fast-wasm-scraper');
const html = `
<html>
<body>
<div>
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
</div>
</body>
</html>
`;
const doc = new Document(html);
doc.root.query('li');
// => [
// Element { name: 'li', inner_html: 'One', ... },
// Element { name: 'li', inner_html: 'Two', ... },
// Element { name: 'li', inner_html: 'Three', ... },
// ]
property |
type |
Description |
constructor |
(html: string) => Document |
Takes the raw html as a string and returns a new Document object |
root |
Element |
Returns the root element of the Document |
property |
type |
Description |
name |
string |
Returns the name of the element as a string, ex: 'div' |
html |
string |
Returns a string representation of this Element and it's descendants |
inner_html |
string |
Returns the inner content of this Element as a string |
attributes |
Map<string, string> |
Returns the attributes as a Map<string, string> |
query |
(query_str: string) => Array<Element> |
Returns an array of Elements from the resulting query |
text |
() => Array<string> |
Returns an array of strings from descending text nodes |
|
fast-wasm-scraper |
cheerio |
JsDOM |
Runtime |
WebAssembly (from Rust) |
JavaScript |
JavaScript |
|
|
|
|
Parsing, and querying with li , for a document with 100 list items |
|
|
|
|
|
|
|
Sample size (#) |
87 |
74 |
52 |
Speed (ops/s) |
539 (+/- 1.37%) |
318 (+/- 4.75%) |
38.2 (+/- 11.25%) |
Speedup |
1.69x compared to cheerio, and 14x to JsDOM |
- |
- |
This benchmark was conducted on a rather modest dual core CPU and Node.js
v.12.20.0. You can also run the benchmarks locally by cloning the GitHub
repository.