Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch() performance parity #1203

Closed
ignoramous opened this issue Feb 4, 2022 · 64 comments
Closed

fetch() performance parity #1203

ignoramous opened this issue Feb 4, 2022 · 64 comments
Labels
enhancement New feature or request fetch
Milestone

Comments

@ignoramous
Copy link

This would solve...

Undici fetch isn't as fast as the ones in browsers like Firefox / Deno / Workers.

The implementation should look like...

Measure and improve (e2e latencies, op/sec, conns/sec, etc?) to match up with other impls.

I have also considered...

I hand-rolled a http2 nodejs stdlib client and it was a magnitude faster than undici fetch, even without connection pooling.

I must note though, the conns were to Cloudflare and Google endpoints, which responded with 1KB or less per fetch (basically, POST and uncached GET DNS over HTTPS connections).

Additional context

This is a follow up from: https://news.ycombinator.com/item?id=30164925

Sorry that I didn't really benchmark the difference between Deno / Firefox / Undici / Cloudflare Workers, but I knew instantly that it was the cause of higher latencies when I deployed the same code in Node v16 LTS.

Possibly related: #22, #208, #952, #902

Other projects: szmarczak/http2-wrapper/issues/71

@ignoramous ignoramous added the enhancement New feature or request label Feb 4, 2022
@ronag
Copy link
Member

ronag commented Feb 4, 2022

Do you have a reproducible benchmark we can use?

@mcollina
Copy link
Member

mcollina commented Feb 5, 2022

Undici fetch isn't as fast as the ones in browsers like Firefox / Deno / Workers.

I took a look at the comment in hackernews and the comment was not including benchmarks either.

I'm prone to close this unless somebody would like to prepare a cross-runtime benchmark suite.

@ignoramous
Copy link
Author

ignoramous commented Feb 5, 2022

Fair. I'll take a stab at extracting out the parts from my server to demonstrate this. I have to note that the slowness I observed was in a production env (running in a VM in a container behind a network load balancer) that also sees a tonne of other IO.

@ignoramous
Copy link
Author

ignoramous commented Feb 6, 2022

I wrote up a tiny DoH load-tester with python and golang (don't ask) which I intend to publish as a github action by tomorrow. In the meanwhile, here's a run from my laptop:

deno/runs:3748
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 280 296 309 336 411 694 889 898 929
undici/runs:3826
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 1140 1666 2171 2811 3302 3941 4501 4540 4995
node:naive-h2/runs:3791
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 1713 2067 2570 3276 3975 4269 4336 4530 4534

@ignoramous
Copy link
Author

ignoramous commented Feb 8, 2022

After fighting a bit with github-actions, a poor man's profiler on is finally up:

From the current run, here's Node:

node/runs:1200
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 532 1215 1665 3359 3469 20953 21094 21125 21182

here's Deno:

deno/runs:6300
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 87 97 104 111 116 127 165 184 189

These are measured from Date.now wrapping around the fetch call. Both local and remote caches are unlikely to interfere because every DoH query is of form <random-str>.dnsleaktest.com.

Anything in the serverless-dns code-base that may significantly interfere with timing fetch calls is disabled.


The profiler forwards as many DoH requests it can for 50s to 60s, after which it quits.

Deno's serveDoh doesn't slow down at all, and blazes through 6300+ requests. Node's serverHTTPS (+ undici), otoh, manages 1200+ requests. I must note that the final batch of 100 queries served after the ~1100th request, took 20s+ to complete.

Just a note: I haven't used Node's built-in perf-module to profile since that is un-importable on Deno (from what I can tell)... hand-rolled a minimal and inefficient perf-counter on my own, instead.

@mcollina
Copy link
Member

mcollina commented Feb 8, 2022

I might be a bit lost on your codebase, but I could not find undici being used anywhere.

I would note that there are plenty of things happening on that module. How have you identified the problem in undici fetch implementation?

@mcollina
Copy link
Member

mcollina commented Feb 8, 2022

If by any chance you are testing the http/2 implementation, consider that https://github.com/serverless-dns/serverless-dns/blob/b36a9cdb5786b78a7dd8e22a2f0a767e9e340ab1/src/plugins/dns-operation/dnsResolver.js#L398 is highly inefficient as you will be incurring in significant overhead of creating those http/2 connections. You have to use a connection pool.

@ignoramous
Copy link
Author

ignoramous commented Feb 8, 2022

I might be a bit lost on your codebase, but I could not find undici being used anywhere.

Dang, we moved to node-fetch (to see if it was any better) and didn't revert back to undici (which we now have).

Here's results from Node + undici run:

node/undici/runs:1300
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 200 602 965 1121 3083 12267 20400 20407 20467

How have you identified the problem in undici fetch implementation?

What's measured is just this method dnsResolver.js:resolveDnsUpstream, which doesn't do much other than construct a Request and call cache-api (does not exist on Node, disabled either way on these "profiler" runs), fetch (over undici on Node), or doh2 (unused).

We then rolled our own DNS over UDP / DNS over TCP connection-pooled transport atop Node (dgram and net), and that showed considerable perf improvement (around 15x to 100x):

node/udp/runs:6600
percentile p10 p50 p75 p90 p95 p99 p99.9 p99.99 p100
milliseconds 17 18 19 20 21 25 34 89 131

Of course, a standalone benchmark would be great to have, but I wanted to demonstrate the issue I spot with the kind of workloads we specifically run (lots and lots of DoH requests), and that, by bypassing undici / node-fetch polyfills (or running on cloudflare-workers or deno which bring their own fetch impl to the table) from the code-path meant apparent improvements in IO.

@mcollina
Copy link
Member

mcollina commented Feb 8, 2022

I don't think anyone had performed an optimization of either fetch or node WHATWG Streams implementation, so this is a good point and there is likely a lot of margin there.

Have you tried using undici.request() instead?

Are there some instructions on how to run the benchmarks? Being able to get some good diagnostics would definitely help.

@mcollina
Copy link
Member

mcollina commented Feb 9, 2022

Here is a bit of tests from our own benchmarks:

│ Tests               │ Samples │           Result │ Tolerance │ Difference with slowest │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - fetch      │      20 │  1028.31 req/sec │  ± 2.71 % │                       - │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ http - no keepalive │      10 │  3891.51 req/sec │  ± 2.00 % │              + 278.44 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - pipeline   │      95 │  6034.47 req/sec │  ± 2.95 % │              + 486.83 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ http - keepalive    │      50 │  6382.57 req/sec │  ± 2.98 % │              + 520.68 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - request    │      15 │  8528.35 req/sec │  ± 2.11 % │              + 729.35 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - stream     │      10 │ 10226.33 req/sec │  ± 2.66 % │              + 894.48 % │
|─────────────────────|─────────|──────────────────|───────────|─────────────────────────|
│ undici - dispatch   │      50 │ 11399.31 req/sec │  ± 2.79 % │             + 1008.54 % │

There is likely a lot of room for improvements.

@ignoramous
Copy link
Author

Have you tried using undici.request() instead?

I will do so over the coming days and report back.

Are there some instructions on how to run the benchmarks?

Apart from running the profiler.yml gh-action (of your own fork of serverless-dns), you could checkout the codebase, and then

# clone serverless-dns
git clone https://github.com/serverless-dns/serverless-dns.git

# download deps
npm i

# install q, a golang doh client
# https://github.com/natesales/q
echo "deb [trusted=yes] https://repo.natesales.net/apt /" | sudo tee /etc/apt/sources.list.d/natesales.list > /dev/null
sudo apt-get update
sudo apt-get install q

# export QDOH, where the 'run' script goes looking for `q`
export QDOH=q

# install node v15+ / deno v1.17+, if missing
...

# doh query, <random-str>.dnsleaktest.com, is sent to cloudflare-dns.com
# run doh, on node + undici
timeout 60s ./run node p1

# doh query, <random-str>.dnsleaktest.com, is sent to cloudflare-dns.com
# run doh, on deno
timeout 60s ./run deno p1

# dns query, <random-str>.dnsleaktest.com, is sent to 1.1.1.1
# run udp/tcp dns, on node
timeout 60s ./run node p3

Here is a bit of tests from our own benchmarks:

Thanks. Sorry, if this is a noob question, but: how do I generate such logs myself for my tests?

@mcollina
Copy link
Member

mcollina commented Feb 9, 2022

Thanks. Sorry, if this is a noob question, but: how do I generate such logs myself for my tests?

Check out #1214 for the fetch benchmarks.

You can run our benchmarks by running:

PORT=3001 node benchmarks/server.js
PORT=3001 node benchmarks/benchmarks.js

@RafaelGSS
Copy link
Member

RafaelGSS commented Apr 17, 2022

Hello folks! Recently, I spent some time working on the fetch performance investigation and I decided to create a report about my findings. I'll share an abstract (probably the full report will become a blog post at some moment)

TL;DR:

After some analysis, I came up with a few PRs improving fetch performance directly:

However, the major bottleneck is under WebStreams implementation in Node.js core, thus,
I don't think there is a clear path for undici.fetch to improve its performance significantly without touching Node.js internals.


Firstly, the benchmark was unfair with undici.fetch since it was allocating the server response in a variable causing an event-loop block due to GC runs. Then, I solved it in: #1283.

After several analyses, I have found that WebStreams is very slow (through microbenchmark and by fetch benchmark) - For reference, see https://github.com/RafaelGSS/nodejs-bench-operations/blob/main/RESULTS-v17.md#stream-readablejs. Thus, I did a comparison replacing WebStreams with native Streams and I got 60% ~ 90% of improvement.

Avoiding WebStreams in response
   diff --git a/lib/fetch/index.js b/lib/fetch/index.js
   index 9047976..778ad82 100644
   --- a/lib/fetch/index.js
   +++ b/lib/fetch/index.js
   @@ -50,6 +50,7 @@ const {
    const { kHeadersList } = require('../core/symbols')
    const EE = require('events')
    const { Readable, pipeline } = require('stream')
   +const Readable2 = require('../api/readable')
    const { isErrored, isReadable } = require('../core/util')
    const { dataURLProcessor } = require('./dataURL')
    const { kIsMockActive } = require('../mock/mock-symbols')
   @@ -964,7 +965,7 @@ async function fetchFinale (fetchParams, response) {
        })
    
        // 4. Set response’s body to the result of piping response’s body through transformStream.
   -    response.body = { stream: response.body.stream.pipeThrough(transformStream) }
   +    // response.body = { stream: response.body.stream.pipeThrough(transformStream) }
      }
    
      // 6. If fetchParams’s process response consume body is non-null, then:
   @@ -1731,9 +1732,8 @@ async function httpNetworkFetch (
        })()
      }
    
   +  const { body, status, statusText, headersList } = await dispatch({ body: requestBody })
      try {
   -    const { body, status, statusText, headersList } = await dispatch({ body: requestBody })
   -
        const iterator = body[Symbol.asyncIterator]()
        fetchParams.controller.next = () => iterator.next()
    
   @@ -1797,7 +1797,7 @@ async function httpNetworkFetch (
      // 17. Run these steps, but abort when the ongoing fetch is terminated:
    
      //    1. Set response’s body to a new body whose stream is stream.
   -  response.body = { stream }
   +  response.body = { stream: body }
    
      //    2. If response is not a network error and request’s cache mode is
      //    not "no-store", then update response in httpCache for request.
   @@ -1957,7 +1957,7 @@ async function httpNetworkFetch (
                headers.append(key, val)
              }
    
   -          this.body = new Readable({ read: resume })
   +          this.body = new Readable2(resume, this.abort, headers.get('content-type'))
    
              const decoders = []

If you apply the above git diff and profile the application, it will show createAbortError as one of the functions wasting more time on the CPU.

image

And, these AbortSignal are mostly created in WebStreams instance (even if the instance is not in use), e.g:

image

image

So, avoiding the AbortSignal creation through WebStreams bumps, even more, the performance (150% ~ 200% -- based on the initial benchmarks)

Avoiding AbortSignal
diff --git a/lib/fetch/index.js b/lib/fetch/index.js
index 1fbf29b..322e5ae 100644
--- a/lib/fetch/index.js
+++ b/lib/fetch/index.js
@@ -50,6 +50,7 @@ const {
const { kHeadersList } = require('../core/symbols')
const EE = require('events')
const { Readable, pipeline } = require('stream')
+const Readable2 = require('../api/readable')
const { isErrored, isReadable } = require('../core/util')
const { dataURLProcessor } = require('./dataURL')
const { kIsMockActive } = require('../mock/mock-symbols')
@@ -957,14 +958,14 @@ async function fetchFinale (fetchParams, response) {

    // 3. Set up transformStream with transformAlgorithm set to identityTransformAlgorithm
    // and flushAlgorithm set to processResponseEndOfBody.
-    const transformStream = new TransformStream({
-      start () {},
-      transform: identityTransformAlgorithm,
-      flush: processResponseEndOfBody
-    })
+    // const transformStream = new TransformStream({
+    //   start () {},
+    //   transform: identityTransformAlgorithm,
+    //   flush: processResponseEndOfBody
+    // })

    // 4. Set response’s body to the result of piping response’s body through transformStream.
-    response.body = { stream: response.body.stream.pipeThrough(transformStream) }
+    // response.body = { stream: response.body.stream.pipeThrough(transformStream) }
  }

  // 6. If fetchParams’s process response consume body is non-null, then:
@@ -1731,9 +1732,8 @@ async function httpNetworkFetch (
    })()
  }

+  const { body, status, statusText, headersList } = await dispatch({ body: requestBody })
  try {
-    const { body, status, statusText, headersList } = await dispatch({ body: requestBody })
-
    const iterator = body[Symbol.asyncIterator]()
    fetchParams.controller.next = () => iterator.next()

@@ -1775,29 +1775,29 @@ async function httpNetworkFetch (
  // 16. Set up stream with pullAlgorithm set to pullAlgorithm,
  // cancelAlgorithm set to cancelAlgorithm, highWaterMark set to
  // highWaterMark, and sizeAlgorithm set to sizeAlgorithm.
-  if (!ReadableStream) {
-    ReadableStream = require('stream/web').ReadableStream
-  }
-
-  const stream = new ReadableStream(
-    {
-      async start (controller) {
-        fetchParams.controller.controller = controller
-      },
-      async pull (controller) {
-        await pullAlgorithm(controller)
-      },
-      async cancel (reason) {
-        await cancelAlgorithm(reason)
-      }
-    },
-    { highWaterMark: 0 }
-  )
+  // if (!ReadableStream) {
+  //   ReadableStream = require('stream/web').ReadableStream
+  // }
+
+  // const stream = new ReadableStream(
+  //   {
+  //     async start (controller) {
+  //       fetchParams.controller.controller = controller
+  //     },
+  //     async pull (controller) {
+  //       await pullAlgorithm(controller)
+  //     },
+  //     async cancel (reason) {
+  //       await cancelAlgorithm(reason)
+  //     }
+  //   },
+  //   { highWaterMark: 0 }
+  // )

  // 17. Run these steps, but abort when the ongoing fetch is terminated:

  //    1. Set response’s body to a new body whose stream is stream.
-  response.body = { stream }
+  response.body = { stream: body }

  //    2. If response is not a network error and request’s cache mode is
  //    not "no-store", then update response in httpCache for request.
@@ -1870,10 +1870,10 @@ async function httpNetworkFetch (
      fetchParams.controller.controller.enqueue(new Uint8Array(bytes))

      // 8. If stream is errored, then terminate the ongoing fetch.
-      if (isErrored(stream)) {
-        fetchParams.controller.terminate()
-        return
-      }
+      // if (isErrored(stream)) {
+      //   fetchParams.controller.terminate()
+      //   return
+      // }

      // 9. If stream doesn’t need more data ask the user agent to suspend
      // the ongoing fetch.
@@ -1891,16 +1891,16 @@ async function httpNetworkFetch (
      response.aborted = true

      // 2. If stream is readable, error stream with an "AbortError" DOMException.
-      if (isReadable(stream)) {
-        fetchParams.controller.controller.error(new AbortError())
-      }
+      // if (isReadable(stream)) {
+      //   fetchParams.controller.controller.error(new AbortError())
+      // }
    } else {
      // 3. Otherwise, if stream is readable, error stream with a TypeError.
-      if (isReadable(stream)) {
-        fetchParams.controller.controller.error(new TypeError('terminated', {
-          cause: reason instanceof Error ? reason : undefined
-        }))
-      }
+      // if (isReadable(stream)) {
+      //   fetchParams.controller.controller.error(new TypeError('terminated', {
+      //     cause: reason instanceof Error ? reason : undefined
+      //   }))
+      // }
    }

    // 4. If connection uses HTTP/2, then transmit an RST_STREAM frame.
@@ -1957,7 +1957,7 @@ async function httpNetworkFetch (
            headers.append(key, val)
          }

-          this.body = new Readable({ read: resume })
+          this.body = new Readable2(resume, this.abort, headers.get('content-type'))

          const decoders = []

Then, I came up with avoiding AbortController (that creates the AbortSignal under the hood), and impressively, it bumped the performance by 20%!

Avoiding AbortController
diff --git a/lib/fetch/request.js b/lib/fetch/request.js
index 0f10e67..ae9c37c 100644
--- a/lib/fetch/request.js
+++ b/lib/fetch/request.js
@@ -29,9 +29,9 @@ let TransformStream

const kInit = Symbol('init')

-const requestFinalizer = new FinalizationRegistry(({ signal, abort }) => {
-  signal.removeEventListener('abort', abort)
-})
+// const requestFinalizer = new FinalizationRegistry(({ signal, abort }) => {
+//   signal.removeEventListener('abort', abort)
+// })

// https://fetch.spec.whatwg.org/#request-class
class Request {
@@ -355,7 +355,7 @@ class Request {

    // 28. Set this’s signal to a new AbortSignal object with this’s relevant
    // Realm.
-    const ac = new AbortController()
+    const ac = { signal: { addEventListener: () => {} } }
    this[kSignal] = ac.signal
    this[kSignal][kRealm] = this[kRealm]

@@ -376,7 +376,7 @@ class Request {
      } else {
        const abort = () => ac.abort()
        signal.addEventListener('abort', abort, { once: true })
-        requestFinalizer.register(this, { signal, abort })
+        // requestFinalizer.register(this, { signal, abort })
      }
    }

@@ -741,7 +741,8 @@ class Request {
    clonedRequestObject[kHeaders][kRealm] = this[kHeaders][kRealm]

    // 4. Make clonedRequestObject’s signal follow this’s signal.
-    const ac = new AbortController()
+    const ac = { signal: { addEventListener: () => {}, abort: () => {} } }
+    // const ac = new AbortController()
    if (this.signal.aborted) {
      ac.abort()
    } else {

Additional Notes

In the first iteration, Clinic Doctor report says something is wrong with Active Handles

image

The Active Handles are dropping often, which is far from expected. For comparison reasons, look at the undici.request graph:

image

There is probably a bad configuration out there and I have found that undici.fetch is dropping many packages during the benchmark causing TCP RETRANSMISSION.

Basically, when a package is dropped in an active connection a re-transmission protocol is sent and when it happens frequently, the performance is affected.

TCP Retransmission using undici.fetch

root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
LADDR:LPORT               RADDR:RPORT               RETRANSMITS
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52390         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52400         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52398         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52402         99
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52414        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52424        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52412        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52392        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52420        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52396        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52422        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52404        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52410        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52428        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52418        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52408        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52426        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52416        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52406        100
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52394        101

For comparison reasons, looks the retransmissions using

  • http.request (keep alive)
root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
LADDR:LPORT               RADDR:RPORT               RETRANSMITS
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52526          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52532          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52506          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52514          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52524          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52512          1
[::ffff:127.0.0.1]#3001 <-> [::ffff:127.0.0.1]#52518          1
  • undici.request
root@rafaelgss-desktop:/usr/share/bcc/tools# ./tcpretrans -c
Tracing retransmits ... Hit Ctrl-C to end
^C
EMPTY
LADDR:LPORT               RADDR:RPORT               RETRANSMITS

Unfortunately, I haven't found out why it's happening. I tried, I swear :P.

@ronag
Copy link
Member

ronag commented Apr 18, 2022

@benjamingr why is creating abort signals so expensive?

@devsnek
Copy link
Member

devsnek commented Apr 18, 2022

I would guess it is because createAbortSignal+makeTransferable does a lot of prototype mutation stuff and a call into c++ which is basically the bane of js engine performance.

i think we could move the logic of createAbortSignal into the AbortSignal constructor, sorta like this:

const kDoThing = Symbol('kDoThing');
function createAbortSignal(aborted, reason) {
  return new AbortSignal(aborted, reason, kDoThing);
}

class AbortSignal extends EventTarget {
  constructor(aborted, reason, thing) {
    if (thing !== kDoThing) throw new ERR_ILLEGAL_CONSTRUCTOR();
    super();
    this[kAborted] = aborted;
    this[kReason] = reason;
  }
}

the makeTransferable part is harder cuz c++ but i feel like we could at least make it more static than what makeTransferable is doing, similar to what i did above. also seems weird to me that JSTransferable needs to exist at all, like we already have the api with kTransfer and kDeserialize and whatnot but I don't know all the specifics there.

@ronag
Copy link
Member

ronag commented Apr 18, 2022

@jasnell @addaleax

@RafaelGSS
Copy link
Member

@devsnek I didn't mention the analysis I did in Node.js Webstreams (planning to open an issue in Node.js any time soon), but indeed, makeTransferable has something on it for sure.

image

@jasnell
Copy link
Member

jasnell commented Apr 18, 2022

makeTransferable is an ugly hack, unfortunately. I'd love to find a way to handle it more elegantly but we are fairly strictly limited by the way the structured clone algorithm works and the v8 implementation of it. I have some ideas on how to make improvements there.

also seems weird to me that JSTransferable needs to exist at all, like we already have the api with kTransfer and kDeserialize and whatnot but I don't know all the specifics there.

The way it works is this:

The structured clone algorithm only works by default with ordinary JavaScript objects with enumerable properties. Things like JavaScript classes are not supported. In order to make objects like WritableStream and ReadableStream consistently transferable per the spec is to treat them as "Host Objects". The JSTransferable class is exactly how we do that. (A Host object is a JavaScript object that is backed by a C++ class.).

When a Host object is passed into the structured clone algorithm implementation, v8 will call out to the serialization/deserialization delegate implementation that Node.js provides. Node.js' delegate implementation will check to see what kind of Host object it is. If the Host object is just a native c++ class that extends from BaseObject, then the c++ class will have functions on it for handling the serialization/deserialization at the native layer. If the Host object is a JSTransferable, that means that the serialization/deserialization is handled by JavaScript functions on the JavaScript wrapper object -- specifically, the [kTransfer], [kClone], [kDeserialize] functions. The delegate will as the JSTransferable host object to serialize itself, and it will do so by calling out to the JavaScript functions..... tl;dr: JSTransferable is how we allow a JavaScript class to act as a Host object, and the kTransfer/kDeserialize do the actual work of that.

But, we have a problem here. Classes like WritableStream and ReadableStream are defined by the standards as not extending from any other class and there are Web Platform Tests that verify that the implementation matches the Web IDL definition. Specifically, we can't have class WritableStream extends JSTransferable because that would violate the standard definition of WritableStream.

The makeTransferable() utility is a hack to get around that. Instead of creating a WritableStream that extends from JSTransferable, makeTransferable() creates an instance of JSTransferable and uses the WritableStream instance as its prototype. It effectively inverts the inheritance (that is, creates a JSTransferable that extends from WritableStream). This is a bit slow for several reasons that should be fairly obvious. Performance-wise, it's not great. But it allows us to implement the streams spec correctly.

@devsnek
Copy link
Member

devsnek commented Apr 18, 2022

Looking at this more, it seems like we just need to teach v8 how to recognize non-c++-backed "host" objects, and then we wouldn't need to add the JSTransferable superclass.

oop thank you for the writeup james

@jasnell
Copy link
Member

jasnell commented Apr 18, 2022

Yes, it would be ideal if the structured clone algorithm provided a way to allow arbitrary classes to define how they would be serialized / deserialized. I've raised this conversation with the WHATWG (whatwg/html#7428) before and, unfortunately, it received a pretty chilly response.

An approach based on well-known Symbols (e.g. Symbol.transfer, Symbol.clone, Symbol.deserialize) would be useful, where v8 would recognize that objects that implement these should be passed to the serializer/deserializer delegate. There's some complexity on the deserialization end around how to make the deserializer aware of what classes it needs to handle but that's largely implementation detail.

Unfortunately, without these changes at either the standard (tc-39, whatwg) level or v8 level, JSTransferable will be a necessary evil.

@devsnek
Copy link
Member

devsnek commented Apr 18, 2022

I don't think we need to do anything at the standards level, if v8 just had an api to tag constructors with a private symbol or similar, it could forward them back to WriteHostObject when it sees them and then we could do whatever we want with them.

@jasnell
Copy link
Member

jasnell commented Apr 18, 2022

Yeah, doing it entirely within v8 is possible. My concern there would be interop. I would much prefer the mechanism to be standardized so that there's a better hope that it would work consistently across runtimes.

@jasnell
Copy link
Member

jasnell commented Apr 18, 2022

Within v8, the key limitation is that, inside v8's implementation, only C++-backed objects can be identified as host objects, and only host objects are passed into the delegate to be handled, and since a host object can be nested arbitrarily deep in object graphs, we have to depend on v8 identifying them as it walks the graph being serialized. If we can start there -- teaching v8 to recognize when a pure JavaScript object is to be treated as a host object -- then we've gone a long way towards solving the problem.

@ronag
Copy link
Member

ronag commented Jan 20, 2023

@RafaelGSS TCP Retransmission might be related to tcpNoDelay. Have you tried disabling that?

@BSN4
Copy link

BSN4 commented Jan 20, 2023

waiting for 19.5.0 to test fetch result just saw @RafaelGSS pr 👍

@RafaelGSS
Copy link
Member

@RafaelGSS TCP Retransmission might be related to tcpNoDelay. Have you tried disabling that?

That's a good point. I haven't.

@BSN4
Copy link

BSN4 commented Jan 20, 2023

using v20 nightly build I can see no diffrence between deno, and node fetch infact it looks better than deno .. but bun slightly faster ... good job guys

@KhafraDev
Copy link
Member

fetch seems to have gotten much slower in v19.7.0 compared to v19.6.1

v19.6.1:

[bench:run] │ Tests               │ Samples │          Result │ Tolerance │ Difference with slowest │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - fetch      │       1 │  766.77 req/sec │  ± 0.00 % │                       - │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - no keepalive │       1 │ 1345.28 req/sec │  ± 0.00 % │               + 75.45 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - keepalive    │       1 │ 2246.10 req/sec │  ± 0.00 % │              + 192.93 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - request    │       1 │ 2372.56 req/sec │  ± 0.00 % │              + 209.42 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - pipeline   │       1 │ 2522.22 req/sec │  ± 0.00 % │              + 228.94 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - stream     │       1 │ 3170.79 req/sec │  ± 0.00 % │              + 313.52 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - dispatch   │       1 │ 3273.41 req/sec │  ± 0.00 % │              + 326.91 % │

v19.7.0:

[bench:run] │ Tests               │ Samples │          Result │ Tolerance │ Difference with slowest │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - fetch      │       1 │  564.40 req/sec │  ± 0.00 % │                       - │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - no keepalive │       1 │ 1295.50 req/sec │  ± 0.00 % │              + 129.54 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - keepalive    │       1 │ 1959.11 req/sec │  ± 0.00 % │              + 247.12 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - request    │       1 │ 2297.02 req/sec │  ± 0.00 % │              + 306.99 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - pipeline   │       1 │ 2624.22 req/sec │  ± 0.00 % │              + 364.96 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - stream     │       1 │ 2977.49 req/sec │  ± 0.00 % │              + 427.55 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - dispatch   │       1 │ 3473.19 req/sec │  ± 0.00 % │              + 515.38 % │

@mcollina
Copy link
Member

cc @anonrig

@anonrig
Copy link
Member

anonrig commented Feb 21, 2023

I don't think this is related to URL, but related to stream performance.

ReadableStream/WritableStream abort-signal might cause this regression. (nodejs/node#46273) cc @debadree25 @ronag

@debadree25
Copy link
Member

debadree25 commented Feb 22, 2023

The abort signal connection feels unlikely in 19.7 none of the streams PRs seem to introduce any significant change inside WHATWYG streams all of it interop work, could be something else?

@ronag
Copy link
Member

ronag commented Feb 22, 2023

Can someone do a bisect?

@KhafraDev
Copy link
Member

The likely candidate is the PR that made byte readable streams transferable. However undici doesn't use byte readable streams so I'm unsure how it would affect performance this drastically on my machine. Has anyone else ran the benchmarks on 19.6.1 vs 19.7.0?

@debadree25
Copy link
Member

debadree25 commented Feb 22, 2023

Benchmarks look similar on my machine too, attempting to do the git bisect but will take some time 😅

@debadree25
Copy link
Member

The difference in results between v19.6.1 and v19.7.0 albeit is small

v19.6.1

[bench:run] │ Tests               │ Samples │          Result │ Tolerance │ Difference with slowest │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - fetch      │       1 │  809.35 req/sec │  ± 0.00 % │                       - │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - no keepalive │       1 │ 1244.84 req/sec │  ± 0.00 % │               + 53.81 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - keepalive    │       1 │ 1578.42 req/sec │  ± 0.00 % │               + 95.02 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - pipeline   │       1 │ 2047.72 req/sec │  ± 0.00 % │              + 153.01 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - request    │       1 │ 2312.49 req/sec │  ± 0.00 % │              + 185.72 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - stream     │       1 │ 3077.30 req/sec │  ± 0.00 % │              + 280.22 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - dispatch   │       1 │ 3571.34 req/sec │  ± 0.00 % │              + 341.26 % │

v19.7.0

[bench:run] │ Tests               │ Samples │          Result │ Tolerance │ Difference with slowest │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - fetch      │       1 │  781.15 req/sec │  ± 0.00 % │                       - │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - no keepalive │       1 │ 1284.11 req/sec │  ± 0.00 % │               + 64.39 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ http - keepalive    │       1 │ 1593.15 req/sec │  ± 0.00 % │              + 103.95 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - pipeline   │       1 │ 2129.34 req/sec │  ± 0.00 % │              + 172.59 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - request    │       1 │ 2427.34 req/sec │  ± 0.00 % │              + 210.74 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - stream     │       1 │ 3134.43 req/sec │  ± 0.00 % │              + 301.26 % │
[bench:run] |─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
[bench:run] │ undici - dispatch   │       1 │ 3457.92 req/sec │  ± 0.00 % │              + 342.67 % │
[bench:run] 

@legendecas
Copy link
Member

Refactoring JS transferable to move away from the prototype mutation: nodejs/node#47956. Reviews welcomed.

@H4ad
Copy link
Member

H4ad commented Jul 7, 2023

NodeJS 20.3.0:

│ Tests               │ Samples │          Result │ Tolerance │ Difference with slowest │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - fetch      │       1 │  510.90 req/sec │  ± 0.00 % │                       - │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ http - no keepalive │       1 │ 1125.31 req/sec │  ± 0.00 % │              + 120.26 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - request    │       1 │ 1475.09 req/sec │  ± 0.00 % │              + 188.72 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - pipeline   │       1 │ 1521.68 req/sec │  ± 0.00 % │              + 197.84 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ http - keepalive    │       1 │ 1599.12 req/sec │  ± 0.00 % │              + 213.00 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - stream     │       1 │ 2476.90 req/sec │  ± 0.00 % │              + 384.81 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - dispatch   │       1 │ 2792.84 req/sec │  ± 0.00 % │              + 446.65 % │

After nodejs/node#47956:

│ Tests               │ Samples │          Result │ Tolerance │ Difference with slowest │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - fetch      │       1 │  644.45 req/sec │  ± 0.00 % │                       - │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ http - no keepalive │       1 │ 1153.57 req/sec │  ± 0.00 % │               + 79.00 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ http - keepalive    │       1 │ 1354.32 req/sec │  ± 0.00 % │              + 110.15 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - request    │       1 │ 1602.05 req/sec │  ± 0.00 % │              + 148.59 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - pipeline   │       1 │ 1701.39 req/sec │  ± 0.00 % │              + 164.01 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - stream     │       1 │ 2331.69 req/sec │  ± 0.00 % │              + 261.81 % │
|─────────────────────|─────────|─────────────────|───────────|─────────────────────────|
│ undici - dispatch   │       1 │ 2645.30 req/sec │  ± 0.00 % │              + 310.47 % │

Probably it will be better to have one of those that already shared the benchmarks but is just to showcase that PR helped to improve the fetch by some margin.

@mcollina
Copy link
Member

mcollina commented Jul 7, 2023

@RafaelGSS if you have time to refresh your analysis, folks could move to the next bottleneck.

@RafaelGSS
Copy link
Member

Planing to do it soon!

@mcollina
Copy link
Member

mcollina commented Sep 7, 2023

Here is some good news, fetch got even faster after Node v20.6.0.

In node 20.5.0, I got a nice 877.08 req/s out of fetch, while in v20.6.0 I got a 932 req/s in my benchmark, with a consistent and measurable 5% improvement.

Given all the latest improvements, we have been able to reduce the gap between fetch() and http.request() from 160% to 120% (data in #2250).

@silverwind
Copy link
Contributor

silverwind commented Sep 7, 2023

Updated my benchmarks, it does look this small improvement is visible. bun and axios are still significantly faster, though.

@mcollina
Copy link
Member

mcollina commented Feb 19, 2024

A few updates.

The latest benchmarks shows that, while axios is still faster, undici.fetch() now passes both node-fetch and got.

I also took the liberty of doing a flamegraph, and apparently the vast majority of the cost is due to the initialization of the whatwg streams.

Screenshot 2024-02-19 at 16 29 15

I can see two ways to speed this up:

  1. make them faster (maybe @RafaelGSS or @anonrig have ideas) - possibly we could have non-transferable WHATWG streams
  2. add a Proxy (or analog) in between, so that we initialize the streams only if streams methods are needed and not for the body mixins (.json(), .text(), etc..)

@KhafraDev wdyt?

@KhafraDev
Copy link
Member

KhafraDev commented Feb 19, 2024

  1. Node specific options to web specs will always cause issues (usually with maintaining them), I don't think it's a good idea.
  2. Streams are needed in the body mixins; the body mixin steps will call fully read body method which uses whatwg streams. Removing streams would make this harder to maintain. I did make an issue about short-circuiting the stream logic when doing something along the lines of new Response('string').text() which could also be applied.

There is another alternative, as @RafaelGSS experimented with earlier: move webstreams to node streams, when they're only used internally. As long as we can match the methods to look somewhat similar to webstreams (as in, _read = pull, etc., by extending the class) I wouldn't be against it. Honestly we could probably change response.body to a node stream, and convert to a webstream when accessed, which is very rarely.

I should probably do a better write up of how the body mixin methods work because there's a TON of optimizations we could do without infringing on maintainability.

@mcollina
Copy link
Member

mcollina commented Apr 23, 2024

After more than two years, I'm extremely happy to confirm that we have achieved performance parity with node-fetch on Node.js main branch (v22):

┌─────────┬───────────────────────┬─────────┬────────────────────┬─────────────┬─────────────────────────┐
│ (index) │ Tests                 │ Samples │ Result             │ Tolerance   │ Difference with slowest │
├─────────┼───────────────────────┼─────────┼────────────────────┼─────────────┼─────────────────────────┤
│ 0       │ 'http - no keepalive' │ 101     │ '7247.97 req/sec'  │ '± 3.90 %'  │ '-'                     │
│ 1       │ 'got'                 │ 40      │ '9716.84 req/sec'  │ '± 2.86 %'  │ '+ 34.06 %'             │
│ 2       │ 'axios'               │ 35      │ '10088.40 req/sec' │ '± 2.68 %'  │ '+ 39.19 %'             │
│ 3       │ 'node-fetch'          │ 10      │ '10639.12 req/sec' │ '± 2.34 %'  │ '+ 46.79 %'             │
│ 4       │ 'request'             │ 30      │ '11108.52 req/sec' │ '± 2.76 %'  │ '+ 53.26 %'             │
│ 5       │ 'superagent'          │ 10      │ '14468.50 req/sec' │ '± 1.77 %'  │ '+ 99.62 %'             │
│ 6       │ 'undici - fetch'      │ 10      │ '15149.84 req/sec' │ '± 1.99 %'  │ '+ 109.02 %'            │
│ 7       │ 'undici - pipeline'   │ 101     │ '16352.78 req/sec' │ '± 10.98 %' │ '+ 125.62 %'            │
│ 8       │ 'http - keepalive'    │ 25      │ '16710.83 req/sec' │ '± 2.87 %'  │ '+ 130.56 %'            │
│ 9       │ 'undici - request'    │ 10      │ '26142.70 req/sec' │ '± 2.58 %'  │ '+ 260.69 %'            │
│ 10      │ 'undici - dispatch'   │ 25      │ '26817.31 req/sec' │ '± 2.70 %'  │ '+ 270.00 %'            │
│ 11      │ 'undici - stream'     │ 10      │ '27448.13 req/sec' │ '± 2.00 %'  │ '+ 278.70 %'            │
└─────────┴───────────────────────┴─────────┴────────────────────┴─────────────┴─────────────────────────┘

There is still some way to go to reach undici.request(), but I have a feeling this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fetch
Projects
None yet
Development

No branches or pull requests