Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-running application memory usage testing #2423

Closed
rvagg opened this issue Aug 18, 2015 · 39 comments
Closed

Long-running application memory usage testing #2423

rvagg opened this issue Aug 18, 2015 · 39 comments
Labels
memory Issues and PRs related to the memory management or memory footprint.

Comments

@rvagg
Copy link
Member

rvagg commented Aug 18, 2015

I don't want to be presumptuous and drop a new repo into this org so I'm starting it here: https://github.com/rvagg/node-memtest but will move it if it's considered helpful by others (perhaps there's a good critique that will demolish my approach).

I've been using it to test long-running memory usage across versions to generate graphs like this:

memtest

It can either put data into InfluxDB for plotting by Grafana or it can spit out memory usage to stdout so you can watch it stream by.

There are only two "tests" in there so far, just starters, I imagine the list of tests expanding over time as we come up with use-cases from the wild that demonstrate interesting behaviour we should monitor.

The list of Node & io.js versions is also dynamic, I've been including all of the "stable" ones and in the above screenshot I also have a nightly of master in the mix to test of the 3.0.0 leak is still there. It could be evolved to automatically pick up new nightlies and continually be testing and producing data for us to give us confidence in our releases.

I've set up anonymous access to an instance I have running here if you want to have a poke at it: http://128.199.216.28:3000/dashboard/db/memory-tests

Is anyone interested in helping out with this?

@ChALkeR ChALkeR added the memory Issues and PRs related to the memory management or memory footprint. label Aug 18, 2015
@kzc
Copy link

kzc commented Aug 18, 2015

Tracking process vsize (virtual size) would also be useful. rss can potentially miss some leaks if the leaked memory is not being actively referenced (swapped out and not resident in memory).

@EricMCornelius
Copy link

👍 much needed validation, thanks for tackling this!

@ChALkeR
Copy link
Member

ChALkeR commented Aug 24, 2015

I did not take a look on the details, but 2.5.0 seems more stable than 3.1/3.1.1, especially in the http-spew test.

@rvagg
Copy link
Member Author

rvagg commented Aug 24, 2015

@ChALkeR yeah, heavy Buffer use puts us at the mercy of how V8 treats ArrayBuffers unfortunately, at least it doesn't grow though. @trevnorris has talked about doing some deeper investigation into the nature of GC on this and whether it's totally on the ArrayBuffer side or if there is something we can tweak on our end.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 24, 2015

From the first look it could be related to #1671, but I'm not sure yet.

@trevnorris
Copy link
Contributor

Here's the flame graph for creating Buffers in v2.5.0: https://cldup.com/VDEJ25y_oL.svg

Here it is for v3.1.0: https://cldup.com/90ZGBV_cfP.svg

The two things that stand out to me are std::map and the time spent in some sort of malloc. @bnoordhuis also noticed that it generates a large number of page faults. Not sure what to make of this, but will attempt to investigate. Either way this requires a V8 fix.

@kzc
Copy link

kzc commented Aug 24, 2015

@trevnorris Memory and CPU are like apples and oranges. You can often get lower memory use in these synthetic IO benchmarks by using a smaller default Buffer size, but usually at the expense of CPU in increased garbage collection bookkeeping.

Which variable are you trying to minimize?

@trevnorris
Copy link
Contributor

@kzc Choosing 64KB buffers was not random. The probable most common case of I/O in node is done over TCP. Which lends itself to the 64KB packet limit test. Which is why I chose to create the flamegraph with that value.

Though I have much less concern at this point over memory usage, than the fact that Buffer instantiation time is over 2x slower then it was in v0.12. And not only in a small range of sizes, but at all tested sizes (which stopped at 256KB).

@kzc
Copy link

kzc commented Aug 25, 2015

A tight for loop around 1e7 unused new SlowBuffer()s doesn't measure Buffer instantiation time as much as it measures GC time as seen in the stats from flame graphs above:

v2.5.0:
    node::smalloc::Alloc                  - 78.32% of CPU
    v8::internal::Heap::CollectAllGarbage - 69.75% of CPU

v3.1.0:
    node::Buffer::Create                  - 58.24% of CPU
    v8::ArrayBuffer::New                  - 47.86% of CPU
    v8::internal::Heap::CollectAllGarbage - 36.00% of CPU

You have to take the results of synthetic benchmarks with a grain of salt as it doesn't approximate a real life scenario which allocates buffer memory on an as-needed basis putting much less pressure on GC.

One would also have to measure IO throughput in addition to measuring CPU to get a fair comparison of relative node version performance. Ultimately IO throughput is the most important metric, or perhaps throughput per unit of energy depending on the situation.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

You have to take the results of synthetic benchmarks with a grain of salt as it doesn't approximate a real life scenario which allocates buffer memory on an as-needed basis putting much less pressure on GC.

I'm not quite sure what you mean here. Buffers do not trigger scavenges, and it actually takes a mark-sweep to clean them. The problem that I know about is that Buffer usage triggers many mark-sweeps, which is slow and not memory-efficient.

One would also have to measure IO throughput in addition to measuring CPU to get a fair comparison of relative node version performance. Ultimately IO throughput is the most important metric, or perhaps throughput per unit of energy depending on the situation.

Do you have a benchmark to share?

@trevnorris
Copy link
Contributor

Doing testing between v2.5.0 and v3.2.0 I saw a jump in minor page faults from 27,300 to 5,955,571. So at a quick glance it would seem that something bad for performance is occurring.

@trevnorris
Copy link
Contributor

From the benchmarks/ dir. Here are a few results:

                                            v3.2.0   v0.12.7

net/net-s2c.js len=16777216 type=buf dur=5: 16.1     21.5
net/net-c2s.js len=16777216 type=buf dur=5: 16.7     19.6

These two show the issue at its worst because throughput is most dependent on being able to allocate new buffers, write them and clean them up once the write is complete. We have companies like F5 attempting to do millions of transactions, and when a single buffer allocation takes 0.5us more (whether from instantiation or necessary GC) it gravely affects their application.

@kzc
Copy link

kzc commented Aug 25, 2015

I'm not quite sure what you mean here. Buffers do not trigger scavenges, and it actually takes a mark-sweep to clean them.

v8::internal::Heap::CollectAllGarbage is indirectly called by node::Buffer::Create.

The act of creating a SlowBuffer invokes GC.

Do you have a benchmark to share?

The ones provided in this ticket could be used. They just have to close the server and exit after a fixed number of transactions.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

@kzc, I still fail to see why does it need a v8::internal::Heap::CollectAllGarbage call and how is that good for performance and/or memory.

@kzc
Copy link

kzc commented Aug 25, 2015

I still fail to see why does it need a v8::internal::Heap::CollectAllGarbage call and how is that good for performance and/or memory.

Because 1e7 new SlowBuffer()s are created in a tight loop in JS, it has no choice but to reclaim memory from a GC within the v8 allocator. It's not a matter of good nor bad - just necessary.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

@kzc Scavenges are enough in all the supplied testcases, I believe. My question is why is it running full mark-sweeps.

@trevnorris
Copy link
Contributor

They just have to close the server and exit after a fixed number of transactions.

Wait, what? I've got to be misunderstanding what you're saying. node servers are doing high throughput for days/weeks/months at a time. So, what did you mean?

Also, to make sure it's clear, Buffers are created w/ the kInternalized flag. So V8 has full control of the memory.

@trevnorris
Copy link
Contributor

The act of creating a SlowBuffer invokes GC.

Why?

@trevnorris
Copy link
Contributor

reclaim memory from a GC within the v8 allocator

We're not using ArrayBuffer::Allocator. If that's what you're referring to. We malloc() manually then hand it off using ArrayBuffer::New() using kInternalized. So that V8 can take full control of the memory's lifetime.

@kzc
Copy link

kzc commented Aug 25, 2015

@trevnorris - I was referring to constructing a benchmark that exits after a certain number of HTTP requests. If you time the thing you get an idea of how long node takes to process N http requests/responses as fast as a client can handle. Nothing fancy.

@trevnorris
Copy link
Contributor

Ah okay. Thanks for clarifying. :)

@kzc
Copy link

kzc commented Aug 25, 2015

Why does creating a SlowBuffer invoke GC? It's clearly visible in both your flame graphs. It has to free up memory somehow in the tight loop. There's no pauses in the event loop to collect garbage otherwise.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

@kzc Why does it invoke a mark-sweep?

@kzc
Copy link

kzc commented Aug 25, 2015

We're not using ArrayBuffer::Allocator. If that's what you're referring to. We malloc() manually then hand it off using ArrayBuffer::New() using kInternalized.

malloc, free, v8 memory, GCing - it's all quite recursively intertwined. It's hard measure one in isolation because each affect all of them.

Might it be possible to reduce GC by recycling the IO SlowBuffers with their already malloc()ed memory?

@kzc
Copy link

kzc commented Aug 25, 2015

Why does it invoke a mark-sweep?

Not exactly sure why, but would v8 have any other choice when allocating 64K * 1e7 in a tight JS loop?

@trevnorris
Copy link
Contributor

Here's the revised test script:

var SlowBuffer = require('buffer').SlowBuffer;
var ITER = 1e5, SIZE = 65536, cntr = 0;
var t = process.hrtime();

(function runner() {
  if (++cntr > ITER) return print();
  new SlowBuffer(SIZE);
  setImmediate(runner);
}());

function print() {
  t = process.hrtime(t);
  console.log(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}

This allows a full turn of the event loop between each Buffer creation. Even so, allocating 64KB buffers runs in 3060 ns/op and 5300 ns/op respectively.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

@trevnorris What I'm talking about when I mention scavenges vs mark-sweeps is that if you insert an if (cntr % 177 === 0) gc(true); to your testcase, it starts running 15-30% faster.

If it doesn't, you might have to decrease 177 to something lower, the point is to run a scavenge before v8 decides to run a mark-sweep.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

@kzc Try inserting gc(true) as I mentioned above.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

20x «concurrency» and a warmup:

'use strict';

var limit = 1000000;
var concurrent = 20;
var scavenges = true;

var warmup = limit / 10;
var start, count = 0;
limit += warmup;

function call() {
    var buffer = new Buffer(64 * 1024);
    var bufferx;
    if (scavenges && (count % 170 === 0)) {
        // gc(true) is a scavenge. Not sure if this is documented.
        gc(true);
    }
    count++;
    if (count > limit) {
        console.log(process.hrtime(start));
        process.exit(0);
    } else if (count === warmup) {
        start = process.hrtime();
    }
    setImmediate(call);
}

for (var i = 0; i < concurrent; i++) {
    call();
}

This has up to 2x speedup with gc(true) enabled (scavenges set to true).

@trevnorris Yes, with SlowBuffer too. In fact, with SlowBuffer the difference is even more visible (2.5x).

@trevnorris
Copy link
Contributor

@ChALkeR The reason SlowBuffer is the more important metric is because that's how data packets are handled coming in from I/O. Those two use the exact same code path, except SlowBuffer must make a jump from JS to C++. Which only adds 30-50ns max.

@kzc
Copy link

kzc commented Aug 25, 2015

@trevnorris - Even so for allocating 64KB buffers run in 3060 ns/op and 5300 ns/op respectively.

I see. Fair enough. Might it be possible to reduce GC by recycling the old IO SlowBuffers in their entirety with their already malloc()ed memory?

@ChALkeR - 2x speedup with gc(true) - very cool.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

Btw, the memory usage also drops about 2.5x times with gc(true) (using SlowBuffers):

[chalker@pc arraybuffer]$ node --expose-gc test-quick-0.js 
[ 5, 825429614 ]
{ rss: 64040960, heapTotal: 10307328, heapUsed: 4564176 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-0.js 
[ 5, 724898065 ]
{ rss: 60608512, heapTotal: 9275392, heapUsed: 4564176 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-0.js 
[ 5, 711345311 ]
{ rss: 65585152, heapTotal: 9275392, heapUsed: 4564176 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-1.js 
[ 2, 547167883 ]
{ rss: 24649728, heapTotal: 21658624, heapUsed: 3784272 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-1.js 
[ 2, 562118909 ]
{ rss: 24330240, heapTotal: 21658624, heapUsed: 3784128 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-1.js 
[ 2, 649892998 ]
{ rss: 24715264, heapTotal: 21658624, heapUsed: 3775640 }

Measured at the end, but it should be fine, as RSS doesn't generally go down.

@trevnorris
Copy link
Contributor

@kzc

Might it be possible to reduce GC by recycling the old IO SlowBuffers in their entirety with their already malloc()ed memory?

No. The Object's are not persistent weak handles, so we are not notified when they are cleaned up. Instead we hand off the memory to the ArrayBuffer using kInternalized so V8 can clean up the object at its leisure.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2015

My point is that the default GC strategy is highly inefficient at least in these testcases, and it is taking at least half of the total testcase time to do things that aren't really needed here. And it can't even keep the memory down as low as it could if it was using scavenges.

I am not yet sure how that affects real-world apps, it needs more testing.

The main issue for that is #1671.

@kzc
Copy link

kzc commented Aug 26, 2015

@ChALkeR - your modification of @trevnorris' benchmark is really fascinating. You're really on to something here.

1,100,001 new Buffer(64 * 1024) calls in all cases, timings with/without gc(true), and concurrent set to different values:

concurrent = 20, gc(true) called:
  2.25user 0.01system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 29,892maxresident)k
  0inputs+0outputs (0major+231,176minor)pagefaults 0swaps

concurrent = 20, gc(true) NOT called:
  4.29user 0.29system 0:04.46elapsed 102%CPU (0avgtext+0avgdata 51,508maxresident)k
  0inputs+0outputs (0major+579,680minor)pagefaults 0swaps

concurrent = 1, gc(true) called:
  8.87user 0.02system 0:08.91elapsed 99%CPU (0avgtext+0avgdata 19,608maxresident)k
  0inputs+0outputs (0major+1,090,812minor)pagefaults 0swaps

concurrent = 1, gc(true) NOT called:
  9.83user 0.20system 0:09.91elapsed 101%CPU (0avgtext+0avgdata 83,232maxresident)k
  0inputs+0outputs (0major+646,562minor)pagefaults 0swaps

The gc(true) effect on lowering elapsed time, memory and CPU I can understand, but the concurrent effect I never expected - there is no concurrency in this single thread node process after all. But even so, it's dramatically faster with 20 async callbacks on the go at once. Are the new Buffer(64 * 1024)s going into different v8 memory zones or something? The concurrent = 20 setting better resembles a real world scenario with 20 socket connections.

@kzc
Copy link

kzc commented Aug 26, 2015

The concurrent = 20 setting better resembles a real world scenario with 20 socket connections.

Let me take that statement back - new Buffer(65336) is not used, after all. An HTTP benchmark with 20 concurrent connections would better resemble a real world scenario.

@cjihrig
Copy link
Contributor

cjihrig commented Nov 10, 2015

@rvagg what is the status here? Do you still actively use this? Still potentially interested in bringing it into the org?

@rvagg
Copy link
Member Author

rvagg commented Nov 14, 2015

@cjihrig probably not, I don't want to lump the org with activity that there is little interest in maintaining. I'll keep this as a private project for now unless others want to join the fun.

@rvagg rvagg closed this as completed Nov 14, 2015
@cjihrig
Copy link
Contributor

cjihrig commented Nov 14, 2015

OK, sounds good. For the record, if you change your mind, I'd be +1 to having it in the org. Seems like useful info to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory Issues and PRs related to the memory management or memory footprint.
Projects
None yet
Development

No branches or pull requests

6 participants