Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix race condition in unrefd interval test #3550

Closed
wants to merge 2 commits into from
Closed

test: fix race condition in unrefd interval test #3550

wants to merge 2 commits into from

Conversation

mcornac
Copy link
Contributor

@mcornac mcornac commented Oct 27, 2015

Before this commit, test-timers-unrefd-interval-still-fire required a
1ms interval to fire 5 times before a 100ms timeout to pass which could
cause intermittent failures in CI and in issue #1781.

This commit gives the test up to 5x as long to complete while allowing
the test to complete as quickly as before when possible.
cc @mhdawson

Before this commit, test-timers-unrefd-interval-still-fire required a
1ms interval to fire 5 times before a 100ms timeout to pass which could
cause intermittent failures in CI and in issue #1781.

This commit gives the test up to 5x as long to complete while allowing
the test to complete as quickly as before when possible.
@Fishrock123
Copy link
Contributor

I'm not too fond of this way of doing things.

I suggest re-writing the test to be more like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js

Particularly, it is more reliable to depend on the order of timer execution rather than arbitrary timeouts.

A master timeout is, however, necessary to keep the process open. That being said, the last unrefed timeout should be able to tell when it has succeed and cancel the master timeout.

@Fishrock123
Copy link
Contributor

i.e.

setTimeout(b, 2)
setTimeout(a, 1)

b will always be called last. You must register the longer timers first.

@Fishrock123 Fishrock123 added timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout. test Issues and PRs related to the tests. labels Oct 27, 2015
@Fishrock123
Copy link
Contributor

Actually, that exact order may not be correct for "regular" timers, in my test it applied to _unrefActive, which is... a little different. I'll look more in detail soon.

@mcornac
Copy link
Contributor Author

mcornac commented Oct 28, 2015

I made some changes to look more like https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js. Mainly so that the master timeout is cleared as soon as the last unrefd interval fires.

@Fishrock123 Fishrock123 self-assigned this Oct 28, 2015

var N = 5;
const TEST_DURATION = common.platformTimeout(500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 seems awfuly large with scaling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What number is reasonable here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100, although really you could probably get away with something smaller. If this is taking more than 50 or so ms you've got some big issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like 100ms was causing the intermittent failure in CI in the first place, unless that can be handled in platformTimeout.
cc @mhdawson

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, looks like we didn't do that for AIX yet? https://github.com/nodejs/node/blob/master/test/common.js#L232-L240

In this case I suggest setting the timeout to 500 only if common.isAix is true.

@Fishrock123
Copy link
Contributor

@mhdawson what are the sort of specs of an AIX machine? I am highly curious how 5 1ms timers could take more than 100ms.

@mhdawson
Copy link
Member

I think I may have gotten it wrong and its zLinux. as opposed to AIX. I think the main issue is that we have a good number of VMs sharing the same hardware. Its not that it fails all the time only that we see intermittent failures. @thinkingdust can you talk to Joran and provide more details if you have them.

@mhdawson
Copy link
Member

In terms of platform timeout I don't want to extend the length of all AIX tests because we see intermittent failures on a few tests.

@mcornac
Copy link
Contributor Author

mcornac commented Nov 2, 2015

The original failure was on zLinux so a platform specific change won't do.
With the updates, the test now clears the master timeout on success so its length will only matter when deciding that the test has failed.
I don't think the failure has repeated so I think we could keep the original 100ms.
/cc @joransiu

@Fishrock123
Copy link
Contributor

LGTM if common.platformTimeout(500); is 100 then.

I'd really like to know if @joransiu or anyone else has any insight as to how that could be happening though.

@mcornac
Copy link
Contributor Author

mcornac commented Nov 4, 2015

I think the failure was caused by other processes running on the CI machine. I can replicate the failure with the bash script below. The failure rate can be controlled by changing the number of background processes.

#!/bin/bash

NUM_PROCESSES=32
for j in `seq 1 $NUM_PROCESSES`;
do
  k=0
  while [ $k -lt 1000000 ]
  do
    ((k++))
  done &
done

pass=0
fail=0
output=""
for i in `seq 1 100`;
do
  output=$(./node test/parallel/test-timers-unrefd-interval-still-fires.js 2>&1)
  len=${#output}
  if [ $len -eq 0 ]; then
    ((pass++))
  else
    ((fail++))
  fi
  if [ $i -ne 1 ]; then
    echo -en "\e[1A"
  fi
  echo pass $pass : fail $fail
done
wait

@Fishrock123
Copy link
Contributor

@thinkingdust Any idea how many processor-level threads that machine has?

Like, are you literally just clogging everything up that much that the timer doesn't get run?

@joransiu
Copy link
Contributor

joransiu commented Nov 4, 2015

I had suggested a few experiments with @thinkingdust to get a better gauge on various timeout values for our Linux on Z configuration. The machine we are running on is a guest Linux system running with 8 CPs on zEC12 @ 5.5GHz => 8 processor level threads available.

With relative little load on the system, our very informal measurements:

Timeout Failure Rate
20ms 16/500 (~3%)
30ms 4 / 500 (~1%)
100ms 0 / 500 (0%)

Now, if we ran 8 instances in parallel... @ 100ms timeout, we observe a failure rate of about 1%. I'd say given these stats, 100ms is probably reasonable.

@mcornac
Copy link
Contributor Author

mcornac commented Nov 4, 2015

Keep open timeout has been changed back to 100ms.

var timer = setInterval(function() {

const keepOpen = setTimeout(() => {
assert.strictEqual(nbIntervalFired, N);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fishrock123
Copy link
Contributor

8 CPs on zEC12 @ 5.5GHz => 8 processor level threads available.

That should never fail this.

Is it possible libuv isn't scheduling these efficiently on AIX?

@Fishrock123
Copy link
Contributor

LGTM minus the comment.

@mcornac
Copy link
Contributor Author

mcornac commented Nov 9, 2015

I made the change to throw the same error as https://github.com/nodejs/node/blob/master/test/parallel/test-timers-unref-active-unenrolled-disposed.js#L14

I think it was slightly useful to print nbIntervalFired to see if the interval callback ran at all. Is it a good idea to include that?

@Fishrock123
Copy link
Contributor

@thinkingdust sure.

@mcornac
Copy link
Contributor Author

mcornac commented Nov 9, 2015

@Fishrock123 Is this format okay?

@Fishrock123
Copy link
Contributor

Was going to land squashed as

    test: fix race condition in unrefd interval test

    The test now relies more on timers implementation rather than arbitrary
    timeouts, and will cancel a master timeout once complete.

    Refs: https://github.com/nodejs/node/issues/1781
    PR-URL: https://github.com/nodejs/node/pull/3550
    Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
    Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>

@thinkingdust But I think I actually missed something, sorry, we should harden the test a bit more.

clearInterval(timer);
clearTimeout(keepOpen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In here, can you set timer._onTimeout to a function that throws an error if it called again?

(That will ensure it does not get called more than 5 times.)

@mcornac
Copy link
Contributor Author

mcornac commented Nov 16, 2015

@Fishrock123 Any more suggestions for this PR?

@Fishrock123
Copy link
Contributor

Fishrock123 pushed a commit to Fishrock123/node that referenced this pull request Nov 20, 2015
Rely more on timers implementation rather than arbitrary timeouts.

Refs: nodejs#1781
PR-URL: nodejs#3550

Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
@Fishrock123
Copy link
Contributor

Thanks, landed in 6de82c6

@Fishrock123
Copy link
Contributor

(Squashed and updated description)

@misterdjules
Copy link

I don't think I understand why 6de82c6 would fix the race condition and make the test fail less often in the situation described in the first comment of this PR. Am I missing something?

@Fishrock123
Copy link
Contributor

Ugh, guess I should have updated the description.

It doesn't fix it per se, but it does refactor the test to work in a more reliable fashion.

Ultimately if your 5 1ms timers aren't firing in under like 25ms you've probably got serious hardware problems, or armv6.

@misterdjules
Copy link

One way to fix the flakiness of this test would be to rely on the tests suite's per-test timeout (which is currently set to 60 seconds) and instead of adding a timer to hold the loop open (which would require us to hardcode this 60 seconds tests suite timeout somewhere else, or refactor the tests suite driver so that it can communicate that value to JS tests), we could add another type of handle that holds the loop open indefinitely, like a net.Server instance.

When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail. But we would get rid of almost all false positives.

@Fishrock123
Copy link
Contributor

When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail.

That's what happens now.

If 5 1ms timers fail within <100ms there are issues, this isn't just flakiness.

As it stands, this patch appears to have corrected the original issue.

@misterdjules
Copy link

When the fifth timer's callback is called, we can close that handle and the test would exist asap in most cases, or the tests suite would make it time out in case it would fail.

That's what happens now.

It seems that what happens now is that the test will fail if the 5 timers' callbacks haven't run within 100 ms. According to @mhdawson and @joransiu it is possible that on systems with high load, 5 timers' callbacks take longer than 100ms to run, which seems entirely plausible. That doesn't mean that unrefed timers are broken, but still it seems that this change would make the test fail in these situations. That would make the test flaky.

What I was describing is a way to give much more time for these timers to fire, not just 100ms, without making the best cases take longer, and without hardcoding the expiry delay for the handle that keeps the loop open. I believe that would make the test non flaky.

As it stands, this patch appears to have corrected the original issue.

Do you mean that it fixes #1781, or that it fixes the problems reported by @mhdawson, or both?

If I'm not missing anything, 6de82c6 has been pushed today, and #1781 was closed on July 28th, so I don't know how we can determine that this change fixed that issue.

rvagg pushed a commit that referenced this pull request Dec 5, 2015
Rely more on timers implementation rather than arbitrary timeouts.

Refs: #1781
PR-URL: #3550

Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
@rvagg rvagg mentioned this pull request Dec 17, 2015
MylesBorins pushed a commit that referenced this pull request Dec 29, 2015
Rely more on timers implementation rather than arbitrary timeouts.

Refs: #1781
PR-URL: #3550

Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
MylesBorins pushed a commit that referenced this pull request Jan 19, 2016
Rely more on timers implementation rather than arbitrary timeouts.

Refs: #1781
PR-URL: #3550

Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
@MylesBorins MylesBorins mentioned this pull request Jan 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Issues and PRs related to the tests. timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants