Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky watchFile() test on Windows #21694

Closed
wants to merge 2 commits into from
Closed

Conversation

Trott
Copy link
Member

@Trott Trott commented Jul 6, 2018

Detect race condition and retry test when it occurs.

Confirmed that a version of the test (modified to remove ES6-isms)
fails on 0.8.9 (which is expected as that was the last version with the
bug that the test was written to detect) and succeeds on 0.8.10.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines

@Trott Trott added fs Issues and PRs related to the fs subsystem / file system. windows Issues and PRs related to the Windows platform. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Jul 6, 2018
@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Jul 6, 2018
@Trott
Copy link
Member Author

Trott commented Jul 6, 2018

Stress test on current master: https://ci.nodejs.org/job/node-stress-single-test/1930/

Hopefully it fails (thus demonstrating the issue and suggesting that a clean run with this PR would be a sign that it addresses it). But it might not. I'm not sure how frequent this race condition is. But there are other ways to trigger it if the above doesn't work.

@Trott
Copy link
Member Author

Trott commented Jul 6, 2018

Another solution would be to remove the assertions, as the original bug was that the callback didn't run at all. Simply confirming that the callback is invoked is enough to indicate that the bug has been fixed.

@Trott
Copy link
Member Author

Trott commented Jul 6, 2018

Stress test didn't work and I think it's because we don't have a support configuration. @nodejs/build Can we get win10 + vs2017 added and (if it's no longer supported) remove win10 + VS2015 from node-stress-single test?

https://ci.nodejs.org/job/node-stress-single-test/1930/nodes=win10-1p-vs2015/console

16:58:15 [win10-1p-vs2015] $ cmd /c call C:\Users\ADMINI~1\AppData\Local\Temp\jenkins8802136331077346752.bat
16:58:15 
16:58:15 c:\workspace\node-stress-single-test\nodes\win10-1p-vs2015>for /F "delims=. tokens=1" %F in ('call python tools\getnodeversion.py') do (set "NODE_MAJOR_VERSION=%F" ) 
16:58:15 
16:58:15 c:\workspace\node-stress-single-test\nodes\win10-1p-vs2015>(set "NODE_MAJOR_VERSION=11" ) 
16:58:15 
16:58:15 c:\workspace\node-stress-single-test\nodes\win10-1p-vs2015>if 11 GTR 5 if "vs2015" == "vs2013" (
16:58:15 echo Not building v11 with VS2013 as it is only supported up to v5.x  
16:58:15  exit /b 0 
16:58:15 ) 
16:58:15 
16:58:15 c:\workspace\node-stress-single-test\nodes\win10-1p-vs2015>if 11 LSS 8 if "vs2015" == "vs2017" (
16:58:15 echo Not building v11 with VS2017 as it is only supported starting on v8.x  
16:58:15  exit /b 0 
16:58:15 ) 
16:58:15 
16:58:15 c:\workspace\node-stress-single-test\nodes\win10-1p-vs2015>call vcbuild.bat release nosign x64 
16:58:15 Note: vcbuild no longer signs by default. "nosign" is redundant.
16:58:15 Looking for Python 2.x
16:58:16 Looking for NASM
16:58:16 Looking for Visual Studio 2017
16:58:16 Failed to find a suitable Visual Studio installation.
16:58:16 Try to run in a "Developer Command Prompt" or consult
16:58:16 https://github.com/nodejs/node/blob/master/BUILDING.md#windows-1
16:58:16 Build step 'Conditional steps (multiple)' marked build as failure
16:58:16 Notifying upstream projects of job completion
16:58:16 Finished: FAILURE

@Trott
Copy link
Member Author

Trott commented Jul 11, 2018

Failed once in 19200 runs at https://ci.nodejs.org/job/node-stress-single-test/1948/nodes=win10-vs2017/consoleText. So we'll need a heck of a lot of runs to show reliable failure and then clean success with a different approach.

@Trott
Copy link
Member Author

Trott commented Jul 12, 2018

OK, 13 failures in 51200 runs. That shows the master branch (with the test moved to parallel) fails from time to time: https://ci.nodejs.org/job/node-stress-single-test/1951/nodes=win10-vs2017/consoleText

Now to run a stress test on a fix...

@Trott
Copy link
Member Author

Trott commented Jul 12, 2018

Stress test against this pull request: https://ci.nodejs.org/job/node-stress-single-test/1956/nodes=win10-vs2017/

@Trott
Copy link
Member Author

Trott commented Jul 12, 2018

Stress test was successful! Remaining tasks before this can land:

  • Rewrite without ES6-isms to confirm it still fails on 0.8.9 and passes in 0.8.10.
  • Try stress test and 0.8.9/0.8.10 test with version without the timer (which I believe is only there due to the race condition).
  • Run full CI.
  • Get some reviews.

@jasnell jasnell added the wip Issues and PRs that are still a work in progress. label Jul 12, 2018
Copy link
Member

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubberstamp LGTM.

The regression that test-fs-watch-file-enoent-after-deletion was written
to test for involves whether or not the callback runs or not. Checking
what the file watcher reports unfortunately is subject to race
conditions on Windows (and possibly elsewhere) because the file watcher
returns control to the event loop before it may be receiving data from
the OS. So remove those assertions. The test still checks what it is
supposed to check, but is no longer subject to race conditions.
@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

Confirmed that this version of the test (rewritten to remove ES6-isms) passes in 0.8.10 and never exits (so fails) in 0.8.9. This is what we want/expect.

'use strict';

// This test is broken in 0.8.9--it never exits.
// This test passes in 0.8.10 and more recent because the bug in 0.8.9 is fixed.

// Make sure the deletion event gets reported in the following scenario:
// 1. Watch a file.
// 2. The initial stat() goes okay.
// 3. Something deletes the watched file.
// 4. The second stat() fails with ENOENT.

// The second stat() translates into the first 'change' event but a logic error
// stopped it from getting emitted.
// https://github.com/nodejs/node-v0.x-archive/issues/4027

var path = require('path');
var fs = require('fs');

var filename = path.join('/tmp', 'watched');
try {
  fs.unlinkSync(filename);
} catch (e) {
  // swallow error, it's probably just that the file doesn't exist because
  // you haven't run this test before
}
fs.writeFileSync(filename, 'quis custodiet ipsos custodes');

fs.watchFile(filename, { interval: 50 }, function(curr, prev) {
  fs.unwatchFile(filename);
});

setTimeout(fs.unlinkSync, 300, filename);

@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

Same results with 0.8.9 and 0.8.10 with the timer stripped out. Woo hoo!

'use strict';

// This test is broken in 0.8.9--it never exits.
// This test passes in 0.8.10 and more recent because the bug in 0.8.9 is fixed.

// Make sure the deletion event gets reported in the following scenario:
// 1. Watch a file.
// 2. The initial stat() goes okay.
// 3. Something deletes the watched file.
// 4. The second stat() fails with ENOENT.

// The second stat() translates into the first 'change' event but a logic error
// stopped it from getting emitted.
// https://github.com/nodejs/node-v0.x-archive/issues/4027

var path = require('path');
var fs = require('fs');

var filename = path.join('/tmp', 'watched');
try {
  fs.unlinkSync(filename);
} catch (e) {
  // swallow error, it's probably just that the file doesn't exist because
  // you haven't run this test before
}
fs.writeFileSync(filename, 'quis custodiet ipsos custodes');

fs.watchFile(filename, { interval: 50 }, function(curr, prev) {
  fs.unwatchFile(filename);
});

fs.unlinkSync(filename);

The timer was there to address a race condition that has been removed
from the test. Remove it.
@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

Stress test on the new version that has no timer. https://ci.nodejs.org/job/node-stress-single-test/1959/nodes=win10-vs2017/

@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

Stress test without the timer came back green! 🎉

@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

@Trott Trott removed the wip Issues and PRs that are still a work in progress. label Jul 13, 2018
@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

ping @apapirovski

@Trott Trott added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Jul 13, 2018
@Trott
Copy link
Member Author

Trott commented Jul 13, 2018

Note to self (or whoever lands this): add Fixes: https://github.com/nodejs/node/issues/21692 metadata to the first commit.

Trott added a commit to Trott/io.js that referenced this pull request Jul 14, 2018
The regression that test-fs-watch-file-enoent-after-deletion was written
to test for involves whether or not the callback runs or not. Checking
what the file watcher reports unfortunately is subject to race
conditions on Windows (and possibly elsewhere) because the file watcher
returns control to the event loop before it may be receiving data from
the OS. So remove those assertions. The test still checks what it is
supposed to check, but is no longer subject to race conditions.

Fixes: nodejs#21692

PR-URL: nodejs#21694
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Trott added a commit to Trott/io.js that referenced this pull request Jul 14, 2018
The timer was there to address a race condition that has been removed
from the test. Remove it.

PR-URL: nodejs#21694
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
@Trott
Copy link
Member Author

Trott commented Jul 14, 2018

Landed in 39977db...6d60c93

@Trott Trott closed this Jul 14, 2018
targos pushed a commit that referenced this pull request Jul 14, 2018
The regression that test-fs-watch-file-enoent-after-deletion was written
to test for involves whether or not the callback runs or not. Checking
what the file watcher reports unfortunately is subject to race
conditions on Windows (and possibly elsewhere) because the file watcher
returns control to the event loop before it may be receiving data from
the OS. So remove those assertions. The test still checks what it is
supposed to check, but is no longer subject to race conditions.

Fixes: #21692

PR-URL: #21694
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
targos pushed a commit that referenced this pull request Jul 14, 2018
The timer was there to address a race condition that has been removed
from the test. Remove it.

PR-URL: #21694
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
@targos targos mentioned this pull request Jul 17, 2018
@Trott Trott deleted the fix-21692 branch January 13, 2022 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. flaky-test Issues and PRs related to the tests with unstable failures on the CI. fs Issues and PRs related to the fs subsystem / file system. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants