Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite cucumber test caching (and support logic) #2795

Closed
wants to merge 45 commits into from
Closed

Conversation

TheMarex
Copy link
Member

@TheMarex TheMarex commented Aug 22, 2016

Issue

This PR implements #2745. As a result a lot of the support logic has to be reworked. Logs are now handled differently, they are saved by scenario in test/logs/{feature_path}/{scenario}.log.

/cc @MoKob @daniel-j-h @danpat @lbud

Tasklist

  • make all tests pass
  • rebase + squash
  • review
  • adjust for for comments

Requirements / Relations

None.

default: '--require features --tags ~@stress --tags ~@todo',
verify: '--require features --tags ~@todo --tags ~@bug --tags ~@stress -f progress',
jenkins: '--require features --tags ~@todo --tags ~@bug --tags ~@stress --tags ~@options -f progress',
bugs: '--require features --tags @bug',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we completely removing the bugs tag, or is this an accident to leave this out?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks like all bugs are removed below.

Copy link
Member Author

@TheMarex TheMarex Aug 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this was just a small cleanup. @todo and @bug seem to have served the same function, so it is probably easier to have just one.

There are actually quite a few test cases marked as todo that seem weird, we should maybe clean up at some point. (@stress also seems to be dysfunctional)

Copy link

@MoKob MoKob Aug 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree on the clean-up.
I tried to use todo as missing feature and bug as broken feature. I don't think that this separation serves much of a purpose, though. I am missing a complete overview if there can be important distinctions in the code-base but I agree that it looks like a good candidate to clean-up.

@TheMarex
Copy link
Member Author

TheMarex commented Aug 22, 2016

Some tests are currently failing because osrm-routed just exists with error code 0 without any errors, in the middle of a scenario. Each test passes individually, just running them like cucumber -p verify triggers it.

@lbud does this sound familiar to any bugs you hit back in the day?

EDIT: Fixed this. maxBuffer for exec was too small.

@TheMarex
Copy link
Member Author

Tests pass for me locally but there seem to be problems on travis.

/cc @springmeyer

@springmeyer
Copy link
Contributor

springmeyer commented Aug 30, 2016

I've been poking at this branch. All tests are passing locally, but on travis there are still 60 failures:

712 scenarios (60 failed, 652 passed)
3299 steps (60 failed, 1 skipped, 3238 passed)

And what looks like a bug in our JS code whereby we try to write to the log file after finishScenarioLogs has been called and the logStream has been nulled out.

/home/travis/build/Project-OSRM/osrm-backend/node_modules/cucumber/lib/cucumber/runtime/event_broadcaster.js:30
            throw error;
            ^
TypeError: node_modules/cucumber/lib/cucumber/support_code/library.js:15 node_modules/cucumber/lib/cucumber/support_code/library.js:15 Cannot read property 'write' of null
    at Object.writeToStream (/home/travis/build/Project-OSRM/osrm-backend/features/support/log.js:44:39)
    at start (/home/travis/build/Project-OSRM/osrm-backend/node_modules/d3-queue/build/d3-queue.js:68:13)
    at poke (/home/travis/build/Project-OSRM/osrm-backend/node_modules/d3-queue/build/d3-queue.js:56:26)
    at Queue.queue.defer (/home/travis/build/Project-OSRM/osrm-backend/node_modules/d3-queue/build/d3-queue.js:34:7)
    at Object.finishLogs.setupScenarioLogFile.callback.log.callback [as log] (/home/travis/build/Project-OSRM/osrm-backend/features/support/log.js:49:23)
    at emitOne (events.js:82:20)
    at Socket.emit (events.js:169:7)
    at readableAddChunk (_stream_readable.js:153:18)
    at Socket.Readable.push (_stream_readable.js:111:10)
    at Pipe.onread (net.js:536:20)

@springmeyer
Copy link
Contributor

I was able to replicate the above errors on linux in a docker image of Ubuntu Xenial. It looks like the logStream problem is independent of the cucumber test failures. The test failures can be replicated with ./node_modules/.bin/cucumber.js features/testbot/load.feature: https://gist.github.com/springmeyer/97f48cf7f64e9710e46b0b58655ec279. I notice that osrm-routed is left running.

I'm unsure the cause of these failures. But I wonder if the move from child_process.exec to child_process.spawn has caused the issue.

@oxidase
Copy link
Contributor

oxidase commented Aug 31, 2016

@springmeyer it is nodejs/node#2098

/bin/sh -c "/home/miha/foss/osrm-backend/build/osrm-routed /home/miha/foss/osrm-backend/test/cache/testbot/load.feature/dd1cffbaa9edee02f6605c9dd9f008c0/7dfbf4b6f2687a0dd069d80e893bb71b/23_load_data_directly_st.osrm -p 5000"

It worked before on linux because osrm-routed was spawn-ed directly without sh.
I don't see any easy solution, but changing back to use spawn for osrm-routed. If it is ok i can update the branch.

@TheMarex
Copy link
Member Author

I would totally be in favor of using spawn. Reasons why we use exec at the moment:

  • Passing environment variables over the shell (setting options.env does not seem to work on some platforms @springmeyer ?)
  • We need to capture a copy of the full stdout and stderr for some test cases (features/step_definitions/options.js)

Everything is now abstracted using runBin so it should be easy to switch, once the above problems are solved.

@springmeyer
Copy link
Contributor

Passing environment variables over the shell (setting options.env does not seem to work on some platforms @springmeyer ?)

tldr; I also think spawn should be fine for us now.

@TheMarex you may be remembering the problem I hit with DYLD_LIBRARY_PATH on osx: it was that DYLD_LIBRARY_PATH is stripped by OS X when passed in options.env. You can workaround this with exec because it supports prepended environment variables like FOO=bar ./command whereas with spawn the only way to pass environment variables is by passing them to options.env. Anyway, that limitation of spawn does not apply to us anymore because my original need for prepending DYLD_LIBRARY_PATH was avoided (using install_name_tool) (however this is still only in WIP branches).

@springmeyer
Copy link
Contributor

Thanks everyone for the help and comments. My next action: convert runBin to using child_process.spawn. This will require moving to a 'cmd',['arg','arg'] format. Will first ensure all tests are still passing on OS X, then will see if tests are passing on travis/linux. If travis/linux is still failing then I'll take a look at killing/tracking the child using the pid like we did before: however hopefully this will be unneeded.

@oxidase
Copy link
Contributor

oxidase commented Aug 31, 2016

still one issue with the log stream that must be deleted in a callback at https://github.com/Project-OSRM/osrm-backend/pull/2795/files#diff-122ec2288966c104a49c369d3146f016R38 but not with a race condition before child exit

@springmeyer
Copy link
Contributor

Thank you @oxidase for the great work on using execFile and for more flexible passing of options. I fixed one issue with the STXXLCFG value being passed through (that was leading to test failures because of slower running (or buggy?) stxxl disk fallback). And I found a place to shutdown osrm-routed to avoid the race condition. Looks like travis is now going green 🎉. Unless anyone has concerns I think this is ready to merge.

@springmeyer
Copy link
Contributor

Testing locally and noticing what the # of files inside test/cache stays constant now, as we want, between runs of npm test:

$ find test/cache/ -type f | wc -l
16316

But after a run there are 128 files that are newly modified. This a minor problem but still seems unexpected to me: https://gist.github.com/springmeyer/c9ae99dd3f6ead94c1070365893c178e. @TheMarex can you think of a reason why the files in the cache would be overridden even if nothing changed in the osrm-backend code?

@springmeyer
Copy link
Contributor

But after a run there are 128 files that are newly modified.

Looking like a race condition. I've found that when I run time ./node_modules/.bin/cucumber.js features/testbot/traffic_turn_penalties.feature sometimes osrm-contract and osrm-extract are run and sometimes not).

@springmeyer
Copy link
Contributor

Found and fixed the problem in c1b504d. I noticed that 1 times in 10 the hash returned from

hashOfFiles: (paths, cb) => {
let queue = d3.queue();
for (let i = 0; i < paths.length; ++i) {
queue.defer(fs.readFile, paths[i]);
}
queue.awaitAll((err, results) => {
if (err) return cb(err);
let checksum = crypto.createHash('md5');
for (let i = 0; i < results.length; ++i) {
checksum.update(results[i]);
}
cb(null, checksum.digest('hex'));
});
},
was different. This was happening because sometimes the order of the files passed in was different. Serializing the queue so that the files were read in a predictable order and therefore consistently sorted before hashing should solve this.

@MoKob MoKob changed the title [WIP] Rewrite cucumber test caching (and support logic) Rewrite cucumber test caching (and support logic) Sep 2, 2016
@oxidase
Copy link
Contributor

oxidase commented Sep 5, 2016

@springmeyer looks good to me for rebase and merge. The last timeout issue was due to slow md5 sum computation with a cold FS cache (locally for me ~10 seconds).

@oxidase
Copy link
Contributor

oxidase commented Sep 5, 2016

Just found one issue - https://github.com/Project-OSRM/osrm-backend/blob/tests/caching/features/support/hooks.js#L53 always makes osrm-routed shutdown. I will revert it and add osrm-routed logging output to a new stream.

@oxidase
Copy link
Contributor

oxidase commented Sep 14, 2016

@TheMarex there is still a problem with a data race if timeout occurs during osrm-extract or osrm-contract. 1b4f087 is a partial "solution", because if timeout occurs during osrm-extract, the scheduled osrm-contract will be started with parameters for the next scenario, but timestamp will be created for the original one. I see two possibilities:

  • make all task parameters local at defer calls. It will ignore timeouts during extraction and contraction (this is similar to the previous behavior), so it will be possible to run multiple extract/contract processes simultaneously.
  • setup scenario queue in Before, make extract/contract tasks abortable, and abort execution of queued tasks if a timeout occurs.

@springmeyer
Copy link
Contributor

@oxidase thanks for your further work on this. I propose we merge this and ticket the remaining issue you are seeing around behavior when a timeout occurs. Since, if I am understanding right, 1) timeouts should be rare, 2) you have ideas for solving handling them better, and 3) we have other PRs that would benefit from this being merged first (notably #2834). Does that sound right? Do you agree that merging this and handling timeout handling in a separate PR would be acceptable?

@karenzshea
Copy link
Contributor

karenzshea commented Sep 16, 2016

I rebased this onto master. Conflicts were just around retagging @ a few cucumber tests.

update: argh messed this up the first time.

@oxidase
Copy link
Contributor

oxidase commented Sep 16, 2016

@springmeyer i have made the first option, because making a global queue with abortable jobs would require too much time. At the moment if preprocessing starts it will finish even if a timeout occurs.Most important is that stamp files correspond to the original scenario. The last change makes consistent only extraction and contraction. If timeout occurs during writeOSM or linkOSM then possible error will be

    Step Definition: features/step_definitions/routing.js:4
    Message:
      Error: EEXIST: file already exists, link '/home/miha/foss/osrm-backend/test/cache/guidance/end-of-road.feature/bdbf19af33facfd685cb902e571cb78e/8_end_of_road_with_through_street.osm' -> '/home/miha/foss/osrm-backend/test/cache/guidance/end-of-road.feature/bdbf19af33facfd685cb902e571cb78e/7d282afd3cf004904a81c1bb22d409e6/8_end_of_road_with_through_street.osm'

To fix it the corrupted OSM files must be removed, but for 5 seconds timeout it is very unlikely.

TL;DR: LGTM for merge.

Copy link
Contributor

@oxidase oxidase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@MoKob MoKob added this to the 5.5.0 milestone Sep 19, 2016
@springmeyer springmeyer reopened this Sep 19, 2016
@springmeyer
Copy link
Contributor

Rebased and squashed and landed in 7d124ce. This landed a bit sooner than I intended - meant to consult with @MoKob beforehand. But unless there are concerns I'll keep it in master and it should have no harm on the code and just improves caching on travis and predictability and speed of test runs.

@MoKob
Copy link

MoKob commented Sep 20, 2016

@springmeyer seems fine. Since it is not touching any features, we are in a good state. I just would love to get 5.4 fixed before introducing major features into master, since porting patches could get painful otherwise. Hopefully we will have this done soon.

@TheMarex
Copy link
Member Author

TheMarex commented Oct 2, 2016

Great work here everyone. 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants