[rllib] Fix impala stress test #5101

ericl · 2019-07-03T01:46:48Z

What do these changes do?

This test was failing since it hit an edge case in the configuration which caused plasma memory to get shared to the learner thread without a copy. This is a known issue of instability with plasma.

Also, fix up the config files to allow more easily specifying custom wheels. Now, you just need to edit the config.yaml file and the configuration is straightforward.

AmplabJenkins · 2019-07-03T07:35:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15054/
Test FAILed.

richardliaw · 2019-07-04T00:32:35Z

ci/long_running_tests/config.yaml

@@ -35,8 +35,10 @@ worker_nodes:
 # List of shell commands to run to set up nodes.
 setup_commands:
    # Install nightly Ray wheels.
-    - source activate tensorflow_p36 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/<<<RAY_BRANCH>>>/<<<RAY_COMMIT>>>/ray-<<<RAY_VERSION>>>-cp36-cp36m-manylinux1_x86_64.whl


will defer to @robertnishihara on this

The issue with pinning it to ray==0.7.2 is that it can lead to people accidentally testing ray==0.7.2 instead of the commit they are trying to test.

I've made this mistake a number of times, but I think the current script + config will naturally prompt the user to enter the relevant version and so on, preventing that error.

@ericl are you changing this in order to be able to run the config file on its own and without having to run ./start_workloads.sh?

Good point, I changed it to a placeholder that will crash by default.

The reason I'm making this change is it takes too long to figure out how to get it to load the version I want with command args (and it's flat out not possible for some version specs).

So it makes sense to just require the user to edit the file manually rather than try to automate it.

Can you elaborate on this a bit? For example, I imagine most people would start by running

./start_workloads.sh

that would fail and prompt them for the branch.

Then they would run something like

./start_workloads.sh master

or

./start_workloads.sh releases/0.7.2

that would fail and prompt them for the ray version. Then they would run something like

./start_workloads.sh master 0.8.0.dev1

that would fail and prompt them for the commit. Then they would run something like

./start_workloads.sh master 0.8.0.dev1 62e4b591e3d6443ce25b0f05cc32b43d5e2ebb3d

and then that would work.

That seems like a lot, but the feedback is very quick.

If we don't prompt people like that, then the autoscaler will start up but will fail when it tries to install the wheel RAY_WHEEL_TO_TEST_HERE, which doesn't exist. The user will need to parse the error message and figure out what to change and they'd need to look up how the wheel filenames are formatted in S3 which is not straightforward. It's a longer turn around time.

Can you say what versions are impossible to test? Is that due to the change in the S3 filenames that we're using (since we changed it to include the branch)?

That's great in ideal, but in practice what will happen is you'll have some wheel you want to test, and have to spend about 15 minutes reading the code to try to understand what arguments to put into start_workloads...

My point here is-- the existing solution is too complicated. By deleting the code, you end up with a simpler and easy to understand workflow.

Would the wheel be fully specified by the branch/version/commit combination? Or are you referring to a situation where the wheels that aren't in S3?

The script will definitely need to be customized in some cases. E.g., if I want to compile Ray from one of my branches, then I'd need to modify the script.

Regarding needing to read the code, I tried to make the script print informative prompts so that the user would know exactly what arguments to pass in. Do you think they might be unclear in some settings?

ericl · 2019-07-04T00:44:13Z

I'm going to push some more changes that fix a couple other issues

upgrading TF
removing buggy tmux

AmplabJenkins · 2019-07-04T01:30:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15087/
Test FAILed.

AmplabJenkins · 2019-07-04T01:37:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15089/
Test FAILed.

AmplabJenkins · 2019-07-05T22:05:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1465/
Test FAILed.

AmplabJenkins · 2019-07-05T22:08:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15118/
Test PASSed.

ericl · 2019-07-05T22:36:15Z

This is ready to merge. The test passes modulo #5125

AmplabJenkins · 2019-07-06T01:39:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15123/
Test FAILed.

AmplabJenkins · 2019-07-06T09:26:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1470/
Test PASSed.

robertnishihara · 2019-07-07T05:10:35Z

python/ray/rllib/optimizers/aso_aggregator.py

-                    self.batch_buffer)
+                if len(self.batch_buffer) == 1:
+                    # make a defensive copy to avoid sharing plasma memory
+                    # across multiple threads


Do we understand what the issue is? The plasma client should be thread safe now after apache/arrow#4503

Hmm, let me try it again. It's possible I wasn't using the latest wheels.

Oh I know why, it's because I was not able to test with versions of Ray after #5125

That almost certainly excludes arrow versions with that patch.

AmplabJenkins · 2019-07-08T02:02:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15169/
Test PASSed.

AmplabJenkins · 2019-07-08T15:43:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1516/
Test FAILed.

AmplabJenkins · 2019-07-09T22:53:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1557/
Test FAILed.

robertnishihara

Looks good to me. Can you also update the instructions in https://github.com/ray-project/ray/blob/master/ci/long_running_tests/README.rst

AmplabJenkins · 2019-07-10T00:04:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15248/
Test PASSed.

AmplabJenkins · 2019-07-10T00:38:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1561/
Test FAILed.

AmplabJenkins · 2019-07-10T02:24:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15252/
Test PASSed.

add copy

6763c32

ericl requested review from robertnishihara and richardliaw July 3, 2019 01:55

Merge remote-tracking branch 'upstream/master' into fix-impala-crash-

82fc2c9

richardliaw reviewed Jul 4, 2019

View reviewed changes

upgrade to tf 1.14

7612df7

update

1a7e82e

reduce count to workaround ray-project#5125

269afdf

Update impala.py

4592289

ericl requested review from pcmoritz and removed request for robertnishihara July 6, 2019 23:14

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 7, 2019

robertnishihara reviewed Jul 7, 2019

View reviewed changes

ericl added 2 commits July 7, 2019 16:22

placeholder

30fb805

Merge remote-tracking branch 'upstream/master' into fix-impala-crash-

79bf5fc

comments

9c3ae6e

robertnishihara approved these changes Jul 9, 2019

View reviewed changes

update

e61058f

ericl merged commit 5ab5017 into ray-project:master Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] Fix impala stress test #5101

[rllib] Fix impala stress test #5101

ericl commented Jul 3, 2019

AmplabJenkins commented Jul 3, 2019

richardliaw Jul 4, 2019

robertnishihara Jul 7, 2019

ericl Jul 7, 2019

robertnishihara Jul 8, 2019

ericl Jul 8, 2019

ericl Jul 8, 2019

robertnishihara Jul 8, 2019

ericl commented Jul 4, 2019

AmplabJenkins commented Jul 4, 2019

AmplabJenkins commented Jul 4, 2019

AmplabJenkins commented Jul 5, 2019

AmplabJenkins commented Jul 5, 2019

ericl commented Jul 5, 2019

AmplabJenkins commented Jul 6, 2019

AmplabJenkins commented Jul 6, 2019

robertnishihara Jul 7, 2019

ericl Jul 7, 2019

ericl Jul 7, 2019

AmplabJenkins commented Jul 8, 2019

AmplabJenkins commented Jul 8, 2019

AmplabJenkins commented Jul 9, 2019

robertnishihara left a comment

AmplabJenkins commented Jul 10, 2019

AmplabJenkins commented Jul 10, 2019

AmplabJenkins commented Jul 10, 2019

[rllib] Fix impala stress test #5101

[rllib] Fix impala stress test #5101

Conversation

ericl commented Jul 3, 2019

What do these changes do?

AmplabJenkins commented Jul 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jul 4, 2019

AmplabJenkins commented Jul 4, 2019

AmplabJenkins commented Jul 4, 2019

AmplabJenkins commented Jul 5, 2019

AmplabJenkins commented Jul 5, 2019

ericl commented Jul 5, 2019

AmplabJenkins commented Jul 6, 2019

AmplabJenkins commented Jul 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jul 8, 2019

AmplabJenkins commented Jul 8, 2019

AmplabJenkins commented Jul 9, 2019

robertnishihara left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Jul 10, 2019

AmplabJenkins commented Jul 10, 2019

AmplabJenkins commented Jul 10, 2019