Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Fix impala stress test #5101

Merged
merged 10 commits into from
Jul 10, 2019
Merged

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jul 3, 2019

What do these changes do?

This test was failing since it hit an edge case in the configuration which caused plasma memory to get shared to the learner thread without a copy. This is a known issue of instability with plasma.

Also, fix up the config files to allow more easily specifying custom wheels. Now, you just need to edit the config.yaml file and the configuration is straightforward.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15054/
Test FAILed.

@@ -35,8 +35,10 @@ worker_nodes:
# List of shell commands to run to set up nodes.
setup_commands:
# Install nightly Ray wheels.
- source activate tensorflow_p36 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/<<<RAY_BRANCH>>>/<<<RAY_COMMIT>>>/ray-<<<RAY_VERSION>>>-cp36-cp36m-manylinux1_x86_64.whl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will defer to @robertnishihara on this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with pinning it to ray==0.7.2 is that it can lead to people accidentally testing ray==0.7.2 instead of the commit they are trying to test.

I've made this mistake a number of times, but I think the current script + config will naturally prompt the user to enter the relevant version and so on, preventing that error.

@ericl are you changing this in order to be able to run the config file on its own and without having to run ./start_workloads.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I changed it to a placeholder that will crash by default.

The reason I'm making this change is it takes too long to figure out how to get it to load the version I want with command args (and it's flat out not possible for some version specs).

So it makes sense to just require the user to edit the file manually rather than try to automate it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this a bit? For example, I imagine most people would start by running

./start_workloads.sh

that would fail and prompt them for the branch.

Then they would run something like

./start_workloads.sh master

or

./start_workloads.sh releases/0.7.2

that would fail and prompt them for the ray version. Then they would run something like

./start_workloads.sh master 0.8.0.dev1

that would fail and prompt them for the commit. Then they would run something like

./start_workloads.sh master 0.8.0.dev1 62e4b591e3d6443ce25b0f05cc32b43d5e2ebb3d

and then that would work.

That seems like a lot, but the feedback is very quick.

If we don't prompt people like that, then the autoscaler will start up but will fail when it tries to install the wheel RAY_WHEEL_TO_TEST_HERE, which doesn't exist. The user will need to parse the error message and figure out what to change and they'd need to look up how the wheel filenames are formatted in S3 which is not straightforward. It's a longer turn around time.

Can you say what versions are impossible to test? Is that due to the change in the S3 filenames that we're using (since we changed it to include the branch)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great in ideal, but in practice what will happen is you'll have some wheel you want to test, and have to spend about 15 minutes reading the code to try to understand what arguments to put into start_workloads...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point here is-- the existing solution is too complicated. By deleting the code, you end up with a simpler and easy to understand workflow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the wheel be fully specified by the branch/version/commit combination? Or are you referring to a situation where the wheels that aren't in S3?

The script will definitely need to be customized in some cases. E.g., if I want to compile Ray from one of my branches, then I'd need to modify the script.

Regarding needing to read the code, I tried to make the script print informative prompts so that the user would know exactly what arguments to pass in. Do you think they might be unclear in some settings?

@ericl
Copy link
Contributor Author

ericl commented Jul 4, 2019

I'm going to push some more changes that fix a couple other issues

  • upgrading TF
  • removing buggy tmux

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15087/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15089/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1465/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15118/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Jul 5, 2019

This is ready to merge. The test passes modulo #5125

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15123/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1470/
Test PASSed.

@ericl ericl requested review from pcmoritz and removed request for robertnishihara July 6, 2019 23:14
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 7, 2019
self.batch_buffer)
if len(self.batch_buffer) == 1:
# make a defensive copy to avoid sharing plasma memory
# across multiple threads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand what the issue is? The plasma client should be thread safe now after apache/arrow#4503

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me try it again. It's possible I wasn't using the latest wheels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I know why, it's because I was not able to test with versions of Ray after #5125

That almost certainly excludes arrow versions with that patch.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15169/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1516/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1557/
Test FAILed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Can you also update the instructions in https://github.com/ray-project/ray/blob/master/ci/long_running_tests/README.rst

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15248/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1561/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15252/
Test PASSed.

@ericl ericl merged commit 5ab5017 into ray-project:master Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants