-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Fix impala stress test #5101
Conversation
Test FAILed. |
@@ -35,8 +35,10 @@ worker_nodes: | |||
# List of shell commands to run to set up nodes. | |||
setup_commands: | |||
# Install nightly Ray wheels. | |||
- source activate tensorflow_p36 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/<<<RAY_BRANCH>>>/<<<RAY_COMMIT>>>/ray-<<<RAY_VERSION>>>-cp36-cp36m-manylinux1_x86_64.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will defer to @robertnishihara on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with pinning it to ray==0.7.2
is that it can lead to people accidentally testing ray==0.7.2
instead of the commit they are trying to test.
I've made this mistake a number of times, but I think the current script + config will naturally prompt the user to enter the relevant version and so on, preventing that error.
@ericl are you changing this in order to be able to run the config file on its own and without having to run ./start_workloads.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I changed it to a placeholder that will crash by default.
The reason I'm making this change is it takes too long to figure out how to get it to load the version I want with command args (and it's flat out not possible for some version specs).
So it makes sense to just require the user to edit the file manually rather than try to automate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this a bit? For example, I imagine most people would start by running
./start_workloads.sh
that would fail and prompt them for the branch.
Then they would run something like
./start_workloads.sh master
or
./start_workloads.sh releases/0.7.2
that would fail and prompt them for the ray version. Then they would run something like
./start_workloads.sh master 0.8.0.dev1
that would fail and prompt them for the commit. Then they would run something like
./start_workloads.sh master 0.8.0.dev1 62e4b591e3d6443ce25b0f05cc32b43d5e2ebb3d
and then that would work.
That seems like a lot, but the feedback is very quick.
If we don't prompt people like that, then the autoscaler will start up but will fail when it tries to install the wheel RAY_WHEEL_TO_TEST_HERE
, which doesn't exist. The user will need to parse the error message and figure out what to change and they'd need to look up how the wheel filenames are formatted in S3 which is not straightforward. It's a longer turn around time.
Can you say what versions are impossible to test? Is that due to the change in the S3 filenames that we're using (since we changed it to include the branch)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great in ideal, but in practice what will happen is you'll have some wheel you want to test, and have to spend about 15 minutes reading the code to try to understand what arguments to put into start_workloads...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point here is-- the existing solution is too complicated. By deleting the code, you end up with a simpler and easy to understand workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the wheel be fully specified by the branch/version/commit combination? Or are you referring to a situation where the wheels that aren't in S3?
The script will definitely need to be customized in some cases. E.g., if I want to compile Ray from one of my branches, then I'd need to modify the script.
Regarding needing to read the code, I tried to make the script print informative prompts so that the user would know exactly what arguments to pass in. Do you think they might be unclear in some settings?
I'm going to push some more changes that fix a couple other issues
|
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
This is ready to merge. The test passes modulo #5125 |
Test FAILed. |
Test PASSed. |
self.batch_buffer) | ||
if len(self.batch_buffer) == 1: | ||
# make a defensive copy to avoid sharing plasma memory | ||
# across multiple threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we understand what the issue is? The plasma client should be thread safe now after apache/arrow#4503
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, let me try it again. It's possible I wasn't using the latest wheels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I know why, it's because I was not able to test with versions of Ray after #5125
That almost certainly excludes arrow versions with that patch.
Test PASSed. |
Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Can you also update the instructions in https://github.com/ray-project/ray/blob/master/ci/long_running_tests/README.rst
Test PASSed. |
Test FAILed. |
Test PASSed. |
What do these changes do?
This test was failing since it hit an edge case in the configuration which caused plasma memory to get shared to the learner thread without a copy. This is a known issue of instability with plasma.
Also, fix up the config files to allow more easily specifying custom wheels. Now, you just need to edit the config.yaml file and the configuration is straightforward.