-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Use tensor random generator #1094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! This is long-needed. I have a few comments though.
dynet/nodes-dropout.cc
Outdated
std::uniform_int_distribution<> seed_dist(1, 2147483647); | ||
Eigen::internal::UniformRandomGenerator<float> uni_rg(seed_dist(*rndeng)); | ||
m.tvec().device(*dev.edevice) = m.tvec().random(uni_rg); | ||
m.tvec().device(*dev.edevice) = (m.tvec() < m.tvec().constant((1.f-p))).cast<float>() * 1.f / (1.f-p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need *1.f
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed now.
dynet/nodes-dropout.cc
Outdated
@@ -30,7 +30,10 @@ size_t Dropout::aux_storage_size() const { | |||
template<class MyDevice> | |||
void Dropout::forward_dev_impl(const MyDevice & dev, const vector<const Tensor*>& xs, Tensor& fx) const { | |||
Tensor m(dim, (float*)aux_mem, fx.device, DeviceMempool::FXS); | |||
TensorTools::randomize_bernoulli(m, (1.f-p), 1.f / (1.f-p)); | |||
std::uniform_int_distribution<> seed_dist(1, 2147483647); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of this code being copied all the time. If this is something that we'll want to use frequently, perhaps it can be made a global variable in the DyNet namespace or something?
Also, tangentially, there are some random number generation functions in tensor.h
/tensor.cc
that it might be reasonable to move to their own rand.h
/rand.cc
files, along with this one. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good, I moved the random helpers to their own files and added a draw_random_seed() helper to avoid code duplication.
dynet/nodes-dropout.cc
Outdated
@@ -30,7 +30,10 @@ size_t Dropout::aux_storage_size() const { | |||
template<class MyDevice> | |||
void Dropout::forward_dev_impl(const MyDevice & dev, const vector<const Tensor*>& xs, Tensor& fx) const { | |||
Tensor m(dim, (float*)aux_mem, fx.device, DeviceMempool::FXS); | |||
TensorTools::randomize_bernoulli(m, (1.f-p), 1.f / (1.f-p)); | |||
std::uniform_int_distribution<> seed_dist(1, 2147483647); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is there a reason why you didn't just implement all of this in TensorTools? It seems like that would work as well. If there's a reason that's no problem though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do that partly because the TensorTools are also used for parameter initialization which I didn't want to touch, and partly because I didn't know how to properly move over the device-related code.
OK, I tested this on CPU and GPU with the current master branch and this PR using the attached script. It looks like master CPU/GPU and PR CPU are consistent, but the PR on GPU gives weird results. Could you try running the script and seeing what's up? |
Thanks for taking a look, that is strange indeed. random_normal() and random_uniform() look fine to me but bernoulli and dropout do not, which would hint at this line being broken on GPU:
The result of operator< is a |
OK, I'm getting very strange behavior even on RandomNormal. It looks like almost exactly half the time it's sampling the exact same value, while half the time it's sampling from the appropriate distribution. This looks like a relatively major bug in the upstream Eigen implementation. I'll try to reproduce it only in Eigen and see if they're interested in fixing it. |
FYI, we're having some discussions offline about how to implement this directly in cuRAND, which will presumably be more stable than relying on Eigen and potentially easier. |
Closed in favor of #1154. |
Use tensor random generator instead of the standard one. This should fix speed issues on GPUs due to creating random numbers on CPU and then transferring to GPU, as was done previously.
I implemented this carefully, but it would be good if someone could review the code, as this involves core functionality and is hard to unit-test.
Some numbers on a CPU machine and a GPU machine:
I suspect that most of the 3.48s were dynet overhead because in my particular use case dropout went from consuming 90% of computation time to <1%.
fixes #542, #438
related to #1059