-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform LMUFFT with raw convolution #42
Conversation
When testing this on my RTX 3060 for my specific model, I'm finding that the FFT implementation is faster than the raw conv implementation. So which one is best does seem to depend on the specific hardware/CUDA/TensorFlow. I'm hoping to test across more hardware soon, but I think for the foreseeable future, we're looking at keeping both implementations around. The best would be to autotune it, but that's probably a good chunk more work. |
24be26c
to
723ae6b
Compare
I think this is ready to go. In the end, I had to add two ways of doing the raw convolution, since the one that's faster on GPUs (using NCHW format) doesn't work on CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a changelog entry too.
0a9656a
to
266677d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left the one comment about the benchmarks still open for when the next reviewer takes a look. It'll also need a changelog entry, and the codecov failure resolved, but it looks good to me.
c778edd
to
e75363a
Compare
keras_lmu/layers.py
Outdated
@@ -523,6 +523,16 @@ class LMUFFT(tf.keras.layers.Layer): | |||
return_sequences : bool, optional | |||
If True, return the full output sequence. Otherwise, return just the last | |||
output in the output sequence. | |||
conv_mode : "fft" or "raw" or "raw_nchw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of performance difference were you seeing between raw
and raw_nchw
? On my machine I don't see any. TensorFlow should automatically be swapping things around to use NCHW on the GPU (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/rewriter_config.proto#L60).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to see a performance difference by making the benchmark a bit more compute intensive. But I think the difference was more to do with the other reshapes/transposes going on (which are different in the two conditions), rather than the actual convolution. I pushed a fixup that simplifies the raw
implementation a bit, and on my machine raw
is now faster than raw_nchw
. It'd be good to check what results you see on your machine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes definitely make "raw" faster for me. I still see "raw_nchw" as slightly faster, but it's pretty minor. Interestingly, I'm now seeing the base "rnn" implementation as faster than both the "raw" implementations (and "fft" is even faster). Note: I had to drop the batch size to 16 since my GPU wasn't large enough to run with 32.
In light of these changes, I decided to re-test having the signal along the "H" dimension, and it's way faster for me. Double check that it's the same for you, but it removes the transposes, so makes sense that it's faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The H convolution is still dramatically slower for me (>10x).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's super weird that it's such a big difference. I guess it's either to do with GPU, or with CUDA/CuDNN/Tensorflow versions. That's annoying...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the slowdown I was seeing by upgrading cuda/cudnn, now I see H
as faster same as you, so that's good news.
I'm kind of inclined to just get rid of the raw_nchw
option. Dropping it would simplify the code and the user interface (and remove any potential GPU/CPU compatibility issues). And I increased the number of steps in the benchmark locally to reduce any noise, and I very consistently see raw
being as fast or faster than raw_nchw
now, so I don't think there's a lot of value added in supporting that extra option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've maybe seen one or two cases where nchw
is marginally faster, but it's pretty marginal, and plain "raw" is often as fast or faster for me as well. I'd be fine to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, about what percentage speedup did you see between W and H? One of my GPUs showed about a 1.5x speedup (old time / new time), but the other showed no speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was relatively minor but consistent, on the order of a 10% speedup.
keras_lmu/tests/test_benchmarks.py
Outdated
steptimes = SteptimeLogger() | ||
model.fit(x_train, y_train, epochs=2, batch_size=batch_size, callbacks=[steptimes]) | ||
|
||
step_time = np.mean(steptimes.batch_times[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically you use min
for timing benchmarks (since all the error is positive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all the batches are doing exactly the same thing; for example, the first batch is much longer because there's a lot of overhead, that's why I drop it. I'm not sure if other batches have much overhead, but we'd lose that by doing min
. Maybe that's not a problem, or maybe it's even a good thing, since it would get at the time of the underlying computation better, which is what we really care about. (Assuming that there would be no differences in overhead between the implementations.)
My only other question is whether each batch is fully synchronous (i.e. whether the whole computation has to be finished before on_batch_end
is called). I think that this should be the case, but it's the only other way I can think of there being a problem with min
.
Anyway, I'm fine to make the change, and it should become apparent pretty quickly if there are any problems. (i.e. if the qualitative results from min
don't agree with mean
)
a796b88
to
424f197
Compare
ee58c88
to
ff887f3
Compare
keras_lmu/tests/test_benchmarks.py
Outdated
def test_performance(mode, capsys): | ||
@pytest.mark.parametrize( | ||
"mode, min_time, max_time", | ||
[("rnn", 0.3, 0.4), ("fft", 0.02, 0.03), ("raw", 0.2, 0.3)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just note somewhere what kind of GPU these come from (is it a K80 that we use on Azure?)
Fixups look good to me. When you've got all the tests passing, feel free to merge. |
For some reason the latest release of matplotlib causes a version of numpy to be installed during the build process that conflicts with the version installed at run time.
4d0de58
to
ac00184
Compare
New commits lgtm 👍 |
Add the ability to run the impulse response convolution as a raw convolution, rather than using the FFT. In practice, I've found that this can speed things up, though it also appears to require more CPU memory (which is surprising).
I also added a profiling test.
Based on #40.
TODO: