Perform LMUFFT with raw convolution #42

hunse · 2021-06-15T16:11:27Z

Add the ability to run the impulse response convolution as a raw convolution, rather than using the FFT. In practice, I've found that this can speed things up, though it also appears to require more CPU memory (which is surprising).

I also added a profiling test.

Based on #40.

TODO:

Figure out why this uses more CPU memory in some of our larger models (I have an example). Update: I looked into this, but was unable to reproduce the problem. I was trying on different hardware, though (the GPU I originally ran on was unavailable), so it's possible that's part of it. That said, I don't see any reason that the raw conv approach should use more memory (unless perhaps there's poorly-implemented explicit padding), so I'm not too worried about this.
Have the profiling test assert something, or make it an optional test/benchmark
Look into using shorter impulse responses when we have shorter theta. This is where this approach could really shine.

hunse · 2021-07-06T19:45:38Z

When testing this on my RTX 3060 for my specific model, I'm finding that the FFT implementation is faster than the raw conv implementation. So which one is best does seem to depend on the specific hardware/CUDA/TensorFlow. I'm hoping to test across more hardware soon, but I think for the foreseeable future, we're looking at keeping both implementations around. The best would be to autotune it, but that's probably a good chunk more work.

hunse · 2021-07-14T19:45:00Z

I think this is ready to go. In the end, I had to add two ways of doing the raw convolution, since the one that's faster on GPUs (using NCHW format) doesn't work on CPU.

arvoelke

Should add a changelog entry too.

keras_lmu/layers.py

keras_lmu/tests/test_benchmarks.py

keras_lmu/tests/test_layers.py

keras_lmu/tests/test_benchmarks.py

keras_lmu/layers.py

arvoelke

LGTM. I left the one comment about the benchmarks still open for when the next reviewer takes a look. It'll also need a changelog entry, and the codecov failure resolved, but it looks good to me.

drasmuss · 2021-07-26T15:18:17Z

keras_lmu/layers.py

@@ -523,6 +523,16 @@ class LMUFFT(tf.keras.layers.Layer):
    return_sequences : bool, optional
        If True, return the full output sequence. Otherwise, return just the last
        output in the output sequence.
+    conv_mode : "fft" or "raw" or "raw_nchw"


What kind of performance difference were you seeing between raw and raw_nchw? On my machine I don't see any. TensorFlow should automatically be swapping things around to use NCHW on the GPU (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/rewriter_config.proto#L60).

I was able to see a performance difference by making the benchmark a bit more compute intensive. But I think the difference was more to do with the other reshapes/transposes going on (which are different in the two conditions), rather than the actual convolution. I pushed a fixup that simplifies the raw implementation a bit, and on my machine raw is now faster than raw_nchw. It'd be good to check what results you see on your machine though.

The changes definitely make "raw" faster for me. I still see "raw_nchw" as slightly faster, but it's pretty minor. Interestingly, I'm now seeing the base "rnn" implementation as faster than both the "raw" implementations (and "fft" is even faster). Note: I had to drop the batch size to 16 since my GPU wasn't large enough to run with 32.

In light of these changes, I decided to re-test having the signal along the "H" dimension, and it's way faster for me. Double check that it's the same for you, but it removes the transposes, so makes sense that it's faster.

The H convolution is still dramatically slower for me (>10x).

Wow, that's super weird that it's such a big difference. I guess it's either to do with GPU, or with CUDA/CuDNN/Tensorflow versions. That's annoying...

Fixed the slowdown I was seeing by upgrading cuda/cudnn, now I see H as faster same as you, so that's good news.

I'm kind of inclined to just get rid of the raw_nchw option. Dropping it would simplify the code and the user interface (and remove any potential GPU/CPU compatibility issues). And I increased the number of steps in the benchmark locally to reduce any noise, and I very consistently see raw being as fast or faster than raw_nchw now, so I don't think there's a lot of value added in supporting that extra option.

Yeah, I've maybe seen one or two cases where nchw is marginally faster, but it's pretty marginal, and plain "raw" is often as fast or faster for me as well. I'd be fine to remove it.

Out of curiosity, about what percentage speedup did you see between W and H? One of my GPUs showed about a 1.5x speedup (old time / new time), but the other showed no speedup.

It was relatively minor but consistent, on the order of a 10% speedup.

keras_lmu/layers.py

drasmuss · 2021-07-26T16:18:57Z

keras_lmu/tests/test_benchmarks.py

+    steptimes = SteptimeLogger()
+    model.fit(x_train, y_train, epochs=2, batch_size=batch_size, callbacks=[steptimes])
+
+    step_time = np.mean(steptimes.batch_times[1:])


Typically you use min for timing benchmarks (since all the error is positive).

Not all the batches are doing exactly the same thing; for example, the first batch is much longer because there's a lot of overhead, that's why I drop it. I'm not sure if other batches have much overhead, but we'd lose that by doing min. Maybe that's not a problem, or maybe it's even a good thing, since it would get at the time of the underlying computation better, which is what we really care about. (Assuming that there would be no differences in overhead between the implementations.)

My only other question is whether each batch is fully synchronous (i.e. whether the whole computation has to be finished before on_batch_end is called). I think that this should be the case, but it's the only other way I can think of there being a problem with min.

Anyway, I'm fine to make the change, and it should become apparent pretty quickly if there are any problems. (i.e. if the qualitative results from min don't agree with mean)

keras_lmu/tests/test_benchmarks.py

keras_lmu/layers.py

hunse · 2021-08-13T16:22:44Z

keras_lmu/tests/test_benchmarks.py

-def test_performance(mode, capsys):
+@pytest.mark.parametrize(
+    "mode, min_time, max_time",
+    [("rnn", 0.3, 0.4), ("fft", 0.02, 0.03), ("raw", 0.2, 0.3)],


Maybe just note somewhere what kind of GPU these come from (is it a K80 that we use on Azure?)

hunse · 2021-08-13T16:23:24Z

Fixups look good to me. When you've got all the tests passing, feel free to merge.

For some reason the latest release of matplotlib causes a version of numpy to be installed during the build process that conflicts with the version installed at run time.

tbekolay · 2021-08-16T14:00:33Z

New commits lgtm 👍

hunse force-pushed the lmu-fft-improvements branch from 24be26c to 723ae6b Compare July 14, 2021 16:32

hunse changed the title ~~WIP: Perform LMUFFT with raw convolution~~ Perform LMUFFT with raw convolution Jul 14, 2021

arvoelke suggested changes Jul 19, 2021

View reviewed changes

hunse force-pushed the lmu-fft-improvements branch from 0a9656a to 266677d Compare July 21, 2021 18:05

arvoelke reviewed Jul 21, 2021

View reviewed changes

keras_lmu/layers.py Outdated Show resolved Hide resolved

arvoelke approved these changes Jul 21, 2021

View reviewed changes

hunse force-pushed the lmu-fft-improvements branch from c778edd to e75363a Compare July 22, 2021 16:08

drasmuss reviewed Jul 26, 2021

View reviewed changes

drasmuss reviewed Aug 11, 2021

View reviewed changes

keras_lmu/layers.py Outdated Show resolved Hide resolved

hunse force-pushed the lmu-fft-improvements branch from a796b88 to 424f197 Compare August 11, 2021 23:47

drasmuss force-pushed the lmu-fft-improvements branch from ee58c88 to ff887f3 Compare August 13, 2021 14:15

hunse commented Aug 13, 2021

View reviewed changes

drasmuss mentioned this pull request Aug 13, 2021

Add build_requires option to pyproject.toml nengo/nengo-bones#153

Open

6 tasks

hunse and others added 5 commits August 13, 2021 19:52

Option for LMUFFT to use raw convolution

a2dab11

Rename LMUFFT to LMUFeedforward

4ca20e4

Run CI tests on GPU

f363031

Fix dropout in TensorFlow 2.6

a34900c

Fix numpy/matplotlib installation conflict

ac00184

For some reason the latest release of matplotlib causes a version of numpy to be installed during the build process that conflicts with the version installed at run time.

drasmuss force-pushed the lmu-fft-improvements branch from 4d0de58 to ac00184 Compare August 13, 2021 23:01

drasmuss merged commit ac00184 into master Aug 16, 2021

drasmuss deleted the lmu-fft-improvements branch August 16, 2021 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform LMUFFT with raw convolution #42

Perform LMUFFT with raw convolution #42

hunse commented Jun 15, 2021 •

edited

Loading

hunse commented Jul 6, 2021

hunse commented Jul 14, 2021

arvoelke left a comment

arvoelke left a comment

drasmuss Jul 26, 2021

drasmuss Jul 27, 2021

hunse Aug 11, 2021

drasmuss Aug 11, 2021

hunse Aug 11, 2021

drasmuss Aug 12, 2021 •

edited

Loading

hunse Aug 13, 2021

hunse Aug 13, 2021

drasmuss Aug 16, 2021

drasmuss Jul 26, 2021

hunse Jul 30, 2021

hunse Aug 13, 2021

hunse commented Aug 13, 2021

tbekolay commented Aug 16, 2021

Perform LMUFFT with raw convolution #42

Perform LMUFFT with raw convolution #42

Conversation

hunse commented Jun 15, 2021 • edited Loading

hunse commented Jul 6, 2021

hunse commented Jul 14, 2021

arvoelke left a comment

Choose a reason for hiding this comment

arvoelke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drasmuss Aug 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hunse commented Aug 13, 2021

tbekolay commented Aug 16, 2021

hunse commented Jun 15, 2021 •

edited

Loading

drasmuss Aug 12, 2021 •

edited

Loading