Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate tp.c and working benchmarking #10

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

asglover
Copy link

@asglover asglover commented Jul 19, 2024

I don't think you should actually merge this branch because I had to add a bunch of things to the repo to make the build process work. It's probably best just to use this a tool to study performance and then work the changes into the master and then reject this PR.

@asglover asglover changed the title Alternate tp.c and working enchmarking Alternate tp.c and working benchmarking Jul 19, 2024
@teddykoker
Copy link
Owner

benchmark_versions
benchmark
benchmark_versions-1

Thanks for sharing! Can confirm that I am getting similar if not a slightly more significant performance improvement on my i5 Desktop.

Do you have any idea which modifications are leading to the most performance gains? Definitely would like to incorporate these improvements, but it would be good to have an understanding of why they are faster. It seems to me like the main improvements would be due to:

  1. caching in_* values to prevent repeat memory accesses. Maybe adding restrict would make this change redundant?
  2. caching same +/- cg coefficients into one variable.
  3. caching outer products to avoid repetition.

But its not clear to me which one.

Additionally, would it be faster to perform the contractions and output writing in one step? E.g. for tp_v2_1_1_1 doing:

// This is an automatically generated kernel for performing a tensor product and contraction
void tp_v2_1_1_1 (
    const float* in_1, 
    const float* in_2, 
    float* out)
{   
    // Define CG Coefficients 
    const float C_0 = 0.70710677;

    // Define Input Values
    float in_1_0 = in_1[0];
    float in_1_1 = in_1[1];
    float in_1_2 = in_1[2];
    float in_2_0 = in_2[0];
    float in_2_1 = in_2[1];
    float in_2_2 = in_2[2];

-    // Define Output Accumulators 
-    float out_0 = 0;
-    float out_1 = 0;
-    float out_2 = 0;

      // Perform Outer Products  
     const float op_0_1 = in_1_0 * in_2_1;
     const float op_0_2 = in_1_0 * in_2_2;
     const float op_1_0 = in_1_1 * in_2_0;
     const float op_1_2 = in_1_1 * in_2_2;
     const float op_2_0 = in_1_2 * in_2_0;
     const float op_2_1 = in_1_2 * in_2_1;

-    // Perform Contractions
-    out_2 += op_0_1 * C_0;
-    out_1 -= op_0_2 * C_0;
-    out_2 -= op_1_0 * C_0;
-    out_0 += op_1_2 * C_0;
-    out_1 += op_2_0 * C_0;
-    out_0 -= op_2_1 * C_0;

-     // Write Output
-    out[0] = out_0; 
-    out[1] = out_1; 
-    out[2] = out_2; 

+    out[0] = op_1_2 * C_0 - op_2_1 * C_0;
+    out[1] = -op_0_2 * C_0 + op_2_0 * C_0;
+    out[2] = op_0_1 * C_0 - op_1_0 * C_0;

}

This would eliminate the need to use the accumulators, but maybe the compiler is already essentially doing this...

@asglover
Copy link
Author

Ah, I'm really glad that the improvement replicates. I ran an experiment on my Mac as well. Do you want to discuss on this PR or on the discussion thread?

@teddykoker
Copy link
Owner

Lets discuss here for now, that way we more easily reference changes to the code.

@asglover
Copy link
Author

asglover commented Jul 19, 2024

Okay, got it. The Mac experiment is here for anyone's reference:
#9 (comment)

Do you have any idea which modifications are leading to the most performance gains? Definitely would like to incorporate these improvements, but it would be good to have an understanding of why they are faster. It seems to me like the main improvements would be due to:

I can't answer for sure, because I implemented them all in one pass. They all are operating on the same principle, which is to fully describe the computation graph for the compiler and runtime system to do what they want with.

caching in_* values to prevent repeat memory accesses. Maybe adding restrict would make this change redundant?

I can't quantify size the performance effect, but I think using restrict would make this redundant / unnecessary. They aim to accomplish the same thing. I think the definition of restrict is even more powerful in terms of the optimizations it should allow.
I think with this method, It still needs to read in all the data before it starts writing back anything, or at least it should, but with restrict you should be able to fully interleave reads and writes. You made another comment about restrict earlier.

If I'm understanding correctly this would only be an issue if you try to do a tensor product of tensor against itself (i.e. tensor square)

Technically you would be "violating the contract you make with the compiler" but because you're not aliasing the input and output it can't cause issues for you. Also in this codegen world, it probably makes sense to provided higher performance self-interacting functions for the self-interacting case.

caching same +/- cg coefficients into one variable.

And taking coefficients that are very close to each other (only different because of numerical instability) and smushing them together.

def remove_similar_values(arr : np.ndarray, tolerance=1e-6): 
    unique_values = np.array([], dtype=arr.dtype)
    for value in arr:
        if np.all(np.abs(value - unique_values) > tolerance):
            unique_values = np.append(unique_values, value)
    return unique_values

def get_coefficients(t : np.ndarray):
    """Returns np array of coefficeints"""
    return (
        [ {"index": index, "value" : value.__repr__()} 
         for index, value 
         in list(enumerate(list(remove_similar_values(np.unique(np.abs(t[t!=0]))))))]
            )

Then when I enumerate the contractions, I look for the closest coefficient to use (because I no longer have exact matches)

I make comments here about how that numerical instability could add "unnecessary FMAs" to the function
#9 (reply in thread)

I don't know how to qualify this effect, but I wanted to explicitly do this so that I let the compiler handle the common case like in the cross product when two numbers are (abc - dec) -> (ab - de)c .

One thing I should also change in my codegen is force numbers close to 1 to be 1. That should remove more unnecessary FMAs.

caching outer products to avoid repetition.

If you are using restrict, a compiler should be able to identify this opportunity, even if you don't make explicitly visible, but I wanted to make it super explicit, so that for larger auto generated functions, or if the compiler "gets lazy" when the computation graph start getting big, it would still be a very visible opportunity.

would it be faster to perform the contractions and output writing in one step?

I think that the compiler is probably doing this, I think you could write it in a different format, I don't think it would make a huge difference.

@mitkotak
Copy link
Contributor

Any way to separate out these optimizations and narrow it down more ?

Also let's try playing with 32,64,256 channels.

And for the outputs let's try chopping off the outputs at lmax instead of letting it grow to 2*lmax.

A big table with everything would be useful. Wanna get some more intuition.

@asglover
Copy link
Author

asglover commented Jul 19, 2024

Any way to separate out these optimizations and narrow it down more ?

I think restrict is the lowest lift, and I believe it should enable some performance.

I think caching the products might not be necessary if the compiler recognizes the opportunity, which I think it should
Same with explicitly separating accumulation and writing output.

The coefficient optimizations are a heavier lift.

I would suggest getting restrict to work first, and that doesn't do what you want, looking at the coefficient smushing and forcing to 1.

Also let's try playing with 32,64,256 channels.

And for the outputs let's try chopping off the outputs at lmax instead of letting it grow to 2*lmax.

A big table with everything would be useful. Wanna get some more intuition.

Feel free, and there is a bit of noise, so if you wanted to really benchmark, you probably need to be doing some kind of averaging over multiple runs. I am not going to take the lead on this benchmarking project, I need to focus on the fused output benchmarking we previously discussed today.

also sorry for closing the pr temporarily, I slipped on my keyboard

@asglover asglover closed this Jul 19, 2024
@asglover asglover reopened this Jul 19, 2024
@asglover
Copy link
Author

Also let's try playing with 32,64,256 channels.
I will say, in this single threaded approach, there should be no difference between batch and channels, other than maybe with really low channels, like 1, you would stress the memory more by never getting to reuse data.

I think putting this in a profiler would be insightful, and I'll give that a shot to satisfy my own curiosity. I wonder how much overhead exists outside of the tensor product functions themselves

@asglover
Copy link
Author

asglover commented Jul 19, 2024

Also let's try playing with 32,64,256 channels.

I will say, this is single threaded code, so the only interaction I would expect would be that if you turn channels way down, to say, 1, you should stress the memory system more, and you would get to reuse your cache less

@asglover
Copy link
Author

There definitely is noise in this benchmark. I updated my codegen to force constants very-close-to-1 -> 1 and because of the run to run variation. I can't really assess it's affect, but which sort of suggests it is in the noise.

If you really want to be able to determine the effect of these function level changes, you probably need to create even more synthetic benchmarks where you just run lxl->l to see where in the sea of different kernels the speedup is actually coming from. The fact that it's ~80% speedup is a little strange to me and I don't understand where that comes from

@asglover
Copy link
Author

asglover commented Jul 19, 2024

Also, I can update this PR to include the actual codegen machinery, I use, I'm not hiding it, I assumed you would rather tweak your setup to produce different code. The coefficients stuff is little involved just because you can't step through the tensor and print something for every non-zero value in one shot.

@teddykoker
Copy link
Owner

Also, I can update this PR to include the actual codegen machinery, I use, I'm not hiding it, I assumed you would rather tweak your setup to produce different code. The coefficients stuff is little involved just because you can't step through the tensor and print something for every non-zero value in one shot.

That would be helpful! We can always worry about merging the setups later.

@teddykoker
Copy link
Owner

One more thing to consider with respect to benchmarks: all of the current benchmarking code uses 0s as input irreps. I added a simple randn() function and updated all the benchmarks to use random normal inputs in #11. This seems to effect some implementations more than others but figure its a better benchmark than 0s only.

@asglover
Copy link
Author

Here are the profile results:

image

I bumped up the batch to get a longer profile. I'll have to look up / test the memory bandwidth, but it looks and seems like we are memory limited on these operations. I'm not surprised by this. Because they don't leverage channel, and these tensor product kernels (especially ones that don't explicitly do flops for zeros in the CG coefficients, aka sparse kernels) are not high arithmetic intensity.

This suggests that there are larger speedups on table if you were to write little micro kernels that do 'n1_channel1 x n2_channel2` and the update the code which calls the tensor_product kernels to reflect this blocked approach...

But I think honestly, I don't think you have to chase this down, unless you really wanted to? Single threaded CPU is not the target architecture for this application, and there are probably more interesting things to do with implementing the other parts of the system.

Also, and not unrelatedly, my sole focus is doing performance work on the tensor products and I would be happy to take a look at that and share any improvements I can generate.

my plan was to go:

single threaded cpu code
single operation per thread CUDA code
single operation per thread CUDA code considering channel
single operation per thread CUDA considering channel and linear combination.

but maybe it would be worth stopping and implementing a channel-aware single threaded CPU kernel to see if it's possible to make these calculations flops limited when channel is large?

Let me know your thoughts

@asglover
Copy link
Author

you should be able to check out the code gen tools too. I'm about to go offline. I guess the other benchmarking I mentioned earlier isn't going to happen today, but it still felt productive. It was nice getting introduced and chatting. I'll comment here when I have the time, but won't be online this weekend.

@teddykoker
Copy link
Owner

teddykoker commented Jul 20, 2024

Also, and not unrelatedly, my sole focus is doing performance work on the tensor products and I would be happy to take a look at that and share any improvements I can generate.

my plan was to go:

single threaded cpu code
single operation per thread CUDA code
single operation per thread CUDA code considering channel
single operation per thread CUDA considering channel and linear combination.

but maybe it would be worth stopping and implementing a channel-aware single threaded CPU kernel to see if it's possible to make these calculations flops limited when channel is large?

This sounds like a reasonable plan to me! I didn't have a long term plan with respect to tensor product optimization, but definitely wanted to implement the CUDA versions eventually. Certainly anything you're willing to share to that regard would be greatly appreciated. I agree that optimizing single threaded cpu code is probably not a huge priority in the short term.

My own immediate next steps were probably going to be implementing Nequip or similar, but I'm happy to help out with the other implementations to the extent I can.

I'll be traveling next week so I might be slower to respond. Nice meeting you!

@asglover
Copy link
Author

asglover commented Jul 24, 2024

Hi @teddykoker @mitkotak I have a question? What was the reasoning for testing: channel=128 x channel=1, and channel=64 x channel=1? Aren't both of those situations somewhat like having a large batch?

In this situation, There is an opportunity to reuse half the input data (the data with multiplicity 1), but it's a limited opportunity, because at best, I could halve the input memory / double the arithmetic intensity. Would you be open to benchmarking cases where both inputs have channel=64, 128, 256 instead of just one input having high channel? The case where both have large channel is a situation where doing little blocked micro kernels should have more performance improvement.

@mitkotak
Copy link
Contributor

mitkotak commented Jul 24, 2024

Hi @teddykoker @mitkotak I have a question? What was the reasoning for testing: channel=128 x channel=1, and channel=64 x channel=1? Aren't both of those situations somewhat like having a large batch?

Yup

In this situation, There is an opportunity to reuse half the input data (the data with multiplicity 1), but it's a limited opportunity, because at best, I could halve the input memory / double the arithmetic intensity. Would you be open to benchmarking cases where both inputs have channel=64, 128, 256 instead of just one input having high channel? The case where both have large channel is a situation where doing little blocked micro kernels should have more performance improvement.

That would be a MACE-kind setup. In practice, the TP is done b/w a spherical harmonic and node feature which has the channels but would be useful to see the benchmarks.

@asglover
Copy link
Author

In this example, the node is the one with all the channels, right?

@asglover
Copy link
Author

In practice aren't there usually multiple filters? So couldn't you interact a "batch" or "channel>1" of spherical harmonics with a node feature? I feel like depending on the specific type of message passing opportunities would appear. Also, during the node update step of the message passing, isn't there an opportunity for "large" channels to interact?

@mitkotak
Copy link
Contributor

In practice aren't there usually multiple filters? So couldn't you interact a "batch" or "channel>1" of spherical harmonics with a node feature?

Sure thats why we have the batch dim

I feel like depending on the specific type of message passing opportunities would appear. Also, during the node update step of the message passing, isn't there an opportunity for "large" channels to interact?

Umm not sure what you mean here. Can you give an example of inputs and outputs ? You can print out the irreps in train_tetris to get a sense of the kind of the irreps that we typically play with.

@asglover
Copy link
Author

asglover commented Jul 24, 2024

In practice aren't there usually multiple filters? So couldn't you interact a "batch" or "channel>1" of spherical harmonics with a node feature?

Sure thats why we have the batch dim

Let's use the notation [batch, multiplicity, irrep_dim]

I think that:

[x, 1, in_1_dim] x [1, y, in_2_dim] -> output

Will have the same interactions as:

[1, x, in_1_dim] x [1, y, in_2_dim] -> output

Is this fair?

I know that the underlying data layout might be different, but ignoring that for now/ acknowledging that that is controllable even if it was relevant.

And both offer an opportunity to make a x b (in_1 x in_2) -> output micro kernels where a <= x, b<=y.
How many filters exist in a typical application? is is something like 5 or 10 or is it's sometime like 64 or 128?

I feel like depending on the specific type of message passing opportunities would appear. Also, during the node update step of the message passing, isn't there an opportunity for "large" channels to interact?

Umm not sure what you mean here. Can you give an example of inputs and outputs ? You can print out the irreps in train_tetris to get a sense of the kind of the irreps that we typically play with.

I'm was just saying in the classic message passing framework, there is a message passing step and a node update step, when the node self-interacts. If node features usually have large channel, then a node self interaction should have large channel x large channel

^ This was wrong, the interaction is between the node and the accumulated messages (which I still think should have a channel dimension the size of the number of filters)

@asglover
Copy link
Author

It looks like in the tetris.py SimpleNetwork the multiplicity (filters) is 50 by default, which seems large enough to exploit? Please correct me if I'm wrong. I'm going to take a stab at creating kernels that leverage multiplicity to improve arithmetic intensity and performance.

@mitkotak
Copy link
Contributor

mitkotak commented Jul 24, 2024

It looks like in the tetris.py SimpleNetwork the multiplicity (filters) is 50 by default, which seems large enough to exploit? Please correct me if I'm wrong. I'm going to take a stab at creating kernels that leverage multiplicity to improve arithmetic intensity and performance.

I am guessing it's something like (50, (lmax+1)**2). Can you print out the inputs and the corresponding sh ? Just wanna confirm where it picked up that 50.

@asglover
Copy link
Author

asglover commented Jul 24, 2024

It does something similar, I think?

https://github.com/e3nn/e3nn/blob/main/examples/tetris.py

It calls "simple network" which defines mul=50 as default.

https://github.com/e3nn/e3nn/blob/da5f97f502c8b2fe85e1157f9c0305b8404c48fc/e3nn/nn/models/v2106/gate_points_networks.py#L50

When I run the example too, I see ~50 channels per layer.

Built` a model:
SimpleNetwork(
  (mp): MessagePassing(
    (layers): ModuleList(
      (0): Compose(
        (first): Convolution(
          (sc): FullyConnectedTensorProduct(1x0e x 1x0e -> 150x0e+50x1o+50x2e | 150 paths | 150 weights)
          (lin1): FullyConnectedTensorProduct(1x0e x 1x0e -> 1x0e | 1 paths | 1 weights)
          (fc): FullyConnectedNet[10, 100, 3]
          (tp): TensorProduct(1x0e x 1x0e+1x1o+1x2e -> 1x0e+1x1o+1x2e | 3 paths | 3 weights)
          (lin2): FullyConnectedTensorProduct(1x0e+1x1o+1x2e x 1x0e -> 150x0e+50x1o+50x2e | 250 paths | 250 weights)
          (alpha): FullyConnectedTensorProduct(1x0e+1x1o+1x2e x 1x0e -> 1x0e | 1 paths | 1 weights)
        )
        (second): Gate (150x0e+50x1o+50x2e -> 50x0e+50x1o+50x2e)
      )
      (1): Compose(
        (first): Convolution(
          (sc): FullyConnectedTensorProduct(50x0e+50x1o+50x2e x 1x0e -> 250x0e+50x1o+50x1e+50x2o+50x2e | 17500 paths | 17500 weights)
          (lin1): FullyConnectedTensorProduct(50x0e+50x1o+50x2e x 1x0e -> 50x0e+50x1o+50x2e | 7500 paths | 7500 weights)
          (fc): FullyConnectedNet[10, 100, 750]
          (tp): TensorProduct(50x0e+50x1o+50x2e x 1x0e+1x1o+1x2e -> 150x0e+200x1o+100x1e+100x2o+200x2e | 750 paths | 750 weights)
          (lin2): FullyConnectedTensorProduct(150x0e+200x1o+100x1e+100x2o+200x2e x 1x0e -> 250x0e+50x1o+50x1e+50x2o+50x2e | 67500 paths | 67500 weights)
          (alpha): FullyConnectedTensorProduct(150x0e+200x1o+100x1e+100x2o+200x2e x 1x0e -> 1x0e | 150 paths | 150 weights)
        )
        (second): Gate (250x0e+50x1o+50x1e+50x2o+50x2e -> 50x0e+50x1o+50x1e+50x2o+50x2e)
      )
      (2): Compose(
        (first): Convolution(
          (sc): FullyConnectedTensorProduct(50x0e+50x1o+50x1e+50x2o+50x2e x 1x0e -> 50x0o+250x0e+50x1o+50x1e+50x2o+50x2e | 22500 paths | 22500 weights)
          (lin1): FullyConnectedTensorProduct(50x0e+50x1o+50x1e+50x2o+50x2e x 1x0e -> 50x0e+50x1o+50x1e+50x2o+50x2e | 12500 paths | 12500 weights)
          (fc): FullyConnectedNet[10, 100, 1350]
          (tp): TensorProduct(50x0e+50x1o+50x1e+50x2o+50x2e x 1x0e+1x1o+1x2e -> 100x0o+150x0e+300x1o+250x1e+250x2o+300x2e | 1350 paths | 1350 weights)
          (lin2): FullyConnectedTensorProduct(100x0o+150x0e+300x1o+250x1e+250x2o+300x2e x 1x0e -> 50x0o+250x0e+50x1o+50x1e+50x2o+50x2e | 97500 paths | 97500 weights)
          (alpha): FullyConnectedTensorProduct(100x0o+150x0e+300x1o+250x1e+250x2o+300x2e x 1x0e -> 1x0e | 150 paths | 150 weights)
        )
        (second): Gate (50x0o+250x0e+50x1o+50x1e+50x2o+50x2e -> 50x0o+50x0e+50x1o+50x1e+50x2o+50x2e)
      )
      (3): Convolution(
        (sc): FullyConnectedTensorProduct(50x0o+50x0e+50x1o+50x1e+50x2o+50x2e x 1x0e -> 1x0o+6x0e | 350 paths | 350 weights)
        (lin1): FullyConnectedTensorProduct(50x0o+50x0e+50x1o+50x1e+50x2o+50x2e x 1x0e -> 50x0o+50x0e+50x1o+50x1e+50x2o+50x2e | 15000 paths | 15000 weights)
        (fc): FullyConnectedNet[10, 100, 300]
        (tp): TensorProduct(50x0o+50x0e+50x1o+50x1e+50x2o+50x2e x 1x0e+1x1o+1x2e -> 150x0o+150x0e | 300 paths | 300 weights)
        (lin2): FullyConnectedTensorProduct(150x0o+150x0e x 1x0e -> 1x0o+6x0e | 1050 paths | 1050 weights)
        (alpha): FullyConnectedTensorProduct(150x0o+150x0e x 1x0e -> 1x0e | 150 paths | 150 weights)
      )
    )
  )
)

@mitkotak
Copy link
Contributor

Wait wait, I meant the tetris in this repo

@asglover
Copy link
Author

ah, let me check

@teddykoker
Copy link
Owner

Thats correct. In this repo we are using 32x0e + 32x0o + 8x1o + 8x1e + 8x2e + 8x2o as the biggest representation, see in tetris.c and run_tetris.py. I just pulled these from the example in the e3nn-jax repo

@teddykoker
Copy link
Owner

This is somewhat arbitrary though. In the ChargE3Net paper we use up to 100x0e + 100x0o + 33x1o + 33x1e + 20x2e + 20x2o + 14x3o + 14x3e + 11x4e + 11x4o. In Nequip they use 64x0e + 64x0o + 64x1o + 64x1e + 64x2e + 64x2o.

@asglover
Copy link
Author

asglover commented Jul 24, 2024

@teddykoker thanks for those numbers, they are helpful to have as a reference!

@mitkotak

That would be a MACE-kind setup.

Ah, I see what you mean, it would definitely appear in a MACE-like setting.

I thought in reading the tensor field networks paper, that often each edge would be combined with multiple filters. I thought that there would be some multiplicity associated with the edge. It seems like in tetris there is only 1 filter associated with each edge...

Although thinking about this and looking at the implementation of message passing the e3nn repository. e3nn has a "receiver" based implementation of message passing. So each receiver collects the source node features and the edge to get the inputs for a tensor product, and the receiver then calculates the messages, then accumulates them and so on.

But this is an implementation choice. Each source node could easily create the "outgoing message" for all of their outgoing edges, and then the destination nodes would collect and accumulate the messages calculated by the sources. In this scheme, you could create a "batch" of edges for a given source node. Then you would have something like edge_batch = degree-of-connectivity of the source.

The source node hidden representation would likely have some multiplicity or channel (all of the examples shared by teddy have this). This would create the necessary multiplicity on both inputs to the tensor product to enable a "2D" kernel.

A thing to consider with either implementation is the size of data communicated. In a "source-calculates" implementation you communicate the size of the messages + edge. In a "destination-calculates" implementation, you communicate the size of the hidden representation + edge. If these are the same it's no difference.

Does my reasoning make sense? I guess I could start with a 1D micro kernel to start and see if it provides performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants