-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate tp.c
and working benchmarking
#10
base: main
Are you sure you want to change the base?
Conversation
tp.c
and working enchmarkingtp.c
and working benchmarking
Thanks for sharing! Can confirm that I am getting similar if not a slightly more significant performance improvement on my i5 Desktop. Do you have any idea which modifications are leading to the most performance gains? Definitely would like to incorporate these improvements, but it would be good to have an understanding of why they are faster. It seems to me like the main improvements would be due to:
But its not clear to me which one. Additionally, would it be faster to perform the contractions and output writing in one step? E.g. for // This is an automatically generated kernel for performing a tensor product and contraction
void tp_v2_1_1_1 (
const float* in_1,
const float* in_2,
float* out)
{
// Define CG Coefficients
const float C_0 = 0.70710677;
// Define Input Values
float in_1_0 = in_1[0];
float in_1_1 = in_1[1];
float in_1_2 = in_1[2];
float in_2_0 = in_2[0];
float in_2_1 = in_2[1];
float in_2_2 = in_2[2];
- // Define Output Accumulators
- float out_0 = 0;
- float out_1 = 0;
- float out_2 = 0;
// Perform Outer Products
const float op_0_1 = in_1_0 * in_2_1;
const float op_0_2 = in_1_0 * in_2_2;
const float op_1_0 = in_1_1 * in_2_0;
const float op_1_2 = in_1_1 * in_2_2;
const float op_2_0 = in_1_2 * in_2_0;
const float op_2_1 = in_1_2 * in_2_1;
- // Perform Contractions
- out_2 += op_0_1 * C_0;
- out_1 -= op_0_2 * C_0;
- out_2 -= op_1_0 * C_0;
- out_0 += op_1_2 * C_0;
- out_1 += op_2_0 * C_0;
- out_0 -= op_2_1 * C_0;
- // Write Output
- out[0] = out_0;
- out[1] = out_1;
- out[2] = out_2;
+ out[0] = op_1_2 * C_0 - op_2_1 * C_0;
+ out[1] = -op_0_2 * C_0 + op_2_0 * C_0;
+ out[2] = op_0_1 * C_0 - op_1_0 * C_0;
} This would eliminate the need to use the accumulators, but maybe the compiler is already essentially doing this... |
Ah, I'm really glad that the improvement replicates. I ran an experiment on my Mac as well. Do you want to discuss on this PR or on the discussion thread? |
Lets discuss here for now, that way we more easily reference changes to the code. |
Okay, got it. The Mac experiment is here for anyone's reference:
I can't answer for sure, because I implemented them all in one pass. They all are operating on the same principle, which is to fully describe the computation graph for the compiler and runtime system to do what they want with.
I can't quantify size the performance effect, but I think using
Technically you would be "violating the contract you make with the compiler" but because you're not aliasing the input and output it can't cause issues for you. Also in this codegen world, it probably makes sense to provided higher performance self-interacting functions for the self-interacting case.
And taking coefficients that are very close to each other (only different because of numerical instability) and smushing them together.
Then when I enumerate the contractions, I look for the closest coefficient to use (because I no longer have exact matches) I make comments here about how that numerical instability could add "unnecessary FMAs" to the function I don't know how to qualify this effect, but I wanted to explicitly do this so that I let the compiler handle the common case like in the cross product when two numbers are (abc - dec) -> (ab - de)c . One thing I should also change in my codegen is force numbers close to 1 to be 1. That should remove more unnecessary FMAs.
If you are using restrict, a compiler should be able to identify this opportunity, even if you don't make explicitly visible, but I wanted to make it super explicit, so that for larger auto generated functions, or if the compiler "gets lazy" when the computation graph start getting big, it would still be a very visible opportunity.
I think that the compiler is probably doing this, I think you could write it in a different format, I don't think it would make a huge difference. |
Any way to separate out these optimizations and narrow it down more ? Also let's try playing with 32,64,256 channels. And for the outputs let's try chopping off the outputs at lmax instead of letting it grow to 2*lmax. A big table with everything would be useful. Wanna get some more intuition. |
I think I think caching the products might not be necessary if the compiler recognizes the opportunity, which I think it should The coefficient optimizations are a heavier lift. I would suggest getting restrict to work first, and that doesn't do what you want, looking at the coefficient smushing and forcing to 1.
Feel free, and there is a bit of noise, so if you wanted to really benchmark, you probably need to be doing some kind of averaging over multiple runs. I am not going to take the lead on this benchmarking project, I need to focus on the fused output benchmarking we previously discussed today. also sorry for closing the pr temporarily, I slipped on my keyboard |
I think putting this in a profiler would be insightful, and I'll give that a shot to satisfy my own curiosity. I wonder how much overhead exists outside of the tensor product functions themselves |
I will say, this is single threaded code, so the only interaction I would expect would be that if you turn channels way down, to say, 1, you should stress the memory system more, and you would get to reuse your cache less |
There definitely is noise in this benchmark. I updated my codegen to force constants very-close-to-1 -> 1 and because of the run to run variation. I can't really assess it's affect, but which sort of suggests it is in the noise. If you really want to be able to determine the effect of these function level changes, you probably need to create even more synthetic benchmarks where you just run |
Also, I can update this PR to include the actual codegen machinery, I use, I'm not hiding it, I assumed you would rather tweak your setup to produce different code. The coefficients stuff is little involved just because you can't step through the tensor and print something for every non-zero value in one shot. |
That would be helpful! We can always worry about merging the setups later. |
One more thing to consider with respect to benchmarks: all of the current benchmarking code uses 0s as input irreps. I added a simple randn() function and updated all the benchmarks to use random normal inputs in #11. This seems to effect some implementations more than others but figure its a better benchmark than 0s only. |
Here are the profile results: I bumped up the batch to get a longer profile. I'll have to look up / test the memory bandwidth, but it looks and seems like we are memory limited on these operations. I'm not surprised by this. Because they don't leverage channel, and these tensor product kernels (especially ones that don't explicitly do flops for zeros in the CG coefficients, aka sparse kernels) are not high arithmetic intensity. This suggests that there are larger speedups on table if you were to write little micro kernels that do 'n1_channel1 x n2_channel2` and the update the code which calls the tensor_product kernels to reflect this blocked approach... But I think honestly, I don't think you have to chase this down, unless you really wanted to? Single threaded CPU is not the target architecture for this application, and there are probably more interesting things to do with implementing the other parts of the system. Also, and not unrelatedly, my sole focus is doing performance work on the tensor products and I would be happy to take a look at that and share any improvements I can generate. my plan was to go: single threaded cpu code but maybe it would be worth stopping and implementing a channel-aware single threaded CPU kernel to see if it's possible to make these calculations flops limited when channel is large? Let me know your thoughts |
you should be able to check out the code gen tools too. I'm about to go offline. I guess the other benchmarking I mentioned earlier isn't going to happen today, but it still felt productive. It was nice getting introduced and chatting. I'll comment here when I have the time, but won't be online this weekend. |
This sounds like a reasonable plan to me! I didn't have a long term plan with respect to tensor product optimization, but definitely wanted to implement the CUDA versions eventually. Certainly anything you're willing to share to that regard would be greatly appreciated. I agree that optimizing single threaded cpu code is probably not a huge priority in the short term. My own immediate next steps were probably going to be implementing Nequip or similar, but I'm happy to help out with the other implementations to the extent I can. I'll be traveling next week so I might be slower to respond. Nice meeting you! |
Hi @teddykoker @mitkotak I have a question? What was the reasoning for testing: channel=128 x channel=1, and channel=64 x channel=1? Aren't both of those situations somewhat like having a large batch? In this situation, There is an opportunity to reuse half the input data (the data with multiplicity 1), but it's a limited opportunity, because at best, I could halve the input memory / double the arithmetic intensity. Would you be open to benchmarking cases where both inputs have channel=64, 128, 256 instead of just one input having high channel? The case where both have large channel is a situation where doing little blocked micro kernels should have more performance improvement. |
Yup
That would be a MACE-kind setup. In practice, the TP is done b/w a spherical harmonic and node feature which has the channels but would be useful to see the benchmarks. |
In this example, the node is the one with all the channels, right? |
In practice aren't there usually multiple filters? So couldn't you interact a "batch" or "channel>1" of spherical harmonics with a node feature? I feel like depending on the specific type of message passing opportunities would appear. Also, during the node update step of the message passing, isn't there an opportunity for "large" channels to interact? |
Sure thats why we have the batch dim
Umm not sure what you mean here. Can you give an example of inputs and outputs ? You can print out the irreps in train_tetris to get a sense of the kind of the irreps that we typically play with. |
Let's use the notation [batch, multiplicity, irrep_dim] I think that:
Will have the same interactions as:
Is this fair? I know that the underlying data layout might be different, but ignoring that for now/ acknowledging that that is controllable even if it was relevant. And both offer an opportunity to make
I'm was just saying in the classic message passing framework, there is a message passing step and a node update step, when the node self-interacts. If node features usually have large channel, then a node self interaction should have large channel x large channel ^ This was wrong, the interaction is between the node and the accumulated messages (which I still think should have a channel dimension the size of the number of filters) |
It looks like in the tetris.py |
I am guessing it's something like (50, (lmax+1)**2). Can you print out the inputs and the corresponding sh ? Just wanna confirm where it picked up that 50. |
It does something similar, I think? https://github.com/e3nn/e3nn/blob/main/examples/tetris.py It calls "simple network" which defines mul=50 as default. When I run the example too, I see ~50 channels per layer.
|
Wait wait, I meant the tetris in this repo |
ah, let me check |
Thats correct. In this repo we are using |
This is somewhat arbitrary though. In the ChargE3Net paper we use up to |
@teddykoker thanks for those numbers, they are helpful to have as a reference!
Ah, I see what you mean, it would definitely appear in a MACE-like setting. I thought in reading the tensor field networks paper, that often each edge would be combined with multiple filters. I thought that there would be some multiplicity associated with the edge. It seems like in tetris there is only 1 filter associated with each edge... Although thinking about this and looking at the implementation of message passing the e3nn repository. e3nn has a "receiver" based implementation of message passing. So each receiver collects the source node features and the edge to get the inputs for a tensor product, and the receiver then calculates the messages, then accumulates them and so on. But this is an implementation choice. Each source node could easily create the "outgoing message" for all of their outgoing edges, and then the destination nodes would collect and accumulate the messages calculated by the sources. In this scheme, you could create a "batch" of edges for a given source node. Then you would have something like edge_batch = degree-of-connectivity of the source. The source node hidden representation would likely have some multiplicity or channel (all of the examples shared by teddy have this). This would create the necessary multiplicity on both inputs to the tensor product to enable a "2D" kernel. A thing to consider with either implementation is the size of data communicated. In a "source-calculates" implementation you communicate the size of the messages + edge. In a "destination-calculates" implementation, you communicate the size of the hidden representation + edge. If these are the same it's no difference. Does my reasoning make sense? I guess I could start with a 1D micro kernel to start and see if it provides performance improvement. |
I don't think you should actually merge this branch because I had to add a bunch of things to the repo to make the build process work. It's probably best just to use this a tool to study performance and then work the changes into the master and then reject this PR.