This is great, how can I learn to do it myself? #9

seyeeet · 2021-03-23T12:26:52Z

Thanks for sharing this code, is it the rewritten version of (https://github.com/google-research/fast-soft-sort)? what is the benefit of this version compare to fast soft sort from google?
what makes me really interested in this work is the implementation, I found it so hard to incorporate new stuff to pytorch with c++ and cuda, I know it is a little to much to ask but I am sure it will be appreciated so much if you can write a tutorial (or make a video) on how we can do it for other function.
That would be a huge help!

teddykoker · 2021-03-23T19:46:33Z

Great question! A good starting point is Custom C++ and Cuda Extensions by Peter Goldsborough. To get started converting your own function I would recommend doing the following:

Write your function in pure PyTorch/Python/Numpy, using for-loops anytime there is not an existing vectorized function you can leverage. You can use torch.autograd.gradcheck to ensure you implemented the forward/backward functions correctly.
Convert the code to a C++ extension. The code will likely look very similar, you will just need to use a torch::TensorAccessor anytime you loop through a tensor. Feel free to use my code as a reference. You can then test your C++ extension against the python implementation, and also do the grad check. (see my tests/test_ops.py for example)
The CUDA extension will likely be near identical to the C++ extension, except you must use torch::PackedTensorAccessor instead. I would recommend starting with threads=1; blocks=1 and performing any outer loops manually. Once this works you can increase the threads and blocks to parallelize the outer loops. Again I would recommend constantly testing that the numerical outputs are consistent across all of the implementations.

A (perhaps easier) alternative would be leveraging numba. It is a similar approach, but the kernel can be written in Python, and then just-in-time compiled to cpu/gpu. Maghoumi/pytorch-softdtw-cuda is a great example of this method (thanks to Mathieu for pointing this out). If done properly, you can likely achieve performance on-par with the C++/CUDA implementation.

I hope this helps. If enough people ask I would be happy to write a blog post about the process! - Teddy

seyeeet · 2021-03-23T20:28:02Z

Thanks Teddy for the explanation. I hope more people show interest so we can see your blog on this topic.
I will definitely, take a look into the numba option since my c++ knowledge is very limited.
I am not sure if I should close this issue or not since it is not an issue and if it gets close no one will see it anymore. I leave it up to you, please feel free to close it if you think it is not necessary to stay open.
Thanks again for your response.

teddykoker · 2021-03-23T21:37:54Z

I'll leave it open for visibility :)

teddykoker · 2021-05-05T15:58:08Z

Closing this, but keeping it pinned

teddykoker added the question Further information is requested label Mar 23, 2021

teddykoker pinned this issue Mar 24, 2021

teddykoker closed this as completed May 5, 2021

teddykoker changed the title ~~this is great, how can I learn to do it myself?~~ This is great, how can I learn to do it myself? May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This is great, how can I learn to do it myself? #9

This is great, how can I learn to do it myself? #9

seyeeet commented Mar 23, 2021 •

edited

Loading

teddykoker commented Mar 23, 2021

seyeeet commented Mar 23, 2021

teddykoker commented Mar 23, 2021

teddykoker commented May 5, 2021

This is great, how can I learn to do it myself? #9

This is great, how can I learn to do it myself? #9

Comments

seyeeet commented Mar 23, 2021 • edited Loading

teddykoker commented Mar 23, 2021

seyeeet commented Mar 23, 2021

teddykoker commented Mar 23, 2021

teddykoker commented May 5, 2021

seyeeet commented Mar 23, 2021 •

edited

Loading