Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace stack/mask/reduce by indexing in _hsv2rgb #7754

Merged
merged 5 commits into from
Aug 15, 2023

Conversation

nlgranger
Copy link
Contributor

@nlgranger nlgranger commented Jul 24, 2023

Fixes #7753

cc @vfdev-5

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 24, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7754

Note: Links to docs will display an error until the docs builds have been completed.

❌ 33 New Failures

As of commit baa0081:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@nlgranger
Copy link
Contributor Author

nlgranger commented Jul 24, 2023

Not sure what is going on with the out-of-memory errors in CI. The PR actually reduces peak memory usage according to:

rgb = torch.rand((16, 3, 704, 1024), device='cuda')
hsv = _rgb2hsv(rgb)
torch.cuda.reset_peak_memory_stats()
_hsv2rgb(hsv)
print(torch.cuda.max_memory_allocated() // (1024 ** 2))

I cannot reproduce locally.

@NicolasHug
Copy link
Member

Thanks for the PR @nlgranger

This incurs a lot of useless memory and CPU work and could be faster

Can you please provide some simple benchmarks illustrating the gains in memory and perf?

@nlgranger
Copy link
Contributor Author

nlgranger commented Jul 24, 2023

Using the following code:

import time

import torch
from torchvision.transforms._functional_tensor import _rgb2hsv, _hsv2rgb

device = "cuda"
shapes = [
    (3, 320, 320),
    (8, 3, 320, 320),
    (32, 3, 320, 320),
    (3, 640, 768),
    (8, 3, 640, 768),
    (32, 3, 640, 768),
]

for s in shapes:
    rgb = torch.rand(s, device=device)
    hsv = _rgb2hsv(rgb)

    durations = []
    peak_mem = []

    for _ in range(10):
        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()
        t0 = time.monotonic()
        _hsv2rgb(hsv)
        torch.cuda.synchronize()
        t1 = time.monotonic()

    for _ in range(100):
        torch.cuda.synchronize()
        t0 = time.monotonic()
        _hsv2rgb(hsv)
        torch.cuda.synchronize()
        t1 = time.monotonic()

        durations.append(t1 - t0)
        if device == "cuda":
            peak_mem.append(torch.cuda.max_memory_allocated())

    if device == "cuda":
        print(f"{str(s):20s} : {sum(durations) * 10:7.2f}ms  {sum(peak_mem) / 100 / 1024 ** 2:7.2f}MB")
    else:
        print(f"{str(s):20s} : {sum(durations) / len(durations) * 1000:7.2f}ms")

On GPU (Quadro T1000 mobile):

shape time(ms) before peak mem (MB) before
(3, 320, 320) 0.59 8.59 1.84 40.12
(8, 3, 320, 320) 2.81 68.75 13.13 262.81
(32, 3, 320, 320) 10.85 277.50 52.22 1034.38
(3, 640, 768) 1.73 43.88 8.08 161.56
(8, 3, 640, 768) 12.99 334.00 62.94 1231.62
(32, 3, 640, 768) 51.62 OOM 1320.00 OOM

On CPU (i7-9850H):

shape time(ms) before
(3, 320, 320) 2.49 4.83
(8, 3, 320, 320) 18.73 62.24
(32, 3, 320, 320) 85.25 322.25
(3, 640, 768) 10.78 21.25
(8, 3, 640, 768) 104.61 389.26
(32, 3, 640, 768) 553.01 1541.78

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jul 26, 2023

I run my benchmark script and your script with more iterations and see the following:

fn_new2 is hsv2rgb function from the PR with an improvement
fn_new is hsv2rgb function from the PR
fn_v2 is the reference function from v2 (and not v1, _functional_tensor.py).

$ python -u bench_hsv2rgb_pr_7753.py 

[------------ HSV -> RGB cpu torch.float32 ------------]
                         |  fn_new2  |  fn_new  |  fn_v2
1 threads: ---------------------------------------------
      (3, 400, 300)      |     3.3   |    4.2   |    6.3
      (3, 640, 768)      |    14.8   |   19.5   |   38.2
      (16, 3, 400, 300)  |    75.5   |   82.2   |  183.3

Times are in milliseconds (ms).

[------------ HSV -> RGB cuda torch.float32 ------------]
                         |  fn_new2  |  fn_new  |  fn_v2 
1 threads: ----------------------------------------------
      (3, 400, 300)      |   269.5   |  444.2   |   285.2
      (3, 640, 768)      |   268.8   |  443.1   |   284.1
      (16, 3, 400, 300)  |   330.0   |  466.6   |  1155.5

Times are in microseconds (us).
$ python -u bench_hsv2rgb_pr_7753_repro.py 
v2, cuda - (3, 320, 320)        :    0.33ms    20.12MB
new, cuda - (3, 320, 320)        :    0.48ms     8.59MB
new2, cuda - (3, 320, 320)        :    0.31ms     9.38MB
v2, cuda - (8, 3, 320, 320)     :    0.51ms   160.94MB
new, cuda - (8, 3, 320, 320)     :    0.50ms    68.75MB
new2, cuda - (8, 3, 320, 320)     :    0.33ms    75.00MB
v2, cuda - (3, 640, 768)        :    0.36ms    97.34MB
new, cuda - (3, 640, 768)        :    0.48ms    42.50MB
new2, cuda - (3, 640, 768)        :    0.31ms    46.25MB
v2, cuda - (8, 3, 640, 768)     :    2.50ms   779.50MB
new, cuda - (8, 3, 640, 768)     :    1.13ms   338.00MB
new2, cuda - (8, 3, 640, 768)     :    0.81ms   367.00MB

Source

I propose to use this improved function (mix of v2 function + indexing and gather):

def fn_new2(img: Tensor) -> Tensor:
    h, s, v = img.unbind(dim=-3)
    h6 = h.mul(6)
    i = torch.floor(h6)
    f = h6.sub_(i)
    i = i.to(dtype=torch.int32)

    sxf = s * f
    one_minus_s = 1.0 - s
    q = (1.0 - sxf).mul_(v).clamp_(0.0, 1.0)
    t = sxf.add_(one_minus_s).mul_(v).clamp_(0.0, 1.0)
    p = one_minus_s.mul_(v).clamp_(0.0, 1.0)
    i.remainder_(6)

    vpqt = torch.stack((v, p, q, t), dim=-3)

    # vpqt -> rgb mapping based on i
    select = torch.tensor(
        [[0, 2, 1, 1, 3, 0], [3, 0, 0, 2, 1, 1], [1, 1, 3, 0, 0, 2]],
        dtype=torch.long, device=img.device
    )

    select = select[:, i]
    if select.ndim > 3:
        select = select.transpose(0, 1)
    return vpqt.gather(-3, select)

@nlgranger what do you think ?

@nlgranger nlgranger force-pushed the fix_7753 branch 2 times, most recently from 27bb519 to 1fd4730 Compare July 26, 2023 22:58
@nlgranger
Copy link
Contributor Author

@vfdev-5 I have included your in-place optimizations as well thank you.

@nlgranger nlgranger force-pushed the fix_7753 branch 2 times, most recently from 88bb7d7 to 2e07135 Compare July 26, 2023 23:03
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jul 26, 2023

@vfdev-5 I have included your in-place optimizations as well thank you.

@nlgranger by the way, we have to update only v2 implementation, let's keep v1 implementation as it is.

@nlgranger nlgranger force-pushed the fix_7753 branch 2 times, most recently from b0a89c3 to bc9142f Compare July 27, 2023 15:29
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR !

However, once again, please revert all changes in _functional_tensor.py (https://github.com/pytorch/vision/pull/7754/files#r1276480612) as transforms v2 will replace v1 soon.

I left few other comments to address

@nlgranger
Copy link
Contributor Author

These out of memory issues keep showing up during the tests. I don't see how the modifications I made could cause them.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 10, 2023

These out of memory issues keep showing up during the tests. I don't see how the modifications I made could cause them.

Probably, it was a flaky CI. We can update the branch once again to see if we can reproduce the OOM.

Concerning gather needs contiguous indices? I'm not sure about that. Let's see in terms of perfs and it could be also possible that pytorch itself does .contiguous (to check the code).

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 10, 2023

@nlgranger I rerun my benchmark on your latest commit vs implementation without contiguous call and using non-blocking (fn_v2_plus ):

[------------- HSV -> RGB cpu torch.float32 -------------]          
                         |  fn_v1  |  fn_v2  |  fn_v2_plus
1 threads: -----------------------------------------------          
      (3, 400, 300)      |    7.8  |    3.2  |      3.2   
      (3, 640, 768)      |   49.9  |   14.8  |     14.6   
      (16, 3, 400, 300)  |  369.6  |   87.5  |     75.9   
                                                                    
Times are in milliseconds (ms).                                                                                                          
                                                                    
[------------- HSV -> RGB cuda torch.float32 -------------]                                                                              
                         |  fn_v1   |  fn_v2  |  fn_v2_plus                                                                                                                                                                                                                        1 threads: ------------------------------------------------                                                                                                                                                                                                                              (3, 400, 300)      |   516.2  |  276.1  |    273.8   
      (3, 640, 768)      |   502.0  |  276.1  |    272.7   
      (16, 3, 400, 300)  |  2360.4  |  416.4  |    298.8        
                                                                    
Times are in microseconds (us).

How about using fn_v2_plus implementation instead ?

@nlgranger
Copy link
Contributor Author

Sure, but the tests won't pass anyway and I have no clue why.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 10, 2023

Sure, but the tests won't pass anyway and I have no clue why.

Which tests specifically you are talking about ? Currently, we have a lot of flaky failing tests...

@nlgranger
Copy link
Contributor Author

The slower speed is probably due to the .contiguous() I added. It's not necessary, but I thought it could play with the OOM problems during tests.

btw, I found where the non_blocking argument come to play (here), and according to the CUDA doc
the driver will copy the data to pinned memory to make the transfer async (but might sync with the stream first?). According to your benchmark, this seems faster than explicitly allocating in pinned memory via the PyTorch API.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 11, 2023

@nlgranger can you please update the code to rerun the CI and see if there are any OOMs. If you are busy, would you mind me pushing to the branch to move forward.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @nlgranger !

Let's see if there are any related OOM in the CI -> No OOMs seen on CI. Merging

@vfdev-5 vfdev-5 merged commit 6c44ceb into pytorch:main Aug 15, 2023
@github-actions
Copy link

Hey @vfdev-5!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@vfdev-5 vfdev-5 added module: transforms Perf For performance improvements labels Aug 15, 2023
facebook-github-bot pushed a commit that referenced this pull request Aug 25, 2023
Summary: Co-authored-by: vfdev <vfdev.5@gmail.com>

Reviewed By: matteobettini

Differential Revision: D48642248

fbshipit-source-id: 24f789cb0ddfb5810c423e4f3ef9e3d28cc2a8a6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: transforms Perf For performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hsv2rgb is slow due to masking followed by einsum
4 participants