Replace stack/mask/reduce by indexing in _hsv2rgb #7754

nlgranger · 2023-07-24T09:29:55Z

Fixes #7753

cc @vfdev-5

pytorch-bot · 2023-07-24T09:29:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7754

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 33 New Failures

As of commit baa0081:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nlgranger · 2023-07-24T16:48:58Z

Not sure what is going on with the out-of-memory errors in CI. The PR actually reduces peak memory usage according to:

rgb = torch.rand((16, 3, 704, 1024), device='cuda')
hsv = _rgb2hsv(rgb)
torch.cuda.reset_peak_memory_stats()
_hsv2rgb(hsv)
print(torch.cuda.max_memory_allocated() // (1024 ** 2))

I cannot reproduce locally.

NicolasHug · 2023-07-24T16:53:32Z

Thanks for the PR @nlgranger

This incurs a lot of useless memory and CPU work and could be faster

Can you please provide some simple benchmarks illustrating the gains in memory and perf?

nlgranger · 2023-07-24T17:52:36Z

Using the following code:

import time

import torch
from torchvision.transforms._functional_tensor import _rgb2hsv, _hsv2rgb

device = "cuda"
shapes = [
    (3, 320, 320),
    (8, 3, 320, 320),
    (32, 3, 320, 320),
    (3, 640, 768),
    (8, 3, 640, 768),
    (32, 3, 640, 768),
]

for s in shapes:
    rgb = torch.rand(s, device=device)
    hsv = _rgb2hsv(rgb)

    durations = []
    peak_mem = []

    for _ in range(10):
        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()
        t0 = time.monotonic()
        _hsv2rgb(hsv)
        torch.cuda.synchronize()
        t1 = time.monotonic()

    for _ in range(100):
        torch.cuda.synchronize()
        t0 = time.monotonic()
        _hsv2rgb(hsv)
        torch.cuda.synchronize()
        t1 = time.monotonic()

        durations.append(t1 - t0)
        if device == "cuda":
            peak_mem.append(torch.cuda.max_memory_allocated())

    if device == "cuda":
        print(f"{str(s):20s} : {sum(durations) * 10:7.2f}ms  {sum(peak_mem) / 100 / 1024 ** 2:7.2f}MB")
    else:
        print(f"{str(s):20s} : {sum(durations) / len(durations) * 1000:7.2f}ms")

On GPU (Quadro T1000 mobile):

shape	time(ms)	before	peak mem (MB)	before
(3, 320, 320)	0.59	8.59	1.84	40.12
(8, 3, 320, 320)	2.81	68.75	13.13	262.81
(32, 3, 320, 320)	10.85	277.50	52.22	1034.38
(3, 640, 768)	1.73	43.88	8.08	161.56
(8, 3, 640, 768)	12.99	334.00	62.94	1231.62
(32, 3, 640, 768)	51.62	OOM	1320.00	OOM

On CPU (i7-9850H):

shape	time(ms)	before
(3, 320, 320)	2.49	4.83
(8, 3, 320, 320)	18.73	62.24
(32, 3, 320, 320)	85.25	322.25
(3, 640, 768)	10.78	21.25
(8, 3, 640, 768)	104.61	389.26
(32, 3, 640, 768)	553.01	1541.78

torchvision/transforms/_functional_tensor.py

vfdev-5 · 2023-07-26T21:54:07Z

I run my benchmark script and your script with more iterations and see the following:

fn_new2 is hsv2rgb function from the PR with an improvement
fn_new is hsv2rgb function from the PR
fn_v2 is the reference function from v2 (and not v1, _functional_tensor.py).

$ python -u bench_hsv2rgb_pr_7753.py 

[------------ HSV -> RGB cpu torch.float32 ------------]
                         |  fn_new2  |  fn_new  |  fn_v2
1 threads: ---------------------------------------------
      (3, 400, 300)      |     3.3   |    4.2   |    6.3
      (3, 640, 768)      |    14.8   |   19.5   |   38.2
      (16, 3, 400, 300)  |    75.5   |   82.2   |  183.3

Times are in milliseconds (ms).

[------------ HSV -> RGB cuda torch.float32 ------------]
                         |  fn_new2  |  fn_new  |  fn_v2 
1 threads: ----------------------------------------------
      (3, 400, 300)      |   269.5   |  444.2   |   285.2
      (3, 640, 768)      |   268.8   |  443.1   |   284.1
      (16, 3, 400, 300)  |   330.0   |  466.6   |  1155.5

Times are in microseconds (us).

$ python -u bench_hsv2rgb_pr_7753_repro.py 
v2, cuda - (3, 320, 320)        :    0.33ms    20.12MB
new, cuda - (3, 320, 320)        :    0.48ms     8.59MB
new2, cuda - (3, 320, 320)        :    0.31ms     9.38MB
v2, cuda - (8, 3, 320, 320)     :    0.51ms   160.94MB
new, cuda - (8, 3, 320, 320)     :    0.50ms    68.75MB
new2, cuda - (8, 3, 320, 320)     :    0.33ms    75.00MB
v2, cuda - (3, 640, 768)        :    0.36ms    97.34MB
new, cuda - (3, 640, 768)        :    0.48ms    42.50MB
new2, cuda - (3, 640, 768)        :    0.31ms    46.25MB
v2, cuda - (8, 3, 640, 768)     :    2.50ms   779.50MB
new, cuda - (8, 3, 640, 768)     :    1.13ms   338.00MB
new2, cuda - (8, 3, 640, 768)     :    0.81ms   367.00MB

Source

I propose to use this improved function (mix of v2 function + indexing and gather):

def fn_new2(img: Tensor) -> Tensor:
    h, s, v = img.unbind(dim=-3)
    h6 = h.mul(6)
    i = torch.floor(h6)
    f = h6.sub_(i)
    i = i.to(dtype=torch.int32)

    sxf = s * f
    one_minus_s = 1.0 - s
    q = (1.0 - sxf).mul_(v).clamp_(0.0, 1.0)
    t = sxf.add_(one_minus_s).mul_(v).clamp_(0.0, 1.0)
    p = one_minus_s.mul_(v).clamp_(0.0, 1.0)
    i.remainder_(6)

    vpqt = torch.stack((v, p, q, t), dim=-3)

    # vpqt -> rgb mapping based on i
    select = torch.tensor(
        [[0, 2, 1, 1, 3, 0], [3, 0, 0, 2, 1, 1], [1, 1, 3, 0, 0, 2]],
        dtype=torch.long, device=img.device
    )

    select = select[:, i]
    if select.ndim > 3:
        select = select.transpose(0, 1)
    return vpqt.gather(-3, select)

@nlgranger what do you think ?

nlgranger · 2023-07-26T22:59:55Z

@vfdev-5 I have included your in-place optimizations as well thank you.

torchvision/transforms/_functional_tensor.py

vfdev-5 · 2023-07-26T23:13:34Z

@vfdev-5 I have included your in-place optimizations as well thank you.

@nlgranger by the way, we have to update only v2 implementation, let's keep v1 implementation as it is.

torchvision/transforms/_functional_tensor.py

vfdev-5

Thanks for working on this PR !

However, once again, please revert all changes in _functional_tensor.py (https://github.com/pytorch/vision/pull/7754/files#r1276480612) as transforms v2 will replace v1 soon.

I left few other comments to address

torchvision/transforms/v2/functional/_color.py

nlgranger · 2023-07-29T22:18:27Z

These out of memory issues keep showing up during the tests. I don't see how the modifications I made could cause them.

vfdev-5 · 2023-08-10T07:28:23Z

These out of memory issues keep showing up during the tests. I don't see how the modifications I made could cause them.

Probably, it was a flaky CI. We can update the branch once again to see if we can reproduce the OOM.

Concerning gather needs contiguous indices? I'm not sure about that. Let's see in terms of perfs and it could be also possible that pytorch itself does .contiguous (to check the code).

vfdev-5 · 2023-08-10T09:30:51Z

@nlgranger I rerun my benchmark on your latest commit vs implementation without contiguous call and using non-blocking (fn_v2_plus ):

[------------- HSV -> RGB cpu torch.float32 -------------]          
                         |  fn_v1  |  fn_v2  |  fn_v2_plus
1 threads: -----------------------------------------------          
      (3, 400, 300)      |    7.8  |    3.2  |      3.2   
      (3, 640, 768)      |   49.9  |   14.8  |     14.6   
      (16, 3, 400, 300)  |  369.6  |   87.5  |     75.9   
                                                                    
Times are in milliseconds (ms).                                                                                                          
                                                                    
[------------- HSV -> RGB cuda torch.float32 -------------]                                                                              
                         |  fn_v1   |  fn_v2  |  fn_v2_plus                                                                                                                                                                                                                        1 threads: ------------------------------------------------                                                                                                                                                                                                                              (3, 400, 300)      |   516.2  |  276.1  |    273.8   
      (3, 640, 768)      |   502.0  |  276.1  |    272.7   
      (16, 3, 400, 300)  |  2360.4  |  416.4  |    298.8        
                                                                    
Times are in microseconds (us).

How about using fn_v2_plus implementation instead ?

nlgranger · 2023-08-10T11:41:03Z

Sure, but the tests won't pass anyway and I have no clue why.

vfdev-5 · 2023-08-10T12:14:32Z

Sure, but the tests won't pass anyway and I have no clue why.

Which tests specifically you are talking about ? Currently, we have a lot of flaky failing tests...

nlgranger · 2023-08-10T12:21:59Z

The slower speed is probably due to the .contiguous() I added. It's not necessary, but I thought it could play with the OOM problems during tests.

btw, I found where the non_blocking argument come to play (here), and according to the CUDA doc
the driver will copy the data to pinned memory to make the transfer async (but might sync with the stream first?). According to your benchmark, this seems faster than explicitly allocating in pinned memory via the PyTorch API.

vfdev-5 · 2023-08-11T15:20:23Z

@nlgranger can you please update the code to rerun the CI and see if there are any OOMs. If you are busy, would you mind me pushing to the branch to move forward.

Fixes pytorch#7753

vfdev-5

LGTM, thanks @nlgranger !

Let's see if there are any related OOM in the CI -> No OOMs seen on CI. Merging

github-actions · 2023-08-15T12:37:45Z

Hey @vfdev-5!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

Summary: Co-authored-by: vfdev <vfdev.5@gmail.com> Reviewed By: matteobettini Differential Revision: D48642248 fbshipit-source-id: 24f789cb0ddfb5810c423e4f3ef9e3d28cc2a8a6

facebook-github-bot added the cla signed label Jul 24, 2023

pmeier requested a review from vfdev-5 July 24, 2023 09:45

vfdev-5 reviewed Jul 26, 2023

View reviewed changes

torchvision/transforms/_functional_tensor.py Outdated Show resolved Hide resolved

nlgranger force-pushed the fix_7753 branch 2 times, most recently from 27bb519 to 1fd4730 Compare July 26, 2023 22:58

nlgranger force-pushed the fix_7753 branch 2 times, most recently from 88bb7d7 to 2e07135 Compare July 26, 2023 23:03

vfdev-5 reviewed Jul 26, 2023

View reviewed changes

torchvision/transforms/_functional_tensor.py Outdated Show resolved Hide resolved

nlgranger force-pushed the fix_7753 branch 2 times, most recently from b0a89c3 to bc9142f Compare July 27, 2023 15:29

vfdev-5 reviewed Jul 27, 2023

View reviewed changes

torchvision/transforms/_functional_tensor.py Outdated Show resolved Hide resolved

vfdev-5 requested changes Jul 29, 2023

View reviewed changes

torchvision/transforms/v2/functional/_color.py Outdated Show resolved Hide resolved

torchvision/transforms/v2/functional/_color.py Outdated Show resolved Hide resolved

torchvision/transforms/v2/functional/_color.py Outdated Show resolved Hide resolved

nlgranger force-pushed the fix_7753 branch from bc9142f to 5902f34 Compare July 29, 2023 20:47

nlgranger force-pushed the fix_7753 branch from f3c2024 to 93f169d Compare August 7, 2023 07:46

Replace stack/mask/reduce by indexing in _hsv_to_rgb

9562e2a

Fixes pytorch#7753

nlgranger force-pushed the fix_7753 branch from c0705b9 to 9562e2a Compare August 12, 2023 13:00

nlgranger and others added 2 commits August 12, 2023 15:01

--amend

77821e5

Merge branch 'main' into fix_7753

3c46595

vfdev-5 approved these changes Aug 14, 2023

View reviewed changes

vfdev-5 added 2 commits August 15, 2023 09:54

Merge branch 'main' into fix_7753

5192459

Merge branch 'main' into fix_7753

baa0081

vfdev-5 merged commit 6c44ceb into pytorch:main Aug 15, 2023

vfdev-5 added module: transforms Perf For performance improvements labels Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace stack/mask/reduce by indexing in _hsv2rgb #7754

Replace stack/mask/reduce by indexing in _hsv2rgb #7754

nlgranger commented Jul 24, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 24, 2023 •

edited

Loading

nlgranger commented Jul 24, 2023 •

edited

Loading

NicolasHug commented Jul 24, 2023

nlgranger commented Jul 24, 2023 •

edited

Loading

vfdev-5 commented Jul 26, 2023 •

edited

Loading

nlgranger commented Jul 26, 2023

vfdev-5 commented Jul 26, 2023 •

edited

Loading

vfdev-5 left a comment

nlgranger commented Jul 29, 2023

vfdev-5 commented Aug 10, 2023

vfdev-5 commented Aug 10, 2023 •

edited

Loading

nlgranger commented Aug 10, 2023

vfdev-5 commented Aug 10, 2023

nlgranger commented Aug 10, 2023

vfdev-5 commented Aug 11, 2023

vfdev-5 left a comment •

edited

Loading

github-actions bot commented Aug 15, 2023

Replace stack/mask/reduce by indexing in _hsv2rgb #7754

Replace stack/mask/reduce by indexing in _hsv2rgb #7754

Conversation

nlgranger commented Jul 24, 2023 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Jul 24, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7754

❌ 33 New Failures

nlgranger commented Jul 24, 2023 • edited Loading

NicolasHug commented Jul 24, 2023

nlgranger commented Jul 24, 2023 • edited Loading

vfdev-5 commented Jul 26, 2023 • edited Loading

nlgranger commented Jul 26, 2023

vfdev-5 commented Jul 26, 2023 • edited Loading

vfdev-5 left a comment

Choose a reason for hiding this comment

nlgranger commented Jul 29, 2023

vfdev-5 commented Aug 10, 2023

vfdev-5 commented Aug 10, 2023 • edited Loading

nlgranger commented Aug 10, 2023

vfdev-5 commented Aug 10, 2023

nlgranger commented Aug 10, 2023

vfdev-5 commented Aug 11, 2023

vfdev-5 left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Aug 15, 2023

nlgranger commented Jul 24, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 24, 2023 •

edited

Loading

nlgranger commented Jul 24, 2023 •

edited

Loading

nlgranger commented Jul 24, 2023 •

edited

Loading

vfdev-5 commented Jul 26, 2023 •

edited

Loading

vfdev-5 commented Jul 26, 2023 •

edited

Loading

vfdev-5 commented Aug 10, 2023 •

edited

Loading

vfdev-5 left a comment •

edited

Loading