Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files #35647

Liangliang-Ma · 2025-01-13T06:00:22Z

Recently when we run transformers + qlora doing fine-tuning, we found NaN produced by torch.nn.functional.scaled_dot_product_attention.
Given that xpu has similar implement of fused sdpa, we would like to follow pytorch/transformers tmp solution to modify the mask here, which could solve the issue on XPU device too.

Rocketknight1 · 2025-01-13T17:28:05Z

cc @muellerzr @SunMarc because I'm not sure who to ping for xpu! Feel free to ping someone else if needed

SunMarc

Could you check if you get the same issue as here with XPU ? Also, does this PR fixes your issue ? cc @ydshieh as you might know something

Liangliang-Ma · 2025-01-14T01:54:13Z

Could you check if you get the same issue as here with XPU ? Also, does this PR fixes your issue ? cc @ydshieh as you might know something

Yes we used XPU to do fine-tuning and got the same issue. With this PR the issue fixed.

ydshieh · 2025-01-14T08:52:11Z

I can only say that this seems reasonable to me but not more than that.

@Liangliang-Ma It would be better to provide a tiny code snippet to demonstrate the issue. Like providing a mask with a row with all places being masked, and pass it to F.scaled_dot_product_attention to show we do get NaN (on XPU).

Liangliang-Ma · 2025-01-16T06:56:31Z

I can only say that this seems reasonable to me but not more than that.

@Liangliang-Ma It would be better to provide a tiny code snippet to demonstrate the issue. Like providing a mask with a row with all places being masked, and pass it to F.scaled_dot_product_attention to show we do get NaN (on XPU).

import torch
import intel_extension_for_pytorch
from torch.nn import functional as F

torch.manual_seed(0)

a = 3
b = 4

q = torch.randn(size=(1, 1, a, b))
k = torch.randn(size=(1, 1, a, b))
v = torch.randn(size=(1, 1, a, b))

def check(q, k, v, device):

    q = q.to(device)
    k = k.to(device)
    v = v.to(device)

    neg_value = torch.finfo(q.dtype).min
    mask = [[neg_value, neg_value, neg_value], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
    mask = torch.tensor([[mask]]).to(device)

    with torch.amp.autocast("xpu", dtype=torch.bfloat16):
        o = F.scaled_dot_product_attention(q, k, v, mask, 0.0, is_causal=False)
    print(o)

check(q, k, v, "cpu")
check(q, k, v, "xpu")

Thanks @ydshieh , I modified your test and get the NaN result like this:

tensor([[[[ 0.1210,  0.3627, -0.9969, -0.6149],
          [ 0.1295,  0.4572, -1.0491, -0.6166],
          [ 0.1095,  0.3819, -0.7369, -0.8267]]]])
tensor([[[[    nan,     nan,     nan,     nan],
          [ 0.1299,  0.4590, -1.0469, -0.6172],
          [ 0.1094,  0.3809, -0.7344, -0.8281]]]], device='xpu:0',
       dtype=torch.bfloat16)

I found that this issue caused from casting torch.finfo(torch.float).min to bfloat16, which result in a row of -inf.
This one can make sdpa kernel output NaN.

Liangliang-Ma · 2025-01-16T07:01:51Z

@Rocketknight1 Hi, may I know if the CI workflow failures are expected or not, for it seems to be generated modeling from original code different from what I modified. Thanks!

SunMarc · 2025-01-16T13:14:30Z

Hey @Liangliang-Ma, you need to modify the modular file for mistral and bamba as the modeling file is generated automatically from this file. This should fix the CI issue

SunMarc

Thanks for the snippet, LGTM with the fix I proposed for the CI

HuggingFaceDocBuilderDev · 2025-01-16T13:42:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Liangliang-Ma · 2025-01-17T06:26:20Z

@SunMarc Thanks. The CI passed with the fix.

SunMarc · 2025-01-17T13:41:23Z

gentle ping @ArthurZucker as this concerns attention class

Liangliang-Ma · 2025-01-24T03:22:12Z

@SunMarc @ArthurZucker Soft reminder of this PR.

Liangliang-Ma · 2025-02-05T01:46:42Z

gentle ping @SunMarc @ArthurZucker again.

SunMarc · 2025-02-05T12:28:28Z

Merging this as this only concerns XPU workflow.

…ling files (#35647) * add xpu for unmask * change modular for generated matching * add lastest modeling for helium

ArthurZucker · 2025-02-13T08:51:18Z

Sorry @Liangliang-Ma ! And thanks for the fix 🤗

…ling files (huggingface#35647) * add xpu for unmask * change modular for generated matching * add lastest modeling for helium

add xpu for unmask

78ddda9

Liangliang-Ma requested review from zucchini-nlp, ArthurZucker, eustlb, Cyrilvallez and Rocketknight1 as code owners January 13, 2025 06:00

SunMarc reviewed Jan 13, 2025

View reviewed changes

SunMarc approved these changes Jan 16, 2025

View reviewed changes

Liangliang-Ma added 3 commits January 16, 2025 21:54

change modular for generated matching

a880ac3

Merge branch 'main' into modeling

93e0020

add lastest modeling for helium

d1b6d5e

SunMarc merged commit 315a9f4 into huggingface:main Feb 5, 2025
17 checks passed

MekkCyber pushed a commit that referenced this pull request Feb 7, 2025

Add XPU type for work-around -inf mask causing sdpa NaN issue in mode…

c21f13f

…ling files (#35647) * add xpu for unmask * change modular for generated matching * add lastest modeling for helium

ArthurZucker removed request for Rocketknight1, ArthurZucker, Cyrilvallez and eustlb February 13, 2025 08:51

ArthurZucker removed the request for review from zucchini-nlp February 13, 2025 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files #35647

Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files #35647

Liangliang-Ma commented Jan 13, 2025

Rocketknight1 commented Jan 13, 2025

SunMarc left a comment •

edited

Loading

Liangliang-Ma commented Jan 14, 2025

ydshieh commented Jan 14, 2025

Liangliang-Ma commented Jan 16, 2025 •

edited

Loading

Liangliang-Ma commented Jan 16, 2025

SunMarc commented Jan 16, 2025

SunMarc left a comment

HuggingFaceDocBuilderDev commented Jan 16, 2025

Liangliang-Ma commented Jan 17, 2025

SunMarc commented Jan 17, 2025

Liangliang-Ma commented Jan 24, 2025

Liangliang-Ma commented Feb 5, 2025

SunMarc commented Feb 5, 2025

ArthurZucker commented Feb 13, 2025

Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files #35647

Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files #35647

Conversation

Liangliang-Ma commented Jan 13, 2025

Rocketknight1 commented Jan 13, 2025

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

Liangliang-Ma commented Jan 14, 2025

ydshieh commented Jan 14, 2025

Liangliang-Ma commented Jan 16, 2025 • edited Loading

Liangliang-Ma commented Jan 16, 2025

SunMarc commented Jan 16, 2025

SunMarc left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 16, 2025

Liangliang-Ma commented Jan 17, 2025

SunMarc commented Jan 17, 2025

Liangliang-Ma commented Jan 24, 2025

Liangliang-Ma commented Feb 5, 2025

SunMarc commented Feb 5, 2025

ArthurZucker commented Feb 13, 2025

SunMarc left a comment •

edited

Loading

Liangliang-Ma commented Jan 16, 2025 •

edited

Loading