xpu: Support new PyTorch XPU backend (>=2.4) #31237

dvrogozh · 2024-06-04T16:53:15Z

XPU backend is a new backend in PyTorch which targets to enabled hardware acceleration on Intel GPUs via sycl. It's being actively worked on at the moment with first set of patches landed in PyTorch upstream and support disclosed in documentation [1]. Initial version should be available starting from PyTorch 2.4, with 2.5 release as a target point of maturity. Current focus of the effort is on functional aspect to identify and close API gaps, if any, and populate set of offloadable aten operations. Some models and scenarios can already be tried out with the caveat of the low performance due to CPU fallbacks on some operations. Overall, [2] outlines upsrreaming process for XPU backend. Note also some relevant XPU related issues opened on PyTorch side [3].

Previously Intel GPU support in PyTorch was only available via Intel Extension for PyTorch (IPEX). Effectively this support is what is getting now upstreamed to the stock PyTorch.

Here I would like to request Huggingface to enable stock Pytorch XPU backend. Considering that IPEX is actually already enabled in Huggingface repos, this should be fairly trivial to extend it to cover XPU backend since the latter reuses XPU device and operations naming from IPEX era.

I did prototype XPU backend support in Huggingface. Please, check these PRs:

xpu: support xpu backend from stock pytorch (>=2.4) #31238
xpu: support xpu backend from stock pytorch (>=2.4) accelerate#2825
I do observe some issues which needs to be addressed on XPU backend side. To avoid long description I will publish it separately in a comment below.

[1] https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support
[2] pytorch/pytorch#114842
[3] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+xpu+in%3Atitle

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 @sywangyi @kding1

Fixes: huggingface/transformers#31237 XPU backend is available in the stock PyTorch starting from version 2.4, see [1]. This commit extends huggingface accelerate to support XPU from both IPEX and the stock pytorch. IPEX is being tried first. See: pytorch/pytorch#114842 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

dvrogozh · 2024-06-04T17:36:27Z

As of pytorch/pytorch@21144ce, huggingface/accelerate@b7fa2fa and 485d913 with applied PRs:

Below are my try out results for Huggingface examples (https://github.com/huggingface/transformers/tree/main/examples/pytorch) running with XPU backend on ATS-M (requires export OverrideDefaultFP64Settings=1 && export IGC_EnableDPEmulation=1 at the moment). I tried all the samples except 2: contrastive-image-text and semantic-segmentation.

Overall, Huggingface examples can run on XPU backend with the low performance at the moment due to range of operations falling back to CPU. Effectively one of the goal was to identify these ops for future prioritization. The only example which failed due to missing of some uAPI is speech-pretraining. See details below.

Some aten operations are not implemented for XPU backend (functional impact)
- Affects: image-classification, image-detection, translation, token-classification, text-classification examples
- Not implemented with sycl
- Not marked for explicit CPU fallback (require PYTORCH_DEBUG_XPU_FALLBACK=1)
- These ops marked as manual in the table below
- XPU issues:
  - xpu: a set of foreach ops not implemented for XPU backend affecting Huggingface examples pytorch/pytorch#127931
  - xpu: aten::nll_loss2d_* not implemented for XPU backend affecting Huggingface examples pytorch/pytorch#127937
Some aten operations are not implemented for XPU backend (performance impact)
- Affects: all the examples
- Not implemented with sycl
- Marked for CPU fallback (no need in PYTORCH_DEBUG_XPU_FALLBACK=1)
- These ops marked as explicit in the table below
- XPU issue: xpu: set of unimplemented ops affect huggingface examples performance pytorch/pytorch#127941
Support for torch.xpu.<memory> uAPIs is missing
- Affects: speech-pretraining example
- XPU issue: xpu: support torch.xpu.<memory> ops (memory_allocated, max_memory_allocated, etc.) pytorch/pytorch#127929
- Ops are:
  - torch.xpu.memory_allocated()
    
    transformers/src/transformers/trainer_utils.py
    
    Line 554 in fd3238b
    
    self.gpu_mem_used_at_start = self.torch.xpu.memory_allocated()
  - torch.xpu.max_memory_allocated()
    
    transformers/src/transformers/trainer_utils.py
    
    Line 608 in fd3238b
    
    self.gpu_mem_used_peak = self.torch.xpu.max_memory_allocated()
  - torch.xpu.reset_peak_memory_stats()
    
    transformers/src/transformers/trainer_utils.py
    
    Line 539 in fd3238b
    
    self.torch.xpu.reset_peak_memory_stats()

op	image-classification	image-detection	translation	token-classification	text-classification	summarization	instance-segmentation	multiple-choice	question-answering
aten::_cdist_forward	explicit		DETR
aten::foreach_addcdiv.ScalarList	manual	ViT	DETR	OPUS_MT	BERT	MRPC
aten::foreach_addcmul.Scalar	manual	ViT	DETR	OPUS_MT	BERT
aten::foreach_div.ScalarList	manual	ViT	DETR	OPUS_MT	BERT	MRPC
aten::foreach_lerp.Scalar	manual	ViT	DETR	OPUS_MT	BERT
aten::foreach_mul.Scalar	manual	ViT	DETR	OPUS_MT	BERT	MRPC
aten::foreach_mul.Tensor	manual	ViT	DETR	OPUS_MT	BERT	MRPC
aten::_foreach_norm.Scalar	manual	ViT	DETR	OPUS_MT	BERT	MRPC
aten::_foreach_sqrt	manual	ViT	DETR	OPUS_MT	BERT	MRPC
aten::addcdiv.out	explicit							SWIN	ROBERTA
aten::addcmul.out	explicit						GOOGLE-T5	SWIN	ROBERTA
aten::all.all_out	explicit		DETR		BERT	MRPC
aten::floor.out	explicit							SWIN
aten::grid_sampler_2d_backward	explicit							SWIN
aten::lerp.Scalar_out	explicit						GOOGLE-T5	SWIN	ROBERTA
aten::linalg_vector_norm.out	explicit	ViT		OPUS_MT		MRPC	GOOGLE-T5	SWIN	ROBERTA
aten::linspace.out	explicit							SWIN
aten::native_batch_norm	explicit							SWIN
aten::native_group_norm_backward	explicit							SWIN
aten::nll_loss2d_backward	manual		DETR					SWIN
aten::nll_loss2d_forward	manual		DETR					SWIN
aten::max_pool2d_with_indices.out	explicit		DETR
aten::prod.int_out	explicit							SWIN
aten::roll	explicit							SWIN
aten::sgn.out	explicit		DETR
aten::sigmoid.out	explicit		DETR	OPUS_MT				SWIN
aten::sigmoid_backward.grad_input	explicit		DETR					SWIN
aten::silu.out	explicit			OPUS_MT
aten::topk.values	explicit							SWIN
aten::upsample_bilinear2d.out	explicit							SWIN
aten::upsample_bilinear2d_backward.grad_input	explicit							SWIN
aten::upsample_nearest2d.out	explicit		DETR

Fixes: huggingface/transformers#31237 XPU backend is available in the stock PyTorch starting from version 2.4, see [1]. This commit extends huggingface accelerate to support XPU from both IPEX and the stock pytorch. IPEX is being tried first. See: pytorch/pytorch#114842 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

amyeroberts · 2024-06-06T11:32:48Z

@dvrogozh Thank you for such an extensive write up, diving into how it affects the library functionality and opening up draft PRs for enabling this ❤️

It's OK if there isn't full coverage of operations - we support the mps backend despite there not being full coverage yet. It's great that you've investigated and we have an idea how much the fallback can slow things down.

Overall, I don't see any reason why this shouldn't be something we enable. Similar to mps, it's not something we'll probably test on our side though at the moment

cc @ydshieh @muellerzr

muellerzr · 2024-06-06T13:38:54Z

Yep agreed :) We are working towards getting this in accelerate first, then the Trainer in terms of which PRs to merge when

Fixes: huggingface/transformers#31237 XPU backend is available in the stock PyTorch starting from version 2.4, see [1]. This commit extends huggingface accelerate to support XPU from both IPEX and the stock pytorch. IPEX is being tried first. See: pytorch/pytorch#114842 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

dvrogozh · 2024-06-12T02:27:38Z

I filed one more issue affecting some (not all) examples and tests - cuda path is wrongly hit sometimes on loss.backward():

xpu: gradient checkpointing wrongly hits cuda path running on non-cuda devices pytorch/pytorch#128478

Fixes: huggingface/transformers#31237 XPU backend is available in the stock PyTorch starting from version 2.4, see [1]. This commit extends huggingface accelerate to support XPU from both IPEX and the stock pytorch. IPEX is being tried first. See: pytorch/pytorch#114842 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

Fixes: huggingface#31237 XPU backend is available in the stock PyTorch starting from version 2.4, see [1]. This commit extends huggingface transformers to support XPU from both IPEX and the stock pytorch. IPEX is being tried first. See: pytorch/pytorch#114842 Requires: huggingface/accelerate#2825 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

* xpu: support xpu backend from stock pytorch (>=2.4) Fixes: #31237 XPU backend is available in the stock PyTorch starting from version 2.4, see [1]. This commit extends huggingface transformers to support XPU from both IPEX and the stock pytorch. IPEX is being tried first. See: pytorch/pytorch#114842 Requires: huggingface/accelerate#2825 Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> * xpu: enable gpt2 and decision_transformer tests for xpu pytorch backend Note that running xpu tests requires TRANSFORMERS_TEST_DEVICE_SPEC=spec.py passed to the test runner: import torch DEVICE_NAME = 'xpu' MANUAL_SEED_FN = torch.xpu.manual_seed EMPTY_CACHE_FN = torch.xpu.empty_cache DEVICE_COUNT_FN = torch.xpu.device_count Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> --------- Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>

dvrogozh mentioned this issue Jun 4, 2024

xpu: support xpu backend from stock pytorch (>=2.4) huggingface/accelerate#2825

Merged

dvrogozh mentioned this issue Jun 4, 2024

xpu: support xpu backend from stock pytorch (>=2.4) #31238

Merged

muellerzr closed this as completed in huggingface/accelerate#2825 Jun 13, 2024

muellerzr closed this as completed in huggingface/accelerate@3b5a00e Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xpu: Support new PyTorch XPU backend (>=2.4) #31237

xpu: Support new PyTorch XPU backend (>=2.4) #31237

dvrogozh commented Jun 4, 2024 •

edited

Loading

dvrogozh commented Jun 4, 2024 •

edited

Loading

amyeroberts commented Jun 6, 2024 •

edited

Loading

muellerzr commented Jun 6, 2024

dvrogozh commented Jun 12, 2024 •

edited

Loading

xpu: Support new PyTorch XPU backend (>=2.4) #31237

xpu: Support new PyTorch XPU backend (>=2.4) #31237

Comments

dvrogozh commented Jun 4, 2024 • edited Loading

dvrogozh commented Jun 4, 2024 • edited Loading

amyeroberts commented Jun 6, 2024 • edited Loading

muellerzr commented Jun 6, 2024

dvrogozh commented Jun 12, 2024 • edited Loading

dvrogozh commented Jun 4, 2024 •

edited

Loading

dvrogozh commented Jun 4, 2024 •

edited

Loading

amyeroberts commented Jun 6, 2024 •

edited

Loading

dvrogozh commented Jun 12, 2024 •

edited

Loading