Conftest: get_device_name() without device occupation #1826

dsmertin · 2025-03-06T13:33:02Z

What does this PR do?

This change fixes a problem with deepspeed test_examples.
torch_hpu.get_device_name() has been moved to a separate process, because it occupies a device and doesn't release it.
For tests what work in a separate process if there's a need for all devices not all of them will be available due the occupation from current pytest process.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

12010486 · 2025-03-06T15:36:53Z

@uartie, would you like to have a look? I'm reviewing as well

uartie · 2025-03-06T15:37:33Z

@dsmertin, @12010486, @regisss what if we use get_device_name() in the optimum/habana/utils.py module, instead? Would that prevent occupying a device?

uartie · 2025-03-06T15:50:47Z

conftest.py

+
+    result = subprocess.run(f"python -c '{script}'", shell=True, capture_output=True, text=True)
+
+    return result.stdout


return result.stdout.strip() since we check if not name by caller (i.e. empty string).

That is, on non-gaudi I get:

result = CompletedProcess(args="python -c 'import habana_frameworks.torch.hpu as torch_hpu\nprint(torch_hpu.get_device_name())'", returncode=0, stdout='\n', stderr='/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead\n return isinstance(object, types.FunctionType)\n/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:135: UserWarning: Device not available\n warnings.warn("Device not available")\n')

regisss · 2025-03-06T16:34:50Z

@dsmertin, @12010486, @regisss what if we use get_device_name() in the optimum/habana/utils.py module, instead? Would that prevent occupying a device?

I think that should work yes 👍

uartie · 2025-03-06T17:31:07Z

@dsmertin, @12010486, @regisss what if we use get_device_name() in the optimum/habana/utils.py module, instead? Would that prevent occupying a device?

I think that should work yes 👍

@dsmertin please try this solution.

conftest.py

uartie · 2025-03-07T05:38:52Z

Meanwhile, would be good to fix the source torch_hpu.get_device_name() to actually do only what its name implies. Otherwise, rename it to torch_hpu.get_device_name_and_occupy()

dsmertin · 2025-03-07T10:35:04Z

@uartie @regisss
I've added get_device_name() from utils, as you suggested.

12010486 · 2025-03-07T11:00:37Z

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

12010486 · 2025-03-07T11:07:12Z

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

Hold on, it seems now we are taking care of it correctly

regisss

LGTM!
I think you'll have to run make style too (there should be a blank line after the import).

HuggingFaceDocBuilderDev · 2025-03-07T14:40:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

uartie · 2025-03-07T16:16:17Z

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

Hold on, it seems now we are taking care of it correctly

On "main" branch, we are still waiting for #1807 to be merged. #1807 is necessary to effectively use --device gaudi3 or device auto-detection on G3. Without #1807, you will still want to use "--device gaudi2" or "GAUDI2_CI=1" on G3 until then.

@hsubramony has already pulled #1807 into transformers_4_49 branch via #1824

12010486 · 2025-03-07T16:22:51Z

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

Hold on, it seems now we are taking care of it correctly

On "main" branch, we are still waiting for #1807 to be merged. #1807 is necessary to effectively use --device gaudi3 or device auto-detection on G3. Without #1807, you will still want to use "--device gaudi2" or "GAUDI2_CI=1" on G3 until then.

@hsubramony has already pulled #1807 into transformers_4_49 branch via #1824

Thanks for the explanation! At this point, it seems the PR got merged in the wrong branch @regisss. or that should be at least merged in transformers_4_49

regisss · 2025-03-07T18:06:20Z

It is already in the transformers_4_49 branch

torch_hpu.get_device_name() moved to a separate process

6a3cef4

dsmertin requested a review from regisss as a code owner March 6, 2025 13:33

dsmertin mentioned this pull request Mar 6, 2025

Add device auto-discovery and cli option #1787

Merged

3 tasks

uartie suggested changes Mar 6, 2025

View reviewed changes

uartie reviewed Mar 6, 2025

View reviewed changes

conftest.py Outdated Show resolved Hide resolved

regisss reviewed Mar 7, 2025

View reviewed changes

get_device_name from oh utils

bb1a89c

dsmertin force-pushed the pytest-device-occupation-fix branch from 2228ddc to bb1a89c Compare March 7, 2025 13:50

dsmertin changed the title ~~Conftest: torch_hpu.get_device_name() moved to a separate process~~ Conftest: get_device_name() without device occupation Mar 7, 2025

regisss approved these changes Mar 7, 2025

View reviewed changes

regisss merged commit 66f3673 into huggingface:main Mar 7, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conftest: get_device_name() without device occupation #1826

Conftest: get_device_name() without device occupation #1826

dsmertin commented Mar 6, 2025

12010486 commented Mar 6, 2025

uartie commented Mar 6, 2025

uartie Mar 6, 2025 •

edited

Loading

uartie Mar 6, 2025

regisss commented Mar 6, 2025

uartie commented Mar 6, 2025

uartie commented Mar 7, 2025

dsmertin commented Mar 7, 2025

12010486 commented Mar 7, 2025

12010486 commented Mar 7, 2025

regisss left a comment

HuggingFaceDocBuilderDev commented Mar 7, 2025

uartie commented Mar 7, 2025

12010486 commented Mar 7, 2025 •

edited

Loading

regisss commented Mar 7, 2025 •

edited

Loading


		result = subprocess.run(f"python -c '{script}'", shell=True, capture_output=True, text=True)

		return result.stdout

Conftest: get_device_name() without device occupation #1826

Conftest: get_device_name() without device occupation #1826

Conversation

dsmertin commented Mar 6, 2025

What does this PR do?

Before submitting

12010486 commented Mar 6, 2025

uartie commented Mar 6, 2025

uartie Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

uartie Mar 6, 2025

Choose a reason for hiding this comment

regisss commented Mar 6, 2025

uartie commented Mar 6, 2025

uartie commented Mar 7, 2025

dsmertin commented Mar 7, 2025

12010486 commented Mar 7, 2025

12010486 commented Mar 7, 2025

regisss left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 7, 2025

uartie commented Mar 7, 2025

12010486 commented Mar 7, 2025 • edited Loading

regisss commented Mar 7, 2025 • edited Loading

uartie Mar 6, 2025 •

edited

Loading

12010486 commented Mar 7, 2025 •

edited

Loading

regisss commented Mar 7, 2025 •

edited

Loading