Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conftest: get_device_name() without device occupation #1826

Merged
merged 2 commits into from
Mar 7, 2025

Conversation

dsmertin
Copy link
Contributor

@dsmertin dsmertin commented Mar 6, 2025

What does this PR do?

This change fixes a problem with deepspeed test_examples.
torch_hpu.get_device_name() has been moved to a separate process, because it occupies a device and doesn't release it.
For tests what work in a separate process if there's a need for all devices not all of them will be available due the occupation from current pytest process.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@dsmertin dsmertin requested a review from regisss as a code owner March 6, 2025 13:33
@12010486
Copy link
Contributor

12010486 commented Mar 6, 2025

@uartie, would you like to have a look? I'm reviewing as well

@uartie
Copy link
Contributor

uartie commented Mar 6, 2025

@dsmertin, @12010486, @regisss what if we use get_device_name() in the optimum/habana/utils.py module, instead? Would that prevent occupying a device?

conftest.py Outdated

result = subprocess.run(f"python -c '{script}'", shell=True, capture_output=True, text=True)

return result.stdout
Copy link
Contributor

@uartie uartie Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return result.stdout.strip() since we check if not name by caller (i.e. empty string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, on non-gaudi I get:

result = CompletedProcess(args="python -c 'import habana_frameworks.torch.hpu as torch_hpu\nprint(torch_hpu.get_device_name())'", returncode=0, stdout='\n', stderr='/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead\n  return isinstance(object, types.FunctionType)\n/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:135: UserWarning: Device not available\n  warnings.warn("Device not available")\n')

@regisss
Copy link
Collaborator

regisss commented Mar 6, 2025

@dsmertin, @12010486, @regisss what if we use get_device_name() in the optimum/habana/utils.py module, instead? Would that prevent occupying a device?

I think that should work yes 👍

@uartie
Copy link
Contributor

uartie commented Mar 6, 2025

@dsmertin, @12010486, @regisss what if we use get_device_name() in the optimum/habana/utils.py module, instead? Would that prevent occupying a device?

I think that should work yes 👍

@dsmertin please try this solution.

@uartie
Copy link
Contributor

uartie commented Mar 7, 2025

Meanwhile, would be good to fix the source torch_hpu.get_device_name() to actually do only what its name implies. Otherwise, rename it to torch_hpu.get_device_name_and_occupy()

@dsmertin
Copy link
Contributor Author

dsmertin commented Mar 7, 2025

@uartie @regisss
I've added get_device_name() from utils, as you suggested.

@12010486
Copy link
Contributor

12010486 commented Mar 7, 2025

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

@12010486
Copy link
Contributor

12010486 commented Mar 7, 2025

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

Hold on, it seems now we are taking care of it correctly

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
I think you'll have to run make style too (there should be a blank line after the import).

@dsmertin dsmertin force-pushed the pytest-device-occupation-fix branch from 2228ddc to bb1a89c Compare March 7, 2025 13:50
@dsmertin dsmertin changed the title Conftest: torch_hpu.get_device_name() moved to a separate process Conftest: get_device_name() without device occupation Mar 7, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss regisss merged commit 66f3673 into huggingface:main Mar 7, 2025
3 of 4 checks passed
@uartie
Copy link
Contributor

uartie commented Mar 7, 2025

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

Hold on, it seems now we are taking care of it correctly

On "main" branch, we are still waiting for #1807 to be merged. #1807 is necessary to effectively use --device gaudi3 or device auto-detection on G3. Without #1807, you will still want to use "--device gaudi2" or "GAUDI2_CI=1" on G3 until then.

@hsubramony has already pulled #1807 into transformers_4_49 branch via #1824

@12010486
Copy link
Contributor

12010486 commented Mar 7, 2025

Ok, I do have a concern in using get_device_name(), as when used on gaudi3, it will output gaudi3, while currently all the tests are either enabled for gaudi2 (and 3 by proxy) or also covering gaudi, a subset of those. @dsmertin, could you check how a small test is running on g3?

Hold on, it seems now we are taking care of it correctly

On "main" branch, we are still waiting for #1807 to be merged. #1807 is necessary to effectively use --device gaudi3 or device auto-detection on G3. Without #1807, you will still want to use "--device gaudi2" or "GAUDI2_CI=1" on G3 until then.

@hsubramony has already pulled #1807 into transformers_4_49 branch via #1824

Thanks for the explanation! At this point, it seems the PR got merged in the wrong branch @regisss. or that should be at least merged in transformers_4_49

@regisss
Copy link
Collaborator

regisss commented Mar 7, 2025

It is already in the transformers_4_49 branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants