Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Python device API for SPMD #5129

Merged
merged 2 commits into from
Jul 25, 2023
Merged

Conversation

JackCaoG
Copy link
Collaborator

@JackCaoG JackCaoG commented Jun 7, 2023

The purpose of this pr is to clean up the xla_model api call behavior under the spmd context. IMO we should always return xla:0 to the user and use that to represent the global device. The other api like global_device_count and ordinal will also behave as there is only one device. The idea here is that all of the existing api under xla_model is about the xla device. Under SPMD context there is one XLA device, which is a virtual device.

We provided a set of separate api that has runtime in their name for user to query the real runtime device information. for example, user should use

xr. global_runtime_device_count()

instead of

xr. global_device_count()  --> always return 1 when `XLA_USE_SPMD=1`

or

len(xm.get_xla_supported_devices()) --> always return 1 when `XLA_USE_SPMD=1`

This pr will require some model code change, hence I want to push this before next release.

Next step

  1. add similar test for apis under runtime.py
  2. add a warning message when user using global_device_count(and other similar apis) under SPMD context
  3. clean up the SPMD device in C++ land

@JackCaoG JackCaoG force-pushed the JackCao/spmd_python_device branch 2 times, most recently from cfe3ae7 to 616d7ce Compare June 27, 2023 20:56
@JackCaoG JackCaoG marked this pull request as ready for review June 27, 2023 21:04
Copy link
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in principle

I'm on the fence about the naming convention here. So far, I've been thinking about the distinction between a TPU chip (an actual PCI device) and a TPU core (a "device" within that chip) by calling the TPU core a "logical" device and the TPU chip a "physical" device.

We should settle on terms we can use by convention for each of these sets:

  1. The set of all runtime devices (e.g. TPU cores)
  2. The set of all PCI devices (e.g. TPU chip)
  3. The set of all PyTorch devices (either the SPMD virtual device or runtime devices)

To me, "physical" makes the most sense to me as (2), since these are the actual devices you see if you run lspci.

torch_xla/experimental/pjrt.py Show resolved Hide resolved
@alanwaketan
Copy link
Collaborator

LGTM in principle

I'm on the fence about the naming convention here. So far, I've been thinking about the distinction between a TPU chip (an actual PCI device) and a TPU core (a "device" within that chip) by calling the TPU core a "logical" device and the TPU chip a "physical" device.

We should settle on terms we can use by convention for each of these sets:

  1. The set of all runtime devices (e.g. TPU cores)
  2. The set of all PCI devices (e.g. TPU chip)
  3. The set of all PyTorch devices (either the SPMD virtual device or runtime devices)

To me, "physical" makes the most sense to me as (2), since these are the actual devices you see if you run lspci.

I'm thinking this is very architecture dependent?
For V4, each device is actually corresponding to a chip.
For V3, it's the story you present.

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Pending Will's comment.

@will-cromar
Copy link
Collaborator

Right, (1) and (2) are going to be the same for TPU v4 because the driver represents both cores as one large device. The same is true for GPU.

For TPU v2/v3 and CPU, (1) will be larger than (2) because the runtime will represent the cores as separate devices within the same process.

@JackCaoG
Copy link
Collaborator Author

I am flexibile with the naming convention here, I forgot about the chip vs core for v3... Did we ever expose the chip vs core to user through any api? For example is there any api that returns number of chip which will will return 4 for 'v3-8'?

I think it would be best if user don't need to know about the core vs chips since it is really confusing.

in my mind, under SPMD, we have 1 logic device. In normal data parallel, v3-8 has 8 logical device and v4-8 has 4 logic device.

SPMD

Accelator logical physcial
v3-8 1 8
v4-8 1 4

non-SPMD

Accelator logical physcial
v3-8 8 8
v4-8 4 4

@will-cromar
Copy link
Collaborator

Yeah, there are APIs in tpu.py the differentiate between chips and cores:

def num_available_chips() -> int:
"""Returns the number of TPU chips attached through PCI."""
num_chips = 0
for vendor_path in glob.glob('/sys/bus/pci/devices/*/vendor'):
vendor_id = pathlib.Path(vendor_path).read_text().strip()
if vendor_id != _GOOGLE_PCI_VENDOR_ID:
continue
device_path = os.path.join(os.path.dirname(vendor_path), 'device')
device_id = pathlib.Path(device_path).read_text().strip()
if device_id in _TPU_PCI_DEVICE_IDS:
num_chips += 1
return num_chips
def num_logical_cores_per_chip() -> int:
"""Returns number of XLA TPU devices per physical chip on the current host."""
return 2 if version() <= 3 else 1
def num_available_devices() -> int:
"""Returns number of XLA TPU devices on the current host.
Note: this does not initialize the computation client and is safe to call
before `xmp.spawn`.
"""
return num_available_chips() * num_logical_cores_per_chip()

Concretely, the number of chips is the expected number of local processes when using multiprocessing. I agree that for normal use cases, we should bury this distinction as much as we can. A clear naming convention is important for tools building on us (e.g. Lightning and Accelerate) and for ourselves once we forget this conversation.

I'm fine with "logical" devices being the of torch.devices we expose. What do you think of these?

  1. "physical" devices = PCI devices
  2. "logical" devices = torch.devices
  3. "runtime" devices = PjRtDevices

@JackCaoG
Copy link
Collaborator Author

yea, I am ok with the name runtime devices, I can rename my api to global_runtime_device_count

@JackCaoG
Copy link
Collaborator Author

JackCaoG commented Jun 28, 2023

I am half way editing this pr but now I have a different thought. In https://github.com/pytorch/xla/blob/master/torch_xla/experimental/xla_sharding.py we used term physcial a lot. I also think physical is an easier to understand term, since most people probably don't know what runtime we are using and how runtime device is different than logical device. If we are only going to use the PCI device concept internally(user don't access it), can we just call it pci_device and leave the physical_device to be the number of device that user can create mesh on? @will-cromar

I think runtime_device is a more accurate term, but it would be confusing to user if they don't know how TPU and pytorch/xla runtime works. Physical is more natural IMO.

@JackCaoG
Copy link
Collaborator Author

JackCaoG commented Jul 5, 2023

@will-cromar gentle ping on your opinion toward using physical to represent runtime device.

@JackCaoG
Copy link
Collaborator Author

JackCaoG commented Jul 6, 2023

chatted with Will offline, I think I am just going to use the term defined in #5129 (comment) and update all physcial in this pr to be runtime to be accurate.

@JackCaoG JackCaoG force-pushed the JackCao/spmd_python_device branch 2 times, most recently from c304e06 to 6c0e8c9 Compare July 12, 2023 01:19
@@ -188,22 +188,22 @@ def test_spawn_threads(self):
{i: torch.device(f'xla:{i}') for i in range(self.num_devices)})

@staticmethod
def _device_attributes():
return xr.device_attributes(str(xm.xla_device()))
def _physical_device_attributes():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be runtime instead of physical in this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@JackCaoG JackCaoG force-pushed the JackCao/spmd_python_device branch from 6c0e8c9 to 3b153b3 Compare July 24, 2023 18:34
@JackCaoG JackCaoG requested review from will-cromar and jonb377 July 24, 2023 20:37
@JackCaoG
Copy link
Collaborator Author

@will-cromar @jonb377 Can I get a review for this one?

@JackCaoG JackCaoG changed the title Add Test for xla_model api + SPMD Update Python device API for SPMD Jul 24, 2023
Copy link
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a couple of questions to help my understanding

test/spmd/test_spmd_xla_model_api.py Show resolved Hide resolved
torch_xla/csrc/init_python_bindings.cpp Show resolved Hide resolved
@JackCaoG
Copy link
Collaborator Author

@amithrm @Liyang90 @jonb377 @alanwaketan @baoleai I am going to merge this pr now, the most noticeable change to users are you should not use xr. global_device_count under SPMD since it will return 1(since there is only one xla virtual device). You should use xr. global_runtime_device_count which will tell you the actual number of devices that runtime sees.

@JackCaoG JackCaoG merged commit f996701 into master Jul 25, 2023
will-cromar pushed a commit that referenced this pull request Sep 14, 2023
* Make python Api to respect the virtual device when SPMD is enabled

* fix typo
will-cromar added a commit that referenced this pull request Sep 15, 2023
* Sharding should be per output of IR Node, instead of per IR Node (#5330)

* sharding should be per output of IR Node, instead of per IR Node

* Update sharding_hash method

* Add test for sharding on IR with multiple output

* fix cpu test

* Fix a bug in getSharding

* Update Python device API for SPMD (#5129)

* Make python Api to respect the virtual device when SPMD is enabled

* fix typo

* Check out the release branch instead of origin/master in ansible (#5344)

* Also dump output sharding on HLO file (#5339)

* Also dump output sharding on HLO file

* only dump output sharding if dump format is HLO

* add test

* fix typo

* Make all-reduce a no-op when world size is 1 (#5342)

* Make all-reduce a no-op when world size is 1

* Fix torch.distributed test

* add fs linker flag (#5347)

* Add py3.10 whl path to doc, refactor whl table (#5354)

* fix amp dtype setting for GPU (#5337)

* fix amp dtype setting for GPU.

* fix ut

* fix lint.

* minor.

* Add python test for SPMD+Runtime Python API (#5349)

* Add python test for SPMD+Runtime Python API

* replace test name

* Update test_xla_spmd_python_api_interaction.py

* Check the actual device instead of query env var for virtual device (#5352)

* Check the actual device instead of query env var for virtual device

* revert unneeded change

* minor changes

* [BE] use self.assertEquals instead of str equality in test_zero1.py (#5364)

* Revert "[BE] use self.assertEquals instead of str equality in test_zero1.py (#5364)" (#5366)

This reverts commit 8ada333.

* [Dynamo|TPU] Tweak `atol` and `rtol` for `test_dynamo.py` (#5363)

* tweak `atol` and `rtol`

* [Dynamo|TPU] Skip`DynamoTrainingBasicTest.test_resnet18` on TPU (#5362)

*  Skip`DynamoTrainingBasicTest.test_resnet18` on TPU

* Add a script for running stablehlo tests. (#5360)

* Add kokoro presubmit for stablehlo tests

* Don't rewrite index hints in global save planning (#5348)

* [Dynamo|TPU] Skip `DynamoInferenceBasicTest.test_resnet18` on TPU (#5361)


* Skip `DynamoInferenceBasicTest.test_resnet18` on TPU

* [BE] use self.assertEquals instead of str equality in test_zero1.py (#5367)

* [BE] use self.assertEquals instead of str equality in test_zero1.py

* Use our own assertEqual

* Remove print statements

* Fix ReplicateShardedData for int type (#5374)

* Fix ReplicateShardedData for int type

* add test

* Update dynamo.md (#5378)

Update dynamo.md to remove note about fallback ops since they're supported now

* Revert "Fix ReplicateShardedData for int type (#5374)" (#5380)

This reverts commit 7fb7dfe.

* Remove the mention of XRT_TPU_CONFIG in the CONTRIBUTING.md (#5379)

* [Dynamo|TPU] Tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU (#5373)

* tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU

* Rectify test_zero1.py once optim.load_state_dict doesn't guarantee immutability (#5382)

* [TEST ONLY] print statements for test_zero1.py to debug

* Try fix

* Rectify test_zero1.py to account for state_dict modification

* Fix lint

* Add gpu doc for how to build PyTorch/XLA from source with GPU support. (#5384)

* Add gpu doc for how to build PyTorch/XLA from source with GPU support.

* fix typo

* fix comments

* fix comments

* clear pending ir should also clear the cc op tokens (#5385)

* Port resnet data loading optimizations to SPMD test script (#5386)

* Add support for in-place ops with self tensors in dynamo bridge (#5309)

* Add more support for in-place ops in dynamo bridge

Run linter

* Add check to explicitly sync self tensors

Remove debugging lines

Update unit tests to a model

* Clean up some code

Surround  in an if-statement

Update metrics for fallback related dynamo tests

Update cloned args logic

Revert "Update metrics for fallback related dynamo tests"

This reverts commit 3855f43.

* Update single_node flag back to False

* Add dynamo test in TPU CI (#5381)

Add dynamo test in TPU CI

* Add manual seed in multihost checkpoint (#5392)

* Fix change_id type in coverage uploading (#5394)

* Update dynamo cpu fallback op to aten::_foobar (#5393)

* Run single host multi GPU tests in the CI. (#5387)

* Add gpu doc for how to build PyTorch/XLA from source with GPU support.

* Run single host multi GPU tests.

* fix linter

* fix linter

* fix error

* fix test

* [PJRT] Separate collective ops test from TPU runtime test. (#5396)

* [PJRT] Separate collective ops test from TPU runtime test.

* formatting

* Fix ReplicateShardedData for int type (#5404)

* Update the dynamo backend name to `openxla` (#5402)

* Replace aot backend with openxla

* Update the inference backend except the fallback tests

* handle the fallback tests

* update remaining test

* update doc

* add torch pin

* Delete .torcch_pin

* linter

* [SPMD] Multi-host batch sharded data loading (#5331)

* Refactor to share code between export_torch_model and save_as_stablehlo (#5388)

* Refactor to share code between export_torch_model and save_as_stablehlo

* Fix TPU collective ops test for multi-host TPUs (#5408)

* Fix TPU collective ops test for multi-host TPUs

* formatting

* Partially replicate lower-rank tensors (#5409)

* Partially replicate lower-rank tensors

* Fix unit test

* Remove unnecessary device count check

* Fix unordered partition spec test

* yapf

* Revert "Partially replicate lower-rank tensors (#5409)" (#5412)

This reverts commit 56a6a02.

* SPMD cross slice-replication using partial_replication sharding (#5411)

* Revert "Support unordered sharding spec for partial replication (#5316)"
* Update test_2d_tensor_3d_mesh unit test to surface a bug
* Use partial replication for 2D tensor over 3D mesh sharding

* Fix the incorect clone arg condition in dynamo bridge (#5414)

* [SPMD] named partition spec support (#5415)

[SPMD] named partition spec

* [PJRT|TPU] Update `test_xla_devices_single_process_all_chips` for expected device number (#5421)

Update `test_xla_devices_single_process_all_chips` for expected device number

* Add repo for libcudnn8=8.7.0.84 and CUDA 11.8 (#5425)

* Update fix_includes.sh (#5441)

Without this patch I cannot get torch_xla to build outside of the docker. This should fix it.

* [PJRT] Support `torchrun` with `pjrt://` `init_method` (#5438)

* Support torchrun with `pjrt://` `init_method`

* move import

* fix error

* Fix NameError

* Fix path

* Remove from TPU CI

* Bugfix + add more test for llama (#5439)

Bugfix details:
1. When the graph have mutations the exported graph will have additional
   inputs. For now we are dropping them.
2. We should trace with args instead of final_args.

* Move the C++ test build to CI build job instead of test job (#5442)

* Update gcc to 10. (#5445)

* Update gcc to 10,

And use unversioned clang-format (so it's installation will succeed)
in both debian bullseye and buster

* gcc10 to ansible

* Update the random seed for every dynamo execution (#5444)

* Revert "Update gcc to 10. (#5445)" (#5449)

This reverts commit 454e916.

Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>

* Install gcc-10 (#5450)

* Revert "Install gcc-10 (#5450)" (#5452)

This reverts commit 65b7639.

* parallelize SPMD inputhandler and GetDataShards (#5447)

* parallelize SPMD inputhandler and GetDataShards

* add output handler trace

* Remove base image override from TPU CI build (#5453)

* Update to GCC 10 (#5451)

* Cache sharded placeholder for dynamo execution (#5446)

* Cache the output sharding spec for dynamo

* address review comments

* add test

* remove dead code

* add missing wait deivce ops

* Update xla_graph_executor.cpp

* linter

* Remove Docker image override from dev image (#5456)

* hack: implement (unimplement?) GetDataShard for XRT

* skip flaky test (#5459)

* Neuron import hook (#5429)

* Enable Neuron import hook for calling initialization functions if using AWS Neuron

* removing copy/paste error

* moving aws init call and removing comment

* Add missing includes (#5434)

* Add missing includes

Currently this is included indirectly through PyTorch includes, but when I remove
the include from PyTorch's headers, the xla build fails.

* [TESTING] Pin PyTorch PR

* Retrigger CI after timeout

* Remove .torch_pin

* [GPU]Update README.md with wheel/docker for CUDA12.0 and deprecate CUDA11.7 (#5443)

* [GPU]Update README.md with wheel and docker support CUDA12.0 and deprecate CUDA 11.7

* Update README.md with docker support CUDA 12.0 and python 3.8

* Update README.md

* Update README.md

* update remote cache key in ansible (#5463)

* Fix data type in Pow with Scalar base and Tensor exponent (#5467)

* fix dtype inference

* fix linter

* bump the timeout for CI (#5470)

* Fix the input sharding for dynamo (#5469)

* Enabling sharding device data IR (#5475)

* Allow shard device data IR

* Handle XLATensor that is DeviceData IR and does not have XLAData

* fix typo

* Introduce `torch_xla.runtime.use_spmd()` (#5474)

Introduce torch_xla.runtime.use_spmd() and torch_xla.runtime.is_spmd()

* Enable PJRT C API Client and other changes for Neuron (#5428)

* Enable PJRT C API Client and other changes for Neuron

* keeping quotes consistent

* fixing device type call

* refactoring neuron initialization with spawn

* updating replication setting only for torchrun

* removing set replication in xla backed was added to rendezvous handler

* removing workaround for world_size/master_port for neuron

* fixing linter issues

* Don't move full tensor to device in deferred_init (#4819)

* [SPMD] Fix HybridMesh ordering (#5478)

Summary:
In xs.HybridMesh, it assumes the xr.global_runtime_device_attributes() will return the attributes according to the PyTorch/XLA's logical global ordinals. However, it turns out not to be the case.

To fix this, we pass the logical global ordinal as one of the attributes and xs.HybridMesh will sort the attributes according to this new attribute before using the array.

Test Plan:
PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_hybrid_mesh

* [SPMD] Properly skip tests on TPU V2 (#5479)

Summary:
Some of the tests only fail on TPU V2 but were skipped for all TPUs.
Let's fix that.

Test Plan:
PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding.py

* Add @yeounoh to .github CODEOWNERS (#5482)

* Add Python API to execute StableHLO bytecode (#5476)

* [SPMD] Fix TPU CI after #5478 (#5487)

* [SPMD] Fix TPU CI after #5478

Summary:
Let's fix all TPU CI failures after #5478.

Test Plan:
TPU CI

* Fix linters

* [SPMD] Fix XLA_DUMP_POST_OPTIMIZATIONS test (#5485)

Summary:
XLA_DUMP_POST_OPTIMIZATIONS was set as static which means that the value will be fixed during the whole test run for a particular test suite.

Therefore, let's make a separate file.

Test Plan:
PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding.py
PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding_hlo.py

* [Dist] Refactor ZeRO-1 (#5145)

* refactor

* fix

* fix

* add padding

* more robust save/load

* Update artifacts.auto.tfvars for 2.1 release (#5483)

* Update artifacts.auto.tfvars for 2.1 release

Update artifacts.auto.tfvars for 2.1 release

* Remove cuda version 11.7 and add 12.0 for 2.1 triggers

* Add 3.10 tpu version

* Add ShardingSpec to XLATensor when it is created with a PJRTShardedData (#5489)

* Add ShardingSpec to XLATensor when it is created with a PJRTShardedData

* add test

* Add topological sorting to dynamo partitions (#5472)

* Add topological sorting to dynamo partitions

* Run linter

* Update unit tests to include more in-place ops

* [SPMD] Patch nn.Linear (#5491)

Summary:
This pull request introduces a patched version of torch.nn.functional.linear that uses einsum instead of torch.matmul which will flatten the tensors to 2D and collide the sharded dimensions. The torch.matmul default behavior makes it very hard for XLA compiler to propagate the sharding annotation.

Test Plan:
PJRT_DEVICE=CPU python test/test_operations.py -v -k test_patched_linear

* [original author: mrnikwaws] Neuron operator support (#5471)

* adding glu operator support

* adding glu operator

* fixing yaml

* fixing linter issues

* fixing linter issues

* fixing spacing

* fixing spacing

* fixing spacing

* fixing spacing

* fixing shape helper

* fixing spacing

* [SPMD] Make IR sharding custom sharding op (#5433)

Summary:
This pull request changes the syntax of IR sharding by making it a new node instead of just attaching the sharding spec to the tensor. On the same time, we will still attach a sharding spec to the newly created XLATensor which will hold the new IR node.

This new IR node will be a CustomSharding node and in hlo:
%annotate = f32[6,3] custom-call(%copy), custom_call_target="Sharding", sharding={devices=[2,1]0,1}

Test Plan:
PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_mark_sharding_ir
PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_inplace_add_with_sharding

* Support input sharding changed after first dynamo tracing (#5477)

* Support input sharding changed after first dynamo tracing

* fix linter

* Handle the different input for dynamo sharding change

* update counter

* only get sharding specs when spmd is enabled

* add option to skip checking input sharding after x runs

* handle the cpu test

* make XLA_DYNAMO_INPUT_SHARDING_CHECK_THREASHOLD configable

* fix review comments

* Always use ExecuteReplicated with SPMD (#5494)

* Always use ExecuteReplicated with SPMD

* Add unit test

* Skip a couple tests on TPU due to precision issue (#5496)

* Refactor stablehlo API and put them in official location. (#5493)

Changes include:

* make end point in torch_xla/init.py for exposed APIs torch_xla.save_as_stablehlo and torch_xla.save_torch_model_as_stablehlo.
* All tf related integration to its own file.
* Remove args as argument (because it will spear inside of ExportedProgram) but allow user to override it (which we use for now.

* Support tuples in partition spec (#5488)

* Support tuples in partition spec

* Add unit test for partial replication

* yapf

* Support higher-rank tensors over lower-rank mesh

* Fix test & yapf

* Don't use partition_spec when creating group assignment

* Update documentation

* More documentation

* Translate named specs in ShardingSpec

* Add a API to explictly init runtime (#5500)

* Add explict error message when tensor is on CPU for dynamo backend (#5499)

* remove torchvision in stablehlo.py (#5501)

* Fix tupled partition spec test on v3 (#5503)

* Update dynamo doc (#5506)

* Update dynamo.md (#5509)

fixing typo

* Get original_traced_args as example_inputs. (#5511)

Change due to changing name in pytorch/pytorch#107978

* mark_sharding over a replicated tensor is allowed. (#5513)

* [SPMD] Propagate replicated output (#5508)

Summary:
During the LLaMA2 experiements, I disovered that manually marking 1D tensors to be replicated can greatly save a lot of memory. Then I disocvered that explicitly replicated spec will get dropped after mark_step. That is caused by PrepareOutputShardingPropagation where it explicitly clear the sharding spec for replicated output. So, I went ahead and fix that.

Further, I did some experiements of propogating replicated output and that drop the requirements of manually replicating 1D tensors. Hence, I made this change.

I'm still not quite sure why, will follow up later.

Test Plan:
PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py

* Disable cxx abi in ansible when building pt/xla for branch r2.0 (#5332)

* Update pytorch git tag for r2.1 (#5529)

Update more places

Add torch_pin

* Enable megacore_dense by default (#5520) (#5531)

Summary:
This change enables megacore_dense by default to allow asynchorous cc
ops especailly for GSPMD.

Test Plan:
CI

Co-authored-by: Jiewen Tan <jwtan@google.com>

* Add option to unbundle libtpu (#5534) (#5536)

* Add optiona to unbundle libtpu

* Add clarifying note

* Revert 2.1 terraform changes (#5537)

* Fix FSDP for Models with Frozen Weights (#5484) (#5539)

* Fix fsdp not freeing forzen full params

* add test

* formatting

* remove unnecessary env var in test

Co-authored-by: Liyang90 <liyanglu@google.com>

* Update r2.1 wheel to be compatible with PyPI (#5550)

* Update project metadata and remove useless files

* Update README

* Add manylinux platform tag

* formatting

* Add resnet50-weight-quant colab notebook (#5407) (#5556)

* Add resnet50-weight-only-quant colab notebook

* update notebook with llama blog link

Co-authored-by: Siyuan Liu <lsiyuan@google.com>

* hack: add placeholders for `HasSharding` and `GetSharding` to XRT

* formatting

* hack: always return false from `HasSharding`

* Update torch pin to current RC for CI testing

* Cherry pick `pjrt://` init method rename and doc updates (#5562)

* Change `pjrt://` init method to `xla://` (#5560)

* Update PJRT documentation for the 2.1 release (#5557)

* Update PJRT documentation for the 2.1 release

* clarify plugins

* clarify PJRT doc

* Update `pjrt://` to `xla://`

* Use new cache silo and skip test build

* hack: disable missing test

* hack: alter cache silo name

* formatting

---------

Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: iefgnoix <isaacwxf23@gmail.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Baole Ai <baoleai01@gmail.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: jonb377 <jonbolin@google.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
Co-authored-by: Yeounoh Chung <yeounoh@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Alisson Azzolini <37222419+aazzolini@users.noreply.github.com>
Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com>
Co-authored-by: peterbell10 <peterbell10@live.co.uk>
Co-authored-by: Zach Zheng <zczheng@amazon.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Huang, Guangtai <guangtai@amazon.com>
Co-authored-by: Shauheen <shauheen@users.noreply.github.com>
Co-authored-by: Liyang90 <liyanglu@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants