Fix the normalization scheduler to accept DID loop split. #3853

wujingyue · 2025-02-08T01:02:48Z

I'm sure we'll need more tests to be confident, but this incremental PR feels good!

wujingyue · 2025-02-08T01:03:08Z

csrc/multidevice/utils.cpp

@@ -32,16 +32,6 @@ NVF_API bool distributedEnabled() {

 namespace {

-std::unordered_set<IterDomain*> getShardedIterDomains(TensorView* tv) {


github-actions · 2025-02-08T01:03:32Z

Review updated until commit 5723c20

Description

Added support for DID loop split in normalization scheduler.
Introduced getShardedLoopAxis function for loop axis retrieval.
Enhanced scheduleReductionTV to handle DID loop split.
Added a new test case DivideBySum for multidevice sharding.

Changes walkthrough 📝

Relevant files

Enhancement

utils.cpp `Add getShardedLoopAxis and remove getShardedIterDomains` csrc/multidevice/utils.cpp Removed `getShardedIterDomains` function. Added `getShardedLoopAxis` function.	+15/-10
reduction_utils.cpp `Update scheduleReductionTV for DID loop split` csrc/scheduler/reduction_utils.cpp Updated `scheduleReductionTV` to use `getShardedLoopAxis`. Added error checks for DID loop split.	+11/-4
utils.h `Declare getShardedLoopAxis` csrc/multidevice/utils.h Added declaration for `getShardedLoopAxis`.	+4/-0

Tests

test_multidevice_sharding.cpp `Add DivideBySum test case` tests/cpp/test_multidevice_sharding.cpp Added `DivideBySum` test case for multidevice sharding.	+42/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Assumption Check

The PR assumes that the DIDx domain is always the outermost domain in the loop. This assumption should be validated with more test cases to ensure correctness.

int64_t sharded_axis = getShardedLoopAxis(reduction_tv, ParallelType::DIDx);
if (sharded_axis >= 0) {
  NVF_ERROR(
      sharded_axis == 0,
      "Expect 1D mesh and DIDx only appear outermost in loop, but found: ",
      reduction_tv->getLoopDomain());
}

Error Handling

The error handling in getShardedLoopAxis could be improved by providing more context in the error message, such as the specific parallel type that caused the failure.

isParallelTypeDeviceDim(parallel_type),
"Expect a DID but found: ",
parallel_type);

Test Coverage

While a new test DivideBySum is added, it would be beneficial to add more test cases to cover different scenarios and edge cases for the new functionality.

TEST_F(MultiDeviceTest, DivideBySum) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  const int64_t d = communicator_->size();

  // [b, h, s, s]
  TensorView* x = makeContigTensor(4);
  TensorView* sum_x = sum(x, {-1});
  TensorView* sum_x_broadcasted = broadcast(sum_x, {false, false, false, true});
  TensorView* y = div(x, sum_x_broadcasted);
  fusion->addInput(x);
  fusion->addOutput(y);

  auto mesh = DeviceMesh::createForNumDevices(d);
  for (auto* tv : {x, sum_x, sum_x_broadcasted, y}) {
    tv->setDeviceMesh(mesh);
    tv->split(1, d, /*inner_split=*/false);
    tv->axis(1)->parallelize(ParallelType::DIDx);
    tv->reorder({{1, 0}});
  }
  for (auto* tv : {x, y}) {
    tv->setAllocationDomain(tv->getLoopDomain(), true);
  }

  const int64_t b = 2;
  const int64_t h = d * 3;
  const int64_t s = 5;
  at::Tensor unsharded_x_tensor = at::randint(5, {b, h, s, s}, tensor_options);
  at::Tensor x_tensor = shardTensor(unsharded_x_tensor, x);

  FusionExecutorCache executor_cache(std::move(fusion));
  at::Tensor y_tensor = executor_cache.runFusionWithInputs({x_tensor})[0];
  testValidate(
      executor_cache.fusion(),
      {y_tensor},
      {x_tensor},
      {x_tensor / x_tensor.sum(-1, true)},
      __LINE__,
      __FILE__);
}

wujingyue · 2025-02-08T01:03:40Z

!test

in the same way as ExpressionEvaluator::bindTensorDomain and several other places. Caveat: having to fix multiple places in the same way probably indicates a pre-existing duplication of logic.

wujingyue · 2025-02-08T07:50:19Z

!test

naoyam

LGTM

wujingyue · 2025-02-10T18:46:05Z

!test

Priya2698

LGTM.

wujingyue · 2025-02-10T23:54:07Z

CI failures are due to http://nv/exg

wujingyue commented Feb 8, 2025

View reviewed changes

wujingyue requested a review from naoyam February 8, 2025 01:03

wujingyue requested a review from Priya2698 February 8, 2025 01:03

wujingyue added 3 commits February 7, 2025 23:28

Add a repro

afcb246

Fix PrecomputedValues::bindTensorMetaData for DID loop split

8725b94

in the same way as ExpressionEvaluator::bindTensorDomain and several other places. Caveat: having to fix multiple places in the same way probably indicates a pre-existing duplication of logic.

Fix the normalization scheduler to accept DID loop split.

5611117

wujingyue force-pushed the wjy/norm branch from 5a3dcf8 to 5611117 Compare February 8, 2025 07:49

wujingyue changed the base branch from wjy/gdb to bug3817 February 8, 2025 07:50

naoyam mentioned this pull request Feb 10, 2025

Bug in test_issue1273? #3861

Closed

naoyam approved these changes Feb 10, 2025

View reviewed changes

Base automatically changed from bug3817 to main February 10, 2025 18:45

Merge branch 'main' into wjy/norm

5723c20

Priya2698 approved these changes Feb 10, 2025

View reviewed changes

wujingyue merged commit fd96f84 into main Feb 10, 2025
49 of 52 checks passed

wujingyue deleted the wjy/norm branch February 10, 2025 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the normalization scheduler to accept DID loop split. #3853

Fix the normalization scheduler to accept DID loop split. #3853

wujingyue commented Feb 8, 2025 •

edited

Loading

wujingyue Feb 8, 2025

github-actions bot commented Feb 8, 2025 •

edited

Loading

wujingyue commented Feb 8, 2025

wujingyue commented Feb 8, 2025

naoyam left a comment

wujingyue commented Feb 10, 2025

Priya2698 left a comment

wujingyue commented Feb 10, 2025

		@@ -32,16 +32,6 @@ NVF_API bool distributedEnabled() {

		namespace {

		std::unordered_set<IterDomain> getShardedIterDomains(TensorView tv) {

Fix the normalization scheduler to accept DID loop split. #3853

Fix the normalization scheduler to accept DID loop split. #3853

Conversation

wujingyue commented Feb 8, 2025 • edited Loading

wujingyue Feb 8, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 8, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

wujingyue commented Feb 8, 2025

wujingyue commented Feb 8, 2025

naoyam left a comment

Choose a reason for hiding this comment

wujingyue commented Feb 10, 2025

Priya2698 left a comment

Choose a reason for hiding this comment

wujingyue commented Feb 10, 2025

wujingyue commented Feb 8, 2025 •

edited

Loading

github-actions bot commented Feb 8, 2025 •

edited

Loading