Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch v2.6.0 #326

Merged
merged 57 commits into from
Feb 13, 2025
Merged

pytorch v2.6.0 #326

merged 57 commits into from
Feb 13, 2025

Conversation

h-vetinari
Copy link
Member

Build the release candidates

Linux CI cancelled until builds for #322 are live

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Jan 18, 2025

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13297069508. Examine the logs at this URL for more detail.

@h-vetinari
Copy link
Member Author

This looks better than expected so far. Still have to double check dependency changes. Happy if someone could do that (even if it's just noting which bounds changed relative to the current recipe)

@h-vetinari
Copy link
Member Author

Aarch builds fail with

$SRC_DIR/third_party/XNNPACK/src/reference/unary-elementwise.cc:125:14: error: invalid 'static_cast' from type 'xnn_bfloat16' to type '_Float16'
  125 |       return static_cast<TOut>(x);
      |              ^~~~~~~~~~~~~~~~~~~~

@h-vetinari
Copy link
Member Author

Sigh, since when is conda-build applying patches through git rather than through patch? The former is stricter than the latter, and doesn't work in some situations (like applying a patch in one of the submodules):

Applying patch: /Users/runner/work/1/s/recipe/patches_submodules/0001-Fix-bazel-linux-aarch64-gcc13-workflow-and-resolve-a.patch
Applying: Fix `bazel-linux-aarch64-gcc13` workflow and resolve accompanying build errors.
error: sha1 information is lacking or useless (third_party/XNNPACK/src/reference/unary-elementwise.cc).
error: could not build fake ancestor

@h-vetinari h-vetinari force-pushed the 2.6 branch 2 times, most recently from bd0bec7 to 022f063 Compare January 20, 2025 08:03
otherwise conda breaks
```
conda_build.exceptions.RecipeError: Mismatching hashes in recipe. Exact pins in dependencies that contribute to the hash often cause this. Can you change one or more exact pins to version bound constraints?
Involved packages were:
Mismatching package: libtorch (id cpu_generic_habf3c96_0); dep: libtorch 2.6.0.rc7 *0; consumer package: pytorch
```
@danpetry
Copy link
Contributor

on osx-64

FAILED [0.0518s] test/test_nn.py::TestNN::test_batchnorm_nhwc_cpu - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 8 (12.5%)
Greatest absolute difference: 1.430511474609375e-05 at index (0,) (up to 1e-05 allowed)
Greatest relative difference: 4.616721980710281e-06 at index (0,) (up to 1.3e-06 allowed)

skip?

@h-vetinari
Copy link
Member Author

h-vetinari commented Jan 23, 2025

Yeah, this minor accuracy violation indeed sounds skippable, but I've deprioritised this PR until we get the windows builds for 2.5 fixed (and ideally your #318 merged as well).

@danpetry
Copy link
Contributor

ok, good to know

@danpetry
Copy link
Contributor

worth pointing out that as of 6 days time, pypi will have an up to date pytorch package whereas conda won't. Will have a look at that other PR

@h-vetinari
Copy link
Member Author

worth pointing out that as of 6 days time, pypi will have an up to date pytorch package whereas conda won't.

Are you talking about rc's, or are we not looking at the same index? 2.6.0 GA hasn't been published AFAICT. Or are you saying that 2.6.0 will be released in 6 days?

In any case, this is no reason to rush. We didn't have windows packages for years, and I'm more concerned about fixing them, than lagging behind the PyPI release a bit (and we've often lagged for months in the past; this has gotten much better with the open-gpu server, but it still happens; 2.5.0 was released Oct 18th last year, we had first builds on Nov 3rd).

@danpetry
Copy link
Contributor

are you saying that 2.6.0 will be released in 6 days

yes

I'm more concerned about fixing them, than lagging behind the PyPI release a bit

100%

@danpetry
Copy link
Contributor

danpetry commented Jan 28, 2025

I think https://github.com/conda-forge/triton-feedstock might need to be updated as part of this. Some info on which commit to use here: pytorch/pytorch#145120 (comment)

Copy link
Contributor

@danpetry danpetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look over and can't see anything I disagree with.
Agree that the tests and also the patches are getting quite extensive. Suggest that we work to reduce that in future PRs perhaps.

@h-vetinari
Copy link
Member Author

I was just saying in general, it feels like we're skipping too many tests.

I don't think it's that bad TBH. We're skipping 35 tests (out of 8000+), across all architectures, and the vast majority of those skips are specific to one or two variants in our matrix. That's roughly 0.1-0.2% of tests getting skipped on a given platform.

Agree that the tests and also the patches are getting quite extensive. Suggest that we work to reduce that in future PRs perhaps.

💯 I've started an effort to reduce the skips (see #353 & #353). Reducing the patches would need a dedicated effort on the part of the respective authors to upstream their fix. I'll try upstreaming the find_package(CUDA) patch, but it's a non-trivial effort, and even comparatively simple ones like pytorch/pytorch#137084 are still open months later. OTOH, I did structure my include-path manipulation patches in a way that I could at least upstream pytorch/pytorch#145480.

Overall, both the patches and the skips are aiming to be the minimal set necessary to successfully build and test pytorch. I'd love to see them reduced as fast as possible, but for now I don't have a better alternative.

@h-vetinari
Copy link
Member Author

CI will finish in about 7h - could I get some approvals or review until then? 🙏

Or at least a written "ok to proceed"? 🙃

@danpetry
Copy link
Contributor

For now, I haven't reviewed in-depth enough to be able to say "approval", I think - could do that tomorrow if it's needed.
Otherwise: ok to proceed from me.

mgorny and others added 3 commits February 12, 2025 13:30
Co-Authored-By: H. Vetinari <h.vetinari@gmx.com>
Suggested-By: Michał Górny <mgorny@gentoo.org>
@h-vetinari
Copy link
Member Author

Oh boy, nevermind. Something is very wrong on linux+CUDA:

= 191 failed, 14397 passed, 2763 skipped, 91 xfailed, 143695 warnings in 4444.09s (1:14:04) =

It seems that many (56) are due to RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR_HOST_ALLOCATION_FAILED, so basically: running out of memory. Even more (130) are due to RuntimeError: GET was unable to find an engine to execute this computation, which probably has the same underlying reason.

@h-vetinari
Copy link
Member Author

There's another failure that looks like it could have been due to the OOMs, but surprisingly, it fails in exactly the same way on MKL/openblas, and with exactly the same (catastrophic) accuracy violation:

____________________ TestNN.test_Conv3d_1x1x1_no_bias_cuda _____________________
[...]
E                   AssertionError: Tensor-likes are not close!
E                   
E                   Mismatched elements: 6 / 6 (100.0%)
E                   Greatest absolute difference: 18.49855402601494 at index (1, 1, 0, 0, 0) (up to 0.0002 allowed)
E                   Greatest relative difference: 211.43751630108665 at index (0, 0, 0, 0, 0) (up to 0 allowed)
stacktrace
____________________ TestNN.test_Conv3d_1x1x1_no_bias_cuda _____________________
[gw1] linux -- Python 3.10.16 $PREFIX/bin/python

self = <test_nn.TestNN testMethod=test_Conv3d_1x1x1_no_bias_cuda>
test = <torch.testing._internal.common_nn.NewModuleTest object at 0x7fc5465f89a0>
kwargs = {}

    def with_tf32_off(self, test=test, kwargs=kwargs):
        with tf32_off():
>           test.test_cuda(self, **kwargs)

test/test_nn.py:7322: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <torch.testing._internal.common_nn.NewModuleTest object at 0x7fc5465f89a0>
test_case = <test_nn.TestNN testMethod=test_Conv3d_1x1x1_no_bias_cuda>

    def test_cuda(self, test_case):
        if not TEST_CUDA or not self.should_test_cuda:
            raise unittest.SkipTest('Excluded from CUDA tests')
    
        with set_default_dtype(self.default_dtype):
            cpu_input = self._get_input()
    
            type_map = {torch.double: torch.float}
            cpu_input_tuple = cpu_input if isinstance(cpu_input, tuple) else (cpu_input,)
    
            is_any_input_complex = any(isinstance(t, torch.Tensor) and t.dtype.is_complex for t in cpu_input_tuple)
    
            gpu_input_tuple = to_gpu(cpu_input_tuple, type_map=type_map)
    
            cpu_module = self.constructor(*self.constructor_args)
            gpu_module = self.constructor(*self.constructor_args).float().cuda()
            cpu_param = test_case._get_parameters(cpu_module)
            gpu_param = test_case._get_parameters(gpu_module)
            for cpu_p, gpu_p in zip(cpu_param[0], gpu_param[0]):
                gpu_p.data.copy_(cpu_p)
    
            test_case._zero_grad_input(cpu_input_tuple)
            test_case._zero_grad_input(gpu_input_tuple)
            test_case._zero_grad_parameters(cpu_module)
            test_case._zero_grad_parameters(gpu_module)
            cpu_output = test_case._forward(cpu_module, cpu_input_tuple)
            gpu_output = test_case._forward(gpu_module, gpu_input_tuple)
            if getattr(cpu_module, "return_indices", False):
                cpu_output = cpu_output[0]
                gpu_output = gpu_output[0]
            test_case.assertEqual(cpu_output, gpu_output, atol=self.precision, rtol=0, exact_dtype=False)
    
            # Run backwards on CPU and GPU and compare results
            for _ in range(5):
                cpu_gradOutput = cpu_output.clone().normal_()
                gpu_gradOutput = cpu_gradOutput.type_as(gpu_output)
                cpu_gradInput = test_case._backward(cpu_module, cpu_input_tuple, cpu_output, cpu_gradOutput)
                gpu_gradInput = test_case._backward(gpu_module, gpu_input_tuple, gpu_output, gpu_gradOutput)
                test_case.assertEqual(cpu_gradInput, gpu_gradInput, atol=self.precision, rtol=0, exact_dtype=False)
                for cpu_d_p, gpu_d_p in zip(cpu_param[1], gpu_param[1]):
                    test_case.assertEqual(cpu_d_p, gpu_d_p, atol=self.precision, rtol=0)
    
            # Run double-backwards on CPU and GPU and compare results
            if self.check_gradgrad and not self.FIXME_no_cuda_gradgrad_comparison:
                cpu_output = cpu_module(*cpu_input_tuple)
                gpu_output = gpu_module(*gpu_input_tuple)
                if getattr(cpu_module, "return_indices", False):
                    cpu_output = cpu_output[0]
                    gpu_output = gpu_output[0]
    
                cpu_gradOutput = torch.randn_like(cpu_output, requires_grad=True)
                gpu_gradOutput = cpu_gradOutput.type_as(gpu_output).detach()
                gpu_gradOutput.requires_grad = True
    
                cpu_gradInputs = torch.autograd.grad(
                    cpu_output,
                    cpu_input_tuple + tuple(cpu_module.parameters()),
                    cpu_gradOutput,
                    create_graph=True)
                gpu_gradInputs = torch.autograd.grad(
                    gpu_output,
                    gpu_input_tuple + tuple(gpu_module.parameters()),
                    gpu_gradOutput,
                    create_graph=True)
    
                for cpu_d_i, gpu_d_i in zip(cpu_gradInputs, gpu_gradInputs):
>                   test_case.assertEqual(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0, exact_dtype=False)
E                   AssertionError: Tensor-likes are not close!
E                   
E                   Mismatched elements: 6 / 6 (100.0%)
E                   Greatest absolute difference: 18.49855402601494 at index (1, 1, 0, 0, 0) (up to 0.0002 allowed)
E                   Greatest relative difference: 211.43751630108665 at index (0, 0, 0, 0, 0) (up to 0 allowed)

@danpetry
Copy link
Contributor

A lot of the skips look like they're there to cope with CI limitations, basically

@h-vetinari
Copy link
Member Author

A lot of the skips look like they're there to cope with CI limitations, basically

Sure, some portion of those are unavoidable, but basically all skips point at an issue that should be fixed in our packaging, or ideally upstream (in the test framework and/or the actual implementation). I kinda prefer having that transparency (and as a reminder that there's more work to be done), than simply removing the test modules in question and declaring it a done deal.

For example, the torchinductor tests you added in #318 were painful to get passing, but they did show that the torch.compile feature had essentially be broken in our packaging. I prefer a longer list of skips than a less functional package through ignoring problematic test modules wholesale. Of course, the best is to work through the issues surfaced by the skips, but that needs to happen on a longer timeline IMO, not while we're trying to get the release out.

@h-vetinari
Copy link
Member Author

Since the linux64+openblas+CUDA job had already passed for f14feb5 (with no essential changes since then), the only thing to do is to remove the blas_impl == "mkl" from the selector of this line

{% set skips = skips ~ " or test_reentrant_parent_error_on_cpu_cuda)" %} # [linux and blas_impl == "mkl" and cuda_compiler_version != "None"]
# non-MKL problems

As such, I'm not planning to do a new push here (will shift the skip when merging). I'm planning to merge this in ~9-12 hours unless there are other comments @conda-forge/pytorch-cpu; I don't doubt that we'll keep iterating, so even post-merge comments won't take long to get picked up.

h-vetinari added a commit that referenced this pull request Feb 13, 2025
@h-vetinari h-vetinari merged commit a75ba71 into conda-forge:main Feb 13, 2025
26 of 27 checks passed
@h-vetinari h-vetinari deleted the 2.6 branch February 13, 2025 19:25
@danpetry
Copy link
Contributor

🎉 thanks a lot for the huge amount of work on this over the last weeks @h-vetinari . This is really great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants