-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test if last passing run can be reproduced #345
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe/meta.yaml:
This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13174999474. Examine the logs at this URL for more detail. |
58c8b9a
to
aac481c
Compare
Incredibly, this really seems to be MKL-specific somehow, as the openblas builds in #326 passed, while the MKL builds ran into the pytest error (same situation as the CI after merging #340). Whatever "Channels" execnet is trying to use might somehow be getting occupied by MKL?
CC @conda-forge/pytorch-cpu @mgorny @danpetry @rgommers @isuruf literally any ideas on what could be causing this interaction would be welcome. |
have you tried to uninstall pytest-xdist? |
try running without parallel testing? |
basically same thing |
or maybe getting some more verbose logs from pytest-xdist to see why channel 3 is closing? OOM issue maybe...? |
Agree with others. Would start with cutting everything test-wise to 1 thread & 1 process OOM may just be another way of saying oversubscription from parallelism Also other BLAS libraries have their own kinds of parallelism that may need to be disabled. Usually this can be set with an environment variable |
On my wimpy machine, i have to set:
to avoid using "energy efficient" cores on my laptop so that pytorch actually runs faster... might help here. |
Sure, I'm trying the reduced parallelism route (#346), but that's no explanation why the very same parallel invocation stopped working, much less only on MKL together with CUDA on linux. (Win+CUDA+MKL is fine, linux+CUDA+openblas is fine, linux+CPU+MKL is fine) |
I was actually thinking OOM (out of RAM) rather than out of threads, but just an idea. Apparently MKL uses more RAM than openBLAS. Wonder if the failure is deterministic, i.e. is it same test each time? |
It has been deterministic, but in the test collection phase (rather than for any identifiable individual test), which makes it implausible to me that it's due to an OOM. In any case, I opened #348 so we can centralize the discussion on this that's become scattered over a bunch of PRs. If the removal of |
#344 tried to reduce the diff to the last passing run (dfadf15), but still runs into the same issue with pytest.
As a final check, take no shortcuts and simply run CI again for the last passing build; not a hair different (no tests, no skips, no comments, no nothing), just a hard reset.
More concretely all the linux-64 + CUDA + MKL builds are failing with
I double-checked the pytest versions, and there's no difference either between passing:
and failing
The full diff between the test environments from the last passing run to the one in #344 is quite massive though.