Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Re-enable CI on GH200 #1653

Merged
merged 23 commits into from
Jan 14, 2025
Merged

ci: Re-enable CI on GH200 #1653

merged 23 commits into from
Jan 14, 2025

Conversation

havogt
Copy link
Contributor

@havogt havogt commented Sep 20, 2024

No description provided.

ci/cscs-ci.yml Outdated Show resolved Hide resolved
ci/cscs-ci.yml Outdated Show resolved Hide resolved
@edopao
Copy link
Contributor

edopao commented Nov 6, 2024

Tödi seems to be back, we could try to resume this PR without any specific partition/account or time limit.

ci/cscs-ci.yml Outdated Show resolved Hide resolved
@edopao edopao self-requested a review December 20, 2024 10:05
@edopao
Copy link
Contributor

edopao commented Jan 8, 2025

@FlorianDeconinck We got a failure in a test case:
tests/cartesian_tests/integration_tests/multi_feature_tests/test_code_generation.py::test_K_offset_write[dace:gpu]

Test log here:
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4455690602105886/4525297225819146/-/pipelines/1616373175

I see that there is a check on CUDA version:

Maybe this check is not enough?

@FlorianDeconinck
Copy link
Contributor

Hey Enrique,

I have a fix for that. Extent interval check is indeed broken for K and the test itself is bad. I can PR a quick fix to the test or you can change it as part of your PR

    def column_physics_conditional(A: Field[np.float64], B: Field[np.float64], scalar: np.float64):
-        with computation(BACKWARD), interval(1, None):
+        with computation(BACKWARD), interval(1, -1):
            if A > 0 and B > 0:
                A[0, 0, -1] = scalar
                B[0, 0, 1] = A
            lev = 1
            while A >= 0 and B >= 0:
                A[0, 0, lev] = -1
                B = -1
                lev = lev + 1

This should fix the test. The CUDA version test was a previous misunderstanding of where the race condition could come from

@edopao
Copy link
Contributor

edopao commented Jan 8, 2025

Hey Enrique,

I have a fix for that. Extent interval check is indeed broken for K and the test itself is bad. I can PR a quick fix to the test or you can change it as part of your PR

    def column_physics_conditional(A: Field[np.float64], B: Field[np.float64], scalar: np.float64):
-        with computation(BACKWARD), interval(1, None):
+        with computation(BACKWARD), interval(1, -1):
            if A > 0 and B > 0:
                A[0, 0, -1] = scalar
                B[0, 0, 1] = A
            lev = 1
            while A >= 0 and B >= 0:
                A[0, 0, lev] = -1
                B = -1
                lev = lev + 1

This should fix the test. The CUDA version test was a previous misunderstanding of where the race condition could come from

There is no hurry from our side, you can open a fix PR when you have time. I suspect that this test failure is flaky, it only happens sometimes.

@FlorianDeconinck
Copy link
Contributor

PR open there: #1791

@havogt
Copy link
Contributor Author

havogt commented Jan 10, 2025

Looks like the dace problem is still there, even on CUDA 12: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4455690602105886/4525297225819146/-/jobs/8815984483

@havogt
Copy link
Contributor Author

havogt commented Jan 10, 2025

cscs-ci run

@FlorianDeconinck
Copy link
Contributor

Alright been enough failed attempt at fixing this - I'll PR a complete deactivation of the feature today and we will go back to the drawing board to figure out what we are clearly not understanding.

@@ -583,6 +583,11 @@ def test_K_offset_write(backend):
if backend == "cuda":
pytest.skip("cuda K-offset write generates bad code")

if backend == "dace:gpu":
pytest.skip(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FlorianDeconinck I have deactivated the test case this way. I was not sure whether to refer to issue #1684 or #1754.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, January has been idiotically overbooked and I keep falling off. I'll take it from there, sorry for the delay again

@havogt
Copy link
Contributor Author

havogt commented Jan 14, 2025

Thanks @edopao

@havogt havogt merged commit 4dc1531 into main Jan 14, 2025
25 checks passed
@havogt havogt deleted the enable_gh_ci branch January 14, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants