Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester fails on Linux #46175

Closed
sandreenko opened this issue Dec 17, 2020 · 20 comments · Fixed by #105578
Closed
Assignees
Labels
area-VM-coreclr disabled-test The test is disabled in source code against the issue in-pr There is an active PR which will close this issue when it is merged os-linux Linux OS (any supported distro)
Milestone

Comments

@sandreenko
Copy link
Contributor

It was disabled so we have not seen it, the log is:

  baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.sh [FAIL]
      
      Return code:      1
      Raw output file:      /root/helix/work/workitem/baseservices/exceptions/Reports/baseservices.exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.output.txt
      Raw output:
      BEGIN EXECUTION
      /root/helix/work/correlation/corerun stackoverflowtester.dll ''
      Running stackoverflow test(smallframe main)
      "Stack overflow."
      "Repeat 174461 times:"
      "--------------------------------"
      "   at TestStackOverflow.Program.InfiniteRecursionC()"
      "   at TestStackOverflow.Program.InfiniteRecursionB()"
      "   at TestStackOverflow.Program.InfiniteRecursionA()"
      "--------------------------------"
      "   at TestStackOverflow.Program.Test(Boolean)"
      "   at TestStackOverflow.Program.Main(System.String[])"
      "apply_reg_state: ip and cfa unchanged; stopping here (ip=0x7fb3fc0f2c)"
      Gathering state for process 522 corerun
      Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.522.dmp
      Written 61616128 bytes (15043 pages) to core file
      Dump successfully written
      ""
      Missing "Main" method frame at the last line
      Expected: 100
      Actual: 101
      END EXECUTION - FAILED

note that on some archs it fails with a timeout.

AzDo example.

@sandreenko sandreenko added os-linux Linux OS (any supported distro) area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Dec 17, 2020
@sandreenko sandreenko added this to the 6.0.0 milestone Dec 17, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 17, 2020
@sandreenko
Copy link
Contributor Author

PTAL @echesakovMSFT I believe you were working with this test.

@echesakov
Copy link
Contributor

PTAL @echesakovMSFT I believe you were working with this test.

No, @janvorli created this test

@sandreenko sandreenko added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Dec 17, 2020
@janvorli
Copy link
Member

I had no idea the test was disabled. @sandreenko where have you seen it failing with timeout?

@sandreenko
Copy link
Contributor Author

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 17, 2020
@sandreenko sandreenko linked a pull request Dec 21, 2020 that will close this issue
@janvorli
Copy link
Member

janvorli commented Jan 4, 2021

After I've fixed the lookup for the Main in the stack trace, the ARM64 legs are failing due to the fact that we don't have the probing helper change in yet. So for large frames, the failure point is too far from the SP, the failure is not recognized as stack overflow and the sigsegv alternate stack is not large enough to execute the full stack overflow reporting. The alternate stack is about two pages large while the stack overflow needs about 8 pages. We need to wait for the stack probing helper change to reenable the tests.
The OSX / Linux x64 legs are failing due to timeouts caused most likely by the fact that our test infra generates core dumps for the processes that the test launches and that are expectedly failing with the stack overflow. I'll look into a way to prevent dumps generation for the secondary processes.

@mangod9
Copy link
Member

mangod9 commented Jul 26, 2021

is the stack probing helper change noted above merged, or is it still pending?

@echesakov
Copy link
Contributor

is the stack probing helper change noted above merged, or is it still pending?

The change was postponed to 7.0.0 - we need to fix #47810 first. Otherwise, enabling the stack probing helper introduces regressions.

@mangod9
Copy link
Member

mangod9 commented Jul 26, 2021

Ok thanks for the update. Moving this to 7 as well.

@mangod9 mangod9 modified the milestones: 6.0.0, 7.0.0 Jul 26, 2021
@am11
Copy link
Member

am11 commented Feb 9, 2022

Another set of tests have started to fail on CoreCLR Pri0 Runtime Tests Run Linux arm64 checked.

logs: https://helix.dot.net/api/2019-06-17/jobs/70a35f2c-194c-4f0d-97e6-a693efb480e4/workitems/profiler.eventpipe/console

  Starting:    profiler.eventpipe.XUnitWrapper (parallel test collections = on, max threads = 4)
    profiler/eventpipe/eventpipe/eventpipe.sh [FAIL]
      Unhandled exception. System.Exception: Profilee returned exit code 255 instead of expected exit code 100.
         at Profiler.Tests.ProfilerTestRunner.FailFastWithMessage(String error)
         at Profiler.Tests.ProfilerTestRunner.Run(String profileePath, String testName, Guid profilerClsid, String profileeArguments, ProfileeOptions profileeOptions, Dictionary`2 envVars, String reverseServerName, Boolean loadAsNotification, Int32 notificationCopies)
         at EventPipeTests.EventPipe.Main(String[] args)
      apply_reg_state: ip and cfa unchanged; stopping here (ip=0x7fb6cd6024)
      /root/helix/work/workitem/e/profiler/eventpipe/eventpipe/eventpipe.sh: line 384:    47 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"
      
      Return code:      1

Should we disable all these tests until #47810 and this issues are resolved?

@echesakov
Copy link
Contributor

@am11 I am not sure I understand connection between the failing profiler test and the issue with stack probing. Can you please elaborate?

@am11
Copy link
Member

am11 commented Feb 9, 2022

@echesakovMSFT, ah ok. The error from libunwind is "apply_reg_state: ip and cfa unchanged;", so I thought this issue is tracking that from the logs in the top post. Is that error unrelated and do we need to track it?

@echesakov
Copy link
Contributor

@am11 Yes, it looks unrelated.

@mangod9
Copy link
Member

mangod9 commented Jul 19, 2022

@JulieLeeMSFT, Egor had pointed to this #47810 which needs to be merged before rechecking whether this test would pass. Is it planned for 7 (its currently marked as future)?

@mangod9
Copy link
Member

mangod9 commented Aug 11, 2022

moving this to 8.

@mangod9 mangod9 removed this from the 7.0.0 milestone Aug 11, 2022
@mangod9
Copy link
Member

mangod9 commented Jul 29, 2023

Looks like #47810 is still not merged. @JulieLeeMSFT @BruceForstall assume this is not planned for 8?

@BruceForstall BruceForstall added the disabled-test The test is disabled in source code against the issue label Jul 29, 2023
@BruceForstall
Copy link
Member

@mangod9 Note that this test is disabled for all Linux, as well as for win-x86 (#84911). Issue #47810 is an optimization for arm64 only. The arm64 stack probing issue is #13519. There is no current plan to implement it. (cc @kunalspathak)

But, as mentioned, that should only affect arm64. All the other test failures of this test (non-arm64 Linux and win-x86) could be independently investigated.

@mangod9
Copy link
Member

mangod9 commented Jul 31, 2023

@janvorli, would your recent exceptions work handle this case? If so we can move to 9

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 1, 2023
@janvorli janvorli modified the milestones: 8.0.0, 9.0.0 Aug 14, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 20, 2023
@mangod9
Copy link
Member

mangod9 commented Jul 10, 2024

Looks like the disabled test was enabled as part of JanV's fix. Closing now.

@mangod9 mangod9 closed this as completed Jul 10, 2024
@janvorli
Copy link
Member

@mangod9 my PR was closed, not merged in and the tests are still disabled. Based on @jkotas feedback, I wanted to make the fix more bullet proof, but then it went out of my radar with all the EH work. I am reopening the issue. I'll try to get back to fixing it soon.

@jkotas jkotas reopened this Jul 11, 2024
@mangod9
Copy link
Member

mangod9 commented Jul 11, 2024

oh sorry, missed that the PR was closed before merging. Assuming we can enable again in 9

janvorli added a commit to janvorli/runtime that referenced this issue Jul 26, 2024
When multiple threads crash with hardware unhandled exceptions at the
same time, the fact that we were uninstalling async signal handlers
at process exit caused crashes when some thread reached the signal
handler after .NET handler was removed.

This change fixes it by not restoring the signal handlers during
process exit. It actually stops restoring any signal handlers except
for SIGABRT that has to be restored to actually enable the process
exit with abort().

Close dotnet#46175
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2024
@jkotas jkotas closed this as completed in 4c3edd5 Jul 27, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr disabled-test The test is disabled in source code against the issue in-pr There is an active PR which will close this issue when it is merged os-linux Linux OS (any supported distro)
Projects
None yet
8 participants