Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable crashtracking by default #3970

Merged
merged 2 commits into from
Oct 9, 2024

Conversation

lloeki
Copy link
Member

@lloeki lloeki commented Oct 3, 2024

What does this PR do?

Disable by default as a cautionary measure.

Motivation:

Crashtracking has been identified as causing issues with ECHILD and waitpid, and possibly anything involving SIGCHLD with obscure failure modes of hard to anticipate consequences.

Additional Notes:

How to test the change?

CI

@lloeki lloeki requested a review from a team as a code owner October 3, 2024 09:12
@lloeki lloeki marked this pull request as draft October 3, 2024 09:12
@lloeki lloeki force-pushed the lloeki/disable-crashtracking-by-default branch from e4bb9f0 to db0565c Compare October 3, 2024 09:28
@pr-commenter
Copy link

pr-commenter bot commented Oct 3, 2024

Benchmarks

Benchmark execution time: 2024-10-03 10:04:03

Comparing candidate commit db0565c in PR branch lloeki/disable-crashtracking-by-default with baseline commit 9a012ec in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 23 metrics, 2 unstable metrics.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Change seems reasonable, although the crashtracker specs are failing in CI, I guess they need a bit of fixing given the new default.

@lloeki
Copy link
Member Author

lloeki commented Oct 7, 2024

I guess they need a bit of fixing given the new default.

Yes that's what I gathered.

The errors seem weird though, and the pgrep check is a bit naive (on top of requiring pgrep): it would pass if a matching process already existed. Worse, with -f it would pass if any process had the matching string anywhere as argument:

# shell 1
$ ruby -e 'puts "libdatadog-crashtracking-receiver"; sleep 999999'

# shell 2
pgrep -f libdatadog-crashtracking-receiver
3739

So, failures, here they are:

Failed examples:

rspec ./spec/datadog/core/crashtracking/component_spec.rb:368 # Datadog::Core::Crashtracking::Component instance methods integration testing when forked ensures the latest configuration applied
rspec ./spec/datadog/core/crashtracking/component_spec.rb:157 # Datadog::Core::Crashtracking::Component instance methods #start when forked starts a second crash tracker for the fork
  1) Datadog::Core::Crashtracking::Component instance methods integration testing when forked ensures the latest configuration applied
     Failure/Error: expect(status && status.success?).to be(true), "STDOUT:`#{stdout}` STDERR:`#{stderr}"
     
       STDOUT:`` STDERR:`/usr/local/bundle/gems/rspec-support-3.13.1/lib/rspec/support.rb:110:in `block in <module:Support>': (Datadog::Core::Crashtracking::Component (class))._native_start_or_update_on_fork(hash_including(:action=>:update_on_fork, :agent_base_url=>"http://google.com:12345/")) (RSpec::Expectations::ExpectationNotMetError)
           expected: 1 time with arguments: (hash_including(:action=>:update_on_fork, :agent_base_url=>"http://google.com:12345/"))
           received: 0 times

        from /app/spec/datadog/core/crashtracking/component_spec.rb:384:in `block (6 levels) in <top (required)>'

received: 0 times, so OK I get that it's expected because it's disabled by default now, buuuut see below.

  2) Datadog::Core::Crashtracking::Component instance methods #start when forked starts a second crash tracker for the fork
     Got 2 failures:

OK, first one:

     2.1) Failure/Error: expect(status && status.success?).to be(true), "STDOUT:`#{stdout}` STDERR:`#{stderr}"
          
            STDOUT:`` STDERR:`/usr/local/bundle/gems/rspec-support-3.13.1/lib/rspec/support.rb:110:in `block in <module:Support>':  (RSpec::Expectations::ExpectationNotMetError)
            expected #<Integer:5> => 2
                 got #<Integer:3> => 1
          
            Compared using equal?, which compares object identity,
            but expected and actual are not the same object. Use
            `expect(actual).to eq(expected)` if you don't care about
            object identity in this example.

                from /app/spec/datadog/core/crashtracking/component_spec.rb:163:in `block (6 levels) in <top (required)>'

This one's odd because, uh, expect(status && status.success?).to be(true) fails because expected #<Integer:5> => 2 but got #<Integer:3> => 1? That doesn't click. I suppose it's another expectation that raises ExpectationNotMetError but it gets caught by this one.

Second one:

     2.2) Failure/Error: wait_for { `pgrep -f libdatadog-crashtracking-receiver` }.to be_empty
            expected `"1607\n".empty?` to be truthy, got false

          # ./spec/datadog/core/crashtracking/component_spec.rb:104:in `block (3 levels) in <top (required)>'
          # ------------------
          # --- Caused by: ---
          # Timeout::Error:
          #   execution expired
          #   /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait/handler.rb:14:in `sleep'

So pgrep has found one match even though it's disabled and should not start? This seems to contradict the very first (expected) failure. It could be matching itself? BSD and GNU behaving differently? Whatever, it's way too brittle.

If anything, the tests passed locally... (huh, they're actually PENDING...)

lloeki added 2 commits October 9, 2024 16:29
Crashtracking has been identified as causing issues with `ECHILD` and
`waitpid`, and possibly anything involving `SIGCHLD` with obscure
failure modes of hard to anticipate consequences.

Disable by default as a cautionary measure.

Mitigates #3954
These specs are problematic in two ways:

- `pgrep` results in false positive
- the above masked the fact that `at_fork` appears to be unreliable
- some of these leak threads
@lloeki lloeki force-pushed the lloeki/disable-crashtracking-by-default branch from 3ee52cb to 8bc0e5e Compare October 9, 2024 14:29
@lloeki
Copy link
Member Author

lloeki commented Oct 9, 2024

Rebased. Sadly some tests needed to be outright removed as they're not reliable.

They'll be fixed in a subsequent PR.

@lloeki lloeki marked this pull request as ready for review October 9, 2024 14:30
@lloeki lloeki mentioned this pull request Oct 9, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@e5a774a). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #3970   +/-   ##
=========================================
  Coverage          ?   97.84%           
=========================================
  Files             ?     1314           
  Lines             ?    78444           
  Branches          ?     3889           
=========================================
  Hits              ?    76750           
  Misses            ?     1694           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lloeki lloeki merged commit e0978ac into master Oct 9, 2024
246 of 247 checks passed
@lloeki lloeki deleted the lloeki/disable-crashtracking-by-default branch October 9, 2024 15:08
@github-actions github-actions bot added this to the 2.4.0 milestone Oct 9, 2024
@y9v y9v mentioned this pull request Oct 11, 2024
ivoanjo added a commit that referenced this pull request Dec 9, 2024
**What does this PR do?**

This PR re-adds the spec that validates that the at_fork hook in
crashtracking correctly picks up the latest instance of the
component.

**Motivation:**

Since this is quite a weird corner case, I think it's important to
have a spec to cover it.

**Additional Notes:**

This spec was temporarily removed in #3970 with the intention of
re-adding it in #3983 but we kinda forgot about it.

**How to test the change?**

Check this test passes as-is, and fails if you change the
`update_on_fork` call to use `self`, for instance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants