Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/splunkhec] enabled http2 healthcheck #29717

Merged
merged 1 commit into from
Dec 11, 2023

Conversation

dloucasfx
Copy link
Contributor

@dloucasfx dloucasfx commented Dec 8, 2023

Description:
Same description as in open-telemetry/opentelemetry-collector#9022

This PR enables the HTTP2 health check to workaround the issue described here open-telemetry/opentelemetry-collector#9022

As to why I chose 10 seconds for HTTP2ReadIdleTimeout and 10 seconds for HTTP2PingTimeout
Those values have been tested in production and they will result, in an active env (with default http timeout of 10 seconds and default retry settings), of a single export failure or (2 max) before the health check detects the corrupted tcp connection and closes it.
The only drawback is if the connection was not used for over 10 seconds, we might end up sending unnecessary ping frames, which should not be an issue and if it became an issue, then we can tune those settings.

The SFX exporter has multiples http clients:

  • Metric client, Trace client and Event client . Those client will have the http2 health check enabled by default as they share the same default config
  • Correlation client and Dimension client will NOT have the http2 health check enabled. We can revisit this if needed.

Link to tracking Issue:

Testing:

  • Run OTEL with one of the exporters that uses HTTP/2 client, example signalfx exporter
  • For simplicity use a single pipeline/exporter
  • In a different shell, run this to watch the tcp state of the established connection
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
  • From the netstat, take a note of the source port and the source IP address
  • replace <> from previous step
    sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP
  • Note how the OTEL exporter export starts timing out

Expected Result:

  • A new connection should be established, similarly to http/1 and exports should succeed

Actual Result:

  • The exports keep failing for ~ 15 minutes or for whatever the OS tcp_retries2 is configured to
  • After 15 minutes, a new tcp connection is created and exports start working

Documentation:
Readme is updated

Disclaimer:
Not all HTTP/2 servers support H2 Ping, however, this should not be a concern as our ingest servers do support H2 ping.
But if you are routing you can check if H2 ping is supported using this script golang/go#60818 (comment)

@dloucasfx dloucasfx force-pushed the splunkhec-exporter-http2 branch from a58e221 to 4e296e5 Compare December 11, 2023 15:58
Signed-off-by: Dani Louca <dlouca@splunk.com>
@dloucasfx dloucasfx force-pushed the splunkhec-exporter-http2 branch from 4e296e5 to a2217f4 Compare December 11, 2023 16:36
@codeboten codeboten merged commit 26b0610 into open-telemetry:main Dec 11, 2023
83 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants