Skip to content

Commit

Permalink
[exporter/splunkhec] enabled http2 healthcheck (#29717)
Browse files Browse the repository at this point in the history
Same description as in
open-telemetry/opentelemetry-collector#9022

This PR enables the HTTP2 health check to workaround the issue described
here open-telemetry/opentelemetry-collector#9022

As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and 10 seconds
for `HTTP2PingTimeout`
Those values have been tested in production and they will result, in an
active env (with default http timeout of 10 seconds and default retry
settings), of a single export failure or (2 max) before the health check
detects the corrupted tcp connection and closes it.
The only drawback is if the connection was not used for over 10 seconds,
we might end up sending unnecessary ping frames, which should not be an
issue and if it became an issue, then we can tune those settings.

The SFX exporter has multiples http clients:
- Metric client, Trace client and Event client . Those client will have
the http2 health check enabled by default as they share the same default
config
- Correlation client and Dimension client will NOT have the http2 health
check enabled. We can revisit this if needed.

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** <Describe the documentation added.>
Readme is updated

Signed-off-by: Dani Louca <dlouca@splunk.com>
  • Loading branch information
dloucasfx authored Dec 11, 2023
1 parent 40ddee9 commit 26b0610
Show file tree
Hide file tree
Showing 4 changed files with 48 additions and 11 deletions.
27 changes: 27 additions & 0 deletions .chloggen/splunkhec-exporter-http2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: splunkhecexporter

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Enable HTTP/2 health check by default

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [29717]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: []
4 changes: 4 additions & 0 deletions exporter/splunkhecexporter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ The following configuration options can also be configured:
- `use_multi_metric_format` (default: false): Combines metrics with the same metadata to reduce ingest using the [multiple-metric JSON format](https://docs.splunk.com/Documentation/Splunk/9.0.0/Metrics/GetMetricsInOther#The_multiple-metric_JSON_format). Applicable in the `metrics` pipeline only.
- `disable_compression` (default: false): Whether to disable gzip compression over HTTP.
- `timeout` (default: 10s): HTTP timeout when sending data.
- `http2_read_idle_timeout` (default = 10s): Send a ping frame for a health check if the connection has been idle for the configured value.
0s means http/2 health check will be disabled.
- `http2_ping_timeout` (default = 10s): Triggered by `http2_read_idle_timeout`; When there's no response to the ping within the configured value,
the connection will be closed. If this value is set to 0, it will default to 15s.
- `insecure_skip_verify` (default: false): Whether to skip checking the certificate of the HEC endpoint when sending data over HTTPS.
- `ca_file` (no default) Path to the CA cert to verify the server being connected to.
- `cert_file` (no default) Path to the TLS cert to use for client connections when TLS client auth is required.
Expand Down
8 changes: 5 additions & 3 deletions exporter/splunkhecexporter/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,11 @@ func TestLoadConfig(t *testing.T) {
},
InsecureSkipVerify: false,
},
MaxIdleConns: &hundred,
MaxIdleConnsPerHost: &hundred,
IdleConnTimeout: &idleConnTimeout,
MaxIdleConns: &hundred,
MaxIdleConnsPerHost: &hundred,
IdleConnTimeout: &idleConnTimeout,
HTTP2ReadIdleTimeout: 10 * time.Second,
HTTP2PingTimeout: 10 * time.Second,
},
RetrySettings: exporterhelper.RetrySettings{
Enabled: true,
Expand Down
20 changes: 12 additions & 8 deletions exporter/splunkhecexporter/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ import (
)

const (
defaultMaxIdleCons = 100
defaultHTTPTimeout = 10 * time.Second
defaultIdleConnTimeout = 10 * time.Second
defaultSplunkAppName = "OpenTelemetry Collector Contrib"
defaultMaxIdleCons = 100
defaultHTTPTimeout = 10 * time.Second
defaultHTTP2ReadIdleTimeout = time.Second * 10
defaultHTTP2PingTimeout = time.Second * 10
defaultIdleConnTimeout = 10 * time.Second
defaultSplunkAppName = "OpenTelemetry Collector Contrib"
)

// TODO: Find a place for this to be shared.
Expand Down Expand Up @@ -55,10 +57,12 @@ func createDefaultConfig() component.Config {
LogDataEnabled: true,
ProfilingDataEnabled: true,
HTTPClientSettings: confighttp.HTTPClientSettings{
Timeout: defaultHTTPTimeout,
IdleConnTimeout: &defaultIdleConnTimeout,
MaxIdleConnsPerHost: &defaultMaxConns,
MaxIdleConns: &defaultMaxConns,
Timeout: defaultHTTPTimeout,
IdleConnTimeout: &defaultIdleConnTimeout,
MaxIdleConnsPerHost: &defaultMaxConns,
MaxIdleConns: &defaultMaxConns,
HTTP2ReadIdleTimeout: defaultHTTP2ReadIdleTimeout,
HTTP2PingTimeout: defaultHTTP2PingTimeout,
},
SplunkAppName: defaultSplunkAppName,
RetrySettings: exporterhelper.NewDefaultRetrySettings(),
Expand Down

0 comments on commit 26b0610

Please sign in to comment.