Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Linux support bundle script #208

Merged
merged 7 commits into from
Apr 1, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,6 @@ For security information specific to this distribution, please review

## Troubleshooting

Start by reviewing the [Collector troubleshooting
documentation](https://github.com/open-telemetry/opentelemetry-collector/blob/master/docs/troubleshooting.md).

For troubleshooting information specific to this distribution, please review
[troubleshooting.md](docs/troubleshooting.md).

Expand Down
61 changes: 55 additions & 6 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,63 @@
# Troubleshooting

Start by reviewing the [OpenTelemetry Collector troubleshooting
documentation](https://github.com/open-telemetry/opentelemetry-collector/blob/master/docs/troubleshooting.md).

## Gathering Support Information

If you are unable to determine why something is not working then you can [email
support](mailto:signalfx-support@splunk.com). When opening a support request,
flands marked this conversation as resolved.
Show resolved Hide resolved
it is important to include as much information about the issue as possible
flands marked this conversation as resolved.
Show resolved Hide resolved
including:

- What did you try to do?
- What happened?
- What did you expect to happen?
- Have you found any workaround?
- How impactful is the issue?
- How can we produce the issue?

End-to-end architecture information is helpful including:

- What is generating the data?
- Where was the data configured to go to?
- What format was the data sent in?
- How is the next hop configured?
- Where is the data configured to go from here?
- What format was the data sent in?
- Any dns/firewall/networking/proxy information to be aware of?

In addition, it is important to gather support information including:

- Configuration file
- Kubernetes: `kubectl get configmap my-configmap -o yaml >my-configmap.yaml`
- Linux: `/etc/otel/collector`
- Logs and ideally debug logs
- Docker: `docker logs my-container >my-container.log`
- Journald: `journalctl -u my-service >my-service.log`
- Kubernetes: `kubectl logs my-pod >my-pod.log`
flands marked this conversation as resolved.
Show resolved Hide resolved

Support bundle scripts are provided to make it easier to collect information:

- Linux (if installer script was used): `/etc/otel/collector/splunk-support-bundle.sh`
flands marked this conversation as resolved.
Show resolved Hide resolved

## Linux Installer

If either the splunk-otel-collector or td-agent services are not properly
installed and configured:

- Ensure the OS [is supported](getting-started/linux-installer.md#linux-installer-script)
- Ensure the OS has systemd installed
- Ensure not running in a containerized environment (for non-production environments see [this post](https://developers.redhat.com/blog/2014/05/05/running-systemd-within-docker-container/) for a workaround)
- Ensure not running in a containerized environment (for non-production
environments see [this
post](https://developers.redhat.com/blog/2014/05/05/running-systemd-within-docker-container/)
for a workaround)
- Check installation logs for more details

## HTTP Error Codes

- 401 (UNAUTHORIZED): Configured access token or realm is incorrect
- 404 (NOT FOUND): Likely configuration parameter is wrong like endpoint or URI
- 404 (NOT FOUND): Likely configuration parameter is wrong like endpoint or path
(e.g. /v1/log); possible network/firewall/port issue
- 429 (TOO MANY REQUESTS): Org is not provisioned for the amount of traffic
being sent; reduce traffic or request increase in capacity
Expand All @@ -32,6 +76,7 @@ $ journalctl -u my-service.service -f
### Is Fluentd configured properly?

- Is td-agent running? (`systemctl status td-agent`)
- If you changes the configuration did you restart fluentd? (`systemctl restart td-agent`)
flands marked this conversation as resolved.
Show resolved Hide resolved
- Check `fluentd.conf` and `conf.d/\*`; ensure `@label @SPLUNK` is added to
every source otherwise logs are not collected!
- Manual configuration may be required to collect logs off the source. Add
Expand All @@ -41,15 +86,19 @@ $ journalctl -u my-service.service -f
- While every attempt is made to properly configure permissions, it is
possible td-agent does not have the permission required to collect logs.
Debug logging should indicate this issue.
- It is possible the `<parser>` section configuration is not matching the log events
flands marked this conversation as resolved.
Show resolved Hide resolved
- This means things are working (requires debug logging enabled): `2021-03-17
02:14:44 +0000 [debug]: #0 connect new socket`

### Is OTelCol configured properly?

- Check [zpages](https://github.com/open-telemetry/opentelemetry-collector/blob/main/extension/zpagesextension) for samples (`http://localhost:55679/debug/tracez`); may require
`endpoint` configuration
- Enable [logging exporter](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/loggingexporter) and check logs (`journalctl -u
splunk-otel-collector.service -f`)
- Check
[zpages](https://github.com/open-telemetry/opentelemetry-collector/blob/main/extension/zpagesextension)
for samples (`http://localhost:55679/debug/tracez`); may require `endpoint`
configuration
- Enable [logging
exporter](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/loggingexporter)
and check logs (`journalctl -u splunk-otel-collector.service -f`)
- Review the [Collector troubleshooting
documentation](https://github.com/open-telemetry/opentelemetry-collector/blob/master/docs/troubleshooting.md).

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
#!/usr/bin/env bash

# Copyright Splunk Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#######################################
# Globals
#######################################
CONFDIR="/etc/otel/collector" # Default configuration directory
DIRECTORY= # Either passed as CLI parameter or later set to CONFDIR
TMPDIR="/tmp/splunk-support-bundle-$(date +%s)" # Unique temporary directory for support bundle contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's preferable to use mktemp for staging and then produce the support tarball in the directory where the command is invoked by default w/ configurable --output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but feared compatibility across systems (not POSIX to my knowledge). Given installer script relies on /tmp already this felt "safe", but open to change this. @jchengsfx thoughts here?


usage() {
echo "USAGE: $0 [-h] [-d directory]"
echo " -d directory where Splunk OpenTelemetry Connector configuration is located"
echo " (if not specified, defaults to /etc/otel/collector)"
echo " -h display help"
exit 1
}

#######################################
# Parse command line arguments
#######################################
while [[ $# -gt 0 ]]
do
key="$1"
case $key in
-d|--directory)
DIRECTORY="$2"
shift # past argument
shift # past value
;;
-t|--tmpdir)
TMP="$2"
shift # past argument
shift # past value
;;
-h|--help)
usage
;;
*) # unknown option
POSITIONAL+=("$1") # save it in an array for later
shift # past argument
;;
esac
done

#######################################
# Creates a unique temporary directory to store the contents of the support
# bundle. Do not attempt to cleanup to prevent any accidental deletions.
# This command can only be run once per second or will error out.
# This script could result in a lot of temporary data if run multiple times.
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0 if successful, non-zero on error.
#######################################
createTempDir() {
echo "INFO: Creating temporary directory..."
# Override primarily for testing
if [ -n "$TMP" ]; then TMPDIR="$TMP"; fi
if [ -d "$TMPDIR" ]; then
echo "ERROR: TMPDIR ($TMPDIR) exists. Exiting."
exit 1
else
mkdir "$TMPDIR"
for d in logs metrics zpages; do
mkdir "$TMPDIR"/$d
done
fi
}

#######################################
# Check whether commands exist
# If it doesn't the command output will not be captured
#######################################
checkCommands() {
echo "INFO: Checking for commands..."
for EXE in systemctl journalctl curl wget pgrep; do
if ! command -v $EXE &> /dev/null; then
echo "WARN: $EXE could not be found."
echo " Please install to capture full support bundle."
fi
done
}

#######################################
# Gather configuration
# Without this it is very hard to troubleshoot issues so exit if no permissions.
# - GLOBALS: CONFDIR, DIRECTORY, TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0 if successful, non-zero on error.
#######################################
getConfig() {
echo "INFO: Getting configuration..."
# Directory can be passed via CLI parameters
if [ -z "$DIRECTORY" ]; then
DIRECTORY=$CONFDIR
fi
# If directory does not exist the support bundle is useless so exit
if [ ! -d "$DIRECTORY" ]; then
echo "ERROR: Could not find directory ($DIRECTORY)."
usage
fi
# Need to ensure user has permission to access
if test -r "$DIRECTORY"; then
cp -r "$DIRECTORY" "$TMPDIR"/config 2>&1
else
echo "ERROR: Permission denied to directory ($DIRECTORY)."
echo " Run this script with a user who has permissions to this directory."
exit 1
fi
}

#######################################
# Gather status
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0
#######################################
getStatus() {
echo "INFO: Getting status..."
systemctl status splunk-otel-collector >"$TMPDIR"/logs/splunk-otel-collector.txt 2>&1
systemctl status td-agent >"$TMPDIR"/logs/td-agent.txt 2>&1
}

#######################################
# Gather logs
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0
#######################################
getLogs() {
echo "INFO: Getting logs..."
journalctl -u splunk-otel-collector >"$TMPDIR"/logs/splunk-otel-collector.log 2>&1
journalctl -u td-agent >"$TMPDIR"/logs/td-agent.log 2>&1
LOGDIR="/var/log/td-agent"
if test -r "$LOGDIR"; then
cp -r /var/log/td-agent "$TMPDIR"/logs/td-agent/ 2>&1
else
echo "WARN: Permission denied to directory ($LOGDIR)."
fi
}

#######################################
# Gather metrics
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0
#######################################
getMetrics() {
echo "INFO: Getting metric information..."
# It's possible user has disabled prometheus receiver in metrics pipeline
if timeout 1 bash -c 'cat < /dev/null > /dev/tcp/localhost/8888'; then
curl -s http://localhost:8888/metrics >"$TMPDIR"/metrics/collector-metrics.txt 2>&1
else
echo "WARN: localhost:8888/metrics unavailable so metrics not collected"
fi
}

#######################################
# Gather zpages
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0
#######################################
getZpages() {
echo "INFO: Getting zpages information..."
# It's possible user has disabled zpages extension
if timeout 1 bash -c 'cat < /dev/null > /dev/tcp/localhost/55679'; then
curl -s http://localhost:55679/debug/tracez >"$TMPDIR"/zpages/tracez.html 2>&1
# Recursively get pages to see output of samples
wget -q -r -np -l 1 -P "$TMPDIR/zpages" http://localhost:55679/debug/tracez
else
echo "WARN: localhost:55679 unavailable so zpages not collected"
fi
}

#######################################
# Gather Linux information
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0
#######################################
getHostInfo() {
echo "INFO: Getting host information..."
# Filter top to only collect Splunk-specific processes
PIDS=$(pgrep -d "," 'otelcol|fluentd' 2>&1)
if [ -n "$PIDS" ]; then
top -b -n 3 -p "$PIDS" >"$TMPDIR"/metrics/top.txt 2>&1
else
echo "WARN: Unable to find otelcol or fluentd PIDs"
echo " top will not be collected"
fi
df -h >"$TMPDIR"/metrics/df.txt 2>&1
free >"$TMPDIR"/metrics/free.txt 2>&1
}

#######################################
# Tar support bundle
# - GLOBALS: TMPDIR
# - ARGUMENTS: None
# - OUTPUTS: None
# - RETURN: 0 if successful, non-zero on error
#######################################
tarResults() {
echo "INFO: Creating tarball..."
tar cfz "/tmp/$(basename "$TMPDIR").tar.gz" -P "$TMPDIR" 2>&1
if [ -f "/tmp/$(basename "$TMPDIR").tar.gz" ]; then
echo "INFO: Support bundle available at: /tmp/$(basename "$TMPDIR").tar.gz"
echo " Please attach this to your support case"
exit 0
else
echo "ERROR: Support bundle was not properly created."
echo " See $TMPDIR/stdout.log for more information."
exit 1
fi
}

main() {
createTempDir
checkCommands
getConfig
getStatus
getLogs
getMetrics
getZpages
getHostInfo
tarResults
}

# Attempt to generate a support bundle
# Capture all output
main 2>&1 | tee -a "$TMPDIR"/stdout.log
15 changes: 15 additions & 0 deletions internal/buildscripts/packaging/tests/installer_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,21 @@ def test_installer(distro, version, memory_option):
else:
assert container.exec_run("systemctl status td-agent").exit_code != 0

# test support bundle script
assert container.exec_run("/etc/otel/collector/splunk-support-bundle.sh -t /tmp/splunk-support-bundle").exit_code == 0
flands marked this conversation as resolved.
Show resolved Hide resolved
assert container.exec_run("test -f /tmp/splunk-support-bundle/config/splunk_config_linux.yaml").exit_code == 0
flands marked this conversation as resolved.
Show resolved Hide resolved
assert container.exec_run("test -f /tmp/splunk-support-bundle/logs/splunk-otel-collector.log").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/logs/splunk-otel-collector.txt").exit_code == 0
if container.exec_run("test -f /etc/otel/collector/fluentd/fluent.conf").exit_code == 0:
assert container.exec_run("test -f /tmp/splunk-support-bundle/logs/td-agent.log").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/logs/td-agent.txt").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/metrics/collector-metrics.txt").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/metrics/df.txt").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/metrics/free.txt").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/metrics/top.txt").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle/zpages/tracez.html").exit_code == 0
assert container.exec_run("test -f /tmp/splunk-support-bundle.tar.gz").exit_code == 0

run_container_cmd(container, "sh -x /test/install.sh --uninstall")
finally:
run_container_cmd(container, "journalctl -u td-agent --no-pager")
Expand Down