Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run #816

Merged
merged 5 commits into from
Sep 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions must-gather/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ PROMETHEUS_DUMP_PATH ?= $(shell find ./must-gather.local* -name prom_data.tar.gz

build: docker-build docker-push

run: IMAGE_REGISTRY:=ttl.sh
run: IMAGE_NAME:=oadp/must-gather-$(shell git rev-parse --short HEAD)-$(shell echo $$RANDOM)
run: IMAGE_TAG:=1h
run:
IMAGE_REGISTRY=$(IMAGE_REGISTRY) IMAGE_NAME=$(IMAGE_NAME) IMAGE_TAG=$(IMAGE_TAG) make build && \
oc adm must-gather --image ${IMAGE_REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}

docker-build:
docker build -t ${IMAGE_REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} .

Expand Down
6 changes: 3 additions & 3 deletions must-gather/collection-scripts/logs/gather_logs_backup
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ skip_tls=$8
mkdir -p "{object_collection_path}"
echo "[cluster=${cluster}][ns=${ns}] Gathering 'velero backup describe ${backup}'"
if [ "$timeout" = "0s" ]; then
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout 30s /velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
else
oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout ${timeout} /velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need --request-timeout=${timeout} here if timeout is getting passed in the velero container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory no.

Copy link
Member Author

@kaovilai kaovilai Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From further testing and doc digging.. --request-timeout has very different effects to timeout in velero container.

--request-timeout is The length of time to wait before giving up on a single api-server request.

whereas, if the api-server has responded (/velero cli executed but yet to print to stdout), request-timeout do not work to kill a stuck velero CLI process and must-gather still get stuck.

So I propose we keep both for the $timeout defined case.

fi
echo "[cluster=${cluster}][ns=${ns}] Gathering 'velero backup logs ${backup}'"
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero backup logs ${backup} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/backup-${backup}.log" &
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout 30s /velero backup logs ${backup} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/backup-${backup}.log" &

wait
9 changes: 1 addition & 8 deletions must-gather/collection-scripts/logs/gather_logs_pvb
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,5 @@ object_collection_path=$6
node=$(oc get podvolumebackup $pvb --namespace $ns -o jsonpath='{.spec.node}')
mkdir -p ${object_collection_path}
oc describe podvolumebackup ${pvb} --namespace ${ns} &> "${object_collection_path}/pvb-describe-${pvb}.txt" &
for pod in $(oc get pods -o wide --field-selector spec.nodeName=${node} --selector name=restic --no-headers --namespace $ns | awk '{print $1}'); do
echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting Pod logs..."
oc logs --all-containers --namespace ${ns} ${pod} --since ${logs_since} &> "${object_collection_path}/current.log" &
echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting previous Pod logs..."
oc logs --previous --all-containers --namespace ${ns} ${pod} --since ${logs_since} &> "${object_collection_path}/previous.log" &
pwait $max_parallelism
done

# logs covered by restic pod logs in gather_logs_pods
wait
8 changes: 2 additions & 6 deletions must-gather/collection-scripts/logs/gather_logs_pvr
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,8 @@ max_parallelism=$4
pvr=$5
object_collection_path=$6

# Gather PVR describe and logs
# Gather PVR describe
mkdir -p ${object_collection_path}
oc describe podvolumerestores.velero.io ${pvr} --namespace ${ns} &> "${object_collection_path}/pvr-describe-${pvr}.txt"
echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting Pod logs..."
oc logs --all-containers --namespace ${ns} ${pod} --since ${logs_since} &> "${object_collection_path}/current.log" &
echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting previous Pod logs..."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the empty [pod=] logs came from

oc logs --previous --all-containers --namespace ${ns} ${pod} --since ${logs_since} &> "${object_collection_path}/previous.log" &

# logs covered by restic pod logs in gather_logs_pods
wait
8 changes: 4 additions & 4 deletions must-gather/collection-scripts/logs/gather_logs_restore
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ timeout=$7
skip_tls=$8

# Gather restore describe and logs
mkdir -p "{object_collection_path}"
mkdir -p "${object_collection_path}"
echo "[cluster=${cluster}][ns=${ns}] Gathering 'velero restore describe ${restore}'"
if [ "$timeout" = "0s" ]; then
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero describe restore ${restore} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/restore-describe-${restore}.txt" &
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout 30s /velero describe restore ${restore} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/restore-describe-${restore}.txt" &
else
oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero describe restore ${restore} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/restore-describe-${restore}.txt" &
oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout ${timeout} /velero describe restore ${restore} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/restore-describe-${restore}.txt" &
fi
echo "[cluster=${cluster}][ns=${ns}] Gathering 'velero restore logs ${restore}'"
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero restore logs ${restore} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/restore-${restore}.log" &
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout 30s /velero restore logs ${restore} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/restore-${restore}.log" &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are forcing velero CLI commands that involve downloadrequest.Stream to timeout which will resolve issues related to must-gather getting stuck when querying from nonexistent backup storage location.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


wait