OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add `make run` #816

kaovilai · 2022-09-08T19:46:15Z

OADP-774

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

savitharaghunathan

/lgtm

kaovilai · 2022-09-08T20:35:56Z

must-gather/collection-scripts/logs/gather_logs_pvr

 mkdir -p ${object_collection_path}
 oc describe podvolumerestores.velero.io ${pvr} --namespace ${ns} &> "${object_collection_path}/pvr-describe-${pvr}.txt"
-echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting Pod logs..."
-oc logs --all-containers --namespace ${ns} ${pod} --since ${logs_since} &> "${object_collection_path}/current.log" &
-echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting previous Pod logs..."


This is where the empty [pod=] logs came from

savitharaghunathan · 2022-09-09T12:06:46Z

/retest-required

kaovilai · 2022-09-09T15:39:41Z

/retest-required

kaovilai · 2022-09-09T16:12:51Z

/hold this did not fix issue we were trying to fix.. Getting stuck at "Collecting volumesnapshotlocations"

openshift-ci · 2022-09-09T17:44:26Z

New changes are detected. LGTM label has been removed.

kaovilai · 2022-09-09T20:00:24Z

everything seems to work

kaovilai · 2022-09-09T20:02:16Z

must-gather/collection-scripts/logs/gather_logs_restore

 fi
 echo "[cluster=${cluster}][ns=${ns}] Gathering 'velero restore logs ${restore}'"
-oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero restore logs ${restore} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/restore-${restore}.log" &
+oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout 30s /velero restore logs ${restore} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/restore-${restore}.log" &


Here we are forcing velero CLI commands that involve downloadrequest.Stream to timeout which will resolve issues related to must-gather getting stuck when querying from nonexistent backup storage location.

avoids vmware-tanzu/velero#5324

kaovilai · 2022-09-09T20:25:21Z

/unhold ready for review

kaovilai · 2022-09-09T20:26:03Z

/retest

savitharaghunathan · 2022-09-09T21:13:57Z

@kaovilai should there be a default timeout value?

savitharaghunathan · 2022-09-09T21:12:59Z

must-gather/collection-scripts/logs/gather_logs_backup

 else
-    oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
+    oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout ${timeout} /velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &


do we need --request-timeout=${timeout} here if timeout is getting passed in the velero container?

in theory no.

From further testing and doc digging.. --request-timeout has very different effects to timeout in velero container.

--request-timeout is The length of time to wait before giving up on a single api-server request.

whereas, if the api-server has responded (/velero cli executed but yet to print to stdout), request-timeout do not work to kill a stuck velero CLI process and must-gather still get stuck.

So I propose we keep both for the $timeout defined case.

weshayutin · 2022-09-10T20:59:01Z

/retest

kaovilai · 2022-09-12T14:26:34Z

/retest

kaovilai · 2022-09-12T14:57:29Z

checking for ways to write other than empty file when timeout is triggered.

kaovilai · 2022-09-12T16:11:02Z

/retest

kaovilai · 2022-09-12T18:29:56Z

/retest

kaovilai · 2022-09-12T20:59:40Z

/retest

weshayutin · 2022-09-12T23:44:59Z

/retest

kaovilai · 2022-09-13T00:44:19Z

@shubham-pampattiwar when timeout occurs, the file (backup/restore-log/describe-) will contain output like the following if timeout killed the program.

Defaulted container "velero" out of: velero, openshift-velero-plugin (init), velero-plugin-for-aws (init), velero-plugin-for-csi (init)
I0913 00:38:32.700413  121602 request.go:665] Waited for 1.188090857s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/k8s.cni.cncf.io/v1?timeout=32s
command terminated with exit code 124

Specifically 124 is an exit code given by timeout if timeout killed the program. I believe the verbosity seen in the file came from oc exec reporting this exit code.

openshift-ci · 2022-09-13T01:34:17Z

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

shubham-pampattiwar · 2022-09-13T14:40:48Z

@kaovilai LGTM!

kaovilai · 2022-09-13T14:46:20Z

/cherry-pick oadp-1.1

kaovilai · 2022-09-13T14:46:58Z

/cherry-pick oadp-1.0

openshift-cherrypick-robot · 2022-09-13T14:47:03Z

@kaovilai: new pull request created: #821

In response to this:

/cherry-pick oadp-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2022-09-13T14:47:37Z

@kaovilai: #816 failed to apply on top of branch "oadp-1.0":

Applying: Typo shell variable
Using index info to reconstruct a base tree...
M	must-gather/collection-scripts/logs/gather_logs_restore
Falling back to patching base and 3-way merge...
Auto-merging must-gather/collection-scripts/logs/gather_logs_restore
Applying: mustgather `make run` target
Applying: do not gather restic logs twice
Applying: force a timeout of velero command execution
Using index info to reconstruct a base tree...
M	must-gather/collection-scripts/logs/gather_logs_backup
M	must-gather/collection-scripts/logs/gather_logs_restore
Falling back to patching base and 3-way merge...
Auto-merging must-gather/collection-scripts/logs/gather_logs_restore
CONFLICT (content): Merge conflict in must-gather/collection-scripts/logs/gather_logs_restore
Auto-merging must-gather/collection-scripts/logs/gather_logs_backup
CONFLICT (content): Merge conflict in must-gather/collection-scripts/logs/gather_logs_backup
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0004 force a timeout of velero command execution
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick oadp-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kaovilai added 3 commits September 8, 2022 13:50

Typo shell variable

4271a6a

mustgather make run target

45956db

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

do not gather restic logs twice

228195c

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai changed the title ~~mustgather logs improvements, add make run~~ mustgather logs typo, remove duplicate pod logs, add make run Sep 8, 2022

openshift-ci bot requested review from savitharaghunathan and shubham-pampattiwar September 8, 2022 19:47

savitharaghunathan approved these changes Sep 8, 2022

View reviewed changes

openshift-ci bot assigned savitharaghunathan Sep 8, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2022

kaovilai commented Sep 8, 2022

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2022

force a timeout of velero command execution

cf13ed2

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2022

kaovilai commented Sep 9, 2022

View reviewed changes

kaovilai changed the title ~~mustgather logs typo, remove duplicate pod logs, add make run~~ OADP-774 must-gather resolve velero downloadrequest stuck on wait.Until, logs typo, remove duplicate pod logs, add make run Sep 9, 2022

kaovilai changed the title ~~OADP-774 must-gather resolve velero downloadrequest stuck on wait.Until, logs typo, remove duplicate pod logs, add make run~~ OADP-774 must-gather unstuck from velero downloadrequest wait.Until, var typo, remove duplicate logs, add make run Sep 9, 2022

kaovilai removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2022

Use value for timeout from $timeout

fbfd433

savitharaghunathan reviewed Sep 9, 2022

View reviewed changes

shubham-pampattiwar approved these changes Sep 13, 2022

View reviewed changes

kaovilai changed the title ~~OADP-774 must-gather unstuck from velero downloadrequest wait.Until, var typo, remove duplicate logs, add make run~~ OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run Sep 13, 2022

kaovilai merged commit 64558cd into openshift:master Sep 13, 2022

openshift-cherrypick-robot mentioned this pull request Sep 13, 2022

[oadp-1.1] OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run #821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add `make run` #816

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add `make run` #816

kaovilai commented Sep 8, 2022 •

edited by openshift-ci bot

Loading

savitharaghunathan left a comment

kaovilai Sep 8, 2022

savitharaghunathan commented Sep 9, 2022

kaovilai commented Sep 9, 2022

kaovilai commented Sep 9, 2022

openshift-ci bot commented Sep 9, 2022

kaovilai commented Sep 9, 2022

kaovilai Sep 9, 2022

kaovilai Sep 9, 2022

kaovilai commented Sep 9, 2022

kaovilai commented Sep 9, 2022

savitharaghunathan commented Sep 9, 2022

savitharaghunathan Sep 9, 2022

kaovilai Sep 10, 2022

kaovilai Sep 12, 2022 •

edited

Loading

weshayutin commented Sep 10, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

weshayutin commented Sep 12, 2022

kaovilai commented Sep 13, 2022

openshift-ci bot commented Sep 13, 2022

shubham-pampattiwar commented Sep 13, 2022

kaovilai commented Sep 13, 2022

kaovilai commented Sep 13, 2022

openshift-cherrypick-robot commented Sep 13, 2022

openshift-cherrypick-robot commented Sep 13, 2022

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run #816

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run #816

Conversation

kaovilai commented Sep 8, 2022 • edited by openshift-ci bot Loading

savitharaghunathan left a comment

Choose a reason for hiding this comment

kaovilai Sep 8, 2022

Choose a reason for hiding this comment

savitharaghunathan commented Sep 9, 2022

kaovilai commented Sep 9, 2022

kaovilai commented Sep 9, 2022

openshift-ci bot commented Sep 9, 2022

kaovilai commented Sep 9, 2022

kaovilai Sep 9, 2022

Choose a reason for hiding this comment

kaovilai Sep 9, 2022

Choose a reason for hiding this comment

kaovilai commented Sep 9, 2022

kaovilai commented Sep 9, 2022

savitharaghunathan commented Sep 9, 2022

savitharaghunathan Sep 9, 2022

Choose a reason for hiding this comment

kaovilai Sep 10, 2022

Choose a reason for hiding this comment

kaovilai Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

weshayutin commented Sep 10, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

kaovilai commented Sep 12, 2022

weshayutin commented Sep 12, 2022

kaovilai commented Sep 13, 2022

openshift-ci bot commented Sep 13, 2022

shubham-pampattiwar commented Sep 13, 2022

kaovilai commented Sep 13, 2022

kaovilai commented Sep 13, 2022

openshift-cherrypick-robot commented Sep 13, 2022

openshift-cherrypick-robot commented Sep 13, 2022

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add `make run` #816

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add `make run` #816

kaovilai commented Sep 8, 2022 •

edited by openshift-ci bot

Loading

kaovilai Sep 12, 2022 •

edited

Loading