Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run #816

Merged
merged 5 commits into from
Sep 13, 2022

Conversation

kaovilai
Copy link
Member

@kaovilai kaovilai commented Sep 8, 2022

OADP-774

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai kaovilai changed the title mustgather logs improvements, add make run mustgather logs typo, remove duplicate pod logs, add make run Sep 8, 2022
Copy link
Member

@savitharaghunathan savitharaghunathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2022
mkdir -p ${object_collection_path}
oc describe podvolumerestores.velero.io ${pvr} --namespace ${ns} &> "${object_collection_path}/pvr-describe-${pvr}.txt"
echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting Pod logs..."
oc logs --all-containers --namespace ${ns} ${pod} --since ${logs_since} &> "${object_collection_path}/current.log" &
echo "[cluster=${cluster}][ns=${ns}][pod=${pod}] Collecting previous Pod logs..."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the empty [pod=] logs came from

@savitharaghunathan
Copy link
Member

/retest-required

1 similar comment
@kaovilai
Copy link
Member Author

kaovilai commented Sep 9, 2022

/retest-required

@kaovilai
Copy link
Member Author

kaovilai commented Sep 9, 2022

/hold this did not fix issue we were trying to fix.. Getting stuck at "Collecting volumesnapshotlocations"

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2022
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2022
@openshift-ci
Copy link

openshift-ci bot commented Sep 9, 2022

New changes are detected. LGTM label has been removed.

@kaovilai
Copy link
Member Author

kaovilai commented Sep 9, 2022

everything seems to work

fi
echo "[cluster=${cluster}][ns=${ns}] Gathering 'velero restore logs ${restore}'"
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero restore logs ${restore} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/restore-${restore}.log" &
oc -n ${ns} exec $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout 30s /velero restore logs ${restore} --insecure-skip-tls-verify=${skip_tls} --timeout=30s" &> "${object_collection_path}/restore-${restore}.log" &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are forcing velero CLI commands that involve downloadrequest.Stream to timeout which will resolve issues related to must-gather getting stuck when querying from nonexistent backup storage location.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaovilai kaovilai changed the title mustgather logs typo, remove duplicate pod logs, add make run OADP-774 must-gather resolve velero downloadrequest stuck on wait.Until, logs typo, remove duplicate pod logs, add make run Sep 9, 2022
@kaovilai
Copy link
Member Author

kaovilai commented Sep 9, 2022

/unhold ready for review

@kaovilai
Copy link
Member Author

kaovilai commented Sep 9, 2022

/retest

@kaovilai kaovilai changed the title OADP-774 must-gather resolve velero downloadrequest stuck on wait.Until, logs typo, remove duplicate pod logs, add make run OADP-774 must-gather unstuck from velero downloadrequest wait.Until, var typo, remove duplicate logs, add make run Sep 9, 2022
@kaovilai kaovilai removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2022
@savitharaghunathan
Copy link
Member

@kaovilai should there be a default timeout value?

else
oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "/velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
oc -n ${ns} exec --request-timeout=${timeout} $(oc -n ${ns} get po -l component=velero -o custom-columns=name:.metadata.name --no-headers) -- /bin/bash -c "timeout ${timeout} /velero describe backup ${backup} --insecure-skip-tls-verify=${skip_tls} --details" &> "${object_collection_path}/backup-describe-${backup}.txt" &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need --request-timeout=${timeout} here if timeout is getting passed in the velero container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory no.

Copy link
Member Author

@kaovilai kaovilai Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From further testing and doc digging.. --request-timeout has very different effects to timeout in velero container.

--request-timeout is The length of time to wait before giving up on a single api-server request.

whereas, if the api-server has responded (/velero cli executed but yet to print to stdout), request-timeout do not work to kill a stuck velero CLI process and must-gather still get stuck.

So I propose we keep both for the $timeout defined case.

@weshayutin
Copy link
Contributor

/retest

1 similar comment
@kaovilai
Copy link
Member Author

/retest

@kaovilai
Copy link
Member Author

checking for ways to write other than empty file when timeout is triggered.

@kaovilai
Copy link
Member Author

/retest

3 similar comments
@kaovilai
Copy link
Member Author

/retest

@kaovilai
Copy link
Member Author

/retest

@weshayutin
Copy link
Contributor

/retest

@kaovilai
Copy link
Member Author

@shubham-pampattiwar when timeout occurs, the file (backup/restore-log/describe-) will contain output like the following if timeout killed the program.

Defaulted container "velero" out of: velero, openshift-velero-plugin (init), velero-plugin-for-aws (init), velero-plugin-for-csi (init)
I0913 00:38:32.700413  121602 request.go:665] Waited for 1.188090857s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/k8s.cni.cncf.io/v1?timeout=32s
command terminated with exit code 124

Specifically 124 is an exit code given by timeout if timeout killed the program. I believe the verbosity seen in the file came from oc exec reporting this exit code.

@openshift-ci
Copy link

openshift-ci bot commented Sep 13, 2022

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@shubham-pampattiwar
Copy link
Member

@kaovilai LGTM!

@kaovilai kaovilai changed the title OADP-774 must-gather unstuck from velero downloadrequest wait.Until, var typo, remove duplicate logs, add make run OADP-774 must-gather: add timeout to velero logs/describe, var typos, remove duplicate logs, add make run Sep 13, 2022
@kaovilai kaovilai merged commit 64558cd into openshift:master Sep 13, 2022
@kaovilai
Copy link
Member Author

/cherry-pick oadp-1.1

@kaovilai
Copy link
Member Author

/cherry-pick oadp-1.0

@openshift-cherrypick-robot
Copy link
Contributor

@kaovilai: new pull request created: #821

In response to this:

/cherry-pick oadp-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Contributor

@kaovilai: #816 failed to apply on top of branch "oadp-1.0":

Applying: Typo shell variable
Using index info to reconstruct a base tree...
M	must-gather/collection-scripts/logs/gather_logs_restore
Falling back to patching base and 3-way merge...
Auto-merging must-gather/collection-scripts/logs/gather_logs_restore
Applying: mustgather `make run` target
Applying: do not gather restic logs twice
Applying: force a timeout of velero command execution
Using index info to reconstruct a base tree...
M	must-gather/collection-scripts/logs/gather_logs_backup
M	must-gather/collection-scripts/logs/gather_logs_restore
Falling back to patching base and 3-way merge...
Auto-merging must-gather/collection-scripts/logs/gather_logs_restore
CONFLICT (content): Merge conflict in must-gather/collection-scripts/logs/gather_logs_restore
Auto-merging must-gather/collection-scripts/logs/gather_logs_backup
CONFLICT (content): Merge conflict in must-gather/collection-scripts/logs/gather_logs_backup
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0004 force a timeout of velero command execution
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick oadp-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants