Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] improved error reporting during env deployment/configuration #650

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

knopers8
Copy link
Collaborator

@knopers8 knopers8 commented Jan 15, 2025

  • "task '/opt/o2/bin/o2-readout-exe' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU, name alio2-cr1-hv-gw01.cern.ch:/opt/git/ControlWorkflows/tasks/readout@12b11ac4bb652e1835e3e94806a688c951691d5f#2sBwA3Z8yWU) failed with error..." displayed by COG becomes "task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error...". I believe the removed information is not useful for typical operators. Details can be always found in logs.
  • "MesosCommand MesosCommand_Transition timed out for task 2sBwYFZ82wn" becomes "MesosCommand_Transition timed out for task 2sBwYFZ82wn" because by convention, all MesosCommands have "MesosCommand" in their names.
  • If a transition request times out, more accurate error is reported in COG. "nil response" becomes "CONFIGURE could not complete for critical tasks, errors: task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error: MesosCommand_Transition timed out for task 2sBwA3Z8yWU".
  • In case a task transition times out, any other errors which happened at the same time (e.g. a task crashed during transition) are not omitted anymore in COG.
  • "nil response" in task/manager.go becomes "no response from Mesos to CONFIGURE transition request within 120s timeout", but it will not be typically printed for tasks timing out during a transition.
  • "nil response" in commandqueue.go becomes "did not receive neither response nor error for MesosCommand_Transition"

One could consider further simplification of some of these messages, but perhaps let's see how this goes.

OCTRL-975

…iguration

- "task '/opt/o2/bin/o2-readout-exe' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU, name alio2-cr1-hv-gw01.cern.ch:/opt/git/ControlWorkflows/tasks/readout@12b11ac4bb652e1835e3e94806a688c951691d5f#2sBwA3Z8yWU) failed with error..." displayed by COG becomes "task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error...". I believe the removed information is not useful for typical operators. Details can be always found in logs.
- "MesosCommand MesosCommand_Transition timed out for task 2sBwYFZ82wn" becomes "MesosCommand_Transition timed out for task 2sBwYFZ82wn" because by convention, all MesosCommands have "MesosCommand" in their names.
- If a transition request times out, more accurate error is reported in COG. "nil response" becomes "CONFIGURE could not complete for critical tasks, errors: task 'readout' on alio2-cr1-mvs11 (id 2sBwA3Z8yWU) failed with error: MesosCommand_Transition timed out for task 2sBwA3Z8yWU".
- In case a task transition times out, any other errors which happened at the same time (e.g. a task crashed during transition) are not omitted anymore in COG.
- "nil response" in task/manager.go becomes "no response from Mesos to CONFIGURE transition request within 120s timeout", but it will not be typically printed for tasks timing out during a transition.
- "nil response" in commandqueue.go becomes "did not receive neither response nor error for MesosCommand_Transition"

One could consider further simplification of some of these messages, but perhaps let's see how this goes.
@knopers8 knopers8 requested review from teo and justonedev1 January 15, 2025 09:48
@knopers8 knopers8 merged commit 0a82858 into AliceO2Group:master Jan 16, 2025
2 checks passed
@knopers8 knopers8 deleted the nil-response branch January 16, 2025 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants