Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve async error reporting #4550

Merged
merged 2 commits into from
Sep 30, 2024
Merged

improve async error reporting #4550

merged 2 commits into from
Sep 30, 2024

Conversation

wdbaruni
Copy link
Member

Improve async error reporting of executions from compute nodes back to orchestrators and job store, such as errors related to docker executor, s3 publisher and input source.

The PR does the following:

  1. Enriches S3 errors with AWS error code and more metadata
  2. Use the new bacerrors.Error for docker returned errors
  3. Add new ErrorCode to models.Event details, and populate that value with bacerrors {Component}:{ErrorCode}, such as S3Publisher:NoSuchBucket and Docker:ImageNotFound
  4. Introduced new Details field to executions compute state, which will hold additional metadata about the latest state of the execution, mainly the ErrorCode
  5. Publish ErrorCode to otel analytics

Examples:

Bad docker image

→ bacalhau docker run non_existent_image

Job successfully submitted. Job ID: j-29a81940-18a2-44b7-b0da-807d45946f45
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT
 22:37:32.323              Submission       Job submitted
 22:37:32.340  e-640f0876  Scheduling       Requested execution on n-7c5b7d69
                                            * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507
 22:37:34.569  e-640f0876  Exec Scanning    Error: image not available: "non_existent_image"
                                            Hint: To resolve this, either:
                                            1. Check if the image exists in the registry and the name is correct
                                            2. If the image is private, supply the node with valid Docker login credentials using the
                                            DOCKER_USERNAME and DOCKER_PASSWORD environment variables
                                            * ErrorCode: Docker:ImageNotFound
                                            * Image: non_existent_image
 22:37:34.585  e-a3a3afe2  Scheduling       Requested execution on n-7c5b7d69
                                            * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507
 22:37:36.732  e-a3a3afe2  Exec Scanning    Error: image not available: "non_existent_image"
                                            Hint: To resolve this, either:
                                            1. Check if the image exists in the registry and the name is correct
                                            2. If the image is private, supply the node with valid Docker login credentials using the
                                            DOCKER_USERNAME and DOCKER_PASSWORD environment variables
                                            * ErrorCode: Docker:ImageNotFound
                                            * Image: non_existent_image

Error: job failed

To get more details about the run, execute:
	bacalhau job describe j-29a81940-18a2-44b7-b0da-807d45946f45

To get more details about the run executions, execute:
	bacalhau job executions j-29a81940-18a2-44b7-b0da-807d45946f45


bacalhau job executions j-29a81940-18a2-44b7-b0da-807d45946f45 --output yaml
- AllocatedResources:
    Tasks: {}
  ComputeState:
    Message: 'image not available: "non_existent_image"'
    StateType: 8
  CreateTime: 1727642252340926000
  DesiredState:
    Message: execution failed
    StateType: 2
  EvalID: ecad787d-e72a-4987-b353-cd6552d546bf
  FollowupEvalID: ""
  ID: e-640f0876-119b-40ac-883e-b2126b5a40f3
  JobID: j-29a81940-18a2-44b7-b0da-807d45946f45
  ModifyTime: 1727642254570170000
  Name: ""
  Namespace: default
  NextExecution: ""
  NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507
  PreviousExecution: ""
  PublishedResult:
    Type: ""
  Revision: 3
  RunOutput: null
- AllocatedResources:
    Tasks: {}
  ComputeState:
    Message: 'image not available: "non_existent_image"'
    StateType: 8
  CreateTime: 1727642254585495000
  DesiredState:
    Message: execution failed
    StateType: 2
  EvalID: ef3bae6f-54fb-4f48-9b83-98364049e685
  FollowupEvalID: ""
  ID: e-a3a3afe2-5d12-498f-ad19-86ea00425d30
  JobID: j-29a81940-18a2-44b7-b0da-807d45946f45
  ModifyTime: 1727642256732971000
  Name: ""
  Namespace: default
  NextExecution: ""
  NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507
  PreviousExecution: ""
  PublishedResult:
    Type: ""
  Revision: 3
  RunOutput: null

Bad S3 bucket

→ bacalhau job run docker-s3.yaml
Job successfully submitted. Job ID: j-036bc69b-7b81-489b-a714-d1349d6e6f5b
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

 TIME          EXEC. ID    TOPIC            EVENT
 22:36:57.853              Submission       Job submitted
 22:36:57.868  e-ad0ab10c  Scheduling       Requested execution on n-7c5b7d69
                                            * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507
 22:36:57.929  e-ad0ab10c  Execution        Running
 22:37:03.414  e-ad0ab10c  Publishing       Error: failed to publish s3 result: operation error S3: PutObject, https response error StatusCode:
                           Results          404, RequestID: 62FSTZ2400AA0782, api error NoSuchBucket: The specified bucket does not exist
                                            * AWSRequestID: 62FSTZ2400AA0782
                                            * ErrorCode: S3Publisher:NoSuchBucket
                                            * Operation: PutObject
                                            * Service: S3
 22:37:03.432  e-995b726b  Scheduling       Requested execution on n-7c5b7d69
                                            * NodeID: n-7c5b7d69-c42d-493e-ade0-7d6feeedc507
 22:37:03.482  e-995b726b  Execution        Running
 22:37:07.085  e-995b726b  Publishing       Error: failed to publish s3 result: operation error S3: PutObject, https response error StatusCode:
                           Results          404, RequestID: YNJQY666GB15CT3K, api error NoSuchBucket: The specified bucket does not exist
                                            * Operation: PutObject
                                            * Service: S3
                                            * AWSRequestID: YNJQY666GB15CT3K
                                            * ErrorCode: S3Publisher:NoSuchBucket

Error: job failed

To get more details about the run, execute:
	bacalhau job describe j-036bc69b-7b81-489b-a714-d1349d6e6f5b

To get more details about the run executions, execute:
	bacalhau job executions j-036bc69b-7b81-489b-a714-d1349d6e6f5b

To download the results, execute:
	bacalhau job get j-036bc69b-7b81-489b-a714-d1349d6e6f5b

@wdbaruni wdbaruni requested a review from udsamani September 29, 2024 20:49
@wdbaruni wdbaruni marked this pull request as ready for review September 29, 2024 20:49
@udsamani
Copy link
Contributor

There is a legit failing test it seems. Can we investigate that ? Other then that the PR looks good.

@wdbaruni wdbaruni merged commit 3fd303d into main Sep 30, 2024
3 of 4 checks passed
@wdbaruni wdbaruni deleted the async-error-reporting branch September 30, 2024 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants