Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the task status and the container KnownStatus when ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT is enabled #2800

Merged
merged 1 commit into from
Feb 2, 2021

Conversation

chienhanlin
Copy link
Contributor

@chienhanlin chienhanlin commented Jan 28, 2021

Summary

This PR fixes the pending task status when ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT is enabled, but an essential container encountered CannotPullContainerError while pulling images remotely.

By configuring the Agent environment variable ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT=true, Agent will start to pull images for containers with dependencies before the dependsOn condition has been satisfied; however, if any reason prevents Agent to pull images, these containers will neither transit to KnownStatus: PULLED, nor reach to KnownStatus: CREATED/STOPPED, making tasks remain in PENDING state.

An example task definition and a workflow are provided as follows.

Container A
essential: false
"dependsOn": null

Container B
essential: true
"dependsOn": [
    {
        "containerName": "A",
        "condition": "SUCCESS"
    }
]
Task: ----------------------------------------- PENDING--------------------------------------------------
Container A: PULLED -> CREATED -> RUNNING (Waiting for Container B reached to KnownStatus: PULLED)
Container B: ------------------------- CannotPullContainerError--------------------------------------------

In this PR, Agent will search the image from local cached images when Agent failed to pull the image from remote, and ECS_IMAGE_PULL_BEHAVIOR is not specified to always. Once the image is found locally, Agent will add the container to the pulled container state; otherwise, set the desired status of task to STOPPED in order to stop the task. The expected task stopped reason, the container status reason and the last task status are shown as follows.

  • If container B is an essential container -> ECS_IMAGE_PULL_BEHAVIOR=default -> failed to pull the image from remote -> Task stopped reason on ECS control plane: Task failed to start -> container B's status reason on ECS control plane: CannotPullContainerError: Error response from daemon: pull access denied for customer, repository does not exist or may require 'docker login': denied: requested access to the resource is denied -> The task last status: STOPPED

Implementation details

Add logic to search local cached images in agent/engine/docker_task_engine.go when

  1. Agent failed to pull the image from remote
  2. ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT=true
  3. ECS_IMAGE_PULL_BEHAVIOR is not `always

Set the desire status of the task to STOPPED when

  1. Agent cannot find the image from local cached images
  2. The container is an essential container
  • If the image can be found locally -> set the findCachedImage to true -> Add the container to the pulled container state
  • If the image cannot be found locally -> if the container is an essential container -> set DesiredStatus of the task to TaskStopped -> emit task event with an error -> return container metadata with error

Testing

Manual testing

Case 1: When the image for container B is not available in both remote and local cached images
Result:

  • Task metadata endpoint version 4 /task endpoint returns task metadata with the container A only.
  • The task stopped reason: Task failed to start
  • B's container status reason: CannotPullContainerError: Error response from daemon: pull access denied for customer, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  • The task last status: STOPPED

curl ${ECS_CONTAINER_METADATA_URI_V4}/task

{
   ...
   "DesiredStatus":"STOPPED",
   "KnownStatus":"STOPPED",
   ...
   "Containers":[
      {
         "DockerId":"xxx",
         "Name":"A",
         "DockerName":"xxxx",
         "Image":"xxx",
         "ImageID":"xxx",
         ...
      }
   ]
}

Case 2: When the image for container B is only available in local cached images
Result:

  • Task metadata endpoint version 4 /task endpoint returns task metadata with two containers, and the container B has KnownStatus: PULLED.

curl ${ECS_CONTAINER_METADATA_URI_V4}/task

{
   ...
   "DesiredStatus":"RUNNING",
   "KnownStatus":"NONE",
   ...
   "Containers":[
      {
         "DockerId":"xxx",
         "Name":"A",
         "DockerName":"xxxx",
         "Image":"xxx",
         "ImageID":"xxx",
         ...
      },
      {
         "DockerId":"",
         "Name":"B",
         "DockerName":"",
         "Image":"xxx",
         "ImageID":"xxx",
         "DesiredStatus":"RUNNING",
         "KnownStatus":"PULLED",
         ...
      }
   ]
}

Case 3: When the image for container B is available in both remote and local cached images, and ECS_IMAGE_PULL_BEHAVIOR=once/prefer-cached
Result:

  • Task metadata endpoint version 4 /task endpoint returns task metadata with two containers, and the container B has KnownStatus: PULLED.

Case 4: When the image for container B is not available in remote, and ECS_IMAGE_PULL_BEHAVIOR=always
Result:

  • Task metadata endpoint version 4 /task endpoint returns task metadata with the container A only.
  • The task stopped reason: Task failed to start
  • B's container status reason: CannotPullContainerError: Error response from daemon: pull access denied for customer, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  • The task last status: STOPPED

More test cases and task status when ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT is enabled

Test # ECS_IMAGE_PULL_BEHAVIOR B’s image availability B is essential Task stopped
1 default remote & local true false
2 always remote & local true false
3 once remote & local true false
4 prefer-cached remote & local true false
5 default remote only true false
6 always remote only true false
7 once remote only true false
8 prefer-cached remote only true false
9 default local only true false
10 always local only true true
11 once local only & has pulled true false
12 prefer-cached local only true false
13 default not available true true
14 always not available true true
15 once not available true true
16 prefer-cached not available true true
17 default remote & local false false
18 always remote & local false false
19 once remote & local false false
20 prefer-cached remote & local false false
21 default remote only false false
22 always remote only false false
23 once remote only false false
24 prefer-cached remote only false false
25 default local only false false
26 always local only false true
27 once local only & has pulled false false
28 prefer-cached local only false false
29 default not available false false
30 always not available false true
31 once not available false true
32 prefer-cached not available false false

New tests cover the changes: yes
Update the unit test TestPullAndUpdateContainerReference with more scenarios.

--- PASS: TestPullAndUpdateContainerReference (0.00s)
    --- PASS: TestPullAndUpdateContainerReference/DependentContainersPullUpfrontEnabledWithRemoteImage (0.00s)
    --- PASS: TestPullAndUpdateContainerReference/DependentContainersPullUpfrontDisabledWithRemoteImage (0.00s)
    --- PASS: TestPullAndUpdateContainerReference/DependentContainersPullUpfrontEnabledWithCachedImage (0.00s)
    --- PASS: TestPullAndUpdateContainerReference/DependentContainersPullUpfrontEnabledAndImagePullOnceBehavior (0.00s)
    --- PASS: TestPullAndUpdateContainerReference/DependentContainersPullUpfrontEnabledAndImagePullPreferCachedBehavior (0.00s)
    --- PASS: TestPullAndUpdateContainerReference/DependentContainersPullUpfrontEnabledAndImagePullAlwaysBehavior (0.00s)

Note: Agent behavior remains the same for non-essential containers when ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT is enabled.

Description for the changelog

Bug - Fixed a task status deadlock and pulled container state for cached images when ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT is enabled

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@chienhanlin chienhanlin force-pushed the fixPulledFailedEssential branch from 1a4083e to 6f41413 Compare January 28, 2021 20:17
@chienhanlin chienhanlin force-pushed the fixPulledFailedEssential branch from 6f41413 to 41c293f Compare January 28, 2021 23:02
@chienhanlin chienhanlin changed the title Fix the task status when an essential container encountered CannotPullContainerError Fix the task status and the container KnownStatus when ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT is enabled Jan 28, 2021
@chienhanlin chienhanlin marked this pull request as ready for review January 28, 2021 23:41
@@ -840,6 +840,12 @@ func (engine *DockerTaskEngine) pullContainer(task *apitask.Task, container *api

}

// Add the container to pulled container state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add update the comment to note this is for cached images?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is updated to

func (engine *DockerTaskEngine) pullContainer(task *apitask.Task, container *apicontainer.Container) dockerapi.DockerContainerMetadata {
	...
	if engine.imagePullRequired(engine.cfg.ImagePullBehavior, container, task.Arn) {...}

	// No pull image is required, the cached image will be used.
	// Add the container that uses the cached image to the pulled container state.
	dockerContainer := &apicontainer.DockerContainer{
		Container: container,
	}
	engine.state.AddPulledContainer(dockerContainer, task)
       ...
}

Thank you!

@@ -1722,6 +1722,43 @@ func TestPullAndUpdateContainerReference(t *testing.T) {
assert.Equal(t, dockerapi.DockerContainerMetadata{}, metadata, "expected empty metadata")
}

// TestPullAndUpdateContainerReferenceWithCachedImage checks whether a container is added to task engine state when
// the container is an essential container, the image is cached and DependentContainersPullUpfront is enabled.
func TestPullAndUpdateContainerReferenceWithCachedImage(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test case for the negative scenario? - image pull behavior is always

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative test case is added, named TestPullAndUpdateContainerReferenceFailedToGetImage here
Thank you!

ubhattacharjya
ubhattacharjya previously approved these changes Jan 29, 2021
// Stop the task if the container is an essential container,
// and the image is not available in both remote and local caches
if container.IsEssential() {
task.SetKnownStatus(apitaskstatus.TaskStopped)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd only change the desired status here as we don't know the task is stopped at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed task.SetKnownStatus(apitaskstatus.TaskStopped), and keep task.SetDesiredStatus(apitaskstatus.TaskStopped). Thank you!

}
return dockerapi.DockerContainerMetadata{Error: metadata.Error}
}
seelog.Infof("Task engine [%s]: found cached image %s, use it directly for container %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if engine.client.InspectImage(container.Image) doesn't return an error we can be sure the image exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tracing back with the function call, I find function InspectImage, defined in agent/dockerclient/dockerapi/docker_client.go, returns image data received from the client.ImageInspectWithRaw() call.

func (dg *dockerGoClient) InspectImage(image string) (*types.ImageInspect, error) {
	defer metrics.MetricsEngineGlobal.RecordDockerMetric("INSPECT_IMAGE")()
	client, err := dg.sdkDockerClient()
	if err != nil {
		return nil, err
	}
	imageData, _, err := client.ImageInspectWithRaw(dg.context, image)
	return &imageData, err
}

The ImageInspectWithRaws, defined in agent/vendor/github.com/docker/docker/client/image_inspect.go, returns low-level information about an image based on the given image name or ID. If the image cannot be found through docker engine api, 404 error response will be returned. Details can be found here.

// ImageInspectWithRaw returns the image information and its raw representation.
func (cli *Client) ImageInspectWithRaw(ctx context.Context, imageID string) (types.ImageInspect, []byte, error) {
	if imageID == "" {
		return types.ImageInspect{}, nil, objectNotFoundError{object: "image", id: imageID}
	}
	serverResp, err := cli.get(ctx, "/images/"+imageID+"/json", nil, nil)
	if err != nil {
		return types.ImageInspect{}, nil, wrapResponseError(err, serverResp, "image", imageID)
	}
	defer ensureReaderClosed(serverResp)

	body, err := ioutil.ReadAll(serverResp.body)
	if err != nil {
		return types.ImageInspect{}, nil, err
	}

	var response types.ImageInspect
	rdr := bytes.NewReader(body)
	err = json.NewDecoder(rdr).Decode(&response)
	return response, body, err
}

We have implemented this to verify whether the image exists or not in func imagePullRequired when ImagePullBehavior is prefer-cached.

func (engine *DockerTaskEngine) imagePullRequired(imagePullBehavior config.ImagePullBehaviorType,
	container *apicontainer.Container,
	taskArn string) bool {
	switch imagePullBehavior {
	case config.ImagePullOnceBehavior:
                  ...
	case config.ImagePullPreferCachedBehavior:
		// If the behavior is prefer cached, don't pull if we found cached image
		// by inspecting the image.
		_, err := engine.client.InspectImage(container.Image)
		if err != nil {
			return true
		}
		seelog.Infof("Task engine [%s]: found cached image %s, use it directly for container %s",
			taskArn, container.Image, container.Name)
		return false
	default:
		// Need to pull the image for always and default agent pull behavior
		return true
	}
}

Based on these info, I think it is ok to verify cached images through engine.client.InspectImage(container.Image). Please let me know if anything is still missing here. Thank you!


// TestPullAndUpdateContainerReferenceFailedToGetImage checks whether a container is added to task engine state when
// the container is an essential container, ImagePullBehavior is set to always and DependentContainersPullUpfront is enabled.
func TestPullAndUpdateContainerReferenceFailedToGetImage(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably use a test table as the code seems largely duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to the table driven test here. Thank you!

@@ -1685,41 +1686,119 @@ func TestUpdateContainerReference(t *testing.T) {
}

// TestPullAndUpdateContainerReference checks whether a container is added to task engine state when
// pullSucceeded and DependentContainersPullUpfront is enabled.
// Test # | Image availability | DependentContainersPullUpfront | ImagePullBehavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Although I think we removed the UTs for when DependentContainersPullUpfront is NOT enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DependentContainersPullUpfrontDisabledWithRemoteImage is added to TestPullAndUpdateContainerReference. Thank you!

mssrivas
mssrivas previously approved these changes Feb 1, 2021
sharanyad
sharanyad previously approved these changes Feb 1, 2021
…ontainers

and ECS_PULL_DEPENDENT_CONTAINERS_UPFRONT=true. The task will be stopped
if no cached image can be found locally.
@chienhanlin chienhanlin dismissed stale reviews from sharanyad and mssrivas via a7f975f February 1, 2021 21:20
@chienhanlin chienhanlin force-pushed the fixPulledFailedEssential branch from 3068aec to a7f975f Compare February 1, 2021 21:20
@chienhanlin chienhanlin merged commit 1f6960f into aws:dev Feb 2, 2021
@chienhanlin chienhanlin deleted the fixPulledFailedEssential branch February 2, 2021 08:11
@shubham2892 shubham2892 added this to the 1.50.1 milestone Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants