Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-stepfunctions-tasks): Fargate task definition contains version which causes step function failure when new task definition is pushed. #12080

Closed
gupta-n opened this issue Dec 15, 2020 · 3 comments · Fixed by #12436
Assignees
Labels
@aws-cdk/aws-stepfunctions-tasks bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@gupta-n
Copy link

gupta-n commented Dec 15, 2020

Hi Team,
Currently task definition created by step function CDK appends task version which cause step function failure.

Scenario: Step function failed to launch fargate task because of some transient issue, before retry kicks in there was another deployment went which modified task definition and caused version bump. In this scenario running step function execution will failed as existing task is inactive now.

Reproduction Steps

Scenario: Step function failed to launch fargate task because of some transient issue, before retry kicks in there was another deployment went which modified task definition and caused version bump. In this scenario running step function execution will failed as existing task is inactive now.

const workerTask = new sfn.Task(this, 'Worker', {
            task: new tasks.RunEcsFargateTask({
                cluster: props.ecsCluster,
                taskDefinition: taskDefn,
                integrationPattern: sfn.ServiceIntegrationPattern.SYNC,
                containerOverrides: [
                    {
                        containerName: "TestContainer",
                        environment: []
                    }]
            }),
            outputPath: "$"
        }

Actual Task definition in step function

"parameters": {
    "Cluster": "arn:aws:ecs:us-east-1:XXXXXXXXXXX:cluster/EcsCluster-Cluster2e",
    "TaskDefinition": "arn:aws:ecs:us-east-1:XXXXXXXXXXX:task-definition/TaskDefinition75BC8FE8:16",
    "NetworkConfiguration": {
      "AwsvpcConfiguration": {
        "Subnets": [
          "subnet-XXXXXXXXXXX"
        ],
        "SecurityGroups": [
          "sg-XXXXXXXXXXX"
        ]
      }
    },

Expected output

"parameters": {
    "Cluster": "arn:aws:ecs:us-east-1:XXXXXXXXXXX:cluster/EcsCluster-Cluster2e",
    "TaskDefinition": "arn:aws:ecs:us-east-1:XXXXXXXXXXX:task-definition/TaskDefinition75BC8FE8",
    "NetworkConfiguration": {
      "AwsvpcConfiguration": {
        "Subnets": [
          "subnet-XXXXXXXXXXX"
        ],
        "SecurityGroups": [
          "sg-XXXXXXXXXXX"
        ]
      }
    },

What did you expect to happen?

When creating fargate task definition step function can attach definition to task family which will ensure to always pickup latest version and will make step function not to fail when new deployment is pushed.

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-taskDefinition

Expected output

"parameters": {
    "Cluster": "arn:aws:ecs:us-east-1:XXXXXXXXXXX:cluster/EcsCluster-Cluster2e",
    "TaskDefinition": "arn:aws:ecs:us-east-1:XXXXXXXXXXX:task-definition/TaskDefinition75BC8FE8",
    "NetworkConfiguration": {
      "AwsvpcConfiguration": {
        "Subnets": [
          "subnet-XXXXXXXXXXX"
        ],
        "SecurityGroups": [
          "sg-XXXXXXXXXXX"
        ]
      }
    },

What actually happened?

Step function failed when new changes to task definition were pushed.

Environment

  • CDK CLI Version : 1.32.2
  • Framework Version: 1.32.2
  • Node.js Version: NodeJS = 12.x
  • **OS : Mac
  • Language (Version): Type script ^3.6.4

This is 🐛 Bug Report

@gupta-n gupta-n added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 15, 2020
@shivlaks
Copy link
Contributor

shivlaks commented Dec 15, 2020

@gupta-n thanks for reporting the issue.

I think we'll need some more information before this can be actioned:

Currently task definition created by step function CDK appends task version which cause step function failure.

what's the code required to produce it? what's the expectation?
how do i reproduce this error

Step function failed to launch fargate task because of some transient issue,

what's the point of failure? when executing the state machine? when deploying the CloudFormation template?

before retry kicks in there was another deployment went which modified task definition and caused version bump.

I'm a little lost, are multiple deployments needed to produce this issue? how/where is the task definition defined?
when you say version bump, what are you referring to? how is that version bump observed?
please provide a minimal set of reproduction steps to define a state machine with a problematic task definition.

Step function failed when new changes to task definition were pushed.

what kind of changes? are they prerequisites to producing the error?

Environment

can you fill in all of the requisite version information (NA is not possible for Node.js version and OS). Please mention the specific version that you are using as it's not asking for which versions the bug might be present in.

As i see it, a first step would be to reproduce this error. A minimal code sample and the steps required to produce the failure scenario would be a starting point.

@shivlaks shivlaks added needs-reproduction This issue needs reproduction. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 15, 2020
@gupta-n
Copy link
Author

gupta-n commented Dec 18, 2020

before retry kicks in there was another deployment went which modified task definition and caused version bump.

I'm a little lost, are multiple deployments needed to produce this issue?

Yes

how/where is the task definition defined?

using step function new tasks.RunEcsFargateTask added code snippet.

when you say version bump, what are you referring to? how is that version bump observed?

Version bump happens whenever you modified task definition. one way could be to change fargate memory size.

please provide a minimal set of reproduction steps to define a state machine with a problematic task definition.

Added, Let me know if something else is needed.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 19, 2020
@shivlaks shivlaks added p1 effort/small Small work item – less than a day of effort and removed needs-reproduction This issue needs reproduction. labels Dec 21, 2020
@mergify mergify bot closed this as completed in #12436 Jan 22, 2021
mergify bot pushed a commit that referenced this issue Jan 22, 2021
…instead of ARN (#12436)

feat(stepfunctions-tasks): EcsRunTask now uses taskDefinition family instead of ARN

Currently the ECS run task implementation uses full ARN of the task definition. This ARN contains the ACTIVE revision at the end. The ACTIVE revision keeps on changing as the task definition changes causing potential failures (refer the issue).

This change now lets the run task API to use task definition family (and corresponding ARN which does not contain the revision) to run the task. Using the family would mean that the latest ACTIVE revision of task-definition is used always. This is supported out of the box by ECS (refer the below refs).

Parameter Ref: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-taskDefinition

Permissions Ref: https://docs.aws.amazon.com/step-functions/latest/dg/ecs-iam.html

closes #12080

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

mohanrajendran pushed a commit to mohanrajendran/aws-cdk that referenced this issue Jan 24, 2021
…instead of ARN (aws#12436)

feat(stepfunctions-tasks): EcsRunTask now uses taskDefinition family instead of ARN

Currently the ECS run task implementation uses full ARN of the task definition. This ARN contains the ACTIVE revision at the end. The ACTIVE revision keeps on changing as the task definition changes causing potential failures (refer the issue).

This change now lets the run task API to use task definition family (and corresponding ARN which does not contain the revision) to run the task. Using the family would mean that the latest ACTIVE revision of task-definition is used always. This is supported out of the box by ECS (refer the below refs).

Parameter Ref: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-taskDefinition

Permissions Ref: https://docs.aws.amazon.com/step-functions/latest/dg/ecs-iam.html

closes aws#12080

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-stepfunctions-tasks bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants