Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/ec2 fleet integration #471

Merged
merged 26 commits into from
Oct 14, 2021
Merged

Feat/ec2 fleet integration #471

merged 26 commits into from
Oct 14, 2021

Conversation

cristim
Copy link
Member

@cristim cristim commented Sep 16, 2021

Issue Type

  • Feature Pull Request

Summary

This change switches to using the EC2 Fleet API in instant mode for running Spot instances.

This allows us to support choosing from multiple available allocation strategies instead of the lowest-cost strategy previously hardcoded by sorting instance types by price and attempting to run them one after another.

By default we now use the capacity-optimized-prioritized allocation strategy. The priority is still based on the lowest cost, we still pass a list ordered by price but with this strategy the ordering is not enforced strongly in case the lowest cost instance types are running low on capacity and have increased risk of interruption, therefore provisioning Spot instances that have lower risk of being interrupted. The cost may not be the very lowest but the interruption rates can sometimes be up to 10x lower than before.

The previous behavior is still available when using thelowest-cost allocation strategy.

Another benefit is the use of fewer API calls for launching instances, which will help on large accounts where the previous repeated RunInstances API calls may have contributed to API throttling.

Code contribution checklist

  1. The contribution fixes a single existing github issue, and it is linked
    to it.
  2. The code is as simple as possible, readable and follows the idiomatic Go
    guidelines.
  3. All new functionality is covered by automated test cases so the overall
    test coverage doesn't decrease.
  4. No issues are reported when running make full-test.
  5. Functionality not applicable to all users should be configurable.
  6. Configurations should be exposed through Lambda function environment
    variables which are also passed as parameters to the
    CloudFormation
    and
    Terraform
    stacks defined as infrastructure code.
  7. Global configurations set from the infrastructure stack level should also
    support per-group overrides using tags.
  8. Tags names and expected values should be similar to the other existing
    configurations.
  9. Both global and tag-based configuration mechanisms should be tested and
    proven to work using log output from various test runs.
  10. The logs should be kept as clean as possible (use log levels as
    appropriate) and formatted consistently to the existing log output.
  11. The documentation is updated to cover the new behavior, as well as the
    new configuration options for both stack parameters and tag overrides.
  12. A code reviewer reproduced the problem and can confirm the code
    contribution actually resolves it.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 7963 lines exceeds the maximum allowed for the inline comments feature.

@cristim cristim force-pushed the feat/ec2-fleet-integration branch from 2b11239 to b576743 Compare September 16, 2021 18:07
Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 6606 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 6786 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 6798 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 6825 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 6830 lines exceeds the maximum allowed for the inline comments feature.

Copy link

@codeclimate codeclimate bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR diff size of 6883 lines exceeds the maximum allowed for the inline comments feature.

@codeclimate
Copy link

codeclimate bot commented Sep 16, 2021

Code Climate has analyzed commit 6a0d2e5 and detected 0 issues on this pull request.

View more on Code Climate.

@cristim
Copy link
Member Author

cristim commented Sep 16, 2021

@mello7tre please try this out when you have some time, this will make Spot instances provisioned by AutoSpotting less prone to be interrupted by tapping into the largest capacity pools.

I did a basic test and it seems to work but this is a massive change so there are probably bugs lurking around.

@mello7tre
Copy link
Contributor

Hi @cristim, it's really a massive change!
At the moment i am quite busy, so i cannot tell you when, but i will try to make some tests.

But first i need to have the time to look at the code changes, to understand better the implementation.
Will update you, when i will begin looking at it.

@cristim
Copy link
Member Author

cristim commented Sep 20, 2021

Sure, thanks, I really appreciate it! There's no hurry, this isn't final yet.

I tested it a bit and seems to work in my initial tests but I need to test it further and add a few more unit tests before I can confidently merge it.

The change is smaller than it looks, as it includes a split of instance.go to make it more manageable, so I moved a lot of code around and the tests accompanied to it, which is even bigger.

@mello7tre
Copy link
Contributor

Hi @cristim, finally i had some time to look at the code and to study in detail how EC2 Fleet works (never used it).
I still not have tested, on the field, the fleet version but i will do it in the next days.

By the way you, have done a great work splitting instance code, i like it a lot, it's very logic; just looking at the names is easy to found the method one is looking for.

@cristim
Copy link
Member Author

cristim commented Sep 23, 2021

Thanks, glad you like it. To be honest I was reluctant to split it but when I started it was really too much for my brain so decided to break it down.

I still have to add a bunch more unit tests and test it further but over the last week I didn't have time and motivation to work on this.

- This allows us to implement support for allocation strategies, currently hardcoded
  to capacity-optimized-prioritized, but later it could be made configurable.
- It requires the use of a temporary LaunchTemplate, which is created based
  on the data previously passed to the RunInstances API call and deleted immediately
  after the EC2 fleet API call.
- Added global config with per-ASG tag overrides
- Extended unit test coverage for the new logic
- Improved tests for reading other configurations from tags
- Added unit tests for EBS block device conversion logic
- Converted PatchBeanstalkUserdata config flag to bool value
@cristim cristim force-pushed the feat/ec2-fleet-integration branch from 6a0d2e5 to 13fcb35 Compare September 29, 2021 14:01
@cristim cristim marked this pull request as ready for review October 3, 2021 22:41
@cristim cristim merged commit ff96e8f into master Oct 14, 2021
@cristim cristim deleted the feat/ec2-fleet-integration branch October 21, 2021 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants