Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add spot termination watcher (beta) #3789

Merged
merged 37 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
2826df7
feat: add lambda to watch spot termination warnings
npalm Mar 2, 2024
ffb5500
feat: add lambda to watch spot termination warnings
npalm Mar 2, 2024
2c11973
feat: add lambda to watch spot termination warnings
npalm Mar 3, 2024
871caf5
end end comment for docs
npalm Mar 3, 2024
8e217aa
docs: auto update terraform docs
github-actions[bot] Mar 3, 2024
8fea5b8
prettier
npalm Mar 3, 2024
0b71feb
cleanup
npalm Mar 3, 2024
2f7a442
prettier
npalm Mar 3, 2024
dd5e8aa
fix middy
npalm Mar 3, 2024
175f15a
cleanup
npalm Mar 4, 2024
90989af
docs: auto update terraform docs
github-actions[bot] Mar 4, 2024
a0654f8
adjust coverage gates
npalm Mar 4, 2024
ea0a9a6
add multi-runner
npalm Mar 4, 2024
b768c0a
docs: auto update terraform docs
github-actions[bot] Mar 4, 2024
a64da0c
spot termination watcher
npalm Mar 13, 2024
87a0346
docs: auto update terraform docs
github-actions[bot] Mar 13, 2024
0212885
format
npalm Mar 13, 2024
ecf7aa4
run update docs wwith app
npalm Mar 13, 2024
9fa2006
docs: auto update terraform docs
forest-pr[bot] Mar 13, 2024
a84d9b5
own review
npalm Mar 13, 2024
0aec5e7
add termination watchter outputs
npalm Mar 13, 2024
8968a90
docs: auto update terraform docs
forest-pr[bot] Mar 13, 2024
53b5b7c
fix lint
npalm Mar 13, 2024
2beaf18
update example
npalm Mar 13, 2024
a90d4c1
update example
npalm Mar 13, 2024
388dbe1
update example
npalm Mar 13, 2024
e3684a8
update example
npalm Mar 13, 2024
2ab2b42
update ci
npalm Mar 13, 2024
9869dcc
docs: auto update terraform docs
forest-pr[bot] Mar 13, 2024
2a1c36a
create docs entry
npalm Mar 13, 2024
708dfe3
update docs
npalm Mar 18, 2024
464d35d
docs: auto update terraform docs
forest-pr[bot] Mar 18, 2024
102761c
update docs
npalm Mar 18, 2024
9f46ffc
docs: auto update terraform docs
forest-pr[bot] Mar 18, 2024
f8a134b
update lock file
npalm Mar 19, 2024
0a0a0f9
update lock file for new module
npalm Mar 20, 2024
16a84ec
docs: auto update terraform docs
forest-pr[bot] Mar 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 23 additions & 2 deletions .github/workflows/terraform.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ jobs:
touch lambdas/functions/control-plane/runners.zip
touch lambdas/functions/gh-agent-syncer/runner-binaries-syncer.zip
touch lambdas/functions/ami-housekeeper/ami-housekeeper.zip
touch lambdas/functions/termination-watcher/termination-watcher.zip
- name: terraform init
run: terraform init -get -backend=false -input=false
- if: contains(matrix.terraform, '1.5.')
Expand Down Expand Up @@ -69,7 +70,18 @@ jobs:
matrix:
terraform: [1.5.6, "latest"]
module:
["ami-housekeeper", "download-lambda", "multi-runner", "runner-binaries-syncer", "runners", "setup-iam-permissions", "ssm", "webhook"]
[
"ami-housekeeper",
"download-lambda",
"lambda",
"multi-runner",
"runner-binaries-syncer",
"runners",
"setup-iam-permissions",
"ssm",
"termination-watcher",
"webhook",
]
defaults:
run:
working-directory: modules/${{ matrix.module }}
Expand Down Expand Up @@ -118,7 +130,16 @@ jobs:
matrix:
terraform: [1.5.6, "latest"]
example:
["default", "ubuntu", "prebuilt", "arm64", "ephemeral", "windows", "multi-runner"]
[
"default",
"ubuntu",
"prebuilt",
"arm64",
"ephemeral",
"termination-watcher",
"windows",
"multi-runner",
]
defaults:
run:
working-directory: examples/${{ matrix.example }}
Expand Down
22 changes: 22 additions & 0 deletions .github/workflows/update-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,32 @@ jobs:
name: Auto update terraform docs
runs-on: ubuntu-latest
steps:
- uses: philips-software/app-token-action@9f5d57062c9f2beaffafaa9a34f66f824ead63a9 # v2.0.0
npalm marked this conversation as resolved.
Show resolved Hide resolved
id: app
with:
app_id: ${{ vars.FOREST_PR_BOT_APP_ID }}
app_base64_private_key: ${{ secrets.FOREST_PR_BOT_APP_KEY_BASE64 }}
auth_type: installation
org: philips-labs

- name: Checkout with GITHUB Action token
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633 # ratchet:actions/checkout@v4
with:
token: ${{ steps.app.outputs.token }}

# use an app to ensure CI is triggered
- name: Generate TF docs
if: github.repository_owner == 'philips-labs'
uses: terraform-docs/gh-actions@f6d59f89a280fa0a3febf55ef68f146784b20ba0 # ratchet:terraform-docs/gh-actions@v1.0.0
with:
find-dir: .
git-commit-message: "docs: auto update terraform docs"
git-push: ${{ github.ref != 'refs/heads/main' || github.repository_owner != 'philips-labs' }}
git-push-user-name: forest-pr|bot
git-push-user-email: "forest-pr[bot]@users.noreply.github.com"

- name: Generate TF docs (forks)
if: github.repository_owner != 'philips-labs'
uses: terraform-docs/gh-actions@f6d59f89a280fa0a3febf55ef68f146784b20ba0 # ratchet:terraform-docs/gh-actions@v1.0.0
with:
find-dir: .
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| Name | Source | Version |
|------|--------|---------|
| <a name="module_ami_housekeeper"></a> [ami\_housekeeper](#module\_ami\_housekeeper) | ./modules/ami-housekeeper | n/a |
| <a name="module_instance_termination_watcher"></a> [instance\_termination\_watcher](#module\_instance\_termination\_watcher) | ./modules/termination-watcher | n/a |
| <a name="module_runner_binaries"></a> [runner\_binaries](#module\_runner\_binaries) | ./modules/runner-binaries-syncer | n/a |
| <a name="module_runners"></a> [runners](#module\_runners) | ./modules/runners | n/a |
| <a name="module_ssm"></a> [ssm](#module\_ssm) | ./modules/ssm | n/a |
Expand Down Expand Up @@ -163,6 +164,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| <a name="input_instance_max_spot_price"></a> [instance\_max\_spot\_price](#input\_instance\_max\_spot\_price) | Max price price for spot instances per hour. This variable will be passed to the create fleet as max spot price for the fleet. | `string` | `null` | no |
| <a name="input_instance_profile_path"></a> [instance\_profile\_path](#input\_instance\_profile\_path) | The path that will be added to the instance\_profile, if not set the environment name will be used. | `string` | `null` | no |
| <a name="input_instance_target_capacity_type"></a> [instance\_target\_capacity\_type](#input\_instance\_target\_capacity\_type) | Default lifecycle used for runner instances, can be either `spot` or `on-demand`. | `string` | `"spot"` | no |
| <a name="input_instance_termination_watcher"></a> [instance\_termination\_watcher](#input\_instance\_termination\_watcher) | Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.<br><br>`enable`: Enable or disable the spot termination watcher.<br>'enable\_metrics': Enable or disable the metrics for the spot termination watcher.<br>`memory_size`: Memory size linit in MB of the lambda.<br>`s3_key`: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.<br>`s3_object_version`: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.<br>`timeout`: Time out of the lambda in seconds.<br>`zip`: File location of the lambda zip file. | <pre>object({<br> enable = optional(bool, false)<br> enable_metric = optional(object({<br> spot_warning = optional(bool, false)<br> }))<br> memory_size = optional(number, null)<br> s3_key = optional(string, null)<br> s3_object_version = optional(string, null)<br> timeout = optional(number, null)<br> zip = optional(string, null)<br> })</pre> | `{}` | no |
| <a name="input_instance_types"></a> [instance\_types](#input\_instance\_types) | List of instance types for the action runner. Defaults are based on runner\_os (al2023 for linux and Windows Server Core for win). | `list(string)` | <pre>[<br> "m5.large",<br> "c5.large"<br>]</pre> | no |
| <a name="input_job_queue_retention_in_seconds"></a> [job\_queue\_retention\_in\_seconds](#input\_job\_queue\_retention\_in\_seconds) | The number of seconds the job is held in the queue before it is purged. | `number` | `86400` | no |
| <a name="input_key_name"></a> [key\_name](#input\_key\_name) | Key pair name | `string` | `null` | no |
Expand All @@ -177,6 +179,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| <a name="input_log_level"></a> [log\_level](#input\_log\_level) | Logging level for lambda logging. Valid values are 'silly', 'trace', 'debug', 'info', 'warn', 'error', 'fatal'. | `string` | `"info"` | no |
| <a name="input_logging_kms_key_id"></a> [logging\_kms\_key\_id](#input\_logging\_kms\_key\_id) | Specifies the kms key id to encrypt the logs with. | `string` | `null` | no |
| <a name="input_logging_retention_in_days"></a> [logging\_retention\_in\_days](#input\_logging\_retention\_in\_days) | Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. | `number` | `180` | no |
| <a name="input_metrics_namespace"></a> [metrics\_namespace](#input\_metrics\_namespace) | The namespace for the metrics created by the module. Merics will only be created if explicit enabled. | `string` | `"GitHub Runners"` | no |
| <a name="input_minimum_running_time_in_minutes"></a> [minimum\_running\_time\_in\_minutes](#input\_minimum\_running\_time\_in\_minutes) | The time an ec2 action runner should be running at minimum before terminated, if not busy. | `number` | `null` | no |
| <a name="input_pool_config"></a> [pool\_config](#input\_pool\_config) | The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the `schedule_expression`. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1. | <pre>list(object({<br> schedule_expression = string<br> size = number<br> }))</pre> | `[]` | no |
| <a name="input_pool_lambda_memory_size"></a> [pool\_lambda\_memory\_size](#input\_pool\_lambda\_memory\_size) | Memory size limit for scale-up lambda. | `number` | `512` | no |
Expand Down Expand Up @@ -248,6 +251,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| Name | Description |
|------|-------------|
| <a name="output_binaries_syncer"></a> [binaries\_syncer](#output\_binaries\_syncer) | n/a |
| <a name="output_instance_termination_watcher"></a> [instance\_termination\_watcher](#output\_instance\_termination\_watcher) | n/a |
| <a name="output_queues"></a> [queues](#output\_queues) | SQS queues. |
| <a name="output_runners"></a> [runners](#output\_runners) | n/a |
| <a name="output_ssm_parameters"></a> [ssm\_parameters](#output\_ssm\_parameters) | n/a |
Expand Down
37 changes: 37 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,11 @@ This tracing config generates timelines for following events:

This feature has been disabled by default.

### Multiple runner module in your AWS account

The watcher will act on all spot termination notificatins and log all onses relevant to the runner module. Therefor we suggest to only deploy the watcher once. You can either deploy the watcher by enabling in one of your deployments or deploy the watcher as a stand alone module.


## Debugging

In case the setup does not work as intended, trace the events through this sequence:
Expand All @@ -187,6 +192,38 @@ In case the setup does not work as intended, trace the events through this seque

## Experimental features

### Termination watcher

This feature is in early stage and therefore disabled by default.

The termination watcher is currently watching for spot termination notifications. The module is only taken events into account for instances tagged with `ghr:environment` by default when deployment the module as part of one of the main modules (root or multi-runner). The module can also be deployed stand-alone, in that case the tag filter needs to be tunned.

- Logs: The module will log all termination notifications. For each warning it will look up instance details and log the environment, instance type and time the instance is running. As well some other details.
- Metrics: Metrics are disabled by default, this to avoid costs. Once enabled a metric will be created for each warning with at least dimensions for the environment and instance type. THe metric name space can be configured via the variables. The metric name used is `SpotInterruptionWarning`.

#### Log example

Below an example of the the log messages created.

```
{
"level": "INFO",
"message": "Received spot notification warning:",
"environment": "default",
"instanceId": "i-0039b8826b3dcea55",
"instanceType": "c5.large",
"instanceLaunchTime": "2024-03-15T08:10:34.000Z",
"instanceRunningTimeInSeconds": 68,
"tags": [
{
"Key": "ghr:environment",
"Value": "default"
}
... all tags ...
]
}
```

### Queue to publish workflow job events

This queue is an experimental feature to allow you to receive a copy of the wokflow_jobs events sent by the GitHub App. This can be used to calculate a matrix or monitor the system.
Expand Down
1 change: 1 addition & 0 deletions docs/examples/termination-watcher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--8<-- "examples/termination-watcher/README.md"
npalm marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 5 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,11 @@ The control plane (scale up lambda) will store the runner registration configura

The AMI cleaner is a lambda that will clean up AMIs that are older than a configurable amount of days. This is useful when using the AMI builder to create AMIs. The cleaner will also check which AMIs are used the latest version of the launch template. And you can provide SSM config paths pointing to AMI IDs. The cleaner will not delete these AMIs. The AMI cleaner is opt in, it will not be created by default.

### Instance Termination Watcher

> This feature is Beta, changes will not trigger a major release as long in beta.

The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.

### Security

Expand Down
1 change: 1 addition & 0 deletions examples/default/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ terraform output -raw webhook_secret

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_aws_region"></a> [aws\_region](#input\_aws\_region) | AWS region. | `string` | `"eu-west-1"` | no |
| <a name="input_environment"></a> [environment](#input\_environment) | Environment name, used as prefix. | `string` | `null` | no |
| <a name="input_github_app"></a> [github\_app](#input\_github\_app) | GitHub for API usages. | <pre>object({<br> id = string<br> key_base64 = string<br> })</pre> | n/a | yes |

Expand Down
11 changes: 9 additions & 2 deletions examples/default/main.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
locals {
environment = var.environment != null ? var.environment : "default"
aws_region = "eu-west-1"
aws_region = var.aws_region
}

resource "random_id" "random" {
Expand Down Expand Up @@ -79,7 +79,7 @@ module "runners" {

# override delay of events in seconds
delay_webhook_event = 5
runners_maximum_count = 1
runners_maximum_count = 2

# set up a fifo queue to remain order
enable_fifo_build_queue = true
Expand Down Expand Up @@ -109,6 +109,13 @@ module "runners" {
]
}

instance_termination_watcher = {
enable = true
enable_metric = {
spot_warning = true
}
}

}

module "webhook_github_app" {
Expand Down
7 changes: 7 additions & 0 deletions examples/default/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,10 @@ variable "environment" {
type = string
default = null
}

variable "aws_region" {
description = "AWS region."

type = string
default = "eu-west-1"
}
8 changes: 8 additions & 0 deletions examples/lambdas-download/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ module "lambdas" {
{
name = "runner-binaries-syncer"
tag = var.module_version
},
{
name = "ami-housekeeper"
tag = var.module_version
},
{
name = "termination-watcher"
tag = var.module_version
}
]
}
Expand Down
32 changes: 16 additions & 16 deletions examples/multi-runner/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions examples/multi-runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ terraform output -raw webhook_secret

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_aws_region"></a> [aws\_region](#input\_aws\_region) | AWS region to deploy to | `string` | `"eu-west-1"` | no |
| <a name="input_environment"></a> [environment](#input\_environment) | Environment name, used as prefix | `string` | `null` | no |
| <a name="input_github_app"></a> [github\_app](#input\_github\_app) | GitHub for API usages. | <pre>object({<br> id = string<br> key_base64 = string<br> })</pre> | n/a | yes |

Expand Down
15 changes: 14 additions & 1 deletion examples/multi-runner/main.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
locals {
environment = var.environment != null ? var.environment : "multi-runner"
aws_region = "eu-west-1"
aws_region = var.aws_region

# Load runner configurations from Yaml files
multi_runner_config_files = {
Expand Down Expand Up @@ -94,6 +94,19 @@ module "runners" {

# Enable debug logging for the lambda functions
# log_level = "debug"

# Enable spot termination watcher
# spot_instance_termination_watcher = {
# enable = true
# }

# Enable to track the spot instance termination warning
# instance_termination_watcher = {
# enable = true
# enable_metric = {
# spot_warning = true
# }
# }
}

module "webhook_github_app" {
Expand Down
7 changes: 7 additions & 0 deletions examples/multi-runner/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,10 @@ variable "environment" {
type = string
default = null
}

variable "aws_region" {
description = "AWS region to deploy to"

type = string
default = "eu-west-1"
}
25 changes: 25 additions & 0 deletions examples/termination-watcher/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading