[ENH] - Set minimum nodes to 0 for AWS deployment #2154

aktech · 2023-12-18T20:00:18Z

Feature description

We need to set the min_nodes for AWS to 0 for user and worker nodes. We do have 0 for GCP.

Otherwise this makes Nebari quite expensive (~$625/month) for someone trying it with default configuration:

amazon_web_services:
  kubernetes_version: '1.26'
  region: us-east-1

Cost:

general node: m5.2xlarge at $0.384/hr: 0.384*720 = $276.48
user and worker node: 2 x m5.xlarge at $0.192/hr: 20.192720 = $276.48
Node cost $552.96
K8s base cost ~ $72

Nebari base cost on AWS: $625 per month with default config

This was not supported in by AWS, but its been a while since it's supported.

Value and/or benefit

Reduction in base cost for Nebari on AWS with the default configuration.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

pt247 · 2023-12-22T12:55:10Z

To replicate this, I created a Nebari cluster. The following nodes were created:

┌───────────────────────────────────────────────────────── Nodes(all)[1] ──────────────────────────────────────────────────────────┐
│ NAME↑                                           STATUS     ROLE        TAINTS     VERSION                       PODS AGE         │
│ ip-10-10-41-202.eu-west-1.compute.internal      Ready      <none>      0          v1.26.10-eks-e71965b             0 132m        │

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Default settings create a single node. All the pods are deployed on the same node.
Do we still need this change? Am I looking in the right place?

pt247 · 2023-12-22T14:44:38Z

Thanks @aktech for pointing out that k9 is not actually showing all the hosts. On AWS EC2 console we see 3 instances.

After setting the worker and user nodes to 0 and destroying and recreating, I can verify that there is only one general instance.

pt247 · 2024-01-07T17:01:52Z

Status

Current Progress

To get nodes to scale from 0 in AWS EKS, we need to do the following:

Change the hardcoded values for minimum nodes needed for user and worker to 0.
Attach a label to the node group it can assign to nodes it creates, for example "dedicated" = "worker".
Attach a tag to ASG created like: "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "worker"

The PR successfully does the first two steps.

Issue

We have the following stages that deal with AWS:

01-terrafrom-state: This initializes the state of terrafrom. It's not suited for creating ASG tags because ASG does not yet exist. Even the cluster is still being made at this point.
02-infrastructure: This is where we create most of platform-specific resources.
In the case of AWS, we create clusters and node groups, among others. ASGs are created as a part of creating node groups.
03-kubernetes-initialize accomplishes a few things specific to the platform. For example, the nvidia-installer has different implementations for AWS and GCP.
All the later stages do nothing platform-specific.

Now, terraform refuses to create tags for autoscaling groups, which are only available once the node groups are formed. This means we need to do this in the next stage, which is 03-kubernetes-initialize.

Possible solutions

Add a module for ASG tagging at 03-kubernetes-initialize/aws-asg-tagging

pt247 · 2024-01-07T18:35:19Z

Please note: I had to move the user scheduler to the general node, as it kept the user node alive even if there was no user activity.

If needed I can raise a separate mini-pr to just do this.

pt247 · 2024-01-09T16:07:28Z

ASG tagging is moved to 03-kubernetes-initialize. I can see that AWS deployment is working fine, but the integration test for local deployment is failing.
Local Integration Tests / test-local-integration

Error:

[terraform]: ╷
[terraform]: │ Error: Invalid provider configuration
[terraform]: │ 
[terraform]: │ Provider "registry.terraform.io/hashicorp/aws" requires explicit
[terraform]: │ configuration. Add a provider block to the root module and configure the
[terraform]: │ provider's required arguments as described in the provider documentation.
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: No valid credential sources found
[terraform]: │ 
[terraform]: │   with provider["registry.terraform.io/hashicorp/aws"],
[terraform]: │   on <empty> line 0:
[terraform]: │   (source code not available)
[terraform]: │ 
[terraform]: │ Please see https://registry.terraform.io/providers/hashicorp/aws
[terraform]: │ for more information about providing credentials.
[terraform]: │ 
[terraform]: │ Error: failed to refresh cached credentials, no EC2 IMDS role found,
[terraform]: │ operation error ec2imds: GetMetadata, http response error StatusCode: 404,
[terraform]: │ request to EC2 IMDS failed

At this point, I can use some pointers on how to resolve it. For some reason, it's expecting AWS credentials to be set.

Link to logs with error: https://github.com/nebari-dev/nebari/actions/runs/7463265924/job/20307526950?pr=2168

After removing this block from

module "tagging" {
  count              = var.cloud_provider == "aws" ? 1 : 0
  source             = "./modules/tagging"
  asg_node_group_map = var.asg_node_group_map
}

And the run passes deploy nebari phase. Logs: https://github.com/nebari-dev/nebari/actions/runs/7463764148/job/20309179650?pr=2168

pt247 · 2024-01-13T00:00:03Z

To replicate the issue of local deployment, I am trying to run the same config on my laptop. It's an Intel-based Mac.

Config

Nebari config

I got the following config from the CI logs

$ cat nebari-config.yaml 
provider: local
namespace: dev
nebari_version: 2024.1.1rc2.dev82+g99d4445c
project_name: thisisatest
domain: github-actions.nebari.dev
ci_cd:
  type: none
terraform_state:
  type: remote
security:
  keycloak:
    initial_root_password: foad9omyohtfc7hfanwbem8zhahaup3s
  authentication:
    type: password
theme:
  jupyterhub:
    hub_title: Nebari - thisisatest
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted

Hosts file

$ cat /etc/hosts | grep 172.18.1.100
172.18.1.100 github-actions.nebari.dev

Error

[terraform]: Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
[terraform]: 
[terraform]: Outputs:
[terraform]: 
[terraform]: load_balancer_address = {
[terraform]:   "hostname" = ""
[terraform]:   "ip" = "172.20.1.100"
[terraform]: }
Attempt 1 failed to connect to tcp tcp://172.20.1.100:80
Attempt 2 failed to connect to tcp tcp://172.20.1.100:80
Attempt 3 failed to connect to tcp tcp://172.20.1.100:80
Attempt 4 failed to connect to tcp tcp://172.20.1.100:80
Attempt 5 failed to connect to tcp tcp://172.20.1.100:80
Attempt 6 failed to connect to tcp tcp://172.20.1.100:80
Attempt 7 failed to connect to tcp tcp://172.20.1.100:80
Attempt 8 failed to connect to tcp tcp://172.20.1.100:80
Attempt 9 failed to connect to tcp tcp://172.20.1.100:80
Attempt 10 failed to connect to tcp tcp://172.20.1.100:80
ERROR: After stage=04-kubernetes-ingress unable to connect to ingress host=172.20.1.100 port=80

Issue

Nothing is running on port 80 on my laptop.

$ sudo lsof -i -P | grep LISTEN | grep :80
Password:
$

Next step

The document clearly says it doesn't work on Mac. So, I will try this on an EC2 machine instead.

kcpevey · 2024-02-22T16:25:26Z

Reopening this as a release blocker as we've discovered an issue.

From @kenafoster :

I confirmed it works with the out-of-the-box config so it's the way we are using profile->node selectors that doesn't work with default 0.

He is currently on PTO until next week so we'll wait until then to discuss.

rsignell · 2024-02-24T19:20:14Z

@kcpevey, glad you are on the case! Really hoping to see this land, as it would help me sell Nebari a lot easier!

kenafoster · 2024-03-01T02:57:07Z

@pt247 shared this with me. I tested a similar configuration in AWS and it works - you can scale from 0->1 using the "dedicated" selector to target a node pool by name.

So, the way to target different JupyterLab profile to a specific NodeGroup is by using the following key:

        node_selector:
          "dedicated": "fly-weight"

"dedicated" is a key used in lables that Nebari uses for hinting which links ASG to NodeGroups.
And "fly-weight" is the name of the node group.

The full config looks like this:
amazon_web_services:
  region: eu-west-1
  kubernetes_version: '1.26'
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 2
      max_nodes: 5
    user:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    worker:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    fly-weight:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    middle-weight:
      instance: m5.2xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    heavy-weight:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
profiles:
  jupyterlab:
    - display_name: Small Instance
      description: Stable environment with 1.5-2 cpu / 6-8 GB ram
      default: true
      kubespawner_override:
        cpu_limit: 2
        cpu_guarantee: 1.5
        mem_limit: 8G
        mem_guarantee: 6G
        node_selector:
          "dedicated": "fly-weight"
    - display_name: Medium Instance
      description: Stable environment with 1.5-2 cpu / 6-8 GB ram
      kubespawner_override:
        cpu_limit: 4
        cpu_guarantee: 2
        mem_limit: 12G
        mem_guarantee: 8G
        node_selector:
          "dedicated": "middle-weight"
    - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.1.1
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        extra_resource_limits:
          nvidia.com/gpu: 1
        node_selector:
          "dedicated": "heavy-weight"

pt247 · 2024-03-01T13:19:25Z

@kcpevey

I have created a ticket in nebari-docs document this:
nebari-dev/nebari-docs#415

Since no changes are needed in this PR to support this, is it okay to:

Remove the block-release ⛔️ label.
Close this PR.

github-project-automation bot added this to 🪴 Nebari Project Management Dec 18, 2023

github-project-automation bot moved this to New 🚦 in 🪴 Nebari Project Management Dec 18, 2023

pt247 mentioned this issue Dec 22, 2023

Set min nodes to 0 for worker and user. #2168

Merged

10 tasks

dharhas added this to the 2024.1.1 milestone Jan 4, 2024

kcpevey added impact: high 🟥 This issue affects most of the nebari users or is a critical issue needs: review 👀 This PR is complete and ready for reviewing labels Jan 4, 2024

kcpevey assigned dcmcand Jan 4, 2024

pavithraes moved this from New 🚦 to TODO 📬 in 🪴 Nebari Project Management Jan 4, 2024

kcpevey moved this from TODO 📬 to In review/QA 👀 in 🪴 Nebari Project Management Jan 4, 2024

kcpevey linked a pull request Jan 4, 2024 that will close this issue

Set min nodes to 0 for worker and user. #2168

Merged

10 tasks

dcmcand assigned costrouc and unassigned dcmcand Jan 9, 2024

dcmcand modified the milestones: 2024.1.1, Next Release Jan 11, 2024

kcpevey added the project: JATIC Work item needed for the JATIC project label Jan 30, 2024

costrouc closed this as completed in #2168 Feb 10, 2024

github-project-automation bot moved this from In review/QA 👀 to Done 💪🏾 in 🪴 Nebari Project Management Feb 10, 2024

kcpevey reopened this Feb 22, 2024

github-project-automation bot moved this from Done 💪🏾 to In progress 🏗 in 🪴 Nebari Project Management Feb 22, 2024

kcpevey added the block-release ⛔️ Must be completed for release label Feb 22, 2024

pt247 closed this as completed Mar 1, 2024

github-project-automation bot moved this from In progress 🏗 to Done 💪🏾 in 🪴 Nebari Project Management Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

aktech commented Dec 18, 2023 •

edited

Loading

pt247 commented Dec 22, 2023

pt247 commented Dec 22, 2023

pt247 commented Jan 7, 2024 •

edited

Loading

pt247 commented Jan 7, 2024

pt247 commented Jan 9, 2024 •

edited

Loading

pt247 commented Jan 13, 2024

kcpevey commented Feb 22, 2024

rsignell commented Feb 24, 2024

kenafoster commented Mar 1, 2024

pt247 commented Mar 1, 2024

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

Comments

aktech commented Dec 18, 2023 • edited Loading

Feature description

Value and/or benefit

Anything else?

pt247 commented Dec 22, 2023

pt247 commented Dec 22, 2023

pt247 commented Jan 7, 2024 • edited Loading

Status

Current Progress

Issue

Possible solutions

pt247 commented Jan 7, 2024

pt247 commented Jan 9, 2024 • edited Loading

pt247 commented Jan 13, 2024

Config

Nebari config

Hosts file

Error

Issue

Next step

kcpevey commented Feb 22, 2024

rsignell commented Feb 24, 2024

kenafoster commented Mar 1, 2024

pt247 commented Mar 1, 2024

aktech commented Dec 18, 2023 •

edited

Loading

pt247 commented Jan 7, 2024 •

edited

Loading

pt247 commented Jan 9, 2024 •

edited

Loading