Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

Closed
aktech opened this issue Dec 18, 2023 · 10 comments · Fixed by #2168
Closed

[ENH] - Set minimum nodes to 0 for AWS deployment #2154

aktech opened this issue Dec 18, 2023 · 10 comments · Fixed by #2168
Assignees
Labels
block-release ⛔️ Must be completed for release impact: high 🟥 This issue affects most of the nebari users or is a critical issue needs: review 👀 This PR is complete and ready for reviewing project: JATIC Work item needed for the JATIC project
Milestone

Comments

@aktech
Copy link
Member

aktech commented Dec 18, 2023

Feature description

We need to set the min_nodes for AWS to 0 for user and worker nodes. We do have 0 for GCP.

Otherwise this makes Nebari quite expensive (~$625/month) for someone trying it with default configuration:

amazon_web_services:
  kubernetes_version: '1.26'
  region: us-east-1

Cost:

  • general node: m5.2xlarge at $0.384/hr: 0.384*720 = $276.48
  • user and worker node: 2 x m5.xlarge at $0.192/hr: 20.192720 = $276.48
  • Node cost $552.96
  • K8s base cost ~ $72

Nebari base cost on AWS: $625 per month with default config

This was not supported in by AWS, but its been a while since it's supported.

Value and/or benefit

Reduction in base cost for Nebari on AWS with the default configuration.

Anything else?

No response

@pt247
Copy link
Contributor

pt247 commented Dec 22, 2023

To replicate this, I created a Nebari cluster. The following nodes were created:

┌───────────────────────────────────────────────────────── Nodes(all)[1] ──────────────────────────────────────────────────────────┐
│ NAME↑                                           STATUS     ROLE        TAINTS     VERSION                       PODS AGE         │
│ ip-10-10-41-202.eu-west-1.compute.internal      Ready      <none>      0          v1.26.10-eks-e71965b             0 132m        │

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Default settings create a single node. All the pods are deployed on the same node.
Do we still need this change? Am I looking in the right place?

@pt247
Copy link
Contributor

pt247 commented Dec 22, 2023

Thanks @aktech for pointing out that k9 is not actually showing all the hosts. On AWS EC2 console we see 3 instances.

Screenshot 2023-12-22 at 13 31 03

After setting the worker and user nodes to 0 and destroying and recreating, I can verify that there is only one general instance.
Screenshot 2023-12-22 at 14 39 16

@dharhas dharhas added this to the 2024.1.1 milestone Jan 4, 2024
@kcpevey kcpevey added impact: high 🟥 This issue affects most of the nebari users or is a critical issue needs: review 👀 This PR is complete and ready for reviewing labels Jan 4, 2024
@pavithraes pavithraes moved this from New 🚦 to TODO 📬 in 🪴 Nebari Project Management Jan 4, 2024
@kcpevey kcpevey moved this from TODO 📬 to In review/QA 👀 in 🪴 Nebari Project Management Jan 4, 2024
@kcpevey kcpevey linked a pull request Jan 4, 2024 that will close this issue
10 tasks
@pt247
Copy link
Contributor

pt247 commented Jan 7, 2024

Status

Current Progress

To get nodes to scale from 0 in AWS EKS, we need to do the following:

  1. Change the hardcoded values for minimum nodes needed for user and worker to 0.
  2. Attach a label to the node group it can assign to nodes it creates, for example "dedicated" = "worker".
  3. Attach a tag to ASG created like: "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "worker"

The PR successfully does the first two steps.

Issue

We have the following stages that deal with AWS:

  1. 01-terrafrom-state: This initializes the state of terrafrom. It's not suited for creating ASG tags because ASG does not yet exist. Even the cluster is still being made at this point.
  2. 02-infrastructure: This is where we create most of platform-specific resources.
    In the case of AWS, we create clusters and node groups, among others. ASGs are created as a part of creating node groups.
  3. 03-kubernetes-initialize accomplishes a few things specific to the platform. For example, the nvidia-installer has different implementations for AWS and GCP.
  4. All the later stages do nothing platform-specific.

Now, terraform refuses to create tags for autoscaling groups, which are only available once the node groups are formed. This means we need to do this in the next stage, which is 03-kubernetes-initialize.

Possible solutions

  1. Add a module for ASG tagging at 03-kubernetes-initialize/aws-asg-tagging

@pt247
Copy link
Contributor

pt247 commented Jan 7, 2024

Please note: I had to move the user scheduler to the general node, as it kept the user node alive even if there was no user activity.

If needed I can raise a separate mini-pr to just do this.

@pt247
Copy link
Contributor

pt247 commented Jan 9, 2024

ASG tagging is moved to 03-kubernetes-initialize. I can see that AWS deployment is working fine, but the integration test for local deployment is failing.
Local Integration Tests / test-local-integration

Error:

[terraform]: ╷
[terraform]: │ Error: Invalid provider configuration
[terraform]: │ 
[terraform]: │ Provider "registry.terraform.io/hashicorp/aws" requires explicit
[terraform]: │ configuration. Add a provider block to the root module and configure the
[terraform]: │ provider's required arguments as described in the provider documentation.
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: No valid credential sources found
[terraform]: │ 
[terraform]: │   with provider["registry.terraform.io/hashicorp/aws"],
[terraform]: │   on <empty> line 0:
[terraform]: │   (source code not available)
[terraform]: │ 
[terraform]: │ Please see https://registry.terraform.io/providers/hashicorp/aws
[terraform]: │ for more information about providing credentials.
[terraform]: │ 
[terraform]: │ Error: failed to refresh cached credentials, no EC2 IMDS role found,
[terraform]: │ operation error ec2imds: GetMetadata, http response error StatusCode: 404,
[terraform]: │ request to EC2 IMDS failed

At this point, I can use some pointers on how to resolve it. For some reason, it's expecting AWS credentials to be set.

Link to logs with error: https://github.com/nebari-dev/nebari/actions/runs/7463265924/job/20307526950?pr=2168

After removing this block from

module "tagging" {
  count              = var.cloud_provider == "aws" ? 1 : 0
  source             = "./modules/tagging"
  asg_node_group_map = var.asg_node_group_map
}

And the run passes deploy nebari phase. Logs: https://github.com/nebari-dev/nebari/actions/runs/7463764148/job/20309179650?pr=2168

@dcmcand dcmcand assigned costrouc and unassigned dcmcand Jan 9, 2024
@dcmcand dcmcand modified the milestones: 2024.1.1, Next Release Jan 11, 2024
@pt247
Copy link
Contributor

pt247 commented Jan 13, 2024

To replicate the issue of local deployment, I am trying to run the same config on my laptop. It's an Intel-based Mac.

Config

Nebari config

I got the following config from the CI logs

$ cat nebari-config.yaml 
provider: local
namespace: dev
nebari_version: 2024.1.1rc2.dev82+g99d4445c
project_name: thisisatest
domain: github-actions.nebari.dev
ci_cd:
  type: none
terraform_state:
  type: remote
security:
  keycloak:
    initial_root_password: foad9omyohtfc7hfanwbem8zhahaup3s
  authentication:
    type: password
theme:
  jupyterhub:
    hub_title: Nebari - thisisatest
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted

Hosts file

$ cat /etc/hosts | grep 172.18.1.100
172.18.1.100 github-actions.nebari.dev

Error

[terraform]: Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
[terraform]: 
[terraform]: Outputs:
[terraform]: 
[terraform]: load_balancer_address = {
[terraform]:   "hostname" = ""
[terraform]:   "ip" = "172.20.1.100"
[terraform]: }
Attempt 1 failed to connect to tcp tcp://172.20.1.100:80
Attempt 2 failed to connect to tcp tcp://172.20.1.100:80
Attempt 3 failed to connect to tcp tcp://172.20.1.100:80
Attempt 4 failed to connect to tcp tcp://172.20.1.100:80
Attempt 5 failed to connect to tcp tcp://172.20.1.100:80
Attempt 6 failed to connect to tcp tcp://172.20.1.100:80
Attempt 7 failed to connect to tcp tcp://172.20.1.100:80
Attempt 8 failed to connect to tcp tcp://172.20.1.100:80
Attempt 9 failed to connect to tcp tcp://172.20.1.100:80
Attempt 10 failed to connect to tcp tcp://172.20.1.100:80
ERROR: After stage=04-kubernetes-ingress unable to connect to ingress host=172.20.1.100 port=80

Issue

Nothing is running on port 80 on my laptop.

$ sudo lsof -i -P | grep LISTEN | grep :80
Password:
$

Next step

The document clearly says it doesn't work on Mac. So, I will try this on an EC2 machine instead.

@kcpevey kcpevey added the project: JATIC Work item needed for the JATIC project label Jan 30, 2024
@github-project-automation github-project-automation bot moved this from In review/QA 👀 to Done 💪🏾 in 🪴 Nebari Project Management Feb 10, 2024
@kcpevey
Copy link
Contributor

kcpevey commented Feb 22, 2024

Reopening this as a release blocker as we've discovered an issue.

From @kenafoster :

I confirmed it works with the out-of-the-box config so it's the way we are using profile->node selectors that doesn't work with default 0.

He is currently on PTO until next week so we'll wait until then to discuss.

@kcpevey kcpevey reopened this Feb 22, 2024
@github-project-automation github-project-automation bot moved this from Done 💪🏾 to In progress 🏗 in 🪴 Nebari Project Management Feb 22, 2024
@kcpevey kcpevey added the block-release ⛔️ Must be completed for release label Feb 22, 2024
@rsignell
Copy link

@kcpevey, glad you are on the case! Really hoping to see this land, as it would help me sell Nebari a lot easier!

@kenafoster
Copy link
Contributor

@pt247 shared this with me. I tested a similar configuration in AWS and it works - you can scale from 0->1 using the "dedicated" selector to target a node pool by name.

So, the way to target different JupyterLab profile to a specific NodeGroup is by using the following key:

        node_selector:
          "dedicated": "fly-weight"

"dedicated" is a key used in lables that Nebari uses for hinting which links ASG to NodeGroups.
And "fly-weight" is the name of the node group.

The full config looks like this:
amazon_web_services:
  region: eu-west-1
  kubernetes_version: '1.26'
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 2
      max_nodes: 5
    user:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    worker:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    fly-weight:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    middle-weight:
      instance: m5.2xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    heavy-weight:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
profiles:
  jupyterlab:
    - display_name: Small Instance
      description: Stable environment with 1.5-2 cpu / 6-8 GB ram
      default: true
      kubespawner_override:
        cpu_limit: 2
        cpu_guarantee: 1.5
        mem_limit: 8G
        mem_guarantee: 6G
        node_selector:
          "dedicated": "fly-weight"
    - display_name: Medium Instance
      description: Stable environment with 1.5-2 cpu / 6-8 GB ram
      kubespawner_override:
        cpu_limit: 4
        cpu_guarantee: 2
        mem_limit: 12G
        mem_guarantee: 8G
        node_selector:
          "dedicated": "middle-weight"
    - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.1.1
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        extra_resource_limits:
          nvidia.com/gpu: 1
        node_selector:
          "dedicated": "heavy-weight"

@pt247
Copy link
Contributor

pt247 commented Mar 1, 2024

@kcpevey

I have created a ticket in nebari-docs document this:
nebari-dev/nebari-docs#415

Since no changes are needed in this PR to support this, is it okay to:

  1. Remove the block-release ⛔️ label.
  2. Close this PR.

@pt247 pt247 closed this as completed Mar 1, 2024
@github-project-automation github-project-automation bot moved this from In progress 🏗 to Done 💪🏾 in 🪴 Nebari Project Management Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
block-release ⛔️ Must be completed for release impact: high 🟥 This issue affects most of the nebari users or is a critical issue needs: review 👀 This PR is complete and ready for reviewing project: JATIC Work item needed for the JATIC project
Projects
Development

Successfully merging a pull request may close this issue.

8 participants