Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[documentation] Document deployment on existing AWS EKS cluster #942

Closed
iameskild opened this issue Nov 24, 2021 · 7 comments · Fixed by #944
Closed

[documentation] Document deployment on existing AWS EKS cluster #942

iameskild opened this issue Nov 24, 2021 · 7 comments · Fixed by #944
Labels
area: documentation 📖 Improvements or additions to documentation needs: discussion 💬 Needs discussion with the rest of the team

Comments

@iameskild
Copy link
Member

iameskild commented Nov 24, 2021

Related to #935.

To test and document how to deploy to an existing ("local") EKS cluster, I ran through the following steps:

Use (create) base EKS cluster

To get a functioning EKS cluster up and running quickly, I created a cluster and web app based on this tutorial. This cluster is running in it's own VPC with 3 subnets (each in it's own AZ) and there are no node groups. A scenario like seemed like a good place to start from the perspective of an incoming user.

Once this EKS cluster is up, there are still a handful of steps that seem to be required before QHub can be deployed to it:

  • Ensure that the subnets are allowed to "automatically assign public IP addresses to instances launched into it" otherwise node group can't be launched
  • Create general, user and worker node groups
    • Attach Node IAM Role with specific permissions (copied from existing role from previous qhub deployment):
    • Configure node group being mindful of instance size, attached block storage size and auto-scaling features.

I'm sure there are scenarios where there already exists node groups and they can be repurposed but more broadly it would be nice to make this process a lot more streamlined. Did I overcomplicate this, or are there other ways of handling the QHub deployment without having to add these node groups explicitly?

Deploy QHub to Existing EKS Cluster

Ensure that you are using the existing cluster's kubectl context.

Initialize in the usual manner:

python -m qhub init aws --project eaeexisting --domain eaeexisting.qhub.dev --ci-provider github-actions --auth-provider github --auth-auto-provision --repository github.com/iameskild/eaeaws

Then update the qhub-config.yaml file. The important keys to update are:

  • Replace provider: aws with provider: local
  • Replace amazon_web_services with local
    • And update the node_selector and kube_context appropriately

Once updated, deploy in the usual manner:

python -m qhub deploy --config qhub-config.yaml --disable-prompt --dns-provider cloudflare --dns-auto-provision

The deployment completes successfully and all the pods appear to be running (alongside the existing pods from the web app). The issue is that I can't access the cluster from the browser:

404 page not found

When examining the print statement from the deployment more, you can see that the cluster doesn't have an IP address:

[terraform]: ingress_jupyter = {
[terraform]:   "hostname" = "aea1abf087211438cbf9e44ef5fb64c3-197330438.us-east-2.elb.amazonaws.com"
[terraform]:   "ip" = ""
[terraform]:

qhub-config.yaml

project_name: eaeexisting
provider: local
domain: eaeexisting.qhub.dev
certificate:
  type: self-signed
security:
  authentication:
    type: GitHub
    config:
      client_id: 
      client_secret:
      oauth_callback_url: https://eaeexisting.qhub.dev/hub/oauth_callback
  users:
    iameskild:
      uid: 1000
      primary_group: admin
      secondary_groups:
      - users
  groups:
    users:
      gid: 100
    admin:
      gid: 101
default_images:
  jupyterhub: quansight/qhub-jupyterhub:v0.3.13
  jupyterlab: quansight/qhub-jupyterlab:v0.3.13
  dask_worker: quansight/qhub-dask-worker:v0.3.13
  dask_gateway: quansight/qhub-dask-gateway:v0.3.13
  conda_store: quansight/qhub-conda-store:v0.3.13
storage:
  conda_store: 60Gi
  shared_filesystem: 100Gi
theme:
  jupyterhub:
    hub_title: QHub - eaeexisting
    hub_subtitle: Autoscaling Compute Environment on Amazon Web Services
    welcome: Welcome to eaeexisting.qhub.dev. It is maintained by <a href="http://quansight.com">Quansight
      staff</a>. The hub's configuration is stored in a github repository based on
      <a href="https://github.com/Quansight/qhub/">https://github.com/Quansight/qhub/</a>.
      To provide feedback and report any technical problems, please use the <a href="https://github.com/Quansight/qhub/issues">github
      issue tracker</a>.
    logo: /hub/custom/images/jupyter_qhub_logo.svg
    primary_color: '#4f4173'
    secondary_color: '#957da6'
    accent_color: '#32C574'
    text_color: '#111111'
    h1_color: '#652e8e'
    h2_color: '#652e8e'
monitoring:
  enabled: true
cdsdashboards:
  enabled: true
  cds_hide_user_named_servers: true
  cds_hide_user_dashboard_servers: false
ci_cd:
  type: github-actions
  branch: main
terraform_state:
  type: remote
namespace: dev
local:
  kube_context: arn:aws:eks:us-east-2:892486800165:cluster/eaeeks
  node_selectors:
    general:
      key: eks.amazonaws.com/nodegroup
      value: general
    user:
      key: eks.amazonaws.com/nodegroup
      value: user
    worker:
      key: eks.amazonaws.com/nodegroup
      value: worker
profiles:
  jupyterlab:
  - display_name: Small Instance
    description: Stable environment with 1 cpu / 4 GB ram
    default: true
    kubespawner_override:
      cpu_limit: 1
      cpu_guarantee: 0.75
      mem_limit: 4G
      mem_guarantee: 2.5G
      image: quansight/qhub-jupyterlab:v0.3.13
  - display_name: Medium Instance
    description: Stable environment with 2 cpu / 8 GB ram
    kubespawner_override:
      cpu_limit: 2
      cpu_guarantee: 1.5
      mem_limit: 8G
      mem_guarantee: 5G
      image: quansight/qhub-jupyterlab:v0.3.13
  dask_worker:
    Small Worker:
      worker_cores_limit: 1
      worker_cores: 0.75
      worker_memory_limit: 4G
      worker_memory: 2.5G
      worker_threads: 1
      image: quansight/qhub-dask-worker:v0.3.13
    Medium Worker:
      worker_cores_limit: 2
      worker_cores: 1.5
      worker_memory_limit: 8G
      worker_memory: 5G
      worker_threads: 2
      image: quansight/qhub-dask-worker:v0.3.13
environments:
  environment-dask.yaml:
    name: dask
    channels:
    - conda-forge
    dependencies:
    - python
    - ipykernel
    - ipywidgets
    - qhub-dask ==0.3.13
    - python-graphviz
    - numpy
    - numba
    - pandas
  environment-dashboard.yaml:
    name: dashboard
    channels:
    - conda-forge
    dependencies:
    - python==3.9.7
    - ipykernel==6.4.1
    - ipywidgets==7.6.5
    - qhub-dask==0.3.13
    - param==1.11.1
    - python-graphviz==0.17
    - matplotlib==3.4.3
    - panel==0.12.4
    - voila==0.2.16
    - streamlit==1.0.0
    - dash==2.0.0
    - cdsdashboards-singleuser==0.5.7

@iameskild
Copy link
Member Author

iameskild commented Nov 24, 2021

@viniciusdc would you mind taking a look at this to see if I missed anything? And could you share any qhub-config.yaml that successfully deployed on an existing cluster? Thanks a lot :)

@iameskild iameskild added needs: discussion 💬 Needs discussion with the rest of the team area: documentation 📖 Improvements or additions to documentation labels Nov 24, 2021
@iameskild
Copy link
Member Author

Now that I think of it, this is most likely caused by the fact that this existing web app already has an EXTERNAL-IP set. I will attempt this again with an existing cluster that doesn't already have a public facing IP/ingress.

@viniciusdc
Copy link
Contributor

@viniciusdc would you mind taking a look at this to see if I missed anything? And could you share any qhub-config.yaml that successfully deployed on an existing cluster? Thanks a lot :)

Hi @iameskild, the only qhub-config that I have is for a GCP deployment. The only difference from yours (besides the provider) it that we needed to set the load-balancer configuration to an internal one, but that's because some security policies

@iameskild
Copy link
Member Author

Hey @viniciusdc, how did you provision the DNS? From looking reading through the code base, it appears that when deploying to a local (existing) cluster, the update_record for CloudFlare is skipped altogether:
https://github.com/Quansight/qhub/blob/c0d08bbcc08816475bf26466e2d64f9daf03164e/qhub/deploy.py#L108-L119

And that what I see when I deploy:

INFO:qhub.deploy:Couldn't update the DNS record for cloud provider: local

This explains why I can't access the cluster.

@iameskild
Copy link
Member Author

I was able to get around this by updating the DNS record manually in the CloudFlare portal 👍

@viniciusdc
Copy link
Contributor

Hey @viniciusdc, how did you provision the DNS? From looking reading through the code base, it appears that when deploying to a local (existing) cluster, the update_record for CloudFlare is skipped altogether:

https://github.com/Quansight/qhub/blob/c0d08bbcc08816475bf26466e2d64f9daf03164e/qhub/deploy.py#L108-L119

And that what I see when I deploy:

INFO:qhub.deploy:Couldn't update the DNS record for cloud provider: local

This explains why I can't access the cluster.

You can work around that, providing the DNS records manually in the namespace right? by providing the certificate's secrets... (I am not sure)

@iameskild
Copy link
Member Author

iameskild commented Nov 26, 2021

I noticed that a few minutes after posting this 😆 Thanks @viniciusdc

In the future, it might be nice if users with existing clusters can have their DNS recorded auto provisioned as well. Some changes to this part of the code could include a check for which cloud provider they are using and update_record appropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: documentation 📖 Improvements or additions to documentation needs: discussion 💬 Needs discussion with the rest of the team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants