Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

30 Oct - merged and reverted pr - did the merge cause a disruption? #1687

Closed
consideRatio opened this issue Oct 31, 2020 · 4 comments
Closed

Comments

@consideRatio
Copy link
Member

consideRatio commented Oct 31, 2020

Timeline

Disruption of CI linting?

When attempting to revert (2) in (3), I suspect (1) caused issues in the CI systems linting, but I'm not sure.

Disruption of service?

Everything looked good in the CI system etc, but @arnim observed some 504 o we mostly have Grafana to go on. For me things seemed to work, but I got unreliable responses from binderhub in general when curl -v https://gke.mybinder.org/health.

https://grafana.mybinder.org/d/3SpLQinmk/1-overview?orgId=1&from=1604072220723&to=1604102844222&var-cluster=prometheus

@minrk
Copy link
Member

minrk commented Nov 2, 2020

Collecting some details from today's investigations

The biggest source of the error appeared to be a failure of gke-prod to teardown the previous version, resulting in misreporting the 'prime' versions in the federation-redirect. The result was that up-to-date members of the federation were considered invalid, and only GKE received traffic.

On re-deploying with #1686, this issue did not recur, however the health check endpoint continued to be unavailable. Some manual testing has revealed massive performance regressions in the kubernetes Python API between v9 and v12, and a positive feedback loop in how we handle slow checks in the binderhub health handler, ensuring slow checks never finish and are never cached once they take longer than a certain amount of time.

So far, I think we have this plan to resolve the issue:

  1. pin kubernetes to v9 so we can get back to using the latest version of the chart pin kubernetes, jupyterhub in requirements.in binderhub#1190
  2. deploy with latest chart and pinned kubernetes, hopefully seeing that this has isolated the problem
  3. include health checks in our CI (add health checks to deployment tests #1694)
  4. fix some problems seen in the health check for binderhub (Limit outstanding health checks binderhub#1192)
  5. adopt _preload_content=False optimization first deployed in Breaking change / performance: don't make kubernetes-client deserialize k8s events into objects kubespawner#424 and apparently needed ~everywhere the kubernetes python API is used with a nontrivial amount of resources.
  6. bump kubernetes to v12 in the binderhub image

Only 1-2 are needed to get everything working from latest, but the rest are required to allow us to upgrade to a later version of the kubernetes Python client. There's no great pressure to do that, though, as there are no features we need or use in more recent versions than what we are using.

@consideRatio
Copy link
Member Author

@minrk either v9->v12 regression will slow things down no matter if we use _preload_content=False, or, it is only a regression when _preload_content=True (default). Have you drawn a conclusion about this?

If we have a regression in v9->v12 no matter what, we must update Z2JH to stop using v12 also. In z2jh 0.9.0 (currently on mybinder.org) we have used v10.

@consideRatio
Copy link
Member Author

I added checkboxes. I think this action plan is a very sound one, and I'll update my z2jh bump PR to pin kubernetes client to v9 so we can merge with a z2jh bump without considering that performance part.

So far, I think we have this plan to resolve the issue:

@consideRatio
Copy link
Member Author

Update

When trying again to deploy 0.10.6 I used the latest kubernetes client in the z2jh hub image. There was a notable performance improvement in the hub pod associated with a KubeSpawner optimization making the CPU drop to about half. There was a boost in the binder pods CPU though.

I believe the increased CPU load on the binder pods related to gesis becoming unhealthy but im not sure. When now gesis is back online the CPU load is stable at the previous levels.

@MridulS and I had a live debugging session and concluded the gesis deployment was a breaking change in z2jh 0.10.0 documented in the changelog that we had forgot about: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md#breaking-changes

Anyone relying on configuration in the proxy.https section are now explicitly required to set proxy.https.enabled to true.

The current status is that everything is up and running and seem to work as intended, and now we have a new version of z2jh running on mybinder.org-deploy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants