-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
30 Oct - merged and reverted pr - did the merge cause a disruption? #1687
Comments
Collecting some details from today's investigations The biggest source of the error appeared to be a failure of gke-prod to teardown the previous version, resulting in misreporting the 'prime' versions in the federation-redirect. The result was that up-to-date members of the federation were considered invalid, and only GKE received traffic. On re-deploying with #1686, this issue did not recur, however the health check endpoint continued to be unavailable. Some manual testing has revealed massive performance regressions in the kubernetes Python API between v9 and v12, and a positive feedback loop in how we handle slow checks in the binderhub health handler, ensuring slow checks never finish and are never cached once they take longer than a certain amount of time. So far, I think we have this plan to resolve the issue:
Only 1-2 are needed to get everything working from latest, but the rest are required to allow us to upgrade to a later version of the kubernetes Python client. There's no great pressure to do that, though, as there are no features we need or use in more recent versions than what we are using. |
@minrk either v9->v12 regression will slow things down no matter if we use If we have a regression in v9->v12 no matter what, we must update Z2JH to stop using v12 also. In z2jh 0.9.0 (currently on mybinder.org) we have used v10. |
I added checkboxes. I think this action plan is a very sound one, and I'll update my z2jh bump PR to pin kubernetes client to v9 so we can merge with a z2jh bump without considering that performance part.
|
UpdateWhen trying again to deploy 0.10.6 I used the latest kubernetes client in the z2jh hub image. There was a notable performance improvement in the hub pod associated with a KubeSpawner optimization making the CPU drop to about half. There was a boost in the binder pods CPU though. I believe the increased CPU load on the binder pods related to gesis becoming unhealthy but im not sure. When now gesis is back online the CPU load is stable at the previous levels. @MridulS and I had a live debugging session and concluded the gesis deployment was a breaking change in z2jh 0.10.0 documented in the changelog that we had forgot about: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md#breaking-changes
The current status is that everything is up and running and seem to work as intended, and now we have a new version of z2jh running on mybinder.org-deploy |
Timeline
Disruption of CI linting?
When attempting to revert (2) in (3), I suspect (1) caused issues in the CI systems linting, but I'm not sure.
Disruption of service?
Everything looked good in the CI system etc, but @arnim observed some 504 o we mostly have Grafana to go on. For me things seemed to work, but I got unreliable responses from binderhub in general when
curl -v https://gke.mybinder.org/health
.https://grafana.mybinder.org/d/3SpLQinmk/1-overview?orgId=1&from=1604072220723&to=1604102844222&var-cluster=prometheus
The text was updated successfully, but these errors were encountered: