30 Oct - merged and reverted pr - did the merge cause a disruption? #1687

consideRatio · 2020-10-31T00:39:31Z

Timeline

(1) @manics merged @minrk's PR changing CI about linting (use newer github style to set environment variables #1682)
(2) I merged a Henchbot binderhub bump, binderhub: 0.2.0-n277.h3187c31...0.2.0-n287.h068b631 #1684
(3) I merged my PR to revert the binderhub bump, Revert "binderhub: 0.2.0-n277.h3187c31...0.2.0-n287.h068b631" #1685, after @arnim detected a weird behavior and I panicked. But, it seem to have resolved itself at the time I did, but, did it do it because of lower preassure on mybinder.org, or was the disruption caused by something else entirely?

Disruption of CI linting?

When attempting to revert (2) in (3), I suspect (1) caused issues in the CI systems linting, but I'm not sure.

Disruption of service?

Everything looked good in the CI system etc, but @arnim observed some 504 o we mostly have Grafana to go on. For me things seemed to work, but I got unreliable responses from binderhub in general when curl -v https://gke.mybinder.org/health.

https://grafana.mybinder.org/d/3SpLQinmk/1-overview?orgId=1&from=1604072220723&to=1604102844222&var-cluster=prometheus

The text was updated successfully, but these errors were encountered:

minrk · 2020-11-02T14:29:34Z

Collecting some details from today's investigations

The biggest source of the error appeared to be a failure of gke-prod to teardown the previous version, resulting in misreporting the 'prime' versions in the federation-redirect. The result was that up-to-date members of the federation were considered invalid, and only GKE received traffic.

On re-deploying with #1686, this issue did not recur, however the health check endpoint continued to be unavailable. Some manual testing has revealed massive performance regressions in the kubernetes Python API between v9 and v12, and a positive feedback loop in how we handle slow checks in the binderhub health handler, ensuring slow checks never finish and are never cached once they take longer than a certain amount of time.

So far, I think we have this plan to resolve the issue:

pin kubernetes to v9 so we can get back to using the latest version of the chart pin kubernetes, jupyterhub in requirements.in binderhub#1190
deploy with latest chart and pinned kubernetes, hopefully seeing that this has isolated the problem
include health checks in our CI (add health checks to deployment tests #1694)
fix some problems seen in the health check for binderhub (Limit outstanding health checks binderhub#1192)
adopt _preload_content=False optimization first deployed in Breaking change / performance: don't make kubernetes-client deserialize k8s events into objects kubespawner#424 and apparently needed ~everywhere the kubernetes python API is used with a nontrivial amount of resources.
bump kubernetes to v12 in the binderhub image

Only 1-2 are needed to get everything working from latest, but the rest are required to allow us to upgrade to a later version of the kubernetes Python client. There's no great pressure to do that, though, as there are no features we need or use in more recent versions than what we are using.

consideRatio · 2020-11-02T15:03:00Z

@minrk either v9->v12 regression will slow things down no matter if we use _preload_content=False, or, it is only a regression when _preload_content=True (default). Have you drawn a conclusion about this?

If we have a regression in v9->v12 no matter what, we must update Z2JH to stop using v12 also. In z2jh 0.9.0 (currently on mybinder.org) we have used v10.

consideRatio · 2020-11-16T11:02:18Z

I added checkboxes. I think this action plan is a very sound one, and I'll update my z2jh bump PR to pin kubernetes client to v9 so we can merge with a z2jh bump without considering that performance part.

So far, I think we have this plan to resolve the issue:

pin kubernetes to v9 so we can get back to using the latest version of the chart pin kubernetes, jupyterhub in requirements.in binderhub#1190

deploy with latest chart and pinned kubernetes, hopefully seeing that this has isolated the problem

include health checks in our CI (add health checks to deployment tests #1694)

fix some problems seen in the health check for binderhub (Limit outstanding health checks binderhub#1192)

adopt _preload_content=False optimization first deployed in Breaking change / performance: don't make kubernetes-client deserialize k8s events into objects kubespawner#424 and apparently needed ~everywhere the kubernetes python API is used with a nontrivial amount of resources.

bump kubernetes to v12 in the binderhub image

consideRatio · 2020-12-21T21:35:20Z

Update

When trying again to deploy 0.10.6 I used the latest kubernetes client in the z2jh hub image. There was a notable performance improvement in the hub pod associated with a KubeSpawner optimization making the CPU drop to about half. There was a boost in the binder pods CPU though.

I believe the increased CPU load on the binder pods related to gesis becoming unhealthy but im not sure. When now gesis is back online the CPU load is stable at the previous levels.

@MridulS and I had a live debugging session and concluded the gesis deployment was a breaking change in z2jh 0.10.0 documented in the changelog that we had forgot about: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md#breaking-changes

Anyone relying on configuration in the proxy.https section are now explicitly required to set proxy.https.enabled to true.

The current status is that everything is up and running and seem to work as intended, and now we have a new version of z2jh running on mybinder.org-deploy

minrk mentioned this issue Nov 2, 2020

binderhub: 0.2.0-n277.h3187c31...0.2.0-n293.h7e04ad4 #1686

Merged

consideRatio mentioned this issue Nov 2, 2020

Bump Z2JH to 0.10.6 jupyterhub/binderhub#1184

Merged

consideRatio mentioned this issue Dec 21, 2020

binderhub: 0.2.0-n454.h97fb8c3...0.2.0-n465.hb35ec4e #1757

Merged

consideRatio closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30 Oct - merged and reverted pr - did the merge cause a disruption? #1687

30 Oct - merged and reverted pr - did the merge cause a disruption? #1687

consideRatio commented Oct 31, 2020 •

edited by sgibson91

Loading

minrk commented Nov 2, 2020

consideRatio commented Nov 2, 2020

consideRatio commented Nov 16, 2020

consideRatio commented Dec 21, 2020

30 Oct - merged and reverted pr - did the merge cause a disruption? #1687

30 Oct - merged and reverted pr - did the merge cause a disruption? #1687

Comments

consideRatio commented Oct 31, 2020 • edited by sgibson91 Loading

Timeline

Disruption of CI linting?

Disruption of service?

minrk commented Nov 2, 2020

consideRatio commented Nov 2, 2020

consideRatio commented Nov 16, 2020

consideRatio commented Dec 21, 2020

Update

consideRatio commented Oct 31, 2020 •

edited by sgibson91

Loading