Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746

Closed
felder opened this issue Aug 28, 2020 · 9 comments

Comments

@felder
Copy link
Contributor

felder commented Aug 28, 2020

David Wagner in the data8 slack writes:

Looks like again today students experienced problems accessing the datahub servers at noon right after lecture. We have lecture 11am-noon, and I encourage them to take the vitamin (mini-quiz) right after lecture, so we probably had hundreds of students descending on datahub all at the same time at noon. I noticed that it was slow to load and then gave a "bad gateway" error. It looks like it cleared up after 5-10 minutes. I'm anticipating the same pattern might continue in future weeks. I'm wondering if it makes sense to do something to give the servers a hint to spool up preemptively a bit before noon in anticipation of a horde about to descend on them, or if it makes more sense to let students browsers hang for a while and let them keep retrying until they eventually get in?

@felder
Copy link
Contributor Author

felder commented Aug 28, 2020

@yuvipanda I know we're trying to avoid the complexity of managing differing numbers of placeholders pods throughout the day, so it may be that it makes sense for students expect a bit of a delay in the above case. Nonetheless, I told David I'd file this as an issue to discuss.

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Aug 30, 2020
I think #1746 is due to this, since grafana says
the datahub hub pod was constantly around 1G RAM,
and gave them a 'bad gateway' error
@yuvipanda
Copy link
Contributor

Sorry for the issues, @davidwagner! I think it's because we had a 1G limit on the jupyterhub pods, and the datahub pod was hovering around that for a lot of that time.

Screen Shot 2020-08-30 at 2 08 40 PM

With #1759 I've increased that to 2G, and that should hopefully help fix this.

@davidwagner
Copy link

Thanks, @yuvipanda! I have to say that anecdotally it didn't seem to go so great today either: during lecture students reported they couldn't launch a notebook on data hub (and a few TAs told me they saw the same thing), probably because we had a horde of student all descend at the same time as I encouraged them to open the demo notebook during lecture. I'm not sure what to do about that. I can't tell whether it was a partial improvement but not enough or no improvement or what.

@davidwagner
Copy link

There have been continuing issues with Datahub not responding, on and off, for the past several hours, both that I have observed and that a number of students have reported.

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 1, 2020
More resource starvation reported in #1746,
this should help deal with spikes a little
better.
@yuvipanda
Copy link
Contributor

@davidwagner Sorry to hear that :( I've increased excess capacity from 100 extra users to 300 (in #1771), which hopefully can deal with spikes a little better. I've also set its minimum number of nodes to 7, to make sure we at least have 700 users capacity at all times.

Again, really sorry this happened again

@davidwagner
Copy link

Thank you so much for all your support! I'll let you know how it goes as the week continues.

@davidwagner
Copy link

@yuvipanda datahub is again not available. Is the auto-scaling working at all? Something seems like it is going wrong with the auto-scaling. Lecture today was 11am-noon. I have been unable to get into datahub all lecture long. I've been trying every 10 minutes or so and repeatedly it just hangs or gives me 'Bad gateway'. Students are reporting that they're unable to get into it either. We encourage them to open up a notebook with our demos and follow along during lecture, and that hasn't been working for students; they're experiencing the same issues. I would have expected that datahub should scale up more capacity so that I'd be able to get in if I was patient, but that doesn't seem to be happening. We have labs now right after lecture and we rely on students being able to get into datahub; we'll see how that goes.

@yuvipanda
Copy link
Contributor

I'm actively looking at it right now...

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 3, 2020
Our http response times were through the roof,
mostly because of CPU saturation. While we work
on fixing that, this at least makes sure we are guaranteed
one full CPU

Ref #1746
yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 3, 2020
Helps prometheus scrape metrics from hubs, which wasn't
possible until new thanks to our strict networkpolicy

Ref #1746
yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 3, 2020
Pinning to that version of JupyterHub was giving us
JupyterHub 1.0.1, which was causing a lot of performance
issues!

Ref #1746
yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 4, 2020
Otherwise, prometheus ends up on random other nodes,
hogs a lot of CPU and affects performance negatively.

We should set appropriate requests for *all* core pods

Ref #1746
yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 4, 2020
Primarily to bring in
jupyterhub/zero-to-jupyterhub-k8s#1768,
which brings in jupyterhub/kubespawner#424
for performance

Ref #1746
@yuvipanda
Copy link
Contributor

I think this was fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants