Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746

felder · 2020-08-28T21:55:54Z

David Wagner in the data8 slack writes:

Looks like again today students experienced problems accessing the datahub servers at noon right after lecture. We have lecture 11am-noon, and I encourage them to take the vitamin (mini-quiz) right after lecture, so we probably had hundreds of students descending on datahub all at the same time at noon. I noticed that it was slow to load and then gave a "bad gateway" error. It looks like it cleared up after 5-10 minutes. I'm anticipating the same pattern might continue in future weeks. I'm wondering if it makes sense to do something to give the servers a hint to spool up preemptively a bit before noon in anticipation of a horde about to descend on them, or if it makes more sense to let students browsers hang for a while and let them keep retrying until they eventually get in?

felder · 2020-08-28T21:57:16Z

@yuvipanda I know we're trying to avoid the complexity of managing differing numbers of placeholders pods throughout the day, so it may be that it makes sense for students expect a bit of a delay in the above case. Nonetheless, I told David I'd file this as an issue to discuss.

I think #1746 is due to this, since grafana says the datahub hub pod was constantly around 1G RAM, and gave them a 'bad gateway' error

yuvipanda · 2020-08-30T08:40:11Z

Sorry for the issues, @davidwagner! I think it's because we had a 1G limit on the jupyterhub pods, and the datahub pod was hovering around that for a lot of that time.

With #1759 I've increased that to 2G, and that should hopefully help fix this.

davidwagner · 2020-08-31T20:06:25Z

Thanks, @yuvipanda! I have to say that anecdotally it didn't seem to go so great today either: during lecture students reported they couldn't launch a notebook on data hub (and a few TAs told me they saw the same thing), probably because we had a horde of student all descend at the same time as I encouraged them to open the demo notebook during lecture. I'm not sure what to do about that. I can't tell whether it was a partial improvement but not enough or no improvement or what.

davidwagner · 2020-08-31T21:39:44Z

There have been continuing issues with Datahub not responding, on and off, for the past several hours, both that I have observed and that a number of students have reported.

More resource starvation reported in #1746, this should help deal with spikes a little better.

yuvipanda · 2020-09-01T03:55:06Z

@davidwagner Sorry to hear that :( I've increased excess capacity from 100 extra users to 300 (in #1771), which hopefully can deal with spikes a little better. I've also set its minimum number of nodes to 7, to make sure we at least have 700 users capacity at all times.

Again, really sorry this happened again

davidwagner · 2020-09-01T19:39:50Z

Thank you so much for all your support! I'll let you know how it goes as the week continues.

davidwagner · 2020-09-02T19:09:29Z

@yuvipanda datahub is again not available. Is the auto-scaling working at all? Something seems like it is going wrong with the auto-scaling. Lecture today was 11am-noon. I have been unable to get into datahub all lecture long. I've been trying every 10 minutes or so and repeatedly it just hangs or gives me 'Bad gateway'. Students are reporting that they're unable to get into it either. We encourage them to open up a notebook with our demos and follow along during lecture, and that hasn't been working for students; they're experiencing the same issues. I would have expected that datahub should scale up more capacity so that I'd be able to get in if I was patient, but that doesn't seem to be happening. We have labs now right after lecture and we rely on students being able to get into datahub; we'll see how that goes.

yuvipanda · 2020-09-02T20:08:59Z

I'm actively looking at it right now...

Our http response times were through the roof, mostly because of CPU saturation. While we work on fixing that, this at least makes sure we are guaranteed one full CPU Ref #1746

Helps prometheus scrape metrics from hubs, which wasn't possible until new thanks to our strict networkpolicy Ref #1746

Pinning to that version of JupyterHub was giving us JupyterHub 1.0.1, which was causing a lot of performance issues! Ref #1746

Otherwise, prometheus ends up on random other nodes, hogs a lot of CPU and affects performance negatively. We should set appropriate requests for *all* core pods Ref #1746

Primarily to bring in jupyterhub/zero-to-jupyterhub-k8s#1768, which brings in jupyterhub/kubespawner#424 for performance Ref #1746

yuvipanda · 2021-08-10T07:05:26Z

I think this was fixed

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Aug 30, 2020

Add more RAM to core pods

53522dc

I think #1746 is due to this, since grafana says the datahub hub pod was constantly around 1G RAM, and gave them a 'bad gateway' error

yuvipanda mentioned this issue Aug 30, 2020

Add more RAM to core pods #1759

Merged

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 1, 2020

datahub: Increase placeholder pods even more

c27fed2

More resource starvation reported in #1746, this should help deal with spikes a little better.

yuvipanda mentioned this issue Sep 1, 2020

datahub: Increase placeholder pods even more #1771

Merged

yuvipanda mentioned this issue Sep 1, 2020

Revert "datahub: Increase placeholder pods even more" #1773

Merged

yuvipanda mentioned this issue Sep 3, 2020

datahub: Guarantee it at least 1 full CPU #1778

Merged

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 3, 2020

hub: Allow incoming traffic from support ns

15af014

Helps prometheus scrape metrics from hubs, which wasn't possible until new thanks to our strict networkpolicy Ref #1746

yuvipanda mentioned this issue Sep 3, 2020

hub: Allow incoming traffic from support ns #1779

Merged

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 3, 2020

Cleanup hub base image

92597b4

Pinning to that version of JupyterHub was giving us JupyterHub 1.0.1, which was causing a lot of performance issues! Ref #1746

yuvipanda mentioned this issue Sep 3, 2020

Cleanup hub base image #1783

Merged

yuvipanda mentioned this issue Sep 4, 2020

support: Set prometheus resource requests #1792

Merged

yuvipanda referenced this issue in yuvipanda/datahub-old-fork Sep 4, 2020

Bump version of z2jh chart to latest

725583e

Primarily to bring in jupyterhub/zero-to-jupyterhub-k8s#1768, which brings in jupyterhub/kubespawner#424 for performance Ref #1746

This was referenced Sep 4, 2020

Bump version of z2jh chart to latest #1793

Merged

Disable user-scheduler #1817

Merged

yuvipanda mentioned this issue Jan 21, 2021

Empty notebooks submitted to okpy / otter #2107

Open

yuvipanda closed this as completed Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746

Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746

felder commented Aug 28, 2020

felder commented Aug 28, 2020

yuvipanda commented Aug 30, 2020

davidwagner commented Aug 31, 2020

davidwagner commented Aug 31, 2020

yuvipanda commented Sep 1, 2020

davidwagner commented Sep 1, 2020

davidwagner commented Sep 2, 2020

yuvipanda commented Sep 2, 2020

yuvipanda commented Aug 10, 2021

Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746

Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746

Comments

felder commented Aug 28, 2020

felder commented Aug 28, 2020

yuvipanda commented Aug 30, 2020

davidwagner commented Aug 31, 2020

davidwagner commented Aug 31, 2020

yuvipanda commented Sep 1, 2020

davidwagner commented Sep 1, 2020

davidwagner commented Sep 2, 2020

yuvipanda commented Sep 2, 2020

yuvipanda commented Aug 10, 2021