-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource starvation after data8 lecture due to students being encouraged to take a mini quiz #1746
Comments
@yuvipanda I know we're trying to avoid the complexity of managing differing numbers of placeholders pods throughout the day, so it may be that it makes sense for students expect a bit of a delay in the above case. Nonetheless, I told David I'd file this as an issue to discuss. |
I think #1746 is due to this, since grafana says the datahub hub pod was constantly around 1G RAM, and gave them a 'bad gateway' error
Sorry for the issues, @davidwagner! I think it's because we had a 1G limit on the jupyterhub pods, and the datahub pod was hovering around that for a lot of that time. With #1759 I've increased that to 2G, and that should hopefully help fix this. |
Thanks, @yuvipanda! I have to say that anecdotally it didn't seem to go so great today either: during lecture students reported they couldn't launch a notebook on data hub (and a few TAs told me they saw the same thing), probably because we had a horde of student all descend at the same time as I encouraged them to open the demo notebook during lecture. I'm not sure what to do about that. I can't tell whether it was a partial improvement but not enough or no improvement or what. |
There have been continuing issues with Datahub not responding, on and off, for the past several hours, both that I have observed and that a number of students have reported. |
More resource starvation reported in #1746, this should help deal with spikes a little better.
@davidwagner Sorry to hear that :( I've increased excess capacity from 100 extra users to 300 (in #1771), which hopefully can deal with spikes a little better. I've also set its minimum number of nodes to 7, to make sure we at least have 700 users capacity at all times. Again, really sorry this happened again |
Thank you so much for all your support! I'll let you know how it goes as the week continues. |
@yuvipanda datahub is again not available. Is the auto-scaling working at all? Something seems like it is going wrong with the auto-scaling. Lecture today was 11am-noon. I have been unable to get into datahub all lecture long. I've been trying every 10 minutes or so and repeatedly it just hangs or gives me 'Bad gateway'. Students are reporting that they're unable to get into it either. We encourage them to open up a notebook with our demos and follow along during lecture, and that hasn't been working for students; they're experiencing the same issues. I would have expected that datahub should scale up more capacity so that I'd be able to get in if I was patient, but that doesn't seem to be happening. We have labs now right after lecture and we rely on students being able to get into datahub; we'll see how that goes. |
I'm actively looking at it right now... |
Our http response times were through the roof, mostly because of CPU saturation. While we work on fixing that, this at least makes sure we are guaranteed one full CPU Ref #1746
Helps prometheus scrape metrics from hubs, which wasn't possible until new thanks to our strict networkpolicy Ref #1746
Pinning to that version of JupyterHub was giving us JupyterHub 1.0.1, which was causing a lot of performance issues! Ref #1746
Otherwise, prometheus ends up on random other nodes, hogs a lot of CPU and affects performance negatively. We should set appropriate requests for *all* core pods Ref #1746
Primarily to bring in jupyterhub/zero-to-jupyterhub-k8s#1768, which brings in jupyterhub/kubespawner#424 for performance Ref #1746
I think this was fixed |
David Wagner in the data8 slack writes:
Looks like again today students experienced problems accessing the datahub servers at noon right after lecture. We have lecture 11am-noon, and I encourage them to take the vitamin (mini-quiz) right after lecture, so we probably had hundreds of students descending on datahub all at the same time at noon. I noticed that it was slow to load and then gave a "bad gateway" error. It looks like it cleared up after 5-10 minutes. I'm anticipating the same pattern might continue in future weeks. I'm wondering if it makes sense to do something to give the servers a hint to spool up preemptively a bit before noon in anticipation of a horde about to descend on them, or if it makes more sense to let students browsers hang for a while and let them keep retrying until they eventually get in?
The text was updated successfully, but these errors were encountered: