-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate and decrease image proxy upstream request timeout #4570
Comments
Actually, I'm realising that I don't think this kind of analysis is really necessary. The dead links required they are a subset of an otherwise complex route. The thumbnails, however, are basically only the thumbnail endpoint. If we look at the response times from the last week for successful thumbnail requests, we can see that the 15-second timeout is only relevant in very extreme cases. The graph below is for only status 200. Even a 2-second timeout would suffice for the p95 outside of an exception case (like malicious traffic or upstream provider issues). Compare that to the graph for requests that fail due to upstream issues (424): I think we could really easily and safely lower the timeout to 3 seconds and lose less than 1% of requests that might have otherwise succeeded. It would be really great to have a per-provider analysis on this though. And @obulat mentioned that it would be good to check what extensions, if not just by provider. Both would require individual logging, or scraping our logs for the media identifier from the request URL and then querying for that information from the database. To that end, I've attached a spreadsheet of the 8137 thumbnail requests from the last 12 hours that were status 200 but took over 3 seconds. thumbnails 200 over 3 seconds 2024-06-28 (12 hour period).csv @obulat in case you're curious enough to do some data digging, feel free to see what you can query based on that! Should be easy enough to query the catalog for this kind of thing, I think 🤔 The only thing is we won't have the additional file type information that the thumbnail route gets from the initial head request. We cache that in Redis though, so if it ends up being a problem we can pull the information from Redis pretty reliably just using the identifiers. |
So here's what I've pulled of the images from that CSV:
I don't think this is very helpful, really, primarily because it's harder to cross-reference the timing of each one, and see if there is a trend, and because I didn't pull a sample from the set of images that did not time out, it's impossible to know whether this summary indicates a specific trend. I think the thing to do is lower the timeout to 4 seconds, which is really conservative, but much better than what it is today. Then add more logging if we want to investigate further. What's clear to me is that the current 15-second timeout is unnecessary, and most presents an SRE problem than it solves any issue of slow-to-respond providers. In lowering the timeout, however, we should make sure that timed out responses are neither cached in Cloudflare, and that they are treated perhaps a bit more gently than straight-out upstream failures when it comes to the flaky thumbnail cache. Similar to dead link check, it would be good if timed out thumbnails get to retry rather soon after if they keep getting repeated, and even potentially with progressively higher timeouts if necessary. |
Thanks for this awesome analysis!
I think those three courses of action sound perfect here, given the data we have:
|
Problem
Image proxy requests time out at 15 seconds. Similar to #4507, thumbnail requests are highly susceptible to upstream provider problems. If Flickr has a bad day, we'll have a bad day because we can't generate thumbnails (or validate dead links, to tie it back to that example).
Description
We should perform a similar analysis as I did for #4507. Add logging to the thumbnail requests to time how long they take so we can analyse them. Log the time, status, url, and maybe go ahead and log the
provider
slug specifically so we can more easily segment the analysis along those lines (I wish we had done so for the dead link validation).Deploy the logs for a week and follow the same approach to analysing the times as I took for the link validation. Find out how long (average, p95, p99) it takes for successful thumbnails to come back. Find out if there are meaningful differences between providers that we could modulate the timeout based on. etc etc.
The goal is to figure out if we can lower the 15-second timeout without creating significant negative consequences. Some failed thumbnails must be accepted, but something to consider is whether if a request times out (specifically times out) if we could try increasing the timeout the next time that thumbnail gets requested, and do so progressively until we hit some maximum (whether that's 15 seconds or otherwise).
Additional context
This is mostly a way to investigate whether thumbnail timeouts really need to be so high. If they do, that's fine, but it'd be good to know.
It is not, however, likely to be an alternative to segmenting our API request timing monitors, as thumbnail requests will probably always take a little longer than our other requests, and along with search, will always be the most important to know without having the speedy outliers or volume of one or the other affect the visibility of problems with the other.
The text was updated successfully, but these errors were encountered: