Investigate and decrease image proxy upstream request timeout #4570

sarayourfriend · 2024-06-28T05:13:45Z

Problem

Image proxy requests time out at 15 seconds. Similar to #4507, thumbnail requests are highly susceptible to upstream provider problems. If Flickr has a bad day, we'll have a bad day because we can't generate thumbnails (or validate dead links, to tie it back to that example).

Description

We should perform a similar analysis as I did for #4507. Add logging to the thumbnail requests to time how long they take so we can analyse them. Log the time, status, url, and maybe go ahead and log the provider slug specifically so we can more easily segment the analysis along those lines (I wish we had done so for the dead link validation).

Deploy the logs for a week and follow the same approach to analysing the times as I took for the link validation. Find out how long (average, p95, p99) it takes for successful thumbnails to come back. Find out if there are meaningful differences between providers that we could modulate the timeout based on. etc etc.

The goal is to figure out if we can lower the 15-second timeout without creating significant negative consequences. Some failed thumbnails must be accepted, but something to consider is whether if a request times out (specifically times out) if we could try increasing the timeout the next time that thumbnail gets requested, and do so progressively until we hit some maximum (whether that's 15 seconds or otherwise).

Additional context

This is mostly a way to investigate whether thumbnail timeouts really need to be so high. If they do, that's fine, but it'd be good to know.

It is not, however, likely to be an alternative to segmenting our API request timing monitors, as thumbnail requests will probably always take a little longer than our other requests, and along with search, will always be the most important to know without having the speedy outliers or volume of one or the other affect the visibility of problems with the other.

The text was updated successfully, but these errors were encountered:

sarayourfriend · 2024-06-28T07:10:59Z

Actually, I'm realising that I don't think this kind of analysis is really necessary. The dead links required they are a subset of an otherwise complex route. The thumbnails, however, are basically only the thumbnail endpoint.

If we look at the response times from the last week for successful thumbnail requests, we can see that the 15-second timeout is only relevant in very extreme cases. The graph below is for only status 200.

Even a 2-second timeout would suffice for the p95 outside of an exception case (like malicious traffic or upstream provider issues).

Compare that to the graph for requests that fail due to upstream issues (424):

I think we could really easily and safely lower the timeout to 3 seconds and lose less than 1% of requests that might have otherwise succeeded.

It would be really great to have a per-provider analysis on this though. And @obulat mentioned that it would be good to check what extensions, if not just by provider. Both would require individual logging, or scraping our logs for the media identifier from the request URL and then querying for that information from the database. To that end, I've attached a spreadsheet of the 8137 thumbnail requests from the last 12 hours that were status 200 but took over 3 seconds.

thumbnails 200 over 3 seconds 2024-06-28 (12 hour period).csv

@obulat in case you're curious enough to do some data digging, feel free to see what you can query based on that! Should be easy enough to query the catalog for this kind of thing, I think 🤔 The only thing is we won't have the additional file type information that the thumbnail route gets from the initial head request. We cache that in Redis though, so if it ends up being a problem we can pull the information from Redis pretty reliably just using the identifiers.

sarayourfriend · 2024-07-02T05:33:02Z

So here's what I've pulled of the images from that CSV:

+-------+------+--------+-----------+-----------+
| jpegs | pngs | flickr | wikimedia | all_count |
|-------+------+--------+-----------+-----------|
| 7924  | 128  | 7422   | 375       | 8133      |
+-------+------+--------+-----------+-----------+

I don't think this is very helpful, really, primarily because it's harder to cross-reference the timing of each one, and see if there is a trend, and because I didn't pull a sample from the set of images that did not time out, it's impossible to know whether this summary indicates a specific trend.

I think the thing to do is lower the timeout to 4 seconds, which is really conservative, but much better than what it is today. Then add more logging if we want to investigate further. What's clear to me is that the current 15-second timeout is unnecessary, and most presents an SRE problem than it solves any issue of slow-to-respond providers.

In lowering the timeout, however, we should make sure that timed out responses are neither cached in Cloudflare, and that they are treated perhaps a bit more gently than straight-out upstream failures when it comes to the flaky thumbnail cache. Similar to dead link check, it would be good if timed out thumbnails get to retry rather soon after if they keep getting repeated, and even potentially with progressively higher timeouts if necessary.

AetherUnbound · 2024-07-22T18:36:50Z

Thanks for this awesome analysis!

I think the thing to do is lower the timeout to 4 seconds, which is really conservative, but much better than what it is today. Then add more logging if we want to investigate further.
...it would be good if timed out thumbnails get to retry rather soon after if they keep getting repeated, and even potentially with progressively higher timeouts if necessary.

I think those three courses of action sound perfect here, given the data we have:

Decrease timeout to 4 seconds for thumbnail requests
Add logging to investigate timeouts further
Add backoffed retries that have progressively higher timeouts to capture the exceptional cases

sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Jun 28, 2024

openverse-bot added this to Openverse Backlog Jun 28, 2024

openverse-bot moved this to 📋 Backlog in Openverse Backlog Jun 28, 2024

sarayourfriend moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Aug 1, 2024

sarayourfriend self-assigned this Aug 1, 2024

sarayourfriend mentioned this issue Aug 1, 2024

Decrease thumbnail upstream timeout and add timing logs for thumbnail requests #4695

Merged

7 tasks

openverse-bot moved this from 📅 To Do to 🏗 In Progress in Openverse Backlog Aug 1, 2024

sarayourfriend closed this as completed in #4695 Aug 8, 2024

openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and decrease image proxy upstream request timeout #4570

Investigate and decrease image proxy upstream request timeout #4570

sarayourfriend commented Jun 28, 2024

sarayourfriend commented Jun 28, 2024

sarayourfriend commented Jul 2, 2024

AetherUnbound commented Jul 22, 2024

Investigate and decrease image proxy upstream request timeout #4570

Investigate and decrease image proxy upstream request timeout #4570

Comments

sarayourfriend commented Jun 28, 2024

Problem

Description

Additional context

sarayourfriend commented Jun 28, 2024

sarayourfriend commented Jul 2, 2024

AetherUnbound commented Jul 22, 2024