Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and decrease image proxy upstream request timeout #4570

Closed
sarayourfriend opened this issue Jun 28, 2024 · 3 comments · Fixed by #4695
Closed

Investigate and decrease image proxy upstream request timeout #4570

sarayourfriend opened this issue Jun 28, 2024 · 3 comments · Fixed by #4695
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API

Comments

@sarayourfriend
Copy link
Collaborator

Problem

Image proxy requests time out at 15 seconds. Similar to #4507, thumbnail requests are highly susceptible to upstream provider problems. If Flickr has a bad day, we'll have a bad day because we can't generate thumbnails (or validate dead links, to tie it back to that example).

Description

We should perform a similar analysis as I did for #4507. Add logging to the thumbnail requests to time how long they take so we can analyse them. Log the time, status, url, and maybe go ahead and log the provider slug specifically so we can more easily segment the analysis along those lines (I wish we had done so for the dead link validation).

Deploy the logs for a week and follow the same approach to analysing the times as I took for the link validation. Find out how long (average, p95, p99) it takes for successful thumbnails to come back. Find out if there are meaningful differences between providers that we could modulate the timeout based on. etc etc.

The goal is to figure out if we can lower the 15-second timeout without creating significant negative consequences. Some failed thumbnails must be accepted, but something to consider is whether if a request times out (specifically times out) if we could try increasing the timeout the next time that thumbnail gets requested, and do so progressively until we hit some maximum (whether that's 15 seconds or otherwise).

Additional context

This is mostly a way to investigate whether thumbnail timeouts really need to be so high. If they do, that's fine, but it'd be good to know.

It is not, however, likely to be an alternative to segmenting our API request timing monitors, as thumbnail requests will probably always take a little longer than our other requests, and along with search, will always be the most important to know without having the speedy outliers or volume of one or the other affect the visibility of problems with the other.

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Jun 28, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Jun 28, 2024
@sarayourfriend
Copy link
Collaborator Author

Actually, I'm realising that I don't think this kind of analysis is really necessary. The dead links required they are a subset of an otherwise complex route. The thumbnails, however, are basically only the thumbnail endpoint.

If we look at the response times from the last week for successful thumbnail requests, we can see that the 15-second timeout is only relevant in very extreme cases. The graph below is for only status 200.

image

Even a 2-second timeout would suffice for the p95 outside of an exception case (like malicious traffic or upstream provider issues).

Compare that to the graph for requests that fail due to upstream issues (424):

image

I think we could really easily and safely lower the timeout to 3 seconds and lose less than 1% of requests that might have otherwise succeeded.

It would be really great to have a per-provider analysis on this though. And @obulat mentioned that it would be good to check what extensions, if not just by provider. Both would require individual logging, or scraping our logs for the media identifier from the request URL and then querying for that information from the database. To that end, I've attached a spreadsheet of the 8137 thumbnail requests from the last 12 hours that were status 200 but took over 3 seconds.

thumbnails 200 over 3 seconds 2024-06-28 (12 hour period).csv

@obulat in case you're curious enough to do some data digging, feel free to see what you can query based on that! Should be easy enough to query the catalog for this kind of thing, I think 🤔 The only thing is we won't have the additional file type information that the thumbnail route gets from the initial head request. We cache that in Redis though, so if it ends up being a problem we can pull the information from Redis pretty reliably just using the identifiers.

@sarayourfriend
Copy link
Collaborator Author

So here's what I've pulled of the images from that CSV:

+-------+------+--------+-----------+-----------+
| jpegs | pngs | flickr | wikimedia | all_count |
|-------+------+--------+-----------+-----------|
| 7924  | 128  | 7422   | 375       | 8133      |
+-------+------+--------+-----------+-----------+

I don't think this is very helpful, really, primarily because it's harder to cross-reference the timing of each one, and see if there is a trend, and because I didn't pull a sample from the set of images that did not time out, it's impossible to know whether this summary indicates a specific trend.

I think the thing to do is lower the timeout to 4 seconds, which is really conservative, but much better than what it is today. Then add more logging if we want to investigate further. What's clear to me is that the current 15-second timeout is unnecessary, and most presents an SRE problem than it solves any issue of slow-to-respond providers.

In lowering the timeout, however, we should make sure that timed out responses are neither cached in Cloudflare, and that they are treated perhaps a bit more gently than straight-out upstream failures when it comes to the flaky thumbnail cache. Similar to dead link check, it would be good if timed out thumbnails get to retry rather soon after if they keep getting repeated, and even potentially with progressively higher timeouts if necessary.

@AetherUnbound
Copy link
Collaborator

Thanks for this awesome analysis!

I think the thing to do is lower the timeout to 4 seconds, which is really conservative, but much better than what it is today. Then add more logging if we want to investigate further.
...it would be good if timed out thumbnails get to retry rather soon after if they keep getting repeated, and even potentially with progressively higher timeouts if necessary.

I think those three courses of action sound perfect here, given the data we have:

  • Decrease timeout to 4 seconds for thumbnail requests
  • Add logging to investigate timeouts further
  • Add backoffed retries that have progressively higher timeouts to capture the exceptional cases

@sarayourfriend sarayourfriend moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Aug 1, 2024
@sarayourfriend sarayourfriend self-assigned this Aug 1, 2024
@openverse-bot openverse-bot moved this from 📅 To Do to 🏗 In Progress in Openverse Backlog Aug 1, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants