Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Be able to correlate timeouts in reverse-proxy layer in front of Synapse (tag specific request headers) #13786

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 12, 2022

Add opentracing.request_headers_to_tag config to specify which request headers extract and tag traces with in order to correlate timeouts in reverse-proxy layers in front of Synapse with traces.

Fix #13685

Alternative solution: #13801


Jaeger span with a test request header tagged

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@MadLittleMods MadLittleMods added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Sep 12, 2022
Comment on lines +177 to +189
# Tag any headers that we need to extract from the request. This
# is useful to specify any headers that a reverse-proxy in front
# of Synapse may be sending to correlate and match up something
# in that layer to a Synapse trace. ex. when Cloudflare times
# out it gives a `cf-ray` header which we can also tag here to
# find the trace.
for header_name in self.hs.config.tracing.request_headers_to_tag:
headers = request.requestHeaders.getRawHeaders(header_name)
if headers is not None:
parent_span.set_tag(
SynapseTags.REQUEST_HEADER_PREFIX + header_name,
str(headers[0]),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested that this works locally but I'm not sure if the cf-ray header that we want to use on matrix.org is actually a thing.

Based on the following documentation and that we see it in our haproxy logs already, I assume this will be viable,

Add the CF-Ray header to your origin web server logs to match requests proxied to Cloudflare to requests in your server logs.

@MadLittleMods MadLittleMods marked this pull request as ready for review September 13, 2022 01:51
@MadLittleMods MadLittleMods requested a review from a team as a code owner September 13, 2022 01:51
if headers is not None:
parent_span.set_tag(
SynapseTags.REQUEST_HEADER_PREFIX + header_name,
str(headers[0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the call to str? If header_name is a string, then headers[0] will be a string.

See:

https://github.com/twisted/twisted/blob/7c00738a0c24032070ce92304b4c8b887666c0fc/src/twisted/web/http_headers.py#L256-L258

headers = request.requestHeaders.getRawHeaders(header_name)
if headers is not None:
parent_span.set_tag(
SynapseTags.REQUEST_HEADER_PREFIX + header_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with opentracing tags, is request_header.cf-ray an OK thing to do? Or do we need to normalize that to something like request_header.cf_ray or whatever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag can be whatever. And since it's the name of a header, I think it's probably easiest to follow with the actual header name. Similar to what we do with function arguments but those happen to conform to the Python snake casing already.

These are the OpenTelemetry docs but here are some naming conventions, https://opentelemetry.io/docs/reference/specification/common/attribute-naming/

Comment on lines +3615 to +3619
* `request_headers_to_tag`: A list of headers to extract from the request and
add to to the top-level servlet tracing span as tags. Useful when you're using
a reverse proxy service like Cloudflare to protect your Synapse instance in
order to correlate and match up requests that timed out at the Cloudflare
layer to the Synapse traces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to throw a wrench in these plans, but if other reverse proxies do something similar; I wonder if it would make more sense to have a header which could be the source for SynapseRequest.request_id (instead of generating a new one):

def get_request_id(self) -> str:
return "%s-%i" % (self.get_method(), self.request_seq)

Would that be clearer instead of having two separate IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Seems reasonable to me and a little cleaner. I'll create a new PR with this solution ⏩

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this would make it harder to track things in the logs or anything, but seems like it would be a lot cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this out in #13801

Losing the simplicity of the incrementing integer to reference in logs as you scroll kinda sucks though. I'm leaning back to this option I think 🤔

Copy link
Member

@clokep clokep Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing the simplicity of the incrementing integer to reference in logs as you scroll kinda sucks though. I'm leaning back to this option I think 🤔

Is that really used though? I usually just copy and paste for grep. Once the number gets beyond a few digits it is useless anyway IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the one having to log dive and debug big Synapse instances so I don't know. In tests, we will still have the same sequential request ID's since this config won't be defined. We're not changing any default either so it's basically opt-in.

If anyone has a strong desire for the old way, we can always revert as well. If you're happy with #13801, we can move forward with that ⏩

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we should ask folks real quick then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadLittleMods
Copy link
Contributor Author

Closing as it seems we're settling on the alternative #13801

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Be able to correlate traces even when Cloudflare times out
2 participants