Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include request span in spec_compliant mode #5282

Closed
smyrick opened this issue May 29, 2024 · 14 comments · Fixed by #5386
Closed

Include request span in spec_compliant mode #5282

smyrick opened this issue May 29, 2024 · 14 comments · Fixed by #5386
Assignees

Comments

@smyrick
Copy link
Member

smyrick commented May 29, 2024

Is your feature request related to a problem? Please describe.

In deprecated mode the Router adds the request span

SpanMode::SpecCompliant => {
unreachable!("this code path should not be reachable, this is a bug!")
}

Screenshot 2024-05-29 at 1 59 56 PM

In the new spec_compliant mode this span was removed

Screenshot 2024-05-29 at 1 59 13 PM

Describe the solution you'd like

I would like to include this span so that the service entry point from DataDog is consistent

Describe alternatives you've considered

I could possibly add a custom collector to rename spans

Additional context

This will also help keep span name cardinality down even if this span is added back: #5261

@bnjjj
Copy link
Contributor

bnjjj commented May 30, 2024

I'm not sure to understand your need. Indeed there is not request span in spec_compliant mode because it's not compliant with the specs. That's why we introduced this mode. And it's going to be the default mode in router 2.x because that's how it should work to be specs compliant. If you still want to have a request span then you should use the deprecated mode.

@smyrick
Copy link
Member Author

smyrick commented May 30, 2024

@bnjjj Can you share what part of the Otel spec we are complying with? I linked above another issue but currently in spec_compliant mode the service entry span has a different name everytime because it is now the router span and that is messing up DataDog dashboards. This may help solve that #5261

But what is the spec that says there can not be a span called request?

@bnjjj
Copy link
Contributor

bnjjj commented Jun 3, 2024

The one mentioned by Bryn here

@Samjin
Copy link

Samjin commented Jun 4, 2024

The current state of the GraphQL span conventions is experimental, and it lacks explicit guidance on whether a root span should adhere to the same standards considering it originates from an HTTP server request rather than a direct GraphQL operation. IMO, it doesn't make sense to have unlimited number of root spans for a single app https://opentelemetry.io/docs/specs/semconv/graphql/graphql-spans/

Given that the specification for GraphQL spans has not been officially finalized by OTEL, it may not be necessary to strict adherence to the experimental spec. Adopting a more flexible stance could facilitate a smoother transition away from deprecated modes.

Additionally, this approach can provide better support for Datadog's integration as highlighted in this discussion.

@BrynCooke
Copy link
Contributor

BrynCooke commented Jun 6, 2024

I've added a suggestion for us to implement here: #5261 (comment)

I also agree that dynamic span names is strange, not only in the in the semantic conventions, but also in the Datadog exporter in otel rust there is span mapping functionality that also does this.

@abernix
Copy link
Member

abernix commented Jun 7, 2024

@Samjin Could you please confirm that the suggestion in @BrynCooke's comment above would satisfactorily close this issue as well? Thanks! 🙇

@Samjin
Copy link

Samjin commented Jun 7, 2024

@Samjin Could you please confirm that the suggestion in @BrynCooke's comment above would satisfactorily close this issue as well? Thanks! 🙇

Thank you all for adding the option 👍 .
@abernix I think the original issue is request span is missing rather than router span use otel naming pattern. Personally I'd prefer to have request span representing the path or route a request goes to before router, but If apollo's plan is to have router being the root or service entry span, then yes, we can close it.

Currently, I'm uncertain whether the lack of a request span will result in any missing POST/graphql related resource metrics in Datadog. For instance, our gateway categorizes various endpoints such as OPTIONS, POST, and GET as resources, and DD automatically generates service mappings and default metrics for these. Without the request span, this data might not be captured from http layer.

image image

If we designate the router span as the service entry point, a possible workaround could involve updating dd resource mapping with something like map.insert("router", "http.route"), even though the router service isn't technically the HTTP server, to my knowledge. Additionally, supergraph could utilze <graphql.operation.name> or <graphql.operation.type> <graphql.operation.name>. However, If router service has any heavy job or external http calls for proceed certain graphql operations, the supergraph metrics per operation would not be able to capture it.

@bnjjj
Copy link
Contributor

bnjjj commented Jun 10, 2024

@Samjin your suggestion about the resource mapping is valid. I opened a draft PR to fix this. It will require some testing to be able to merge it.

@Samjin
Copy link

Samjin commented Jun 10, 2024

@Samjin your suggestion about the resource mapping is valid. I opened a draft PR to fix this. It will require some testing to be able to merge it.

Based on the draft PR, It looks like request span will not be added back for some reason. Could you please elaborate why that's the case? I was hoping to have request span added back because there is already option to opt out the odd otel spec for root span. That means request can remain as static span name and added back?

router with http.route resource has drawbacks as mentioned above. It'll became a tough choice for user who must lose either http's resource or router's operation resource. If I must choose between them, I think we'd still prefer to have the operation name as router's resource so user can see all the key metrics of a operation starting from router service, such as latency of each operation.

That said, with proper span resource, Datadog user can exclude some custom instrumentations from apollo and save cost eventually. e.g. DD user can use trace.<span_name> <resource_name> for errors, hit as count type and also as a histogram type for latency.

@Samjin
Copy link

Samjin commented Jun 10, 2024

Additionally, if request won't be added, it'd be great if the resource mapping can be configurable so user can choose what resource they need without asking Apollo.

@BrynCooke
Copy link
Contributor

BrynCooke commented Jun 12, 2024

Here's some background to the thinking on this:

We're trying to get to the situation where APMs that natively understand otel specs can work out of the box. So this means following the otel spec as closely as we can.

The lifecycle stages that we have in the router are:

  • router (http server)
  • supergraph (graphql server)
  • subgraph (graphql client)
  • http_client (http client)

The request span was always a synthetic span that didn't really correspond to anything in the code.

The router span doesn't have anything graphql specific in it, it's just bytes at this point, and so we'd like this to match - https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-server

The supergraph span is where we know we are dealing with a graphql request and will be - https://opentelemetry.io/docs/specs/semconv/graphql/graphql-spans/

Subgraph spans are not mentioned in the spec.

Http client - should match https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client

Going forward
So on reading the specs again I think that the following should happen:

I think your idea of allowing more configurable span mappings is a good one would the following be sufficient?

telemetry.spans.router.otel.name = (http|graphql|<hard coded value>) (defaults to http eventually, but initially it'll be still on graphql to match current behaviour)
telemetry.spans.supergraph.otel.name = (graphql|<hard coded value>) (defaults to graphql)

Where:

@Samjin Would this be sufficient? This would give you the span naming patterns that you desire without reintroducing the request span.

@Samjin
Copy link

Samjin commented Jun 13, 2024

where APMs that natively understand otel specs can work out of the box.

Thank you, @BrynCooke, for the clarification provided. I am uncertain that APMs(in our case dd APM) are fully capable of interpreting the otel spec without some form of explicit instrumentation at this stage.

That said I think it's fine we proceed with the approach you've suggested for now, and we'll get better understanding through testing.

@BrynCooke
Copy link
Contributor

I'll be doing some testing on datadog as part of this work, so will circle back when we know more.

@Samjin
Copy link

Samjin commented Jun 14, 2024

https://www.datadoghq.com/blog/node-monitoring-apm/#out-of-the-box-support-for-nodejs-monitoring
https://docs.datadoghq.com/tracing/trace_collection/automatic_instrumentation/dd_libraries/nodejs/#integrations

DD auto instrument common used framework such as Express.js where app would create express.request span out of box, but I don't see Rust being one of it in the list above.

I also looked up universal-service-monitoring which would provide universal.http.server out of box and we already have it today. However, comparing it to express.request from nodejs gateway, the APM results are drastically different. e.g express.request would get 100 million reqs, but USM would get 10 million only for /graphql for a same period. I wouldn't expect to use universal.http.server at all since this is a fairly new feature that's not designed for tracing. Eg. it doesn't collect traces so it can't correlate to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants