Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP/HTTP receiver does not respond with google.rpc.Status on 400 #1357

Closed
fbogsany opened this issue Jul 15, 2020 · 4 comments · Fixed by #1788
Closed

OTLP/HTTP receiver does not respond with google.rpc.Status on 400 #1357

fbogsany opened this issue Jul 15, 2020 · 4 comments · Fixed by #1788
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up

Comments

@fbogsany
Copy link

Describe the bug
The OTLP/HTTP OTEP states:

Response body for all HTTP 4xx and HTTP 5xx responses MUST be a ProtoBuf-encoded Status message that describes the problem.

The collector's HTTP 400 response does not appear to contain a Protobuf-encoded google.rpc.Status message. The response body is Protobuf-encoded, but I can't tell what type it is.

Steps to reproduce
This was discovered while debugging #1344, so these steps only work with that bug present:

  1. Setup Collector with otlp receiver.
  2. Send OTLP/HTTP request with “Content-Encoding: gzip” request header and gzipped content.

What did you expect to see?
HTTP 400 response with a Protobuf-encoded google.rpc.Status message. This is specified as:

message Status {
  int32 code = 1;
  string message = 2;
  repeated google.protobuf.Any details = 3;
}

What did you see instead?
Debugging in Ruby, I saw:

(byebug) Google::Rpc::Status.decode(response.body)
*** Google::Protobuf::ParseError Exception: Error occurred during parsing: Invalid wire type
(byebug) response
#<Net::HTTPBadRequest 400 Bad Request readbody=true>
(byebug) response.body.b
"\n\x19proto: illegal wireType 7\x10\x03\x1A\x19proto: illegal wireType 7"

Picking this apart manually as a Protobuf message, we have:

0b1010 => field_number = 1, wire_type = 2 (length-encoded)
0x19 => length = 25
"proto: illegal wireType 7"
0b10000 => field_number = 2, wire_type = 0 (varint)
0x3
0b1110 => field_number = 3, wire_type = 2 (length-encoded)
0x19 => length = 25
"proto: illegal wireType 7"

which looks like a message type:

message Foo {
  string bar = 1;
  int32 baz = 2;
  string quux = 3;
}

What version did you use?
v0.5.0

What config did you use?

receivers:
  otlp:
    protocols:
      http:
exporters:
  jaeger:
    insecure: true
    endpoint: shopify-tracing.railgun:14250
extensions:
  zpages:
service:
  extensions: [zpages]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger]

Environment
OS: macOS Catalina 10.15.5 (19F101)
Compiler(if manually compiled): go version go1.13.3 darwin/amd64

Additional context
N/A

@fbogsany fbogsany added the bug Something isn't working label Jul 15, 2020
@fbogsany
Copy link
Author

Digging into the gRPC Gateway code, it looks like this is actually an internal.Error type that aims to be compatible at the interface level with google.rpc.Status, but is not compatible at the wire level.

@fbogsany
Copy link
Author

The fix is likely to pass a custom error handler to NewServeMux

r.gatewayMux = gatewayruntime.NewServeMux(
gatewayruntime.WithMarshalerOption("application/x-protobuf", &xProtobufMarshaler{}),
)
using gatewayruntime.WithProtoErrorHandler.

@bogdandrutu bogdandrutu added this to the Beta 0.8.0 milestone Jul 30, 2020
@bogdandrutu bogdandrutu added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Jul 30, 2020
@bogdandrutu bogdandrutu modified the milestones: Beta 0.8.0, Beta 0.9.0 Aug 12, 2020
@tigrannajaryan tigrannajaryan modified the milestones: Beta 0.9.0, Beta 0.10.0, GA 1.0 Sep 2, 2020
@tigrannajaryan
Copy link
Member

I believe this is now fixed and we have a test that verifies it:

name: "ProtoGzipUncompressed",

@fbogsany I have not verified the fix myself, but it appears correct. Closing this, please reopen if necessary.

@bogdandrutu
Copy link
Member

I doubled check, this is solved only for errors in the gzip handler.

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
* Move connection logic into grpcConnection object

If we will need to maintain more than one connection in future, this
splitting off will come in handy.

Co-authored-by: Stefan Prisca <stefan.prisca@gmail.com>

* Make another channel a signal channel

There is another channel that serves as a one-time signal, where
channel's data type does not matter.

* Reorder and document connection members

This is to make clear that the lock is guarding only the connection
since it can be changed by multiple goroutines, and other members are
either atomic or read-only.

* Move stop signal into connection

The stop channel was rather useless on the exporter side - the primary
reason for existence of this channel is to stop a background
reconnecting goroutine. Since the goroutine lives entirely within
grpcConnection object, move the stop channel here. Also expose a
function to unify the stop channel with the context cancellation, so
exporter can use it without knowing anything about stop channels.

Also make export functions a bit more consistent.

* Do not run reconnection routine when being stopped too

It's possible that both disconnected channel and stop channel will be
triggered around the same time, so the goroutine is as likely to start
reconnecting as to return from the goroutine. Make sure we return if
the stop channel is closed.

* Nil clients on connection error

Set clients to nil on connection error, so we don't try to send the
data over a bad connection, but return a "no client" error
immediately.

* Do not call new connection handler within critical section

It's rather risky to call a callback coming from outside within a
critical section. Move it out.

* Add context parameter to connection routines

Connecting to the collector may also take its time, so it can be
useful in some cases to pass a context with a deadline. Currently we
just pass a background context, so this commit does not really change
any behavior. The follow-up commits will make a use of it, though.

* Add context parameter to NewExporter and Start

It makes it possible to limit the time spent on connecting to the
collector.

* Stop connecting on shutdown

Dialling to grpc service ignored the closing of the stop channel, but
this can be easily changed.

* Close connection after background is shut down

That way we can make sure that there won't be a window between closing
a connection and waiting for the background goroutine to return, where
the new connection could be established.

* Remove unnecessary nil check

This member is never nil, unless the Exporter is created like
&Exporter{}, which is not a thing we support anyway.

* Update changelog

Co-authored-by: Stefan Prisca <stefan.prisca@gmail.com>
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants