Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp.CheckValid documentation should warn about usage of returned "internal" error #1314

Closed
powerman opened this issue Apr 26, 2021 · 12 comments

Comments

@powerman
Copy link

What version of protobuf and what language are you using?
Version: v1.26.0

What did you do?
Return error which I get from (*timestamppb.Timestamp).CheckValid().

What did you expect to see?
Error message usable by gRPC clients which has no other choice than matching it by text because codes.InvalidArgument is not enough to find out which field has invalid value and how that value is invalid.

OR documentation for CheckValid must state "returned error is internal and suitable only for logging, it must not be returned to clients or matched by error message".

What did you see instead?
Unreliable error message with prefix "proto: " or "proto:\u00a0". Because of "feature": https://github.com/protocolbuffers/protobuf-go/blob/v1.26.0/internal/errors/errors.go#L29-L33

Anything else we should know about your project / environment?
I've lost some noticeable amount of time debugging this. 😞

@neild
Copy link
Contributor

neild commented Apr 26, 2021

I'm very sorry that this cost you time debugging.

Relying on undocumented properties of a package--such as the exact text of errors--is fundamentally fragile. Error text may change for many reasons, such as to increase clarity or consistency of messages. The default assumption in the absence of explicit documentation that the text of an error is stable and unchanging should be that it is not.

See also:
https://developers.google.com/protocol-buffers/docs/reference/go/faq#unstable-errors

@powerman
Copy link
Author

I'm wholeheartedly agree. As long as we're talking about handling errors in Go, and errors to handle have exported ErrSomething values or SomethingError types.

But with protobuf we usually communicate with remote system, which may not be implemented in Go, and which won't get anything than exact error text - so it doesn't have any option than handle errors by their text message (gRPC error codes are too limited to avoid this).

@dsnet
Copy link
Member

dsnet commented Apr 26, 2021

Error message usable by gRPC clients

Errors that can be sent over gRPC that can be semantically interpreted by the client is a hard problem. Error messages are meant to be a binary signal to indicate that the operation succeeded or failed. Exactly why or how it failed is generally for human consumption and not part of the semantic API for machine consumption. Interpreting the text of the string is not the right solution and I strongly recommend against it in any production code. It would be a code pattern that is akin to a ticking time bomb when it stops working if the error message changes slightly (which does happen).

It's unfortunate that it took time to figure out why you can't depend on the error message, but one of the reasons for the instability is precisely to make it harder for users to depend on the error message for production purposes. It is better to avoid introducing a potentially serious mis-assumption to the code base early on than it is to realize years later that an assumption made in the past was invalid when the production system is already serving many customers.

codes.InvalidArgument is not enough to find out which field has invalid value and how that value is invalid

I'm not sure I see how a stable error message from timestamppb.Timestamp.CheckValid helps you know which field is invalid. A given message may have many Timestamp messages within it. What you need is a path that uniquely identifies a given field from some given root message. However, the error message from CheckValid does not know where in a message it is occur and so it can't encode such information in the error string. If you need the ability to detect exactly which field is invalid, it would be the responsibility of the caller of CheckValid to encode such information in the error.

documentation for CheckValid must state "returned error is internal and suitable only for logging, it must not be returned to clients or matched by error message".

  • The error is not "internal" (I'm not sure exactly what this means).
  • The error is suitable for logging
  • The error is safe to report to clients
  • The error should not be matched by error message

It's nearly universal within the Go ecosystem that errors should be treated as an opaque value unless otherwise stated. For example, some packages may state that the error it returns may be asserted to some concrete type (e.g., strconv.ParseError), or another might say it will match some sentinel error value (e.g. io.EOF). However, barring any guidance the only valid operations for an error is to check whether it is nil or not, and print it for human consumption.

As such, I don't think we need to or should document this here since it is de-facto behavior of errors.

As long as we're talking about handling errors in Go

This is now a different concern than what the issue was originally about (which is about errors being serialized and sent over gRPC). For error values in Go, we could provide API to better distinguish the semantic meaning of an error since such error values have structured Go type and value information present.

At present, all errors produced by the protobuf module matches proto.Error according to errors.Is. For us to add more sentinel errors that could be used with errors.Is we would need more concrete examples and stronger justification for why to add such cases. Errors are returned in many places in the protobuf module I don't think we should provide sentinel errors for every possible error condition. Therefore, we would need guidance for why the errors for a given function should be distinguishable.

@powerman
Copy link
Author

@dsnet The ideas you express are sounds reasonable… except they're not working in gRPC world. When I'm using JSON-RPC 2.0 I never match errors by their message - because I have wide and open space for error codes… but not in gRPC. As long as I'm limited to codes.InvalidArgument I've no other options than parse error message (which I hate to do just because of the reasons you mentioned).

Of course I prefix error message (including one returned by CheckValid) with field path. And this space/nbsp hack does no good except making my tests and client code more complicated - no good because 99% of possible errors use fixed strings anyway, no matter is they come from Go stdlib, third-party libraries or our own code, the only one who tries to "help" us by making error messages unstable is this package. So, even in case this were a good idea it won't work anyway because everyone else don't do this and thus tests/clients continues to match errors by messages… and it's hard to blame them because they don't have any other choice with only codes.InvalidArgument available.

Even in case when all we're doing with error messages is showing them to user, then for UX we anyway wanna show them near exact field related to this error, highlight this error depending on it severity, show recommendations how to auto-fix some concrete errors, etc. - that's why I said this "concept" of not parsing error messages won't work in real world without ability to use unique error codes for each possible error.

For now, to me this space/nbsp "feature" looks exactly as "the road to hell is paved with good intentions". You've implemented this feature trying to improve something, but in practice you made much more harm than good.

I'm not that naıve to expect you to remove this feature, but I strongly recommend you to mention it in documentation - this won't make any harm but may help other people to save time debugging this issue. Because no matter that you're saying about "de-facto behavior" I'm not aware about any other packages which does same error message randomization, so this is surprising behaviour (in 2021 for sure, maybe in 2030 it'll actually became de-facto, but not today) and it should be documented.

P.S. I'm too sleepy now, so I'm too tired to pick words, and thus something of above may sounds rude to you. In this case please excuse me, I didn't mean that.

@powerman
Copy link
Author

The error is not "internal" (I'm not sure exactly what this means).

I mean error which is implementation-dependent, not part of package's exported API space, and not supposed to be detected/handled by user code in any way except printing it or comparing to nil. There are a lot of such errors in stdlib, but thing is, when users needs to detect/handle them they'll parse error messages anyway, and right way to solve this is to make such errors part of exported API (as it happens with golang/go#4373 - it took 8 years, but there was no other way).

@neild
Copy link
Contributor

neild commented Apr 26, 2021

I don't really see how gRPC has anything to do with this.

The Timestamp.CheckValid method returns an error if the timestamp is not valid. It does not return distinguishable errors, because the result of the check is binary: Is this timestamp valid or not? This is not a case like golang/go#4373, where a function may return errors for a number of different reasons and callers have a need to identify one of those reasons specifically.

The CheckValid method does return text explaining why the timestamp is not valid. This text is not part of the CheckValid API and may vary. If code needs to distinguish between different types of invalid timestamps (e.g., underflow vs. overflow) it is almost certainly simpler for that code to check the timestamp directly than to complicate the CheckValid function with a collection of distinguishable errors.

If you return the text of the error returned by CheckValid from an RPC handler, and write a client which depends on the text of that error, then you are depending on a property of the error which may change in the future without notice. You should avoid doing this, and the deliberate nondeterminism of the error text is intended to aid in doing so.

If you want control over the text of an error included in your RPC response, you should provide your own error text rather than using the content of the error provided by a package not under your control.

Also, this is drifting farther afield, but gRPC status responses permit including arbitrary additional error details, which can be used to pass as many error codes as you might desire:
https://github.com/grpc/grpc/blob/master/src/proto/grpc/status/status.proto#L91

@powerman
Copy link
Author

I know about Status.details field but it's too inconvenient to use comparing to usual code+message. It is good to have it in case we'll need to return stack traces or other complex information, but not to return one more extra code with each error and use that extra code as nearly complete replacement for both Status.code and Status.message - this even sounds pointless.

As for CheckValid - initially I didn't like to go there, but you're pushing me: is it the only method which uses protoimpl.X.NewError? No. Can you guarantee each and every call of protoimpl.X.NewError today and in the future will never return errors which users might want to distinguish?.. I doubt question "is user might need to distinguish CheckValid errors and is there other way around" is productive for this discussion.

This text is not part of the CheckValid API and may vary.

This is true, this is okay, this is happens everywhere, but… as you've seen in golang/go#4373 users will anyway depend on exact error text when they think this is only way to go, and they know the risks. This is what users get used to do. But our case is unusual because this package actively resist to doing this, and I believe that unusual behaviour should be documented, as we usually do for any unusual and surprising behaviour.

If you want control over the text of an error included in your RPC response, you should provide your own error text rather than using the content of the error provided by a package not under your control.

Well, this is true for most packages, except:

  • no one will bother replacing each and every error returned by 3rd-party - this is just impractical, because amount of required work is vastly greater than required to fix issues if some error messages will change and this will break something
  • this package is still special and surprising because here you're not "should" do this, you "must" do this, and this is unexpected - and this is one more reason why it "should" (read - "must" 😄) be documented

BTW, on a side note, why don't you randomize it on each run, why you decide to make this random depend on current binary? Similar randomization used in Go map key iteration order is much easier for debugging because it change on each run and thus few runs of go test will discover this issue, but with your binary-dependent implementation serial runs of go test always produce same results, and effect of randomization can be seen with 50% change by switching between go test and go test -race (which isn't something people do often) and making unrelated code changes. All of this makes it harder to debug.

@puellanivis
Copy link
Collaborator

This is what users get used to do. But our case is unusual because this package actively resist to doing this, and I believe that unusual behaviour should be documented, as we usually do for any unusual and surprising behaviour.

I think the matter here is that we disagree about what ”unusual and surprising behavior” means. Some of us are well aware that error messages are always flaky and needing to change them sometimes can cause a cascade failure of tests that depends on that specific message.

This has always been known to be an issue in Go, and the community still keeps running head long into the puddle with scissors. Sometimes because ”it’s the only way to do what I want to do” without taking a moment to stop, reconsider the approach and avoid the well known pitfall. Most of the time, they get away with it, which breads a normalization of deviance until someday, down the line something breaks in production and it’s entirely unknown why until it is finally distilled down into ”we were relying upon the message of an error, and that error message changed”.

If the impetus to change is “everyone or most of everyone has already changed” then no one will ever change. Someone has to be the vanguard, the start of, or push towards a final goal. You complain about this being done in 2021, rather than 2030, but without some major project pushing ahead towards preventing people from relying on error text, we’re just going to be stuck here in the same situation in 2030, the same as we were here in 2012.

and effect of randomization can be seen with 50% change by switching between go test and go test -race (which isn't something people do often) and making unrelated code changes. All of this makes it harder to debug.

I repeatedly run a go test throughout the course of development, to ensure that I haven’t broken something serious before moving on to the next task. I would be surprised that someone would do all their large scale refactoring/greenfield work and then send it through one round of go test and go test -race. I sure there must be someone out there doing something so reckless, but should we really treat them as the norm?

To answer another question you asked: the purpose of making it deterministic by binary is because this is being introduced into all code, not just test code. Go doesn’t randomize short maps outside of go test, so the impact of doing that randomization is limited on non-test code. In our case, we’re doing it for all code, so the impact needs to be minimized. (Perhaps we could actually make it fully random when compiled under test, if we’re not already doing that already?)

“random“ aside, which is secretly still vaguely related to the conversation at hand:

naıve

What happened that made this render with a dotless i rather than, I’m guessing, an intended latin small letter i with diaeresis? Text is flaky and unreliable, am I right?

@powerman
Copy link
Author

Randomizing at each go test run sounds good to me and will make debugging easier.

As for the rest of your comment, you're missing the point. I'm not saying we should depend on error message text. In general I like the idea of forcing people to write better code and avoid depending on error messages. But the point is:

  • At the moment this is unusual and surprising behaviour and it must be documented. Not because this behaviour is wrong, but because it's still unusual and surprising today. You're free to remove this from documentation in 2030, if/when everyone will get used to it.
  • I'd like it to be documented IN BOLD AND USING CAPITAL LETTERS, if you ask me (this never happens, I know, but still). Because there is a difference between trying to improve other people's code and making debugging more complicated to them.
  • While the idea in general is probably good, current implementation is not, because:
    • gRPC's limited space of error codes makes it hard to avoid depending on error messages (which means gRPC isn't really a good place to demonstrate this technique and sell it to community)
    • replacing each and every error message from stdlib and 3rd-party packages is just impractical (so it's worth to propose a solution to this issue first, because you're breaking the way how people work now without giving them usable alternative)
    • binary-dependent randomization makes debugging harder for no reason (at least this one can be fixed)

naıve

I was wondering is anyone will spot this. 😄

@dsnet
Copy link
Member

dsnet commented May 11, 2021

I believe there is a fundamental difference in opinion about what things should be documented versus what things falls under general knowledge and assumptions. The fact that users should not depend on the exact error messages is widely known. Our errors could very well just say "it broken" we should be allowed to make such a change. These types of details need not be documented. Carrying the thought further, it is unreasonable to expect every API that returns an error to carry a warning that the error messages may not be stable across time.

@powerman
Copy link
Author

powerman commented May 12, 2021

No one assume error messages may change between different runs of same library version.
No one assume such randomization will depend on current binary.
These things are unusual and thus should be documented.
I won't complain about "not documented" behaviour if you'll just change error messages between different releases of this package - this is usual and expected and happens everywhere. I'm not telling error messages must never change. I'm just asking about documenting unusual behaviour because it's unusual, not because it's wrong.
I'm afraid everything else I can say about this won't be constructive, so I'll stop here. Feel free to close if you believe adding this bit of documentation will make any harm or require too much effort.

@dsnet
Copy link
Member

dsnet commented May 12, 2021

No one assume error messages may change between different runs of same library version.
No one assume such randomization will depend on current binary.

Both assertions are moot since both assume that error messages can be depended upon.

I'm closing this issue as the heart of this issue is a fundamental difference in philosophies and I don't think more discussion is going to change either party's perspective.

@dsnet dsnet closed this as completed May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants