-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create base error type for ingester per-instance errors and remove logging for them #5585
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,13 +6,15 @@ | |
package ingester | ||
|
||
import ( | ||
"context" | ||
"flag" | ||
"time" | ||
|
||
"github.com/pkg/errors" | ||
"google.golang.org/grpc/codes" | ||
"google.golang.org/grpc/status" | ||
"gopkg.in/yaml.v3" | ||
|
||
"github.com/grafana/mimir/pkg/util/globalerror" | ||
util_log "github.com/grafana/mimir/pkg/util/log" | ||
) | ||
|
||
const ( | ||
|
@@ -22,14 +24,41 @@ const ( | |
maxInflightPushRequestsFlag = "ingester.instance-limits.max-inflight-push-requests" | ||
) | ||
|
||
// We don't include values in the messages for per-instance limits to avoid leaking Mimir cluster configuration to users. | ||
var ( | ||
// We don't include values in the message to avoid leaking Mimir cluster configuration to users. | ||
errMaxIngestionRateReached = errors.New(globalerror.IngesterMaxIngestionRate.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the samples ingestion rate limit", maxIngestionRateFlag)) | ||
errMaxTenantsReached = errors.New(globalerror.IngesterMaxTenants.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the allowed number of tenants", maxInMemoryTenantsFlag)) | ||
errMaxInMemorySeriesReached = errors.New(globalerror.IngesterMaxInMemorySeries.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the allowed number of in-memory series", maxInMemorySeriesFlag)) | ||
errMaxInflightRequestsReached = util_log.DoNotLogError{Err: errors.New(globalerror.IngesterMaxInflightPushRequests.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the allowed number of inflight push requests", maxInflightPushRequestsFlag))} | ||
errMaxIngestionRateReached = newInstanceLimitError(globalerror.IngesterMaxIngestionRate.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the samples ingestion rate limit", maxIngestionRateFlag)) | ||
errMaxTenantsReached = newInstanceLimitError(globalerror.IngesterMaxTenants.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the allowed number of tenants", maxInMemoryTenantsFlag)) | ||
errMaxInMemorySeriesReached = newInstanceLimitError(globalerror.IngesterMaxInMemorySeries.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the allowed number of in-memory series", maxInMemorySeriesFlag)) | ||
errMaxInflightRequestsReached = newInstanceLimitError(globalerror.IngesterMaxInflightPushRequests.MessageWithPerInstanceLimitConfig("the write request has been rejected because the ingester exceeded the allowed number of inflight push requests", maxInflightPushRequestsFlag)) | ||
) | ||
|
||
type instanceLimitErr struct { | ||
msg string | ||
status *status.Status | ||
} | ||
|
||
func newInstanceLimitError(msg string) error { | ||
return &instanceLimitErr{ | ||
// Errors from hitting per-instance limits are always "unavailable" for gRPC | ||
status: status.New(codes.Unavailable, msg), | ||
msg: msg, | ||
} | ||
} | ||
|
||
func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool { | ||
// We increment metrics when hitting per-instance limits and so there's no need to | ||
// log them, the error doesn't contain any interesting information for us. | ||
return false | ||
} | ||
Comment on lines
+48
to
+52
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I read this correctly, then this will remove all logging, and as such should be prominently noted in the PR description. I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct, it will remove all logging for these per-instance errors. We already don't log anything for in-flight requests and all of these increment metrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't understand, there are no labels here. This is only for per-instance (ingester) limits so there's no extra information to log beyond "this thing happened" -- unlike per-tenant limits which it seems like #5584 is addressing (and I agree, sampling is useful for that case). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
func (e *instanceLimitErr) GRPCStatus() *status.Status { | ||
return e.status | ||
} | ||
|
||
func (e *instanceLimitErr) Error() string { | ||
return e.msg | ||
} | ||
|
||
// InstanceLimits describes limits used by ingester. Reaching any of these will result in Push method to return | ||
// (internal) error. | ||
type InstanceLimits struct { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging middleware can handle wrapped errors, but unfortunately gRPC
FromError()
does not. I'm wondering if what we should here to make code more generic is check if the error is implementingGRPCStatus()
and, if so, do not wrap it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wrong here as well.
FromError()
can unwrap too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why do we need this logic at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you improve the existing tests to assert on the returned error and check that gRPC
status.FromError()
returns the expected code?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because
wrapWithUser()
as it exists today doesn't actually "wrap" the error. It does something likeerrors.New(err.Error())
and so it returns a brand new error that doesn't have any relation to the original error or implement theGRPCStatus()
orShouldLog()
methods. I could change it but it seemed like it purposefully didn't wrap the existing error.Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouf... you're right, again.
No, don't do it. I thin would be unsafe. There's some context here cortexproject/cortex#2004 but TL;DR is that some errors carry the series labels but these series labels are only safe to read during the execution of
push()
because they're unmarshalled from the write request into a pool. So before returning, we effectively "make a copy" to ensure the error will be safe after thepush()
has returned.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
errMaxTenantsReached
anderrMaxInMemorySeriesReached
don't carry any label, so they're expected to be safe.As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should create an issue and tackle that, as otherwise this is a recipe for a disaster, and it's just a matter of time that it explodes somewhere.
Errors are something that is meant to escape your function, we shouldn't return errors that aren't safe to escape further than some point.
Edit: issue #6008