Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control % of requests faults are applied to with HTTP headers #10724

Merged

Conversation

Augustyniak
Copy link
Contributor

Description: Add an option to control the percentage of requests we apply fault to using HTTP headers. See #10648 for more details.
Risk Level: Low, new functionality.
Testing: Added unit tests
Docs Changes: Updated
Release Notes: Updated

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
@Augustyniak Augustyniak requested a review from alyssawilk as a code owner April 9, 2020 16:57
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
@mattklein123 mattklein123 self-assigned this Apr 10, 2020
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks really nice work (as usual!). Just a few small comments. Thank you!

/wait

@@ -28,6 +28,7 @@ Version history
* ext_authz: disabled the use of lowercase string matcher for headers matching in HTTP-based `ext_authz`.
Can be reverted temporarily by setting runtime feature `envoy.reloadable_features.ext_authz_http_service_enable_case_sensitive_string_matcher` to false.
* fault: added support for controlling abort faults with :ref:`HTTP header fault configuration <config_http_filters_fault_injection_http_header>` to the HTTP fault filter.
* fault: added support for controlling the percentage of requests that abort, delay and response rate limits faults are applied to using :ref:`HTTP headers <config_http_filters_fault_injection_http_header>` to the HTTP fault filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a master merge and the move to the new "current" version history file.

@@ -98,7 +125,7 @@ fault.http.abort.abort_percent
<envoy_api_field_config.filter.http.fault.v2.HTTPFault.abort>`.

fault.http.abort.http_status
HTTP status code that will be used as the of requests that will be
HTTP status code that will be used as the of requests that will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your bug but this sentence doesn't make sense. Can you fix?

const auto request_delay = fault_settings_->requestDelay();
if (request_delay == nullptr) {
return false;
}

const auto percentage_header =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a perf perspective it would be optimal to not do this lookup unless we are actually using a header provider (similar elsewhere), especially because right now this is going to be an O(N) scan to find the header. Can you figure out a way to only do the lookup on demand? Perhaps by passing the entire map and initializing the header provider with the header to look for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I updated all headers providers to work like this - we pass headers map and we do a specific header lookup only if a header provider is used. After my recent changes, there should be no performance impact related to header lookup at all if fault filter is not configured to use header faults.

One more question regarding the implementation of the percentage header providers. As it is now, we try to read a percentage header (such as x-envoy-fault-abort-request-percentage) and if it's not present or its value is invalid we fallback to the value that's specified in proto config file. Do you think that falling back to 100% would make more sense (as opposed to looking at proto config file)?

envoy::type::v3::FractionalPercent
HeaderPercentageProvider::percentage(const Http::RequestHeaderMap* request_headers) const {
if (!request_headers) {
return percentage_;
}
const auto header = request_headers->get(header_name_);
if (header == nullptr) {
return percentage_;
}
uint64_t header_numerator;
if (!absl::SimpleAtoi(header->value().getStringView(), &header_numerator)) {
return percentage_;
}
if (header_numerator > 100) {
return percentage_;
}
envoy::type::v3::FractionalPercent header_percentage;
header_percentage.set_numerator(header_numerator);
return header_percentage;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that falling back to 100% would make more sense (as opposed to looking at proto config file)?

I think falling back to the config is more flexible, as the operator can configure a default of 100%, 0%, etc. if they want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thought too 👍 Leaving it as it is then.

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
…tp-headers

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
@mattklein123
Copy link
Member

/azp run envoy-presubmit

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Augustyniak
Copy link
Contributor Author

Augustyniak commented Apr 13, 2020

envoy-presubmit (Linux-x64 release) is red again (I think that it's a flaky job, not an actual issue with my PR). @mattklein123 Is there any way to re-run only this one particular job?

@mattklein123
Copy link
Member

CI is broken right now. We are trying to figure out how to fix it.

@mattklein123
Copy link
Member

Please merge master.

/wait

…tp-headers

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
@Augustyniak
Copy link
Contributor Author

Augustyniak commented Apr 14, 2020

I think that 44eedc7 broke some of the git pre-push checks.

Checking format for third_party/statusor/statusor.cc - PASS
Checking spelling for third_party/statusor/statusor.cc
third_party/statusor/statusor.cc:7: * Copyright 2019 Google LLC
third_party/statusor/statusor.cc:41:} // namespace internal_statusor
Checked 1 file(s) and 22 comment(s), found 2 error(s).
ERROR: spell check failed. Run 'tools/spelling/check_spelling_pedantic.py fix and/or add new words to tools/spelling/spelling_dictionary.txt'

@Augustyniak
Copy link
Contributor Author

Fixing spell checking in #10763

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few more comments.

/wait

return percentage_;
}

if (header_numerator > 100) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just noticed this. Fractional percent supports 1/100, 1/10,000, etc. Do we want to support this somehow in a header? I'm wondering if this should somehow be scaled by the default percentage that is configured? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely use a denominator of a configured default percentage - that could work. If we do this we could also have a logic that would compute an output percentage's numerator as the minimum of two values - the value of HTTP percentage header and the value defined in default configured percentage.

Why would this additional complexity be helpful?

  • That would be an additional safety mechanism that would allow Envoy users to define the maximum percentage of requests a given type of fault should be applied to.
  • It would ensure that no change in the configuration of a default percentage results in too many faults being ingested via HTTP headers. Example when the lack of this mechanism could lead to unintended behavior is presented below.

Example

Step 1

default percentage configuration:

numerator: 100
denumerator: 1000

Effective fault injection rate is 10% since 100/1000 = 10.

HTTP headers used by clients injecting faults via HTTP headers:

`x-fault-abort-request-abort`: 404
`x-fault-abort-request-abort-percentage`: 100

Effective fault injection rate is 10% since 100/1000 = 10.

Step 2

Somebody on the server side decides to change a default percentage configuration to:

numerator: 10
denumerator: 100

From their perspective there should be no negative results of this change since the effective failure rate doesn't change (it's still equal to 10% since 100/1000 = 10/100 = 10)
The person responsible for a change doesn't update clients that use HTTP headers to ingest faults.

What happens with clients using HTTP headers to inject faults?
Without proposed min logic for the calculation of a numerator
`x-fault-abort-request-abort`: 404
`x-fault-abort-request-abort-percentage`: 100

Effective fault injection rate increases to 100% since 100/100 (new denominator of a default percentage) = 100.

With proposed min logic for the calculation of a numerator
`x-fault-abort-request-abort`: 404
`x-fault-abort-request-abort-percentage`: 100

Effective fault injection rate is still at 100% since min(100/100, 10 [Default Percentage Effective Rate]) = 100.

Let me know what you think about the proposed additional to your proposed change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the super detailed proposal! I agree the capping makes sense, but please make sure it's well documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 57 to 59
if (!request_headers) {
return absl::nullopt;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can happen? Same elsehwere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will never happen. I added this a safety measure in case we start using this class in more places in the future. That being said, I can remove this check if this kind of defensive programming is not something we do in Envoy repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in general we don't put in this type of defensive logic which we know we can't hit currently, so I would remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so that we are on the same page. There is one place where we call percentage method with empty headers (nullptr) in Mongo related code but I don't think that header faults are supported there so we should be fine. The code I'm talking about can be found here https://github.com/envoyproxy/envoy/pull/10724/files#diff-eb7cf1a112138011ff766f2a9dc402f8R405

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
…tp-headers

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
@mattklein123
Copy link
Member

Thanks, can you check CI? You can build the docs locally with docs/build.sh if that helps.

Signed-off-by: Rafal Augustyniak <raugustyniak@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants