Skip to content

Commit

Permalink
http fault: add active fault stat and overflow setting (#6167)
Browse files Browse the repository at this point in the history
1) Add stat to track number of active injected faults
2) Add config/runtime control over how many concurrent
   faults can be injected. This is useful in cases where
   we want to allow 100% fault injection, but want to
   protect against too many concurrent requests using too
   many resources.
3) Add stat for faults that overflowed.
4) Misc code cleanup / modernization.

Part of #5942.

Signed-off-by: Matt Klein <mklein@lyft.com>
  • Loading branch information
mattklein123 authored Mar 8, 2019
1 parent c69429b commit 191c8b0
Show file tree
Hide file tree
Showing 7 changed files with 287 additions and 87 deletions.
16 changes: 16 additions & 0 deletions api/envoy/config/filter/http/fault/v2/fault.proto
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import "envoy/api/v2/route/route.proto";
import "envoy/config/filter/fault/v2/fault.proto";
import "envoy/type/percent.proto";

import "google/protobuf/wrappers.proto";

import "validate/validate.proto";

// [#protodoc-title: Fault Injection]
Expand Down Expand Up @@ -63,4 +65,18 @@ message HTTPFault {
// <config_http_conn_man_headers_downstream-service-node>` header and compared
// against downstream_nodes list.
repeated string downstream_nodes = 5;

// The maximum number of faults that can be active at a single time via the configured fault
// filter. Note that because this setting can be overridden at the route level, it's possible
// for the number of active faults to be greater than this value (if injected via a different
// route). If not specified, defaults to unlimited. This setting can be overridden via
// `runtime <config_http_filters_fault_injection_runtime>` and any faults that are not injected
// due to overflow will be indicated via the `faults_overflow
// <config_http_filters_fault_injection_stats>` stat.
//
// .. attention::
// Like other :ref:`circuit breakers <arch_overview_circuit_break>` in Envoy, this is a fuzzy
// limit. It's possible for the number of active faults to rise slightly above the configured
// amount due to the implementation details.
google.protobuf.UInt32Value max_active_faults = 6;
}
13 changes: 13 additions & 0 deletions docs/root/configuration/http_filters/fault_filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ Configuration
* :ref:`v2 API reference <envoy_api_msg_config.filter.http.fault.v2.HTTPFault>`
* This filter should be configured with the name *envoy.fault*.

.. _config_http_filters_fault_injection_runtime:

Runtime
-------

Expand All @@ -62,6 +64,13 @@ fault.http.delay.fixed_duration_ms
is missing from both the runtime and the config, no delays will be
injected.

fault.http.max_active_faults
The maximum number of active faults (of all types) that Envoy will will inject via the fault
filter. This can be used in cases where it is desired that faults are 100% injected,
but the user wants to avoid a situation in which too many unexpected concurrent faulting requests
cause resource constraint issues. If not specified, the :ref:`max_active_faults
<envoy_api_field_config.filter.http.fault.v2.HTTPFault.max_active_faults>` setting will be used.

*Note*, fault filter runtime settings for the specific downstream cluster
override the default ones if present. The following are downstream specific
runtime keys:
Expand All @@ -76,6 +85,8 @@ Downstream cluster name is taken from
header. If the following settings are not found in the runtime it defaults to the global runtime settings
which defaults to the config settings.

.. _config_http_filters_fault_injection_stats:

Statistics
----------

Expand All @@ -89,5 +100,7 @@ owning HTTP connection manager.

delays_injected, Counter, Total requests that were delayed
aborts_injected, Counter, Total requests that were aborted
faults_overflow, Counter, Total number of faults that were not injected due to overflowing the :ref:`max_active_faults <envoy_api_field_config.filter.http.fault.v2.HTTPFault.max_active_faults>` setting
active_faults, Gauge, Total number of faults active at the current time
<downstream-cluster>.delays_injected, Counter, Total delayed requests for the given downstream cluster
<downstream-cluster>.aborts_injected, Counter, Total aborted requests for the given downstream cluster
4 changes: 4 additions & 0 deletions docs/root/intro/version_history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ Version history
<envoy_api_field_config.filter.http.ext_authz.v2.AuthorizationResponse.allowed_client_headers>` and :ref:`upstream headers
<envoy_api_field_config.filter.http.ext_authz.v2.AuthorizationResponse.allowed_upstream_headers>` replaces the previous *allowed_authorization_headers* object.
All the control header lists now support :ref:`string matcher <envoy_api_msg_type.matcher.StringMatcher>` instead of standard string.
* fault: added the :ref:`max_active_faults
<envoy_api_field_config.filter.http.fault.v2.HTTPFault.max_active_faults>` setting, as well as
:ref:`statistics <config_http_filters_fault_injection_stats>` for the number of active faults
and the number of faults the overflowed.
* governance: extending Envoy deprecation policy from 1 release (0-3 months) to 2 releases (3-6 months).
* health check: expected response codes in http health checks are now :ref:`configurable <envoy_api_msg_core.HealthCheck.HttpHealthCheck>`.
* http: added new grpc_http1_reverse_bridge filter for converting gRPC requests into HTTP/1.1 requests.
Expand Down
4 changes: 2 additions & 2 deletions source/extensions/filters/http/fault/config.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ namespace Fault {
Http::FilterFactoryCb FaultFilterFactory::createFilterFactoryFromProtoTyped(
const envoy::config::filter::http::fault::v2::HTTPFault& config,
const std::string& stats_prefix, Server::Configuration::FactoryContext& context) {
FaultFilterConfigSharedPtr filter_config(new FaultFilterConfig(
config, context.runtime(), stats_prefix, context.scope(), context.random()));
FaultFilterConfigSharedPtr filter_config(
new FaultFilterConfig(config, context.runtime(), stats_prefix, context.scope()));
return [filter_config](Http::FilterChainFactoryCallbacks& callbacks) -> void {
callbacks.addStreamDecoderFilter(std::make_shared<FaultFilter>(filter_config));
};
Expand Down
90 changes: 57 additions & 33 deletions source/extensions/filters/http/fault/fault_filter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,6 @@ namespace Extensions {
namespace HttpFilters {
namespace Fault {

const std::string FaultFilter::DELAY_PERCENT_KEY = "fault.http.delay.fixed_delay_percent";
const std::string FaultFilter::ABORT_PERCENT_KEY = "fault.http.abort.abort_percent";
const std::string FaultFilter::DELAY_DURATION_KEY = "fault.http.delay.fixed_duration_ms";
const std::string FaultFilter::ABORT_HTTP_STATUS_KEY = "fault.http.abort.http_status";

FaultSettings::FaultSettings(const envoy::config::filter::http::fault::v2::HTTPFault& fault) {

if (fault.has_abort()) {
Expand All @@ -54,13 +49,17 @@ FaultSettings::FaultSettings(const envoy::config::filter::http::fault::v2::HTTPF
for (const auto& node : fault.downstream_nodes()) {
downstream_nodes_.insert(node);
}

if (fault.has_max_active_faults()) {
max_active_faults_ = fault.max_active_faults().value();
}
}

FaultFilterConfig::FaultFilterConfig(const envoy::config::filter::http::fault::v2::HTTPFault& fault,
Runtime::Loader& runtime, const std::string& stats_prefix,
Stats::Scope& scope, Runtime::RandomGenerator& generator)
Stats::Scope& scope)
: settings_(fault), runtime_(runtime), stats_(generateStats(stats_prefix, scope)),
stats_prefix_(stats_prefix), scope_(scope), generator_(generator) {}
stats_prefix_(stats_prefix), scope_(scope) {}

FaultFilter::FaultFilter(FaultFilterConfigSharedPtr config) : config_(config) {}

Expand All @@ -86,6 +85,10 @@ Http::FilterHeadersStatus FaultFilter::decodeHeaders(Http::HeaderMap& headers, b
fault_settings_ = per_route_settings ? per_route_settings : fault_settings_;
}

if (faultOverflow()) {
return Http::FilterHeadersStatus::Continue;
}

if (!matchesTargetUpstreamCluster()) {
return Http::FilterHeadersStatus::Continue;
}
Expand Down Expand Up @@ -129,34 +132,37 @@ Http::FilterHeadersStatus FaultFilter::decodeHeaders(Http::HeaderMap& headers, b
return Http::FilterHeadersStatus::Continue;
}

bool FaultFilter::faultOverflow() {
const uint64_t max_faults = config_->runtime().snapshot().getInteger(
RuntimeKeys::get().MaxActiveFaultsKey, fault_settings_->maxActiveFaults().has_value()
? fault_settings_->maxActiveFaults().value()
: std::numeric_limits<uint64_t>::max());
// Note: Since we don't compare/swap here this is a fuzzy limit which is similar to how the
// other circuit breakers work.
if (config_->stats().active_faults_.value() >= max_faults) {
config_->stats().faults_overflow_.inc();
return true;
}

return false;
}

bool FaultFilter::isDelayEnabled() {
bool enabled = config_->runtime().snapshot().featureEnabled(
DELAY_PERCENT_KEY, fault_settings_->delayPercentage().numerator(),
config_->randomGenerator().random(),
ProtobufPercentHelper::fractionalPercentDenominatorToInt(
fault_settings_->delayPercentage().denominator()));
bool enabled = config_->runtime().snapshot().featureEnabled(RuntimeKeys::get().DelayPercentKey,
fault_settings_->delayPercentage());
if (!downstream_cluster_delay_percent_key_.empty()) {
enabled |= config_->runtime().snapshot().featureEnabled(
downstream_cluster_delay_percent_key_, fault_settings_->delayPercentage().numerator(),
config_->randomGenerator().random(),
ProtobufPercentHelper::fractionalPercentDenominatorToInt(
fault_settings_->delayPercentage().denominator()));
enabled |= config_->runtime().snapshot().featureEnabled(downstream_cluster_delay_percent_key_,
fault_settings_->delayPercentage());
}
return enabled;
}

bool FaultFilter::isAbortEnabled() {
bool enabled = config_->runtime().snapshot().featureEnabled(
ABORT_PERCENT_KEY, fault_settings_->abortPercentage().numerator(),
config_->randomGenerator().random(),
ProtobufPercentHelper::fractionalPercentDenominatorToInt(
fault_settings_->abortPercentage().denominator()));
bool enabled = config_->runtime().snapshot().featureEnabled(RuntimeKeys::get().AbortPercentKey,
fault_settings_->abortPercentage());
if (!downstream_cluster_abort_percent_key_.empty()) {
enabled |= config_->runtime().snapshot().featureEnabled(
downstream_cluster_abort_percent_key_, fault_settings_->abortPercentage().numerator(),
config_->randomGenerator().random(),
ProtobufPercentHelper::fractionalPercentDenominatorToInt(
fault_settings_->abortPercentage().denominator()));
enabled |= config_->runtime().snapshot().featureEnabled(downstream_cluster_abort_percent_key_,
fault_settings_->abortPercentage());
}
return enabled;
}
Expand All @@ -168,7 +174,7 @@ absl::optional<uint64_t> FaultFilter::delayDuration() {
return ret;
}

uint64_t duration = config_->runtime().snapshot().getInteger(DELAY_DURATION_KEY,
uint64_t duration = config_->runtime().snapshot().getInteger(RuntimeKeys::get().DelayDurationKey,
fault_settings_->delayDuration());
if (!downstream_cluster_delay_duration_key_.empty()) {
duration =
Expand All @@ -185,8 +191,8 @@ absl::optional<uint64_t> FaultFilter::delayDuration() {

uint64_t FaultFilter::abortHttpStatus() {
// TODO(mattklein123): check http status codes obtained from runtime.
uint64_t http_status =
config_->runtime().snapshot().getInteger(ABORT_HTTP_STATUS_KEY, fault_settings_->abortCode());
uint64_t http_status = config_->runtime().snapshot().getInteger(
RuntimeKeys::get().AbortHttpStatusKey, fault_settings_->abortCode());

if (!downstream_cluster_abort_http_status_key_.empty()) {
http_status = config_->runtime().snapshot().getInteger(
Expand All @@ -206,6 +212,7 @@ void FaultFilter::recordDelaysInjectedStats() {
}

// General stats.
incActiveFaults();
config_->stats().delays_injected_.inc();
}

Expand All @@ -219,6 +226,7 @@ void FaultFilter::recordAbortsInjectedStats() {
}

// General stats.
incActiveFaults();
config_->stats().aborts_injected_.inc();
}

Expand All @@ -236,11 +244,27 @@ Http::FilterTrailersStatus FaultFilter::decodeTrailers(Http::HeaderMap&) {
}

FaultFilterStats FaultFilterConfig::generateStats(const std::string& prefix, Stats::Scope& scope) {
std::string final_prefix = prefix + "fault.";
return {ALL_FAULT_FILTER_STATS(POOL_COUNTER_PREFIX(scope, final_prefix))};
const std::string final_prefix = prefix + "fault.";
return {ALL_FAULT_FILTER_STATS(POOL_COUNTER_PREFIX(scope, final_prefix),
POOL_GAUGE_PREFIX(scope, final_prefix))};
}

void FaultFilter::onDestroy() { resetTimerState(); }
void FaultFilter::incActiveFaults() {
// Only charge 1 active fault per filter in case we are injecting multiple faults.
if (fault_active_) {
return;
}

config_->stats().active_faults_.inc();
fault_active_ = true;
}

void FaultFilter::onDestroy() {
resetTimerState();
if (fault_active_) {
config_->stats().active_faults_.dec();
}
}

void FaultFilter::postDelayInjection() {
resetTimerState();
Expand Down
35 changes: 22 additions & 13 deletions source/extensions/filters/http/fault/fault_filter.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,18 @@ namespace Fault {
* All stats for the fault filter. @see stats_macros.h
*/
// clang-format off
#define ALL_FAULT_FILTER_STATS(COUNTER) \
#define ALL_FAULT_FILTER_STATS(COUNTER, GAUGE) \
COUNTER(delays_injected) \
COUNTER(aborts_injected)
COUNTER(aborts_injected) \
COUNTER(faults_overflow) \
GAUGE (active_faults)
// clang-format on

/**
* Wrapper struct for connection manager stats. @see stats_macros.h
*/
struct FaultFilterStats {
ALL_FAULT_FILTER_STATS(GENERATE_COUNTER_STRUCT)
ALL_FAULT_FILTER_STATS(GENERATE_COUNTER_STRUCT, GENERATE_GAUGE_STRUCT)
};

/**
Expand All @@ -52,6 +54,7 @@ class FaultSettings : public Router::RouteSpecificFilterConfig {
uint64_t abortCode() const { return http_status_; }
const std::string& upstreamCluster() const { return upstream_cluster_; }
const std::unordered_set<std::string>& downstreamNodes() const { return downstream_nodes_; }
absl::optional<uint64_t> maxActiveFaults() const { return max_active_faults_; }

private:
envoy::type::FractionalPercent abort_percentage_;
Expand All @@ -61,6 +64,7 @@ class FaultSettings : public Router::RouteSpecificFilterConfig {
std::string upstream_cluster_; // restrict faults to specific upstream cluster
std::vector<Http::HeaderUtility::HeaderData> fault_filter_headers_;
std::unordered_set<std::string> downstream_nodes_{}; // Inject failures for specific downstream
absl::optional<uint64_t> max_active_faults_;
};

/**
Expand All @@ -69,15 +73,13 @@ class FaultSettings : public Router::RouteSpecificFilterConfig {
class FaultFilterConfig {
public:
FaultFilterConfig(const envoy::config::filter::http::fault::v2::HTTPFault& fault,
Runtime::Loader& runtime, const std::string& stats_prefix, Stats::Scope& scope,
Runtime::RandomGenerator& generator);
Runtime::Loader& runtime, const std::string& stats_prefix, Stats::Scope& scope);

Runtime::Loader& runtime() { return runtime_; }
FaultFilterStats& stats() { return stats_; }
const std::string& statsPrefix() { return stats_prefix_; }
Stats::Scope& scope() { return scope_; }
const FaultSettings* settings() { return &settings_; }
Runtime::RandomGenerator& randomGenerator() { return generator_; }

private:
static FaultFilterStats generateStats(const std::string& prefix, Stats::Scope& scope);
Expand All @@ -87,7 +89,6 @@ class FaultFilterConfig {
FaultFilterStats stats_;
const std::string stats_prefix_;
Stats::Scope& scope_;
Runtime::RandomGenerator& generator_;
};

typedef std::shared_ptr<FaultFilterConfig> FaultFilterConfigSharedPtr;
Expand All @@ -110,34 +111,42 @@ class FaultFilter : public Http::StreamDecoderFilter {
void setDecoderFilterCallbacks(Http::StreamDecoderFilterCallbacks& callbacks) override;

private:
class RuntimeKeyValues {
public:
const std::string DelayPercentKey = "fault.http.delay.fixed_delay_percent";
const std::string AbortPercentKey = "fault.http.abort.abort_percent";
const std::string DelayDurationKey = "fault.http.delay.fixed_duration_ms";
const std::string AbortHttpStatusKey = "fault.http.abort.http_status";
const std::string MaxActiveFaultsKey = "fault.http.max_active_faults";
};

using RuntimeKeys = ConstSingleton<RuntimeKeyValues>;

bool faultOverflow();
void recordAbortsInjectedStats();
void recordDelaysInjectedStats();
void resetTimerState();
void postDelayInjection();
void abortWithHTTPStatus();
bool matchesTargetUpstreamCluster();
bool matchesDownstreamNodes(const Http::HeaderMap& headers);

bool isAbortEnabled();
bool isDelayEnabled();
absl::optional<uint64_t> delayDuration();
uint64_t abortHttpStatus();
void incActiveFaults();

FaultFilterConfigSharedPtr config_;
Http::StreamDecoderFilterCallbacks* callbacks_{};
Event::TimerPtr delay_timer_;
std::string downstream_cluster_{};
const FaultSettings* fault_settings_;
bool fault_active_{};

std::string downstream_cluster_delay_percent_key_{};
std::string downstream_cluster_abort_percent_key_{};
std::string downstream_cluster_delay_duration_key_{};
std::string downstream_cluster_abort_http_status_key_{};

const static std::string DELAY_PERCENT_KEY;
const static std::string ABORT_PERCENT_KEY;
const static std::string DELAY_DURATION_KEY;
const static std::string ABORT_HTTP_STATUS_KEY;
};

} // namespace Fault
Expand Down
Loading

0 comments on commit 191c8b0

Please sign in to comment.