Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementation of firehose_exporter #63

Merged

Conversation

ArthurHlt
Copy link
Contributor

@ArthurHlt ArthurHlt commented Mar 29, 2021

Context

At Orange we have found some issue with some of our big cloud foundry instances:

  1. We found that exporter was dropping metrics this was due to websocket used
  2. http_start_stop on its implementation was calculating kind of speed instead of real counter/gauge. Hard to explains here but it was due the way they are been retrieved and deleted
  3. Retrieving metrics was kind of slow at the end

All those issues appears when you have a big instance with a lot of traffic on it but we decided to help further and start rewrite some part for creating a new implementation.
We do hope that it will help the community, the first implementation was great anyway and helped so much.

Actual information about this implementation

We have ran this implementation during 2 months, first month we have found big memory consumption which have now fixed. The second month we have no issue, even better it takes less memory for running than previous implementation.

What change ?

  • This new complete version is fully retro-compatible with previous generated metrics, you can also deactivate
    retro-compatibility with flag retro_compat.disable or env var FIREHOSE_EXPORTER_RETRO_COMPAT_DISABLE=false

  • Use log-api in grpc for fastest metrics retrieving: This fix logs dropping on a large cloud foundry instances

  • Remove http_start_stop metrics because those metrics was evaluating request velocity instead of request timing and
    number.
    This was replaced by rollup feature introduced on https://github.com/cloudfoundry/metric-store-release and
    reimplemented here. You have now access to total number of request with metric [namespace]_http_total, request
    timing by quantiles
    (histogram metrics) with metrics [namespace]_http_duration_seconds_(bucket|sum|count) and finally response size [namespace]_response_size_bytes_(count|sum) and [namespace]_response_size_bytes with quantiles 0.2, 0.5, 0.75, 0.95.

    We have fix this part to have correct graph about apps traffic, dashboards has been fixed consequently.

  • Mechanism for exposing and collect metrics has been reworked, call on /metrics is faster now and there is less
    line of codes

  • By default, metric in the form of [namespace]_counter_event_*_delta has been removed as it doesn't seem
    interesting and https://github.com/cloudfoundry/metric-store-release doesn't collect them which confirm this
    feeling.

    You can anyway still get them by using flag retro_compat.enable_delta or env
    var FIREHOSE_EXPORTER_RETRO_COMPAT_ENABLE_DELTA=true

How to fix your dashboards and alarms:

Replace metrics names to retrieve:

  • [namespace]_http_start_stop_requests => [namespace]_http_total

  • [namespace]_http_start_stop_response_size_bytes => [namespace]_response_size_bytes

  • [namespace]_http_start_stop_response_size_bytes_count => [namespace]_response_size_bytes_count

  • [namespace]_http_start_stop_response_size_bytes_sum => [namespace]_response_size_bytes_sum

  • [namespace]_http_start_stop_response_size_bytes_sum => [namespace]_response_size_bytes_sum

  • [namespace]_http_start_stop_server_request_duration_seconds => [namespace]_http_duration_seconds_bucket (metric is
    now a histogram with bucket)

  • [namespace]_http_start_stop_server_request_duration_seconds_count => [namespace]_http_duration_seconds_count

  • [namespace]_http_start_stop_server_request_duration_seconds_sum => [namespace]_http_duration_seconds_sum

  • [namespace]_http_start_stop_last_request_timestamp => metric has been removed to avoid too much cpu work for
    exporter for metric not used in default dashboards or alerts

  • [namespace]_http_start_stop_client_request_duration_seconds_count => metric has been removed because it was
    already not reported anymore on app but only on gorouter metric

  • [namespace]_http_start_stop_client_request_duration_seconds_sum => metric has been removed because it was already
    not reported anymore on app but only on gorouter metric

    Associated pull request

    We have have fixed dashboards and job consequently to this new implementation, we have also added new panels in latency board. you will find this PR here: New implementation of firehose_exporter prometheus-boshrelease#414

@frodenas
Copy link
Contributor

@ArthurHlt Thanks for this PR! Can you please take a look at the failing tests before I review it?

@ArthurHlt
Copy link
Contributor Author

@frodenas I've found race condition when running with -race during test (has you may see here, i was also annoyed by vendor out of sync with promu and ginkgo).

I will be happy that you start review on it, just in same time I will give feedbacks on our production with those race condition fixed. There was 2 kind of race conditions:

  1. one when setting metric envelop and processing them. But it can't occurs in real life this was mostly due to tests, i've fixed it anyway
  2. the second one rollup metrics themself I was overwriting a pointer for fast cleaning a sync.Map, this could lead to miss some values if re-assignment of the pointer is pretty slow (was happening on travis but could not reproduce on my boosted mac or on production). This also has been fixed.

@ArthurHlt
Copy link
Contributor Author

Just to say that on our production all is working correctly with those fixes on race conditions

Copy link
Contributor

@psycofdj psycofdj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly cannot review all the code that has been changed. However, since this is currently running in our production for weeks, I can say that it works far better than the current version.

Request rates are now accurate and communication between the exporter and the firehose is much faster.

As discussed in cloudfoundry/prometheus-boshrelease#419, I suggest that we merge this to master so we can move forward on cloudfoundry/prometheus-boshrelease#414

@psycofdj psycofdj force-pushed the new_implementation branch from 8442056 to 8fbc9ff Compare July 7, 2021 21:19
@psycofdj psycofdj force-pushed the new_implementation branch from 8fbc9ff to f1569a5 Compare July 7, 2021 21:40
@fredga fredga force-pushed the new_implementation branch from 8963d1b to f1569a5 Compare October 18, 2021 14:05
@psycofdj psycofdj force-pushed the new_implementation branch from ea181cd to 314325f Compare June 30, 2022 15:47
@ArthurHlt
Copy link
Contributor Author

Should we merge ? It's in production from a long time at Orange without any perturbation

@psycofdj psycofdj force-pushed the new_implementation branch from 423f536 to d93bf9a Compare January 3, 2023 13:35
@benjaminguttmann-avtq benjaminguttmann-avtq merged commit d65988d into cloudfoundry:master Apr 18, 2023
@benjaminguttmann-avtq
Copy link
Contributor

thx @ArthurHlt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants