New implementation of firehose_exporter #63

ArthurHlt · 2021-03-29T09:31:56Z

Context

At Orange we have found some issue with some of our big cloud foundry instances:

We found that exporter was dropping metrics this was due to websocket used
http_start_stop on its implementation was calculating kind of speed instead of real counter/gauge. Hard to explains here but it was due the way they are been retrieved and deleted
Retrieving metrics was kind of slow at the end

All those issues appears when you have a big instance with a lot of traffic on it but we decided to help further and start rewrite some part for creating a new implementation.
We do hope that it will help the community, the first implementation was great anyway and helped so much.

Actual information about this implementation

We have ran this implementation during 2 months, first month we have found big memory consumption which have now fixed. The second month we have no issue, even better it takes less memory for running than previous implementation.

What change ?

This new complete version is fully retro-compatible with previous generated metrics, you can also deactivate
retro-compatibility with flag retro_compat.disable or env var FIREHOSE_EXPORTER_RETRO_COMPAT_DISABLE=false
Use log-api in grpc for fastest metrics retrieving: This fix logs dropping on a large cloud foundry instances
Remove http_start_stop metrics because those metrics was evaluating request velocity instead of request timing and
number.
This was replaced by rollup feature introduced on https://github.com/cloudfoundry/metric-store-release and
reimplemented here. You have now access to total number of request with metric [namespace]_http_total, request
timing by quantiles
(histogram metrics) with metrics [namespace]_http_duration_seconds_(bucket|sum|count) and finally response size [namespace]_response_size_bytes_(count|sum) and [namespace]_response_size_bytes with quantiles 0.2, 0.5, 0.75, 0.95.

We have fix this part to have correct graph about apps traffic, dashboards has been fixed consequently.
Mechanism for exposing and collect metrics has been reworked, call on /metrics is faster now and there is less
line of codes
By default, metric in the form of [namespace]_counter_event_*_delta has been removed as it doesn't seem
interesting and https://github.com/cloudfoundry/metric-store-release doesn't collect them which confirm this
feeling.

You can anyway still get them by using flag retro_compat.enable_delta or env
var FIREHOSE_EXPORTER_RETRO_COMPAT_ENABLE_DELTA=true

How to fix your dashboards and alarms:

Replace metrics names to retrieve:

[namespace]_http_start_stop_requests => [namespace]_http_total
[namespace]_http_start_stop_response_size_bytes => [namespace]_response_size_bytes
[namespace]_http_start_stop_response_size_bytes_count => [namespace]_response_size_bytes_count
[namespace]_http_start_stop_response_size_bytes_sum => [namespace]_response_size_bytes_sum
[namespace]_http_start_stop_response_size_bytes_sum => [namespace]_response_size_bytes_sum
[namespace]_http_start_stop_server_request_duration_seconds => [namespace]_http_duration_seconds_bucket (metric is
now a histogram with bucket)
[namespace]_http_start_stop_server_request_duration_seconds_count => [namespace]_http_duration_seconds_count
[namespace]_http_start_stop_server_request_duration_seconds_sum => [namespace]_http_duration_seconds_sum
[namespace]_http_start_stop_last_request_timestamp => metric has been removed to avoid too much cpu work for
exporter for metric not used in default dashboards or alerts
[namespace]_http_start_stop_client_request_duration_seconds_count => metric has been removed because it was
already not reported anymore on app but only on gorouter metric
[namespace]_http_start_stop_client_request_duration_seconds_sum => metric has been removed because it was already
not reported anymore on app but only on gorouter metric

Associated pull request

We have have fixed dashboards and job consequently to this new implementation, we have also added new panels in latency board. you will find this PR here: New implementation of firehose_exporter prometheus-boshrelease#414

frodenas · 2021-03-31T17:20:59Z

@ArthurHlt Thanks for this PR! Can you please take a look at the failing tests before I review it?

ArthurHlt · 2021-03-31T21:58:51Z

@frodenas I've found race condition when running with -race during test (has you may see here, i was also annoyed by vendor out of sync with promu and ginkgo).

I will be happy that you start review on it, just in same time I will give feedbacks on our production with those race condition fixed. There was 2 kind of race conditions:

one when setting metric envelop and processing them. But it can't occurs in real life this was mostly due to tests, i've fixed it anyway
the second one rollup metrics themself I was overwriting a pointer for fast cleaning a sync.Map, this could lead to miss some values if re-assignment of the pointer is pretty slow (was happening on travis but could not reproduce on my boosted mac or on production). This also has been fixed.

ArthurHlt · 2021-04-16T14:21:05Z

Just to say that on our production all is working correctly with those fixes on race conditions

psycofdj

I honestly cannot review all the code that has been changed. However, since this is currently running in our production for weeks, I can say that it works far better than the current version.

Request rates are now accurate and communication between the exporter and the firehose is much faster.

As discussed in cloudfoundry/prometheus-boshrelease#419, I suggest that we merge this to master so we can move forward on cloudfoundry/prometheus-boshrelease#414

ArthurHlt · 2022-10-26T13:37:32Z

Should we merge ? It's in production from a long time at Orange without any perturbation

update go version and bump dependencies

update kingpin lib

benjaminguttmann-avtq · 2023-04-18T12:16:45Z

thx @ArthurHlt

ArthurHlt added 4 commits February 1, 2021 18:35

complete rewrite with retro compatibilit

9e789e3

improve performance and use less ram

1b6db1e

clean deps

cf4102b

clean deps

42b4eec

ArthurHlt mentioned this pull request Mar 29, 2021

New implementation of firehose_exporter cloudfoundry/prometheus-boshrelease#414

Closed

ArthurHlt added 8 commits March 31, 2021 19:30

tidy mod and update vendor

ec6b21b

update ginkgo for make pipeline happy

a4e7880

fix race condition

dd8a25c

force order in gauge test

8f73122

clean sync map instead re-allocating

8619ba4

fix promu and clean rollup metric during process

323710e

ensure test and docker are run even when testing/release tool update

2a8a9cb

bump version

e5853ca

benjaminguttmann-avtq requested review from bkez322, gberche-orange and psycofdj May 25, 2021 07:11

psycofdj approved these changes May 25, 2021

View reviewed changes

psycofdj force-pushed the new_implementation branch from 8442056 to 8fbc9ff Compare July 7, 2021 21:19

bump dependencies

f1569a5

psycofdj force-pushed the new_implementation branch from 8fbc9ff to f1569a5 Compare July 7, 2021 21:40

fredga force-pushed the new_implementation branch from 8963d1b to f1569a5 Compare October 18, 2021 14:05

mdimiceli added 6 commits January 4, 2022 11:38

bump dependencies

25e314e

bump dependencies

5336e2d

bump dependencies

3610bda

bump dependencies

553c2e8

bump dependencies

24dd5be

update version

45ab8a3

update go version to 1.18 and bump dependencies

ee55130

psycofdj force-pushed the new_implementation branch from ea181cd to 314325f Compare June 30, 2022 15:47

bump dependencies

f12657f

psycofdj force-pushed the new_implementation branch from 314325f to f12657f Compare June 30, 2022 15:48

ArthurHlt mentioned this pull request Jul 7, 2022

add cf task metrics cloudfoundry/cf_exporter#56

Merged

mdimiceli added 3 commits September 21, 2022 10:16

bump dependencies and upgrade go version to 1.19

04cef40

ugrade version

f484463

fix promu config for go 1.19

73856ea

bumnp dependencies

d93bf9a

psycofdj force-pushed the new_implementation branch from 423f536 to d93bf9a Compare January 3, 2023 13:35

mdimiceli added 7 commits March 28, 2023 14:43

update go version and bump dependencies

a8b24a5

Merge pull request #2 from orange-cloudfoundry/maj-trimestrielle-04-2023

44b3de2

update go version and bump dependencies

update kingpin lib

d2d12ca

fix indentation

5cdb643

Merge pull request #3 from orange-cloudfoundry/update-kingpin-lib

3972c0a

update kingpin lib

update version

03f6019

bump dependencies

a7465cb

benjaminguttmann-avtq merged commit d65988d into cloudfoundry:master Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New implementation of firehose_exporter #63

New implementation of firehose_exporter #63

ArthurHlt commented Mar 29, 2021 •

edited

Loading

frodenas commented Mar 31, 2021

ArthurHlt commented Mar 31, 2021

ArthurHlt commented Apr 16, 2021

psycofdj left a comment •

edited

Loading

ArthurHlt commented Oct 26, 2022

benjaminguttmann-avtq commented Apr 18, 2023

New implementation of firehose_exporter #63

New implementation of firehose_exporter #63

Conversation

ArthurHlt commented Mar 29, 2021 • edited Loading

Context

Actual information about this implementation

What change ?

How to fix your dashboards and alarms:

Associated pull request

frodenas commented Mar 31, 2021

ArthurHlt commented Mar 31, 2021

ArthurHlt commented Apr 16, 2021

psycofdj left a comment • edited Loading

Choose a reason for hiding this comment

ArthurHlt commented Oct 26, 2022

benjaminguttmann-avtq commented Apr 18, 2023

ArthurHlt commented Mar 29, 2021 •

edited

Loading

psycofdj left a comment •

edited

Loading