-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New implementation of firehose_exporter #63
New implementation of firehose_exporter #63
Conversation
@ArthurHlt Thanks for this PR! Can you please take a look at the failing tests before I review it? |
@frodenas I've found race condition when running with I will be happy that you start review on it, just in same time I will give feedbacks on our production with those race condition fixed. There was 2 kind of race conditions:
|
Just to say that on our production all is working correctly with those fixes on race conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly cannot review all the code that has been changed. However, since this is currently running in our production for weeks, I can say that it works far better than the current version.
Request rates are now accurate and communication between the exporter and the firehose is much faster.
As discussed in cloudfoundry/prometheus-boshrelease#419, I suggest that we merge this to master so we can move forward on cloudfoundry/prometheus-boshrelease#414
8442056
to
8fbc9ff
Compare
8fbc9ff
to
f1569a5
Compare
8963d1b
to
f1569a5
Compare
ea181cd
to
314325f
Compare
314325f
to
f12657f
Compare
Should we merge ? It's in production from a long time at Orange without any perturbation |
423f536
to
d93bf9a
Compare
update go version and bump dependencies
update kingpin lib
thx @ArthurHlt |
Context
At Orange we have found some issue with some of our big cloud foundry instances:
All those issues appears when you have a big instance with a lot of traffic on it but we decided to help further and start rewrite some part for creating a new implementation.
We do hope that it will help the community, the first implementation was great anyway and helped so much.
Actual information about this implementation
We have ran this implementation during 2 months, first month we have found big memory consumption which have now fixed. The second month we have no issue, even better it takes less memory for running than previous implementation.
What change ?
This new complete version is fully retro-compatible with previous generated metrics, you can also deactivate
retro-compatibility with flag
retro_compat.disable
or env varFIREHOSE_EXPORTER_RETRO_COMPAT_DISABLE=false
Use log-api in grpc for fastest metrics retrieving: This fix logs dropping on a large cloud foundry instances
Remove http_start_stop metrics because those metrics was evaluating request velocity instead of request timing and
number.
This was replaced by rollup feature introduced on https://github.com/cloudfoundry/metric-store-release and
reimplemented here. You have now access to total number of request with metric
[namespace]_http_total
, requesttiming by quantiles
(histogram metrics) with metrics
[namespace]_http_duration_seconds_(bucket|sum|count)
and finally response size[namespace]_response_size_bytes_(count|sum)
and[namespace]_response_size_bytes
with quantiles0.2
,0.5
,0.75
,0.95
.We have fix this part to have correct graph about apps traffic, dashboards has been fixed consequently.
Mechanism for exposing and collect metrics has been reworked, call on
/metrics
is faster now and there is lessline of codes
By default, metric in the form of
[namespace]_counter_event_*_delta
has been removed as it doesn't seeminteresting and https://github.com/cloudfoundry/metric-store-release doesn't collect them which confirm this
feeling.
You can anyway still get them by using flag
retro_compat.enable_delta
or envvar
FIREHOSE_EXPORTER_RETRO_COMPAT_ENABLE_DELTA=true
How to fix your dashboards and alarms:
Replace metrics names to retrieve:
[namespace]_http_start_stop_requests
=>[namespace]_http_total
[namespace]_http_start_stop_response_size_bytes
=>[namespace]_response_size_bytes
[namespace]_http_start_stop_response_size_bytes_count
=>[namespace]_response_size_bytes_count
[namespace]_http_start_stop_response_size_bytes_sum
=>[namespace]_response_size_bytes_sum
[namespace]_http_start_stop_response_size_bytes_sum
=>[namespace]_response_size_bytes_sum
[namespace]_http_start_stop_server_request_duration_seconds
=>[namespace]_http_duration_seconds_bucket
(metric isnow a histogram with bucket)
[namespace]_http_start_stop_server_request_duration_seconds_count
=>[namespace]_http_duration_seconds_count
[namespace]_http_start_stop_server_request_duration_seconds_sum
=>[namespace]_http_duration_seconds_sum
[namespace]_http_start_stop_last_request_timestamp
=> metric has been removed to avoid too much cpu work forexporter for metric not used in default dashboards or alerts
[namespace]_http_start_stop_client_request_duration_seconds_count
=> metric has been removed because it wasalready not reported anymore on app but only on gorouter metric
[namespace]_http_start_stop_client_request_duration_seconds_sum
=> metric has been removed because it was alreadynot reported anymore on app but only on gorouter metric
Associated pull request
We have have fixed dashboards and job consequently to this new implementation, we have also added new panels in latency board. you will find this PR here: New implementation of firehose_exporter prometheus-boshrelease#414