Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more metrics to in_monitor_agent #2450

Merged
merged 6 commits into from
Jul 1, 2019

Conversation

ganmacs
Copy link
Member

@ganmacs ganmacs commented Jun 11, 2019

Which issue(s) this PR fixes:

no

What this PR does / why we need it:

I added metrics which output and buffer have to the response of in_monitor_agent plugin to fetch the fluentd's inner stats.

Docs Changes:

Need to revise examples https://docs.fluentd.org/input/monitor_agent#output-example

Release Note:

Same as the title

@ganmacs ganmacs requested a review from repeatedly June 11, 2019 02:42
}

if @buffer && @buffer.respond_to?(:statistics)
(@buffer.statistics && @buffer.statistics['buffer'] || {}).each do |k, v|
Copy link
Member

@repeatedly repeatedly Jun 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When @buffer.statistics return false/nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks unnecessary. thank you!
7e0941a

stats['newest_timekey'] = m
end

{ 'buffer' => stats }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the merit of {'buffer' => stats} rather than stats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have consistent with the result of Output#statistics.
I think it better having { 'output' => stats } than stats since in_monitor_agent.rb can identify which the data come from when the components we want to collect stats( at https://github.com/fluent/fluentd/pull/2450/files#diff-0ad62dd2012fecaa6ed2232f8918c528R382) increase.

@bai
Copy link

bai commented Jun 17, 2019

👋Semi-related question: do you think it would make sense to add flush_time metric as well, that'd answer "how long it takes for Output to perform write/try_write"?

ganmacs added 3 commits June 18, 2019 11:23
they are used for monitoring

Signed-off-by: Yuta Iwama <ganmacs@gmail.com>
Signed-off-by: Yuta Iwama <ganmacs@gmail.com>
Signed-off-by: Yuta Iwama <ganmacs@gmail.com>
@ganmacs ganmacs force-pushed the add-more-metrics branch from 7e0941a to 63c006e Compare June 18, 2019 05:31
@ganmacs
Copy link
Member Author

ganmacs commented Jun 18, 2019

do you think it would make sense to add flush_time metric as well

Looks neat for me. I added flush_time_count and slow_flush_count 63c006e !

@bai
Copy link

bai commented Jun 18, 2019

Fantastic, thank you.

@repeatedly
Copy link
Member

flush_time_count

Is this useful? This value seems the total of flush time. How to use this value in monitoring?

@ganmacs
Copy link
Member Author

ganmacs commented Jun 18, 2019

I think it is useful for getting the value which is like average flush_time in 5 mins.
(in promql query is like increase(flush_time_count[5m] / write_count[5m]) )

@bai
Copy link

bai commented Jun 18, 2019

I think it is useful for getting the value which is like average flush_time in 5 mins.
(in promql query is like increase(flush_time_count[5m] / write_count[5m]) )

This is exactly what I've meant 👍 🎉

@repeatedly
Copy link
Member

repeatedly commented Jun 18, 2019

Okay, I understood. elapsed_time is Float type. How to handle overflow? Reset to 0?
Some fluentd users launch many flush thread and it increases flush_time_count quickly.

@ganmacs
Copy link
Member Author

ganmacs commented Jun 18, 2019

elapsed_time is Float type.

Good catch… Since flush_time_count is counter, I think that this value should increase monotonically. I think it's good to use Integer type (to calling #to_i before adding elapsed_time to flush_time ) since most of flush_time is smaller than 1.


done 628379e

@bai
Copy link

bai commented Jun 19, 2019

@ganmacs Do we need to multiply elapsed_time by 1000 before calling to_i to get millisecond precision?

Signed-off-by: Yuta Iwama <ganmacs@gmail.com>
@ganmacs ganmacs force-pushed the add-more-metrics branch from 628379e to 8b5e34f Compare June 21, 2019 08:01
Signed-off-by: Yuta Iwama <ganmacs@gmail.com>
@@ -1202,7 +1204,10 @@ def backup_chunk(chunk, using_secondary, delayed_commit)

def check_slow_flush(start)
elapsed_time = Fluent::Clock.now - start
# millsec precision
@counters_monitor.synchronize { @flush_time_count += (elapsed_time * 1000).to_i }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about (elapsed_time * 1000).to_i calculation outside of synchronize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

05d842e fixed

Signed-off-by: Yuta Iwama <ganmacs@gmail.com>
@repeatedly repeatedly merged commit 07fee9b into fluent:master Jul 1, 2019
@repeatedly
Copy link
Member

Thx!

@ganmacs ganmacs deleted the add-more-metrics branch July 1, 2019 02:40
@repeatedly repeatedly mentioned this pull request Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants