Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reworked varnish_cache plugin #9432

Merged
merged 20 commits into from
Dec 21, 2021
Merged

Conversation

rhajek
Copy link
Contributor

@rhajek rhajek commented Jun 24, 2021

Resolves #5622 #6832

Added reworked version of the VarnishCache input plugin based on json parsing of varnishstat -j.
Conversion to metrics from Varnish stats is reworked, it solves issue with high cardinality when reloading Varnish server, parses backends into tags, supports "VBE.*" metrics.

  • Updated associated README.md.
  • Wrote appropriate unit tests.

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin new plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Jun 24, 2021
@rhajek rhajek marked this pull request as draft June 25, 2021 06:22
@rhajek rhajek marked this pull request as ready for review June 25, 2021 12:13

// Find the most recent 'VBE.reload_' prefix using string compare
// 'VBE.reload_20210623_170621_31083'
func findActiveReloadPrefix(countersJSON map[string]interface{}) string {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach only work if you reload services, but if you load vcl manually with custom name doesn't valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this PR, now it is part of the varnish plugin, varnishadm tool is used to get active vcl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ismaelpuerto is this concern resolved?

Copy link
Member

@helenosheaa helenosheaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rhajek thanks for the PR! It's a significant change to the varnish plugin. I've left review comments with suggestions and some questions that would be great if you could respond to/ resolve.

Are we trying to support a specific version of varnish with this plugin or is it cross-version compatible?

There's also a test failure in circle-ci that would need to be resolved:

FAIL: TestJsonTypes (0.00s)
    varnish_test.go:617: 
        	Error Trace:	varnish_test.go:617
        	Error:      	Not equal: 
        	            	expected: float64(123.45)
        	            	actual  : <nil>(<nil>)
        	Test:       	TestJsonTypes
FAIL
FAIL	github.com/influxdata/telegraf/plugins/inputs/varnish

Has this been tested by someone who was experiencing the cardinality issue?

@ismaelpuerto have you had a chance to test this out and see if it works when you load a vcl manually? It would also be great if we could check that this is backwards compatible when the metric version is set to 1. As I would expect this change in behavior with removal of nonactive VCL's to only happen if a user configures the plugin to be metric version 2.

@@ -139,6 +244,16 @@ func (s *Varnish) Gather(acc telegraf.Accumulator) error {
continue
}

//skip not active vcls
vMetric := parseMetricV2(stat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is parse metrics v2 inside a function named process metrics v1? I would expect metric 1 to keep backwards compatibility for our users (so not removing nonactive VCL's).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was also to fix https://github.com/influxdata/EAR/issues/2289 and strip the active VCL name from field name for v1 metrics. If we want v1 to stay fully backward compatible, we should revert this and remove also unnecessary varnishadm vcl.list check.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

The issue to keep nonactive VCL's is that it can generate huge of metrics that seems useless.

It can be very good to skip all inactive vcls.

There is a similar PR available on #6832 (but seems abandoned) related to #5622

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand but we try to keep backwards compatibility as up until now nonactive vcl's were reported. The approach is to give users the option to opt in to new behavior rather than changing behavior on an upgrade of Telegraf.

@rhajek rhajek requested a review from helenosheaa September 10, 2021 13:08
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this MP. A couple of small in-line comments and questions and one higher-level question:

Can you explain why version 2 metrics include the id tag while version 1 metrics omit it? This seems to greatly increase the cardinality even on my little system running varnish + apache2 with the default configs. See my comment in the parseMetricV2 function.

Given one purpose of this MP is to reduce cardinality this seems to do the opposite as I have run it above.

Thanks!

edit: updated link to version 2 metrics

//check vcl.list output
if jsonOut[0].(float64) != 2 {
return "", fmt.Errorf("unsupported varnishadm format %v", jsonOut[0])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you aware of any docs on when and how this version value changes? Even though it is at 2 right now, it will be a breaking surprise when it does in fact change.

Not required, just something for next time: If it is relatively static, then it might have been nice to unmarshal this into an object, as you do below with vclStruct, to make the references below easier to follow.

Copy link
Contributor Author

@rhajek rhajek Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only documentation that I found:

version "2" is still the same from 6.0.3 released in 9 Feb 2019

May be, check for the presence of "vcl.list", "-j" is enough and we can ignore version.

Unmarshal json array like this

[ 2, ["vcl.list", "-j"], 1631878235.913,
  {
    "status": "active",
    "state": "auto",
    "temperature": "warm",
    "busy": 0,
    "name": "boot"
}
]

into object is definitely better, I will try to fix this.

metric.fieldName = val
} else if val != "" {
metric.tags[sub] = val
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where id is getting added as a tag and why I am asking at a high level why we are increasing the cardinality on that value. Is that something users want to index against?

It seems concerning to add anything that doesn't match _vcl or _field as a tag as any change to the data format is going to cause cardinality to increase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. Id tag is added by the last default regexp rule (?P)

//generic metric like MSE_STORE.store-1-1.g_aio_running_bytes_write
regexp.MustCompile(`([\w\-]*)\.(?P<id>[\w\-.]*)\.([\w\-]*)`),

My idea was not to increase the cardinality, but make the field names corresponding with https://varnish-cache.org/docs/trunk/reference/varnish-counters.html

Cardinality issue was caused because varnish adds timestamps into its metrics names when reloading. This should be solved by tripping vcl name from the name.

Example: we have following varnish metrics

  "MEMPOOL.busyobj.allocs"
  "MEMPOOL.ssl_buf.allocs"
  "MEMPOOL.req0.allocs"
  "MEMPOOL.sess0.allocs"
  "MEMPOOL.req1.allocs"
  "MEMPOOL.sess1.allocs"

Middle part identifies the name of resource. There are limited number of these so it is suitable for tagging/indexing.
We reduce number of tags but increase number of fields. I think cardinality in this case is the same.

varnish,section=MEMPOOL,id=busyobj alloc=xxxx
varnish,section=MEMPOOL,id=ssl_buf alloc=xxxx
varnish,section=MEMPOOL,id=req0 alloc=xxxx

varnish,section=MEMPOOL busyobj.alloc=xxxx
varnish,section=MEMPOOL ssl_buf.alloc=xxxx
varnish,section=MEMPOOL req0.alloc=xxxx

Biggest disadvantage of this approach i see is that this approach will not be backward compatible with v1.
Varnish metrics can have various names, it tried to cover that in 'plugins/inputs/varnish/varnish_test.go:441'

@helenosheaa @ismaelpuerto @anthosz So should we remove the id tag and add middle part into field name?

Copy link
Member

@helenosheaa helenosheaa Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a static number of values that will be set as id=tagvalue I don't think it will be as much of cardinality problem as we were seeing with vcl name where fields included timestamps or ip's so they were constantly increasing.

However what is the advantage of splitting it out into a tag? The more consistency between v1 and v2 unless there is a value add I think the better.

That said I believe the build from this PR is being tested so shall we make a decision on this once we get feedback from that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting varnish metric names into tags can be useful when creating dashboards. Represent a newly added resource as a tag looks more natural to me, it will be automatically shown in the dashboard if you use generic queries and filters.

But, may be, more important is to keep v2 as much as possible backward compatible and not break existing things.
User can always use custom regexps in telegraf configuration to create custom tag splitting if needed.

@helenosheaa
Copy link
Member

hey @ismaelpuerto would you mind testing this build with the two different metric versions?

@ismaelpuerto
Copy link

Hello, I just test with both version and It looks fine, For this test I used varnish-plus-6.0.8r4 and I tested metrics like kvstore, vha and accounting, regarding this metric, maybe it make sense that exists a tag with value of <namespace>, but I can understand that this a special case with a low audience and right now, it's not priority.

Regarding tags, for me is fine have "section" only, I don't need the name of vcl as tag because we have metric "n_vcl" to know if there was loaded a new vcl.

@rhajek
Copy link
Contributor Author

rhajek commented Sep 22, 2021

Hello, I just test with both version and It looks fine, For this test I used varnish-plus-6.0.8r4 and I tested metrics like kvstore, vha and accounting, regarding this metric, maybe it make sense that exists a tag with value of , but I can understand that this a special case with a low audience and right now, it's not priority.

Regarding tags, for me is fine have "section" only, I don't need the name of vcl as tag because we have metric "n_vcl" to know if there was loaded a new vcl.

To add a namespace tag for accounting metrics you can add following regexp into telegraf config. Should I include this rule to the defaults ?

[[inputs.varnish]]
regexps = ['^ACCG.(?P[\w-]).(?P<_field>[\w-.])']

@helenosheaa
Copy link
Member

I think instead of adding it to defaults as it's potentially low audience right now, just add it as an example in the README so people can add if they want to.

@rhajek rhajek requested a review from helenosheaa September 27, 2021 09:07
@ivankudibal
Copy link

@popey - the approver is unclear

Copy link
Member

@helenosheaa helenosheaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivankudibal @rhajek lgtm, however we are just waiting for feedback from the customer on the change before we would merge this.

@dg-nvm
Copy link

dg-nvm commented Nov 5, 2021

@helenosheaa can we get artifacts rebuild again ? Without this I cannot test changes

@powersj
Copy link
Contributor

powersj commented Nov 5, 2021

@rhajek can you rebase on master and push, please? Artifacts are only available for 30-days and there are some folks who want to test this PR out. Updating the PR with a rebase should trigger the tests and package builds again.

If users feel comfortable building the go binary themselves they can also do the following:

git clone https://github.com/rhajek/telegraf -b varnish_cache
cd telegraf
make

The above will produce a binary for use on the system it was run on (i.e. if you build on linux/amd64, it will be a linux/amd64 binary).

@dg-nvm
Copy link

dg-nvm commented Nov 8, 2021

@powersj I had to upgrade Golang to do it but it worked and I will use my compiled one to test it

@dg-nvm
Copy link

dg-nvm commented Nov 8, 2021

@rhajek I still see plenty of reload_20211108_110316_28415 fields on new configuration:

varnishstat (varnish-6.0.6 revision 29a1a8243dbef3d973aec28dc90403188c1dc8e7)
Telegraf unknown (git: varnish_cache 40c4735)

[[inputs.varnish]]
  #use_sudo = false
  #binary = "/bin/varnishstat"
  stats = ["all"]
  #name = instanceName
#  fielddrop = ["reload_*"]

Do we have to add some special configuration to get rid of these multiple fields ?

image

@rhajek
Copy link
Contributor Author

rhajek commented Nov 9, 2021

@rhajek I still see plenty of reload_20211108_110316_28415 fields on new configuration:

@dg-nvm thanks for testing this PR, you have to use metric_version = 2 in your telegraf.config to enable nonactive vcls removal.

@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Nov 9, 2021

@anthosz
Copy link

anthosz commented Nov 9, 2021

Hi @rhajek ,

Thank you very much for the works, I just tested with the artifacts of this day and seems to be good: I have only the active vcls.

In addition, I have one remark concerning the "unhealthy" field, I cannot find it via varnishadm/varnishstat and it's always set to zero (I tried to stop the backend -> instance sick on varnish side -> but unhealthy value is always set to 0)

@anthosz
Copy link

anthosz commented Nov 9, 2021

Another thing concerning custom_arguments (https://github.com/influxdata/telegraf/blob/db40a348a3dbefafdf20557d221522d3751d141d/plugins/inputs/varnish/README.md#custom-arguments):

  • It seems that the instance_name is not take into account when we use custom args: it need to be added manually
  • Once the instance name is set (manually), I have this error:
    2021-11-09T14:05:34Z E! [inputs.varnish] Error in plugin: error gathering metrics: invalid character 'a' looking for beginning of value 2021-11-09T14:05:34Z E! [telegraf] Error running agent: input plugins recorded 1 errors
    I have the feeling that it's due to vcl.list of varnishadm but not sure (debug mode doesn't help)
    available auto cold 0 boot [...]

@dg-nvm
Copy link

dg-nvm commented Nov 9, 2021

@anthosz @rhajek I checked that too and yes, when using instance name command line looks like this:

/bin/varnishadm vcl.list -j -n /instance-name

Instead of:

/bin/varnishadm -n /instance-name vcl.list -j

Order of arguments matters, command and -j have to be at the end

@rhajek
Copy link
Contributor Author

rhajek commented Nov 11, 2021

I fixed how to setup a custom instance name when using custom arguments in README.md.

@telegraf-tiger
Copy link
Contributor

@telegraf-tiger
Copy link
Contributor

@rhajek
Copy link
Contributor Author

rhajek commented Dec 14, 2021

@rhajek can you do a final rebase on master and resolve the conflicts on the readme? We can do a final review and land this if it looks good.

@powersj PR is rebased and ready for approval.

@sspaink sspaink merged commit 6de4d34 into influxdata:master Dec 21, 2021
powersj pushed a commit to powersj/telegraf that referenced this pull request Jan 21, 2022
Co-authored-by: Helen Weller <38860767+helenosheaa@users.noreply.github.com>
@anthosz
Copy link

anthosz commented Feb 22, 2022

Dear @sspaink ,

Do you know when it will be released?

Best regards,

@sspaink
Copy link
Contributor

sspaink commented Feb 22, 2022

It will be included in the next feature release which is scheduled March 16th, if you need it sooner I recommended using a nightly release artifact: https://github.com/influxdata/telegraf/blob/master/docs/NIGHTLIES.md

In the future, we are hoping to move to monthly releases so features should be released quicker: https://community.influxdata.com/t/request-for-community-feedback-rcs-monthy-releases/23703

@anthosz
Copy link

anthosz commented Mar 22, 2022

Hi @sspaink ,

I don't see any new release since the 16th (I prefer to use a "stable" version). Do you have an estimation?

Best regards,

@powersj
Copy link
Contributor

powersj commented Mar 22, 2022

Do you have an estimation?

Plan is to start the release tomorrow - we decided to hold off a little bit so we could land some other snmp changes.

@sspaink
Copy link
Contributor

sspaink commented Mar 23, 2022

@anthosz
Copy link

anthosz commented Mar 24, 2022

@sspaink thank you! \o/

@anthosz
Copy link

anthosz commented Mar 25, 2022

Good job to all for this PR, all works well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

varnish plugin should cleanup backend metric name
9 participants