metric_version = 1 memory leak #9821

noarrg · 2021-09-26T12:10:54Z

Relevant telegraf.conf:
[agent]
interval = "1s"
round_interval = false
metric_batch_size = 800
metric_buffer_limit = 1000
collection_jitter = "0s"
flush_interval = "1s"
flush_jitter = "0s"
hostname = ""
omit_hostname = true
debug = true
quiet = false
logfile = "logger"

[[outputs.prometheus_client]]
listen = ":9276"
path = "/metrics"
expiration_interval = "20s"
export_timestamp = false
metric_version = 1

[[inputs.socket_listener]]
service_address = "udp://:8094"
read_buffer_size = "8MB"
data_format = "prometheus"

System info:
telegraf-1.19.3_windows_amd64
Windows Server 2019 Datacenter

Steps to reproduce:
The input was about 800 metrics sent to the UDP input plugin per second, the prometheus output exposed them and they were forwarded, but the telegraf process memory exhibited linear growth - reaching 4GB memory used by the process after running of an hour.

Setting the metric_version to '2' fixed this - and the process memory remained at about 50 mbs while running for a couple hours.

yummydsky · 2021-12-08T11:31:30Z

Please help to add the following code at v1 collector.go Add function. You can reference v2 collector Add function.

	// Expire metrics, doing this on Add ensure metrics are removed even if no
	// one is querying the data.
	c.Expire(time.Now(), c.ExpirationInterval)

Currently, we are only expiring data is someone is getting the data. This means that if data is continiously pushed, but not gathered, the usage can grow and grow. This change forces expiration during add, similar to how v2 handles this as well. fixes: influxdata#9821

powersj · 2022-11-03T16:33:21Z

Hi,

Expire is currently only called during Collect(), which is run when someone loaded the prometheus endpoint with the v1 metric type.

I have put up #12160 with a fix to add the expire call during add as well. In 20-30mins, there will be some test artifacts attached that PR. Could someone please download that artifact and confirm the lower memory usage? If you do run into any issues, can you provide me with an example config + the metrics pushed to prometheus input?

Thanks!

fixes #9821

fixes #9821 (cherry picked from commit a1de125)

noarrg added the bug unexpected problem or unintended behavior label Sep 26, 2021

telegraf-tiger bot added area/prometheus platform/windows labels Sep 26, 2021

sspaink added help wanted Request for community participation, code, contribution size/m 2-4 day effort labels Nov 2, 2022

powersj mentioned this issue Nov 3, 2022

fix(outputs.prometheus): expire during add #12160

Merged

powersj added waiting for response waiting for response from contributor and removed platform/windows labels Nov 3, 2022

sspaink closed this as completed in #12160 Nov 7, 2022

sspaink pushed a commit that referenced this issue Nov 7, 2022

fix(outputs.prometheus): expire during add (#12160)

a1de125

fixes #9821

powersj added a commit that referenced this issue Nov 29, 2022

fix(outputs.prometheus): expire during add (#12160)

81c0492

fixes #9821 (cherry picked from commit a1de125)

powersj mentioned this issue Jan 17, 2023

Retrieving Metrics from prometheus_client output plugin takes very long since Telegraf version 1.24.4 #12481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric_version = 1 memory leak #9821

metric_version = 1 memory leak #9821

noarrg commented Sep 26, 2021

yummydsky commented Dec 8, 2021

powersj commented Nov 3, 2022

metric_version = 1 memory leak #9821

metric_version = 1 memory leak #9821

Comments

noarrg commented Sep 26, 2021

yummydsky commented Dec 8, 2021

powersj commented Nov 3, 2022