Wavefront output should distinguish between retryable and non-retryable errors #8404

fishy · 2020-11-13T20:04:43Z

Currently we assume all wavefront send errors are retryable, and when an
error happens during Write function we will reject the buffer and keep
retrying the next tick. This means that when an actually non-retryable
error happens, we'll just keep getting the same error on every tick, and
never flush the buffer.

One such error we encountered is "empty metric name" error.

Add isRetryable function to detect non-retryable errors, and make it
default to assume that all errors are retryable (so it matches the
current behavior), but make it possible to mark certain errors as
non-retryable.

Currently only handled that "empty metric name" error as non-retryable.
A support ticket has been filed against wavefront to provide a canonical
way to distinguish between retryable and non-retryable errors.

Signed-off-by: Yuxuan 'fishy' Wang yuxuan.wang@reddit.com

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

ssoroka · 2020-11-13T20:29:56Z

Hey @fishy . it's up to each output to decide what's retry-able and not. If an output gets an error trying to post a metric and it doesn't want it to be retried, it should log the error and continue, not returning an error from the Write() function.

fishy · 2020-11-13T21:15:37Z

@ssoroka Thanks for the feedback, Steven! That sounds reasonable, but the problems are that if we go that route we need to make sure that the output plugin has the actual working logger configured, or the attempt to log and swallow the error will go into a blackhole, or even cause panic. Looking at the plugin code, it does have the logger in the struct, but I'm not entirely sure whether or not it's configured (for example, it's not initialized during the init function:

telegraf/plugins/outputs/wavefront/wavefront.go

Lines 348 to 355 in ca04106

    
           return &Wavefront{ 
        
           	Token:           "DUMMY_TOKEN", 
        
           	MetricSeparator: ".", 
        
           	ConvertPaths:    true, 
        
           	ConvertBool:     true, 
        
           	TruncateTags:    false, 
        
           	ImmediateFlush:  true, 
        
           }

). Can you point me to the code actually set the Log field in wavefront plugin?

ssoroka · 2020-11-13T21:19:50Z

if you're looking at Log Telegraf.Logger, it gets set automatically by Telegraf. you don't see it because it's happening with reflection. It's an internal function called SetLoggerOnPlugin. You can use the logger safely anywhere within the struct's functions

fishy · 2020-11-13T21:32:19Z

Thanks, @ssoroka . I updated the PR to only check retryable inside wavefront plugin. Please take another look.

(commit message and PR description also updated accordingly)

ssoroka · 2020-11-13T21:55:42Z

plugins/outputs/wavefront/wavefront.go

+					return fmt.Errorf("Wavefront sending error: %v", err)
+				}
+				w.Log.Errorf("non-retryable error during Wavefront.Write: %v", err)
+				return nil


You might not want to return here, and instead finish writing the batches, as some of those metrics are probably still good.

Good point. Done.

…le errors Currently we assume all wavefront send errors are retryable, and when an error happens during Write function we will reject the buffer and keep retrying the next tick. This means that when an actually non-retryable error happens, we'll just keep getting the same error on every tick, and never flush the buffer. One such error we encountered is "empty metric name" error. Add isRetryable function to detect non-retryable errors, and make it default to assume that all errors are retryable (so it matches the current behavior), but make it possible to mark certain errors as non-retryable. Currently only handled that "empty metric name" error as non-retryable. A support ticket has been filed against wavefront to provide a canonical way to distinguish between retryable and non-retryable errors. Signed-off-by: Yuxuan 'fishy' Wang <yuxuan.wang@reddit.com>

ssoroka · 2020-11-13T22:08:11Z

Merged! thank you

fishy · 2020-11-13T22:13:00Z

@ssoroka Thanks! I assume this will be included in 1.16.3 release? Do you know when will that release happen?

ssoroka · 2020-11-13T22:24:30Z

Yep! I think we might have another one in a week or two. Until then it'll be in the nightly release.

…le errors (#8404) (cherry picked from commit 18460e1)

…le errors (influxdata#8404)

ssoroka changed the title ~~Distinguish between retryable and non-retryable output errors~~ Wavefront output should distinguish between retryable and non-retryable errors Nov 13, 2020

ssoroka added area/wavefront bug unexpected problem or unintended behavior labels Nov 13, 2020

fishy force-pushed the wavefront-non-retryable-errors branch from 07913a6 to ace138e Compare November 13, 2020 21:31

ssoroka reviewed Nov 13, 2020

View reviewed changes

fishy force-pushed the wavefront-non-retryable-errors branch from ace138e to 12a7d53 Compare November 13, 2020 21:58

ssoroka merged commit 18460e1 into influxdata:master Nov 13, 2020

fishy deleted the wavefront-non-retryable-errors branch November 13, 2020 22:11

This was referenced Nov 13, 2020

SDK to provide canonical retriable given error wavefrontHQ/wavefront-sdk-go#60

Open

Add debug logging for non-retryable metric data on wavefront output. #8405

Closed

ssoroka pushed a commit that referenced this pull request Dec 1, 2020

Wavefront output should distinguish between retryable and non-retryab…

7d3e57f

…le errors (#8404) (cherry picked from commit 18460e1)

arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023

Wavefront output should distinguish between retryable and non-retryab…

ba716d3

…le errors (influxdata#8404)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wavefront output should distinguish between retryable and non-retryable errors #8404

Wavefront output should distinguish between retryable and non-retryable errors #8404

fishy commented Nov 13, 2020 •

edited

Loading

ssoroka commented Nov 13, 2020

fishy commented Nov 13, 2020

ssoroka commented Nov 13, 2020 •

edited

Loading

fishy commented Nov 13, 2020 •

edited

Loading

ssoroka Nov 13, 2020

fishy Nov 13, 2020

ssoroka commented Nov 13, 2020

fishy commented Nov 13, 2020

ssoroka commented Nov 13, 2020

Wavefront output should distinguish between retryable and non-retryable errors #8404

Wavefront output should distinguish between retryable and non-retryable errors #8404

Conversation

fishy commented Nov 13, 2020 • edited Loading

Required for all PRs:

ssoroka commented Nov 13, 2020

fishy commented Nov 13, 2020

ssoroka commented Nov 13, 2020 • edited Loading

fishy commented Nov 13, 2020 • edited Loading

ssoroka Nov 13, 2020

Choose a reason for hiding this comment

fishy Nov 13, 2020

Choose a reason for hiding this comment

ssoroka commented Nov 13, 2020

fishy commented Nov 13, 2020

ssoroka commented Nov 13, 2020

fishy commented Nov 13, 2020 •

edited

Loading

ssoroka commented Nov 13, 2020 •

edited

Loading

fishy commented Nov 13, 2020 •

edited

Loading