-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196
Add simple threading to UDP packet handling, dump message when UDP buffers likely overflowed. #196
Conversation
Signed-off-by: SpencerMalone <malone.spencer@gmail.com>
fd0adc9
to
5b125b5
Compare
Co-authored-by: AustinSchoen <austin.schoen@gmail.com> Signed-off-by: SpencerMalone <malone.spencer@gmail.com>
5b125b5
to
7cd5bab
Compare
One thing I'd like to add to this pre-merge: some exposed metrics about the potential # of times it's overflowed I also don't love the package name, trying to think of an alternative, up for suggestions. I'd love to make this a bit cleaner, but am struggling with the how around that. |
@@ -10,6 +10,7 @@ require ( | |||
github.com/howeyc/fsnotify v0.0.0-20151003194602-f0c08ee9c607 | |||
github.com/kr/pretty v0.1.0 // indirect | |||
github.com/mattn/go-isatty v0.0.4 // indirect | |||
github.com/mdlayher/netlink v0.0.0-20190313131330-258ea9dff42c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with this package after it's use in https://github.com/prometheus/node_exporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good already, thank you for putting it together!
don't love the package name, trying to think of an alternative, up for suggestions
bufwatch
? bufferwatcher
?
I agree that we need a metric for when it happens.
Would it make sense to restrict the whole buffer watching business to platforms where it will work? Or is this portable?
return buf.Bytes(), nil | ||
} | ||
|
||
func convert_addr_to_int32(ip []byte) (ret [4]uint32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use camelCase instead of snake_case
case sem <- struct{}{}: | ||
{ | ||
go func() { | ||
l.handlePacket(data[0:n], e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to call handlePacket concurrently?
why a semaphore and goroutines per handlePacket
call, vs long-lived goroutines that consume off a (buffered) channel? IIUC, with the buffered channel it would be possible to try to write an event into this channel, but if the buffer is full, increment a drop counter and move on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safe! At least, I ran this with a -race
build with no complaints, and have ~22 deployments of this happily chugging along at a pretty high volume and no related crashes.
Re: Semaphore - I think that is a great suggestion, this was originally @AustinSchoen's work, lemme collab with him. The idea of reading UDP packets as fast as we can and dropping them if the processing can't keep up instead of blocking (but keeping track of that drop tally) is so elegant, very +++ to headed down that path. That should totally alleviate my need for procfs metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all credit for that idea goes to @rtreffer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I went with a semaphore because it was the first example of a tunable queue of workers that I could find, there are multiple ways of handling this model. We were still early in the experimentation phase as to what amount and kind of concurrency we needed to handle the firehose of UDP packets coming at our exporters. Regardless I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation. Those packets should back up into the protocol buffer while waiting to be read by the configured number of handlers.
In situations where a system is overloaded only for a few (mili)seconds it would be better to block read off the protocol buffer and wait to handle them than to drop and log that we lost packets IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we'd want to just drop data on the floor if we don't have the capacity to handle it in every situation
I agree, and I don't propose that we do that. We can buffer again after the initial receive.
However, at some point we have to drop, and we want to know when that happens. I would rather guarantee that that happens in a place where we can easily observe it (i.e. in user-lang Go code) than introduce complex platform-dependent monitoring of the socket buffers themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "buffer again" I mean stuff them into a buffered channel, we don't have to get fancy about it.
|
||
if err != nil { | ||
shouldWatchBuffer = false | ||
log.Debugf("Unable to watch UDP buffer size due to: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this always happen on non-Linux platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this always log? Yes, it would. I figured a debug log would be a good place to quietly drop that to minimize confusion over why they can't get the buffer tracking to work, but if you think it's unnecessary, I would be happy to more quietly short circuit that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine as it is, just trying to understand! Maybe add a comment to say how we expect the initialization to behave on non-Linux systems?
Agree! The downside of the OS buffer is that we have a hard time knowing
when it overflowed. Go channels can buffer too, so by consuming the packets
quickly and then adding another buffer before the expensive processing, we
can have our cake and measure it too
…On Thu, Apr 11, 2019, 21:06 Austin Schoen ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In exporter.go
<#196 (comment)>
:
> buf := make([]byte, 65535)
for {
n, _, err := l.conn.ReadFromUDP(buf)
if err != nil {
log.Fatal(err)
}
- l.handlePacket(buf[0:n], e)
+ if shouldWatchBuffer && atomic.LoadUint32(&l.bufferWarning) != 1 {
+ go l.watchBufferSize()
+ }
+
+ data := append([]byte(nil), buf[0:n]...)
+ select {
+ case sem <- struct{}{}:
+ {
+ go func() {
+ l.handlePacket(data[0:n], e)
I believe I went with a semaphore because it was the first example of a
tunable queue of workers that I could find, there are multiple ways of
handling this model. We were still early in the experimentation phase as to
what amount and kind of concurrency we needed to handle the firehose of UDP
packets coming at our exporters. Regardless I don't think we'd want to just
drop data on the floor if we don't have the capacity to handle it in every
situation. Those packets should back up into the protocol buffer while
waiting to be read by the configured number of handlers.
In situations where a system is overloaded only for a few (mili)seconds it
would be better to block read off the protocol buffer and wait to handle
them than to drop and log that we lost packets IMO.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBo9VvSKKjVt6Qfeb3Xpc6pVfVk7iks5vf4exgaJpZM4cYzJs>
.
|
I like the implementation in |
great, I'll close this PR then since it already has conflicts. thank you for the impulse and discussion! |
Followup to #187
What this does
How I tested this
There were two parts I wanted to test, first off did the reporting work, and secondly did the threading help. Here's my testing format:
Proving that UDP overflow prediction works:
Terminal 1:
Terminal 2:
Then in terminal 1, I see the following:
At which point I pushed it a little more (verified it did not continue spamming the above message), and decided to verify in procfs:
Drops proven!
Proving that the threading works
This was simple, I redid the last test, but with
--udp-listener.handlers=10000 --udp-listener.threads=1000
. It did not complain about potential drops in my lower volume test. When I pushed it real hard, I still got it to whine, but at least we know it won't spew false positives.