-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: logging behavior policies and rate-limited logging #92
Conversation
This pull request introduces 1 alert when merging 81c8b75 into fd5ac47 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start, it would be great if the periods were calculated as absolute (monotonically) mathematically rather than relatively. And then being sure that following a write system call (which can block for very long time periods) we evaluate the current period we are in then.
nexit("negative sleep"); | ||
do { | ||
ret = nanosleep(&sleep, &sleep); | ||
} while (ret == -1 && errno == EINTR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you want a loop that doesn't check the time change. Shouldn't we be checking to see if the absolute time has reached or passed the calculated absolute time for the sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of nanosleep(2) is that it sleeps for relative time intervals and when interrupted by a signal, it will store the remaining time to sleep in the timespec pointed to by the second argument, which we will use on the next pass through the loop as the relative time to sleep.
And I think I'm not doing the right thing for the negative sleep - if it's negative, it means we're in a new period and we should just return without sleeping. We start a new period after every sleep anyway.
|
||
void start_new_period() { | ||
bytes_written_this_period = 0; | ||
clock_gettime(CLOCK_MONOTONIC, &start_of_this_period); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to cause the periods to drift? Instead, could we just use math to calculate the periods from some initial startup time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you illustrate with a bit of code how you propose to calculate the periods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cautionary note: clock_gettime() is relatively expensive in tight inner loops. I'm not saying this code is calling it needlessly, but keep in mind as you fiddle with the timing calculations. A common strategy is to cache the time for re-use and only refresh it after blocking IO calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is pretty close to that idea. It finds out the time the first thing on entering and then the usual sequence is
write()
clock_gettime() // how long to sleep for?
nanosleep() // sleep
clock_gettime() // when does the new period start
Not sure if it's worth/good idea eliminating the last clock_gettime() and just adding the sleep time to the previous timestamp. Seems like a topic for a profiling session once we get to that point.
if (diff_nano < SECS_PER_PERIOD * BILLION) { | ||
ssize_t bytes_we_can_write = BYTES_PER_PERIOD - bytes_written_this_period; | ||
if (num_read <= bytes_we_can_write) { | ||
write_io_bufs(pipe, buf_start, num_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume that after a write we are still in the same period?
It seems that we have to re-evaluate what period we are in following each write_to_logs()
call.
bytes_written_this_period += bytes_we_can_write; | ||
buf_start += bytes_we_can_write; | ||
bytes_remaining = num_read - bytes_we_can_write; | ||
sleep_for_the_rest_of_this_period(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't block the thread execution here.
We should either propagate the condition back (with the remaining time) and set a glibc timer, or we need to use a much shorter period so that it is acceptable, in the worst case, to hang for that duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code implements the backpressure policy, I'm still adding the others.
I'm not sure I follow your concern or how backpressure is going to be maintained if we return from here without blocking. If we do, aren't we going to quickly enough be reading from the container's pipe again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first reaction to the sleep was also "yeuch" but given that conmon is written as a sync read()/write() loop there's no other way to idle but to sleep.
If conmon was written as a poll/epoll loop then you could set a timer and flip the read fd out of the loop till it fired. That would be neater and more efficient, but a bigger overhaul of conmon.
Added implementations of the drop, ignore and passthrough log policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, but seems like we need to be sure the rate limit is applied to what is being written to disk and to journald, since conmon
adds metadata to it.
src/log_rate.c
Outdated
return true; | ||
} | ||
char* endptr; | ||
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to set errno
to zero here? We don't before nanosleep
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a shortcut, also used in
Line 247 in c8f7443
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and with nanosleep() we don't because we only look at errno if nanosleep() returns an error.
case 'G': | ||
scale = (size_t)1024 * 1024 * 1024; | ||
break; | ||
case 'T': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really want to allow somebody to specify a "tera-byte" level logging rate? Eek.
No need for passthrough
default if we just make the default rate-limit essentially infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, there would be a sysconfig
value for the maximum allowable logging rate for a individual container. That way, nobody could cause problems for the SRE.
Is that possible to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the rate limit suffixes, I was going by what pv(1) does. We can call it future-proofing. Or I can get rid of T.
As far as a config file goes, conmon receives all it's config on the command line. I think this is something that needs to be looked at at the podman and cri-o level, which I don't have a good grasp of yet.
} | ||
break; | ||
case IGNORE: | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if there were metrics collected for how many bytes were ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can collect the metrics, but what should we do with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be returned up the stack they'd have to be collected and written to a file here, and read by the caller. Alternatively a pipe fd could be sent from the parent similar to how sync_fd is in main. Seems like a reasonable extension but a bit out of scope here
src/log_rate.c
Outdated
start_new_period(); | ||
} | ||
|
||
bool log_rate_write_to_logs(stdpipe_t pipe, char *buf, ssize_t num_read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it'd be worth a comment here stating that this function's call signature must be kept the same as write_to_logs()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it is not clear this is the interface we want to rate limit. The SRE needs to be able to control the final behavior of conmon
, after it has added all the metadata to the logs. So it would seem we need to add the rate limit to the file writes to disk, or the journald writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's true. Stay tuned...
Added a check for extra text after the suffix.
The work on making log collection more reliable is ongoing, but this branch as it stands is unlikely to be directly usable. |
A PR to address #84