Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-54073] Buffering for FileLogStorage #81

Merged
merged 15 commits into from
Oct 26, 2018

Conversation

jglick
Copy link
Member

@jglick jglick commented Oct 19, 2018

Checking whether this addresses JENKINS-54073. It does not seem to make any positive difference in little local tests but I doubt those are realistic anyway. Note that as far as I can tell, freestyle builds do not buffer, and neither did pre-JEP-210 Pipeline builds.


I think I finally tracked down the issue. While this PR still buffers writes to the log on master, the main change is ensuring that the PrintStream made available remotely (most significantly, to DurableTaskStep) refrains from sending every little chunk of output one by one, and also refrains from sending gratuitous ProxyOutputStream.Flush packets. More: #81 (comment)

Freestyle builds in fact do send little bits of output. I guess nobody cared. After all, it was harder to get lots of agents spewing log output in parallel.


The main open question for myself here is the expectation that remote code calls the equivalent of listener.getLogger().flush() when it is done with a major task (like a step). If it neglects to do so, any remaining content will get auto-flushed, but out of order in the build log and possibly too late (after the agent has been disconnected or the build completed). This is not exactly compatible (thus the need for jenkinsci/workflow-durable-task-step-plugin#85), but on the other hand I cannot think of many cases where code other than DurableTaskStep was printing content remotely to begin with—most plugins which print messages do so inside perform or similar, on the master.

A possible refinement would be to turn autoflush back on for the remote PrintStream, so that println(String) and the like are sent immediately, but then modify DecoratedTaskListener to extract the underlying OutputStream from the PrintStream delegate.getLogger(). Thus there would only be one PrintStream on top of the stack of delegates, so any calls to it directly would flush, yet DurableTaskStep.HandlerImpl would continue to write to the buffered, non-autoflushing OutputStream stack, which would be ideal. The problem with this scenario is that decorators then lose access to the PrintStream that they sometimes need for synchronization purposes to avoid mangled output under high concurrency, as in jenkinsci/timestamper-plugin@bf1e61c in jenkinsci/timestamper-plugin#25. I am not sure yet if there is a good resolution to this conflict; if there is, it should be possible to do it compatibly later (since the worst that could happen is that something like jenkinsci/workflow-durable-task-step-plugin#85 gratuitously sends one flush packet at the end of a step).

@jglick jglick added the on-hold label Oct 19, 2018
dwnusbaum
dwnusbaum previously approved these changes Oct 19, 2018
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Freestyle builds and pre-JEP-210 Pipeline builds aren't buffered I'd be surprised if this made a significant difference.

Is the main downside to this change that we would lose unflushed data in the buffer during a restart? Since we aren't currently using SYNC or DSYNC to guarantee our writes get to the disk, it doesn't seem like the buffering would meaningfully change the durability of the logs, so the change seems ok to me.

@jglick jglick changed the title Buffer the main log output stream [JENKINS-54073] Buffering for FileLogStorage Oct 25, 2018
@jglick jglick removed the on-hold label Oct 25, 2018
@jglick jglick requested a review from dwnusbaum October 25, 2018 02:12
@@ -93,6 +93,7 @@
step1.getLogger().println("one #3");
overall.getLogger().println("pausing");
overallHtmlPos = assertOverallLog(overallHtmlPos, "<span class=\"pipeline-node-1\">one #2\none #3\n</span>pausing\n", true);
// TODO if we produce output from the middle of a step, we need new span blocks
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unrelated bug I noticed: the hide links do not work reliably for running builds.

Copy link
Member Author

@jglick jglick Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside: I checked actual progressiveHtml HTTP responses and they look sane, and I certainly verified this scenario when developing the feature, so I am not sure why it would not be working now. Possibly a client-side CSS issue. Will look into it separately. This TODO is merely to make sure that the functional tests verify that the response is right, something that I think the current test does not exercise.


}

private static final class DelayBufferedOutputStream extends BufferedOutputStream {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key part of the fix: buffering.


BufferedBuildListener(OutputStream out) throws IOException {
this.out = out;
ps = new PrintStream(out, false, "UTF-8");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also important: the false. Compare this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: jenkinsci/jenkins#3703

@jglick
Copy link
Member Author

jglick commented Oct 25, 2018

we would lose unflushed data in the buffer during a restart

Yes it is possible, especially during an ungraceful shutdown; this would help for graceful shutdowns. Note that data is flushed whenever

  • we switch to writing output from a different node
  • the log is displayed in any way

I would be open to backing out the log buffering, since it seems the critical change is the remote buffering anyway.

@jglick jglick dismissed dwnusbaum’s stale review October 25, 2018 02:30

Predates main change.

dwnusbaum
dwnusbaum previously approved these changes Oct 25, 2018
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion about backing out the log buffering. I don't think it should matter either way, so maybe better to keep the code simpler, but maybe it would help a little bit in some cases.


private void reschedule() {
Timer.get().schedule(new Flush(this), recurrencePeriod, TimeUnit.MILLISECONDS);
recurrencePeriod = Math.min((long) (recurrencePeriod * RECURRENCE_PERIOD_BACKOFF), MAX_RECURRENCE_PERIOD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why back the recurrencePeriod off over time unconditionally? My initial thought is to back off only if there were no writes to the buffer since the last call to #run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what DurableTaskStep did (prior to watch mode). My reasoning is that most builds, most of the time, are not actually being watched by anyone. So if you have a dozen parallel threads all busily spewing log output, as some reporters of JENKINS-54073 seem to, it is just a waste of Remoting packets to send every update promptly.

With the above logic, you get a chunk transmitted whenever the buffer fills up, or there is new text and up to 10s have elapsed since the last chunk. That logic is per TaskListener, basically per running step. This seems a reasonable balance between avoiding extreme chattiness in the Remoting channel on the one hand, and on the other hand having the build log look idle for so long that a user watching it gets bored and concludes that the build is hung. Nothing terribly scientific here, I just played with parameters in a realistic-looking log-heavy build and watched both the console UI and the output from

ssh -p … user@host tail-log hudson.slaves.SlaveComputer -l FINEST

until both looked reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot to mention: when wrapping in timestamps {…}, this works fine because the buffering is applied right before transmitting over Remoting, whereas FileMonitoringController.Watcher still runs every 100ms and hudson.plugins.timestamper.pipeline.GlobalDecorator is applied to that immediately. So after the process has been running for a few minutes, you will see a 10s pause, then a big chunk of output appear in the log but with every line marked with a timestamp to ~100ms precision. That is per thread; if there are multiple threads running, you see correspondingly faster output.

(progressiveText.jelly displays new build log content once per second. Blue Ocean seems to do something similar up to some cutoff; once the content for a given step gets even slightly long, it just seems to give up and stop showing anything. So it is only really interesting to look at this stuff in the classic UI.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that reasoning makes sense to me, and it seems like a good trade-off.

}

@Override public void flush() throws IOException {
if (!fauxFlushing.get()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the naming kind of confusing here. This looks to me like a FlushUnlessFlushingIsTemporarilyDisabledOutputStream :), and I think fauxFlushing could be more clearly named allowFlushing and its initial value and uses could be inverted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, can work on the naming a bit.

@jglick
Copy link
Member Author

jglick commented Oct 25, 2018

maybe it would help a little bit in some cases

User thread dumps did show some lock contention writing to log, presumably because each write was unbuffered. I expect that this issue should be greatly reduced by minimizing the number of times writes to log are actually needed, and avoiding excessive interleaving, as the later parts of this PR do. Still, it seems useful to avoid the need for synchronous file writes.

Now that I think about it, since I have already written the DelayBufferedOutputStream for use on the remote side, I may as well use it on the master side too. That would reduce the chance and minimize the extent of text loss in the case of a sudden shutdown while a build is running.

@jglick jglick dismissed dwnusbaum’s stale review October 25, 2018 15:03

a little stale

@jglick jglick requested a review from dwnusbaum October 25, 2018 15:03
}

private Object readResolve() throws IOException {
return new BufferedBuildListener(new DelayBufferedOutputStream(ros));
Copy link
Member

@dwnusbaum dwnusbaum Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the change causes test failures when running TaskListenerDecoratorTest in workflow-job, which as far as I can tell are caused by the decoration/filtering taking place on the master instead of the remote. If I remove the DelayBufferedOutputStream and use ros directly, the tests pass. Interestingly, if I use a plain BufferedOutputStream, the tests still fail, so the issue appears to be related to buffering in general. Requesting changes because this seems like a legitimate problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will check this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is not that decoration is taking place on the wrong node (you were perhaps confusing an earlier Running on remote message, always printed from master, with the printed a message on master=false message being asserted). It is just that with buffering, the remotely printed message is not sent to the build log quickly enough; corrected with

diff --git a/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java b/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java
index 2262458..a53b615 100644
--- a/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java
+++ b/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java
@@ -200,6 +200,7 @@ public class TaskListenerDecoratorTest {
             }
             @Override public Void call() throws RuntimeException {
                 listener.getLogger().println("printed a message on master=" + JenkinsJVM.isJenkinsJVM());
+                listener.getLogger().flush();
                 return null;
             }
         }

The question is whether this is just a test fix that needs to be made, as I did in LogStorageTestBase, or whether this is a real problem and we need to somehow force a flush of any remote streams when a step ends. Let me think this over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f8ef9d3 makes TaskListenerDecoratorTest pass sometimes—but it is flaky; depends on timing whether the message appears before or after the build is marked complete. (If you sleep 1 at the end of that test’s script, it passes reliably.) Again whether this is a problem in a realistic script is an open question, but it would at least be annoying for testability to need such sleeps, so I will continue to think about whether it is practical for the master to reach out to agents at the end of a step and request flushes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The issue only affects steps which actually produce content remotely, so normally sh, though I expect checkout and similar could be affected as of jenkinsci/jenkins#3563.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, some ExportTable.diagnoseInvalidObjectId warnings. Remoting 🤷‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether it is practical for the master to reach out to agents at the end of a step and request flushes

I cannot think of any reasonable way to do that. You would need to have the agents Channel.export some flushing thunk which the BufferedBuildListener would collect so that its close could call them all. But at that point the channel might not even be alive, or remote calls could hang, and there could be complicated garbage collection issues here, etc.

I think it is more straightforward to just ask the step, or whatever is working with a TaskListener on the remote side, to flush at some logical conclusion. This is easily done for DurableTaskStep (I will file a PR for it), and for RemoteLauncher in core, and for this little case in the test, the change I mentioned. On the flip side, cloud implementations of TaskListener need to ensure that flush works, something I apparently still need to handle in CloudWatchSender in jenkinsci/pipeline-cloudwatch-logs-plugin#8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, having the remote side flush when it is done with the stream seems like a reasonable solution to me.


private void reschedule() {
Timer.get().schedule(new Flush(this), recurrencePeriod, TimeUnit.MILLISECONDS);
recurrencePeriod = Math.min((long) (recurrencePeriod * RECURRENCE_PERIOD_BACKOFF), MAX_RECURRENCE_PERIOD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an exponential backoff for simply flushing logs (which we're presumably doing to ensure we don't lose data in the event a master goes down)? Wouldn't it be enough to just schedule this to run periodically at a pretty long interval (say 10-30s)? Then it won't run that often, and a few seconds of log data isn't as big a deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we're presumably doing to ensure we don't lose data in the event a master goes down

No, that is a consideration, but not the driver for the choice of MIN_RECURRENCE_PERIOD. Consider a user watching a build that is doing different stuff, including a number of quick sh steps. You want to see output from those in as close to real time as possible. But then once a given process has been running for a few minutes, the likelihood that it is being watched closely drops off and it is fine to send big chunks at intervals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I had a similar question, see #81 (comment) for related discussion.

private long recurrencePeriod = MIN_RECURRENCE_PERIOD;

DelayBufferedOutputStream(OutputStream out) {
super(new FlushControlledOutputStream(out)); // default buffer size: 8Kib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're keeping these streams open as long as the build is running, right? If not, ignore the following, but if so, we should probably use 64 kB for the buffer size. Larger buffer sizes give considerably better I/O performance (especially for HDDs). For SSDs, performance graphs like this one with much better throughput at larger block sizes or this one are extremely common.

64 kB is a kind of sweet-spot which maximizes performance without excessive memory use (64 kB per currently-running-build is a fairly small contribution to overall memory use, and we could always add a system property to let people reduce that for low-memory enviornments).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This buffer size seemed to work well in practice. Can always be tuned. (As noted above, it is a bit tricky to allow control via system property when using remotable classes.)

}

private void checkId(String id) throws IOException {
assert Thread.holdsLock(this);
if (!Objects.equals(id, lastId)) {
bos.flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstand something here (trying to review fast), but won't it defeat the point if we flush here, since checkId is called with interleaved parallels? Along with that, wouldn't it make sense to buffer the index output stream too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically b/c the index gets a lot of tiny writes with parallel interleaved logging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't it defeat the point if we flush here, since checkId is called with interleaved parallels

Not so often, because

the index gets a lot of tiny writes with parallel interleaved logging

is no longer true to nearly the same extent, due to the remote buffering.

It seemed unsafe to leave content not yet flushed to log when log-index already referred to that transition: your index would then contain offsets greater than the length of the file.

wouldn't it make sense to buffer the index output stream too

Should not be necessary as log-index is orders of magnitude smaller than log.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be necessary as log-index is orders of magnitude smaller than log.

Note that this may not necessarily be true, to give one example: parallel shell steps that log relatively slowly but constantly will generate a lot of log index entries until they hit the max backoff (one of the better arguments for a somewhat higher max backoff and growth factor). For example, something like:

randomDownloadTool downloading artifact_foo.jar
.
.
.
.
.
download completed

Would generate this result.

However the other argument is compelling enough in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will generate a lot of log index entries until they hit the max backoff

Yes but the absolute number of interleavings here is in practice small.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is generally sound at this point (and huge 👍 for catching a whole bunch of subtle risks along the way). But, we're still flushing too eagerly -- see points about the minimum recurrence period for auto-flush of logs, buffer size, and flushing with checkId (and lack of index buffering).

If those can be addressed, we could cut a Beta release ASAP to hand to users feeling pain.

I really, really would like to know that this has had some serious manual and scaling testing before a full release.

private static final Logger LOGGER = Logger.getLogger(DelayBufferedOutputStream.class.getName());

// TODO make these customizable (not trivial since this system properties would need to be loaded on the master side and then remoted)
private static final long MIN_RECURRENCE_PERIOD = 250; // ¼s
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest flushing no more than every 10s or so.

I guess you missed the next two constants?

something that guarantees the stream will be flushed at Pipeline shutdown

See #81 (comment) plus linked PRs.

private long recurrencePeriod = MIN_RECURRENCE_PERIOD;

DelayBufferedOutputStream(OutputStream out) {
super(new FlushControlledOutputStream(out)); // default buffer size: 8Kib
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This buffer size seemed to work well in practice. Can always be tuned. (As noted above, it is a bit tricky to allow control via system property when using remotable classes.)

reschedule();
}

private void reschedule() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific please about what is more complicated than it could be?


private void reschedule() {
Timer.get().schedule(new Flush(this), recurrencePeriod, TimeUnit.MILLISECONDS);
recurrencePeriod = Math.min((long) (recurrencePeriod * RECURRENCE_PERIOD_BACKOFF), MAX_RECURRENCE_PERIOD);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we're presumably doing to ensure we don't lose data in the event a master goes down

No, that is a consideration, but not the driver for the choice of MIN_RECURRENCE_PERIOD. Consider a user watching a build that is doing different stuff, including a number of quick sh steps. You want to see output from those in as close to real time as possible. But then once a given process has been running for a few minutes, the likelihood that it is being watched closely drops off and it is fine to send big chunks at intervals.

}
}

void run() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming was just to match that of Flush.run. Note that it would seem simpler to just make this class implement Runnable, but I needed to use the WeakReference link.

}
}

void run() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch errors when rescheduling

I am actually on the fence about this. It is hard to know if the exception in this case was a fatal connection error, in which case it is pointless to reschedule, or something recoverable (and I struggle to come up with an example). At any rate, rescheduling after an exception seemed harmless.

}

private void checkId(String id) throws IOException {
assert Thread.holdsLock(this);
if (!Objects.equals(id, lastId)) {
bos.flush();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't it defeat the point if we flush here, since checkId is called with interleaved parallels

Not so often, because

the index gets a lot of tiny writes with parallel interleaved logging

is no longer true to nearly the same extent, due to the remote buffering.

It seemed unsafe to leave content not yet flushed to log when log-index already referred to that transition: your index would then contain offsets greater than the length of the file.

wouldn't it make sense to buffer the index output stream too

Should not be necessary as log-index is orders of magnitude smaller than log.

@@ -149,11 +148,6 @@ protected static void close(TaskListener listener) throws Exception {
VirtualChannel channel = r.createOnlineSlave().getChannel();
channel.call(new RemotePrint("overall from agent", overall));
channel.call(new RemotePrint("step from agent", step));
while (!IOUtils.toString(text().readAll()).contains("overall from agent") || !IOUtils.toString(text().readAll()).contains("step from agent")) {
// TODO current cloud implementations may be unable to honor the completed flag on remotely printed messages, pending some way to have all affected loggers confirm they have flushed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan to address this differently, by having the cloud implementation honor flush.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which I did)

@jglick
Copy link
Member Author

jglick commented Oct 26, 2018

we're still flushing too eagerly

Not really. Addressed point-by-point above.

we could cut a Beta release

Pointless to mark it -beta since

  • the affected code is only active when using workflow-job 2.26, which is not currently published
  • the behavior is demonstrably more efficient than in the published non-beta release
  • too few people use the experimental update center; if we want to point specific people to try specific builds, we offer them incremental releases

[has this] had some serious manual and scaling testing[?]

I have been testing it manually with a local scalability load. That is how the problems in the previous state were identified and how these changes were developed. I do not know if you consider that “serious”.

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes to the buffer size and allowing the tuning parameters to be configured look good to me.

static final class Tuning implements SerializableOnlyOverRemoting {
private Tuning() {}
// nonfinal for Groovy scripting:
long minRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".minRecurrencePeriod", 250); // ¼s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐜 This still seems awfully short -- if it's a fast-completing task that someone would watch, it'll finish fast and get flushed anyway. Given shell steps already come with 1/4 to 1/3 of a second of overhead, I think we can probably get away with 1-2s as a starting point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately 0.250s is fine with a higher growth factor (probably the better option, since the first few outputs come fast, but then it quickly slows down to something more sedate so as to avoid spamming the log index with slow-moving logs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given jenkinsci/workflow-durable-task-step-plugin#85 I can try 1s, the same as the delay in progressiveText.jelly.

long minRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".minRecurrencePeriod", 250); // ¼s
long maxRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".maxRecurrencePeriod", 10_000); // 10s
float recurrencePeriodBackoff = Float.parseFloat(System.getProperty(DelayBufferedOutputStream.class.getName() + ".recurrencePeriodBackoff", "1.05"));
int bufferSize = Integer.getInteger(DelayBufferedOutputStream.class.getName() + ".bufferSize", 1 << 16); // 64Kib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for 64k buffers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that link layer packets are much smaller than this typically. However a larger buffer (divided into packets by TCP) means less overhead in the Remoting layer, for example in (de)serialization.

// nonfinal for Groovy scripting:
long minRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".minRecurrencePeriod", 250); // ¼s
long maxRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".maxRecurrencePeriod", 10_000); // 10s
float recurrencePeriodBackoff = Float.parseFloat(System.getProperty(DelayBufferedOutputStream.class.getName() + ".recurrencePeriodBackoff", "1.05"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick I think maybe you meant to make the backoff 1.5 rather than 1.05? With 1.05 we'll flush 75 times before we hit the maximum backoff (on run 76), whereas a backoff of 1.5 it's about 9 cycles (enough to give you just under 30s of fast-flushing logs).

The 1.5 constant matches the behavior you seemed to want here, where the 1.05 would still generate a lot of log interleaving (and a big index file) with slow-writing logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, 1.05 was intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, 1.05 was meant.


private static final Logger LOGGER = Logger.getLogger(DelayBufferedOutputStream.class.getName());

static final class Tuning implements SerializableOnlyOverRemoting {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 for making it configurable

@svanoort
Copy link
Member

@jglick I didn't miss the next two constants, but the use-case of a user live-tailing the log hadn't been explained (or if so I was too groggy to catch it). Their values make more sense in that light. I think a few tweaks to their values will help ensure we don't get into an index-abuse situation again.

@svanoort svanoort closed this Oct 26, 2018
@svanoort svanoort reopened this Oct 26, 2018
@svanoort
Copy link
Member

Sorry, clicked "close and comment" rather than "comment" by accident.

jglick added a commit to jglick/workflow-durable-task-step-plugin that referenced this pull request Oct 26, 2018
@dwnusbaum dwnusbaum merged commit c3d93f1 into jenkinsci:master Oct 26, 2018
@jglick jglick deleted the buffering branch October 26, 2018 20:22
batmat added a commit to batmat/pipeline-utility-steps-plugin that referenced this pull request Jan 24, 2019
Makes a few tests fail, like


```
Using temp dir: /pct/tmp/work/pipeline-utility-steps/target/tmp/jkh8549942268385729296
   2.966 [id=118]	INFO	o.j.h.test.SimpleCommandLauncher#launch: agent launched for slave0
   3.559 [id=2382]	INFO	o.j.p.workflow.job.WorkflowRun#finish: p jenkinsci#1 completed: SUCCESS
   3.578 [id=2875]	INFO	jenkins.model.Jenkins#cleanUp: Stopping Jenkins
   3.611 [id=2875]	INFO	jenkins.model.Jenkins#cleanUp: Jenkins stopped
[ERROR] Tests run: 15, Failures: 10, Errors: 0, Skipped: 0, Time elapsed: 83.319 s <<< FAILURE! - in org.jenkinsci.plugins.pipeline.utility.steps.conf.mf.ReadManifestStepTest
[ERROR] testJarWithGradleManifest(org.jenkinsci.plugins.pipeline.utility.steps.conf.mf.ReadManifestStepTest)  Time elapsed: 3.859 s  <<< FAILURE!
java.lang.AssertionError:

Expected: a string containing "Reading: META-INF/MANIFEST.MF"
     but: was "Started
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on slave0 in /pct/tmp/work/pipeline-utility-steps/target/tmp/jkh1255726218456449879/workspace/p
```

Seems to have been introduced by jenkinsci/workflow-api-plugin#81
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants