-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-54073] Buffering for FileLogStorage #81
Conversation
…e we display the latest output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that Freestyle builds and pre-JEP-210 Pipeline builds aren't buffered I'd be surprised if this made a significant difference.
Is the main downside to this change that we would lose unflushed data in the buffer during a restart? Since we aren't currently using SYNC
or DSYNC
to guarantee our writes get to the disk, it doesn't seem like the buffering would meaningfully change the durability of the logs, so the change seems ok to me.
@@ -93,6 +93,7 @@ | |||
step1.getLogger().println("one #3"); | |||
overall.getLogger().println("pausing"); | |||
overallHtmlPos = assertOverallLog(overallHtmlPos, "<span class=\"pipeline-node-1\">one #2\none #3\n</span>pausing\n", true); | |||
// TODO if we produce output from the middle of a step, we need new span blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unrelated bug I noticed: the hide links do not work reliably for running builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside: I checked actual progressiveHtml
HTTP responses and they look sane, and I certainly verified this scenario when developing the feature, so I am not sure why it would not be working now. Possibly a client-side CSS issue. Will look into it separately. This TODO is merely to make sure that the functional tests verify that the response is right, something that I think the current test does not exercise.
|
||
} | ||
|
||
private static final class DelayBufferedOutputStream extends BufferedOutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key part of the fix: buffering.
|
||
BufferedBuildListener(OutputStream out) throws IOException { | ||
this.out = out; | ||
ps = new PrintStream(out, false, "UTF-8"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also important: the false
. Compare this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: jenkinsci/jenkins#3703
Yes it is possible, especially during an ungraceful shutdown; this would help for graceful shutdowns. Note that data is flushed whenever
I would be open to backing out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion about backing out the log buffering. I don't think it should matter either way, so maybe better to keep the code simpler, but maybe it would help a little bit in some cases.
|
||
private void reschedule() { | ||
Timer.get().schedule(new Flush(this), recurrencePeriod, TimeUnit.MILLISECONDS); | ||
recurrencePeriod = Math.min((long) (recurrencePeriod * RECURRENCE_PERIOD_BACKOFF), MAX_RECURRENCE_PERIOD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why back the recurrencePeriod off over time unconditionally? My initial thought is to back off only if there were no writes to the buffer since the last call to #run
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what DurableTaskStep
did (prior to watch mode). My reasoning is that most builds, most of the time, are not actually being watched by anyone. So if you have a dozen parallel
threads all busily spewing log output, as some reporters of JENKINS-54073 seem to, it is just a waste of Remoting packets to send every update promptly.
With the above logic, you get a chunk transmitted whenever the buffer fills up, or there is new text and up to 10s have elapsed since the last chunk. That logic is per TaskListener
, basically per running step. This seems a reasonable balance between avoiding extreme chattiness in the Remoting channel on the one hand, and on the other hand having the build log look idle for so long that a user watching it gets bored and concludes that the build is hung. Nothing terribly scientific here, I just played with parameters in a realistic-looking log-heavy build and watched both the console UI and the output from
ssh -p … user@host tail-log hudson.slaves.SlaveComputer -l FINEST
until both looked reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I forgot to mention: when wrapping in timestamps {…}
, this works fine because the buffering is applied right before transmitting over Remoting, whereas FileMonitoringController.Watcher
still runs every 100ms and hudson.plugins.timestamper.pipeline.GlobalDecorator
is applied to that immediately. So after the process has been running for a few minutes, you will see a 10s pause, then a big chunk of output appear in the log but with every line marked with a timestamp to ~100ms precision. That is per thread; if there are multiple threads running, you see correspondingly faster output.
(progressiveText.jelly
displays new build log content once per second. Blue Ocean seems to do something similar up to some cutoff; once the content for a given step gets even slightly long, it just seems to give up and stop showing anything. So it is only really interesting to look at this stuff in the classic UI.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that reasoning makes sense to me, and it seems like a good trade-off.
} | ||
|
||
@Override public void flush() throws IOException { | ||
if (!fauxFlushing.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find the naming kind of confusing here. This looks to me like a FlushUnlessFlushingIsTemporarilyDisabledOutputStream
:), and I think fauxFlushing
could be more clearly named allowFlushing
and its initial value and uses could be inverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, can work on the naming a bit.
User thread dumps did show some lock contention writing to Now that I think about it, since I have already written the |
} | ||
|
||
private Object readResolve() throws IOException { | ||
return new BufferedBuildListener(new DelayBufferedOutputStream(ros)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the change causes test failures when running TaskListenerDecoratorTest in workflow-job, which as far as I can tell are caused by the decoration/filtering taking place on the master instead of the remote. If I remove the DelayBufferedOutputStream
and use ros
directly, the tests pass. Interestingly, if I use a plain BufferedOutputStream
, the tests still fail, so the issue appears to be related to buffering in general. Requesting changes because this seems like a legitimate problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is not that decoration is taking place on the wrong node (you were perhaps confusing an earlier Running on remote
message, always printed from master, with the printed a message on master=false
message being asserted). It is just that with buffering, the remotely printed message is not sent to the build log quickly enough; corrected with
diff --git a/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java b/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java
index 2262458..a53b615 100644
--- a/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java
+++ b/src/test/java/org/jenkinsci/plugins/workflow/job/console/TaskListenerDecoratorTest.java
@@ -200,6 +200,7 @@ public class TaskListenerDecoratorTest {
}
@Override public Void call() throws RuntimeException {
listener.getLogger().println("printed a message on master=" + JenkinsJVM.isJenkinsJVM());
+ listener.getLogger().flush();
return null;
}
}
The question is whether this is just a test fix that needs to be made, as I did in LogStorageTestBase
, or whether this is a real problem and we need to somehow force a flush of any remote streams when a step ends. Let me think this over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f8ef9d3 makes TaskListenerDecoratorTest
pass sometimes—but it is flaky; depends on timing whether the message appears before or after the build is marked complete. (If you sleep 1
at the end of that test’s script, it passes reliably.) Again whether this is a problem in a realistic script is an open question, but it would at least be annoying for testability to need such sleeps, so I will continue to think about whether it is practical for the master to reach out to agents at the end of a step and request flushes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The issue only affects steps which actually produce content remotely, so normally sh
, though I expect checkout
and similar could be affected as of jenkinsci/jenkins#3563.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, some ExportTable.diagnoseInvalidObjectId
warnings. Remoting 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether it is practical for the master to reach out to agents at the end of a step and request flushes
I cannot think of any reasonable way to do that. You would need to have the agents Channel.export
some flushing thunk which the BufferedBuildListener
would collect so that its close
could call them all. But at that point the channel might not even be alive, or remote calls could hang, and there could be complicated garbage collection issues here, etc.
I think it is more straightforward to just ask the step, or whatever is working with a TaskListener
on the remote side, to flush
at some logical conclusion. This is easily done for DurableTaskStep
(I will file a PR for it), and for RemoteLauncher
in core, and for this little case in the test, the change I mentioned. On the flip side, cloud implementations of TaskListener
need to ensure that flush
works, something I apparently still need to handle in CloudWatchSender
in jenkinsci/pipeline-cloudwatch-logs-plugin#8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, having the remote side flush when it is done with the stream seems like a reasonable solution to me.
|
||
private void reschedule() { | ||
Timer.get().schedule(new Flush(this), recurrencePeriod, TimeUnit.MILLISECONDS); | ||
recurrencePeriod = Math.min((long) (recurrencePeriod * RECURRENCE_PERIOD_BACKOFF), MAX_RECURRENCE_PERIOD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an exponential backoff for simply flushing logs (which we're presumably doing to ensure we don't lose data in the event a master goes down)? Wouldn't it be enough to just schedule this to run periodically at a pretty long interval (say 10-30s)? Then it won't run that often, and a few seconds of log data isn't as big a deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which we're presumably doing to ensure we don't lose data in the event a master goes down
No, that is a consideration, but not the driver for the choice of MIN_RECURRENCE_PERIOD
. Consider a user watching a build that is doing different stuff, including a number of quick sh
steps. You want to see output from those in as close to real time as possible. But then once a given process has been running for a few minutes, the likelihood that it is being watched closely drops off and it is fine to send big chunks at intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: I had a similar question, see #81 (comment) for related discussion.
private long recurrencePeriod = MIN_RECURRENCE_PERIOD; | ||
|
||
DelayBufferedOutputStream(OutputStream out) { | ||
super(new FlushControlledOutputStream(out)); // default buffer size: 8Kib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're keeping these streams open as long as the build is running, right? If not, ignore the following, but if so, we should probably use 64 kB for the buffer size. Larger buffer sizes give considerably better I/O performance (especially for HDDs). For SSDs, performance graphs like this one with much better throughput at larger block sizes or this one are extremely common.
64 kB is a kind of sweet-spot which maximizes performance without excessive memory use (64 kB per currently-running-build is a fairly small contribution to overall memory use, and we could always add a system property to let people reduce that for low-memory enviornments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This buffer size seemed to work well in practice. Can always be tuned. (As noted above, it is a bit tricky to allow control via system property when using remotable classes.)
} | ||
|
||
private void checkId(String id) throws IOException { | ||
assert Thread.holdsLock(this); | ||
if (!Objects.equals(id, lastId)) { | ||
bos.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstand something here (trying to review fast), but won't it defeat the point if we flush here, since checkId
is called with interleaved parallels? Along with that, wouldn't it make sense to buffer the index output stream too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically b/c the index gets a lot of tiny writes with parallel interleaved logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't it defeat the point if we flush here, since
checkId
is called with interleaved parallels
Not so often, because
the index gets a lot of tiny writes with parallel interleaved logging
is no longer true to nearly the same extent, due to the remote buffering.
It seemed unsafe to leave content not yet flushed to log
when log-index
already referred to that transition: your index would then contain offsets greater than the length of the file.
wouldn't it make sense to buffer the index output stream too
Should not be necessary as log-index
is orders of magnitude smaller than log
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be necessary as log-index is orders of magnitude smaller than log.
Note that this may not necessarily be true, to give one example: parallel shell steps that log relatively slowly but constantly will generate a lot of log index entries until they hit the max backoff (one of the better arguments for a somewhat higher max backoff and growth factor). For example, something like:
randomDownloadTool downloading artifact_foo.jar
.
.
.
.
.
download completed
Would generate this result.
However the other argument is compelling enough in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will generate a lot of log index entries until they hit the max backoff
Yes but the absolute number of interleavings here is in practice small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is generally sound at this point (and huge 👍 for catching a whole bunch of subtle risks along the way). But, we're still flushing too eagerly -- see points about the minimum recurrence period for auto-flush of logs, buffer size, and flushing with checkId (and lack of index buffering).
If those can be addressed, we could cut a Beta release ASAP to hand to users feeling pain.
I really, really would like to know that this has had some serious manual and scaling testing before a full release.
private static final Logger LOGGER = Logger.getLogger(DelayBufferedOutputStream.class.getName()); | ||
|
||
// TODO make these customizable (not trivial since this system properties would need to be loaded on the master side and then remoted) | ||
private static final long MIN_RECURRENCE_PERIOD = 250; // ¼s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest flushing no more than every 10s or so.
I guess you missed the next two constants?
something that guarantees the stream will be flushed at Pipeline shutdown
See #81 (comment) plus linked PRs.
private long recurrencePeriod = MIN_RECURRENCE_PERIOD; | ||
|
||
DelayBufferedOutputStream(OutputStream out) { | ||
super(new FlushControlledOutputStream(out)); // default buffer size: 8Kib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This buffer size seemed to work well in practice. Can always be tuned. (As noted above, it is a bit tricky to allow control via system property when using remotable classes.)
reschedule(); | ||
} | ||
|
||
private void reschedule() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific please about what is more complicated than it could be?
|
||
private void reschedule() { | ||
Timer.get().schedule(new Flush(this), recurrencePeriod, TimeUnit.MILLISECONDS); | ||
recurrencePeriod = Math.min((long) (recurrencePeriod * RECURRENCE_PERIOD_BACKOFF), MAX_RECURRENCE_PERIOD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which we're presumably doing to ensure we don't lose data in the event a master goes down
No, that is a consideration, but not the driver for the choice of MIN_RECURRENCE_PERIOD
. Consider a user watching a build that is doing different stuff, including a number of quick sh
steps. You want to see output from those in as close to real time as possible. But then once a given process has been running for a few minutes, the likelihood that it is being watched closely drops off and it is fine to send big chunks at intervals.
} | ||
} | ||
|
||
void run() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming was just to match that of Flush.run
. Note that it would seem simpler to just make this class implement Runnable
, but I needed to use the WeakReference
link.
} | ||
} | ||
|
||
void run() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catch errors when rescheduling
I am actually on the fence about this. It is hard to know if the exception in this case was a fatal connection error, in which case it is pointless to reschedule, or something recoverable (and I struggle to come up with an example). At any rate, rescheduling after an exception seemed harmless.
} | ||
|
||
private void checkId(String id) throws IOException { | ||
assert Thread.holdsLock(this); | ||
if (!Objects.equals(id, lastId)) { | ||
bos.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't it defeat the point if we flush here, since
checkId
is called with interleaved parallels
Not so often, because
the index gets a lot of tiny writes with parallel interleaved logging
is no longer true to nearly the same extent, due to the remote buffering.
It seemed unsafe to leave content not yet flushed to log
when log-index
already referred to that transition: your index would then contain offsets greater than the length of the file.
wouldn't it make sense to buffer the index output stream too
Should not be necessary as log-index
is orders of magnitude smaller than log
.
@@ -149,11 +148,6 @@ protected static void close(TaskListener listener) throws Exception { | |||
VirtualChannel channel = r.createOnlineSlave().getChannel(); | |||
channel.call(new RemotePrint("overall from agent", overall)); | |||
channel.call(new RemotePrint("step from agent", step)); | |||
while (!IOUtils.toString(text().readAll()).contains("overall from agent") || !IOUtils.toString(text().readAll()).contains("step from agent")) { | |||
// TODO current cloud implementations may be unable to honor the completed flag on remotely printed messages, pending some way to have all affected loggers confirm they have flushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan to address this differently, by having the cloud implementation honor flush
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(which I did)
Not really. Addressed point-by-point above.
Pointless to mark it
I have been testing it manually with a local scalability load. That is how the problems in the previous state were identified and how these changes were developed. I do not know if you consider that “serious”. |
…able via system property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes to the buffer size and allowing the tuning parameters to be configured look good to me.
static final class Tuning implements SerializableOnlyOverRemoting { | ||
private Tuning() {} | ||
// nonfinal for Groovy scripting: | ||
long minRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".minRecurrencePeriod", 250); // ¼s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐜 This still seems awfully short -- if it's a fast-completing task that someone would watch, it'll finish fast and get flushed anyway. Given shell steps already come with 1/4 to 1/3 of a second of overhead, I think we can probably get away with 1-2s as a starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately 0.250s is fine with a higher growth factor (probably the better option, since the first few outputs come fast, but then it quickly slows down to something more sedate so as to avoid spamming the log index with slow-moving logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given jenkinsci/workflow-durable-task-step-plugin#85 I can try 1s, the same as the delay in progressiveText.jelly
.
long minRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".minRecurrencePeriod", 250); // ¼s | ||
long maxRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".maxRecurrencePeriod", 10_000); // 10s | ||
float recurrencePeriodBackoff = Float.parseFloat(System.getProperty(DelayBufferedOutputStream.class.getName() + ".recurrencePeriodBackoff", "1.05")); | ||
int bufferSize = Integer.getInteger(DelayBufferedOutputStream.class.getName() + ".bufferSize", 1 << 16); // 64Kib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for 64k buffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that link layer packets are much smaller than this typically. However a larger buffer (divided into packets by TCP) means less overhead in the Remoting layer, for example in (de)serialization.
// nonfinal for Groovy scripting: | ||
long minRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".minRecurrencePeriod", 250); // ¼s | ||
long maxRecurrencePeriod = Long.getLong(DelayBufferedOutputStream.class.getName() + ".maxRecurrencePeriod", 10_000); // 10s | ||
float recurrencePeriodBackoff = Float.parseFloat(System.getProperty(DelayBufferedOutputStream.class.getName() + ".recurrencePeriodBackoff", "1.05")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jglick I think maybe you meant to make the backoff 1.5 rather than 1.05? With 1.05 we'll flush 75 times before we hit the maximum backoff (on run 76), whereas a backoff of 1.5 it's about 9 cycles (enough to give you just under 30s of fast-flushing logs).
The 1.5 constant matches the behavior you seemed to want here, where the 1.05 would still generate a lot of log interleaving (and a big index file) with slow-writing logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, 1.05 was intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, 1.05 was meant.
|
||
private static final Logger LOGGER = Logger.getLogger(DelayBufferedOutputStream.class.getName()); | ||
|
||
static final class Tuning implements SerializableOnlyOverRemoting { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍 for making it configurable
@jglick I didn't miss the next two constants, but the use-case of a user live-tailing the log hadn't been explained (or if so I was too groggy to catch it). Their values make more sense in that light. I think a few tweaks to their values will help ensure we don't get into an index-abuse situation again. |
Sorry, clicked "close and comment" rather than "comment" by accident. |
Makes a few tests fail, like ``` Using temp dir: /pct/tmp/work/pipeline-utility-steps/target/tmp/jkh8549942268385729296 2.966 [id=118] INFO o.j.h.test.SimpleCommandLauncher#launch: agent launched for slave0 3.559 [id=2382] INFO o.j.p.workflow.job.WorkflowRun#finish: p jenkinsci#1 completed: SUCCESS 3.578 [id=2875] INFO jenkins.model.Jenkins#cleanUp: Stopping Jenkins 3.611 [id=2875] INFO jenkins.model.Jenkins#cleanUp: Jenkins stopped [ERROR] Tests run: 15, Failures: 10, Errors: 0, Skipped: 0, Time elapsed: 83.319 s <<< FAILURE! - in org.jenkinsci.plugins.pipeline.utility.steps.conf.mf.ReadManifestStepTest [ERROR] testJarWithGradleManifest(org.jenkinsci.plugins.pipeline.utility.steps.conf.mf.ReadManifestStepTest) Time elapsed: 3.859 s <<< FAILURE! java.lang.AssertionError: Expected: a string containing "Reading: META-INF/MANIFEST.MF" but: was "Started Running in Durability level: MAX_SURVIVABILITY [Pipeline] node Running on slave0 in /pct/tmp/work/pipeline-utility-steps/target/tmp/jkh1255726218456449879/workspace/p ``` Seems to have been introduced by jenkinsci/workflow-api-plugin#81
Checking whether this addresses JENKINS-54073. It does not seem to make any positive difference in little local tests but I doubt those are realistic anyway.Note that as far as I can tell, freestyle builds do not buffer, and neither did pre-JEP-210 Pipeline builds.I think I finally tracked down the issue. While this PR still buffers writes to the
log
on master, the main change is ensuring that thePrintStream
made available remotely (most significantly, toDurableTaskStep
) refrains from sending every little chunk of output one by one, and also refrains from sending gratuitousProxyOutputStream.Flush
packets. More: #81 (comment)Freestyle builds in fact do send little bits of output. I guess nobody cared. After all, it was harder to get lots of agents spewing log output in parallel.
The main open question for myself here is the expectation that remote code calls the equivalent of
listener.getLogger().flush()
when it is done with a major task (like a step). If it neglects to do so, any remaining content will get auto-flushed, but out of order in the build log and possibly too late (after the agent has been disconnected or the build completed). This is not exactly compatible (thus the need for jenkinsci/workflow-durable-task-step-plugin#85), but on the other hand I cannot think of many cases where code other thanDurableTaskStep
was printing content remotely to begin with—most plugins which print messages do so insideperform
or similar, on the master.A possible refinement would be to turn autoflush back on for the remote
PrintStream
, so thatprintln(String)
and the like are sent immediately, but then modifyDecoratedTaskListener
to extract the underlyingOutputStream
from thePrintStream delegate.getLogger()
. Thus there would only be onePrintStream
on top of the stack of delegates, so any calls to it directly would flush, yetDurableTaskStep.HandlerImpl
would continue to write to the buffered, non-autoflushingOutputStream
stack, which would be ideal. The problem with this scenario is that decorators then lose access to thePrintStream
that they sometimes need for synchronization purposes to avoid mangled output under high concurrency, as in jenkinsci/timestamper-plugin@bf1e61c in jenkinsci/timestamper-plugin#25. I am not sure yet if there is a good resolution to this conflict; if there is, it should be possible to do it compatibly later (since the worst that could happen is that something like jenkinsci/workflow-durable-task-step-plugin#85 gratuitously sends one flush packet at the end of a step).