Add support for streaming stdout/stderr from Child invocations #75

brianmario · 2016-12-30T21:38:22Z

As the title says, this adds support for passing a block for receiving chunks of output stdout/stderr as they're being read.

I'm not super happy with how the API turned out, and I could probably add a few more tests - so suggestions are definitely welcome.

@tmm1 @peff @piki @carlosmn @simonsj @vmg @scottjg @mclark @arthurschreiber @tma in case either of you have time to review this 🙏

brianmario · 2016-12-30T21:39:24Z

lib/posix/spawn.rb

@@ -337,6 +337,10 @@ class MaximumOutputExceeded < StandardError
    class TimeoutExceeded < StandardError
    end

+    # Exception raised when output streaming is aborted early.
+    class Aborted < StandardError


This could maybe be CallerAborted or something more specific?

I'm fine with either.

brianmario · 2016-12-30T21:40:12Z

lib/posix/spawn/child.rb

+                  @err << chunk
+                end
+              end
+            end


This whole block is pretty gross, but the alternative may involve being tricky (and less readable?) with Ruby.

This might be the "tricky" Ruby you're talking about, but it seems to me that when streaming is not requested, you could set @stdout_block anyway, to

Proc.new do |chunk| @out << chunk false end

(like you do for the tests) and do the equivalent for @stderr_block. Then you could avoid the inner conditionals here.

To shrink the code even further, you could set

@blocks = { stdout => @stdout_block, stderr => @stderr_block }

(in which case you wouldn't even need @stdout_block and @stderr_block anymore, but you get the idea) then this whole processing code could become

if @blocks[fd].call(chunk) raise Aborted end

brianmario · 2016-12-30T21:41:08Z

lib/posix/spawn/child.rb

+            end
+
+            if @streaming && abort
+              raise Aborted


I don't love raising here, but it enforces proper cleanup (and killing the subprocess) up on

posix-spawn/lib/posix/spawn/child.rb

Line 168 in 0c02f33

rescue Object

.

mhagger

I added some comments. It would be great to have docs, too.

mhagger · 2017-01-03T16:52:12Z

lib/posix/spawn/child.rb

+                  @err << chunk
+                end
+              end
+            end


This might be the "tricky" Ruby you're talking about, but it seems to me that when streaming is not requested, you could set @stdout_block anyway, to

Proc.new do |chunk| @out << chunk false end

(like you do for the tests) and do the equivalent for @stderr_block. Then you could avoid the inner conditionals here.

To shrink the code even further, you could set

@blocks = { stdout => @stdout_block, stderr => @stderr_block }

(in which case you wouldn't even need @stdout_block and @stderr_block anymore, but you get the idea) then this whole processing code could become

if @blocks[fd].call(chunk) raise Aborted end

mhagger · 2017-01-03T16:52:55Z

lib/posix/spawn/child.rb

+              end
+            end
+
+            if @streaming && abort


I think @streaming && is redundant here, since when @streaming is not set, abort retains its initial value, false.

mhagger · 2017-01-03T17:03:05Z

test/test_child.rb

+      })
+    end
+  end
+


There are no tests that involve reading more than one BUFSIZE worth of output, or reading from both stdout and stderr. Those might be worthwhile additions.

~~We should also add a test for passing in a minimal custom object, just to ensure the interface contract is maintained. May be a good time to use a spy.~~
No longer applicable as we are using Procs now.

This requires that the stdout and stderr stream objects passed respond to `#write` and `#string` methods.

mclark

I'm liking the duck typed object interface MUCH better than the two Procs. I think that was the right way to go in this case.

mclark · 2017-01-03T21:29:07Z

lib/posix/spawn/child.rb

          [stdin, stdout, stderr].each do |fd|
-            fd.set_encoding('BINARY', 'BINARY')
+            fd.set_encoding(bin_encoding, bin_encoding)


Can we do this on a per fd basis?

bin_encoding = Encoding::BINARY [stdin, stdout, stderr].each do |fd| fd.set_encoding(bin_encoding, bin_encoding) if fd.respond_to?(:set_encoding) end

Also, are we intentionally dropping the force_encoding calls on stdout and stderr here?

mclark · 2017-01-03T21:30:58Z

lib/posix/spawn/child.rb

-          @out.force_encoding('BINARY')
-          @err.force_encoding('BINARY')
-          input = input.dup.force_encoding('BINARY') if input
+          @stdout_buffer.set_encoding(bin_encoding)


Are these duplicate calls to the above intentional? We really can't assume these objects respond to these methods any more.

mclark · 2017-01-03T21:37:56Z

lib/posix/spawn/child.rb

+            abort = false
+            if chunk
+              if fd == stdout
+                abort = (@stdout_buffer.write(chunk) == 0)


I don't feel like this is a safe way to test for aborting the operation. The output object could be simply refusing to write the current chunk but is not done consuming the stream.
Why not use an exception for this test instead? If the consumer raises Posix::Spawn::Aborted then we clearly know to abort.

mclark · 2017-01-03T21:38:53Z

lib/posix/spawn/child.rb

@@ -262,12 +288,10 @@ def read_and_write(input, stdin, stdout, stderr, timeout=nil, max=nil)
          end

          # maybe we've hit our max output
-          if max && ready[0].any? && (@out.size + @err.size) > max
+          if max && ready[0].any? && (@stdout_buffer.size + @stderr_buffer.size) > max


again, we can't assume there is a #size method on these objects...

I think we should probably keep a local count of the bytes we have written instead of calling #size anyway. We can't trust these objects any more as they could be anything.

mclark · 2017-01-03T21:42:26Z

test/test_child.rb

+      })
+    end
+  end
+


~~We should also add a test for passing in a minimal custom object, just to ensure the interface contract is maintained. May be a good time to use a spy.~~
No longer applicable as we are using Procs now.

mclark · 2017-01-03T21:42:53Z

lib/posix/spawn.rb

@@ -337,6 +337,10 @@ class MaximumOutputExceeded < StandardError
    class TimeoutExceeded < StandardError
    end

+    # Exception raised when output streaming is aborted early.
+    class Aborted < StandardError


I'm fine with either.

mclark · 2017-01-03T21:46:12Z

Oh right, we also need some thorough docs on this once we've nailed down the exact 🦆 interface we are using

peff · 2017-01-04T01:35:16Z

lib/posix/spawn/child.rb

            begin
-              buf << fd.readpartial(BUFSIZE)
+              chunk = fd.readpartial(BUFSIZE)


If I'm understanding this right, the old code would always append directly into the final buffer, whereas this one reads a chunk and then appends that chunk to the buffer. Not knowing anything about how Ruby operates under the hood, is this a potential performance problem? It should just be an extra memcpy in the worst case, but I recall that we've hit bottlenecks on reading into Ruby before. I suspect those were mostly about arrays and not buffers, though (e.g., reading each line into its own buffer can be slow).

I could be wrong here (cc @tenderlove) but I'm pretty sure the previous code would actually create a new string (down inside readpartial), then that string would have been appended to buf. Which would require potentially resizing that buffer first, then the memcpy.

This new code just keeps that first string as a local var first, so we can later determine where to write it. In the default case we're using a StringIO so the result is essentially the same as before (potential buffer resize then a copy). Though iirc we saw some pretty significant speedups by using an array to keep track of chunks, then calling join at the end when it was all needed. The reason for that is because it avoids the reallocation of the resulting buffer as we're reading, and instead allows join to allocate a buffer exactly the size that's needed then copying all the chunks in to it.

Basically this (the old way):

buffer = "" buffer << "one," buffer << "two," buffer << "three" return buffer

vs this (the array optimized version I just mentioned):

buffer = [] buffer << "one," buffer << "two," buffer << "three" # buffer is an array with 3 elements at this point # and this join call figures out how big all of the strings inside are, then creates a single buffer to append them to. return buffer.join

Using that approach efficiently may change the API contract here slightly though...

@brianmario Ah, right, that sort of return value optimization would be pretty easy to implement, and would mean we end up with the same number of copies. Though if we're just counting memcpys anyway, I suspect it doesn't matter much either way.

The reason for that is because it avoids the reallocation of the resulting buffer as we're reading, and instead allows join to allocate a buffer exactly the size that's needed then copying all the chunks in to it.

Interesting. It sounds like appending doesn't grow the buffer aggressively in that case, because you should be able to get amortized constant-time.

Anyway. We're well out of my level for intelligent discussion of Ruby internals. The interesting result is whether reading the output of a spawned cat some-gigabyte-file is measurably any different. Probably not, but it's presumably easy to test.

brianmario

So, I decided to go back to the proc-based API because the requirements on the caller are much simpler. The objects passed as streams need only respond to call with an arity of 1 (the current chunk) and return a boolean. true on success false to abort (note this is opposite from how I originally had it, though I think it makes more sense).

I'll keep going on tests and the documentation changes, but wanted to give folks one last chance to review this direction.

brianmario · 2017-01-04T06:05:00Z

lib/posix/spawn/child.rb

+        streams = {stdout => @stdout_stream, stderr => @stderr_stream}
+
+        bytes_seen = 0
+        chunk_buffer = ""


This buffer is reused by readpartial below. Internally, so far as I can tell, it will be resized to BUFSIZE on the first call to readpartial and then that underlying buffer will be reused from then on out.

Due to the issue with appending to strings that have been mentioned, we might want to consider having #readpartial allocate a new string and give ownership of it to the stream, but since we're now re-using this buffer, it's probably already more efficient than what we had before, so we can probably leave it until we actually find a perf issue we can trace back to this specifically.

Another "" that might be better as String.new here.

brianmario · 2017-01-04T06:05:48Z

lib/posix/spawn/child.rb

            raise MaximumOutputExceeded
          end
        end
-
-        [@out, @err]


I decided to just drop returning these in an attempt at consistency since these one or both of these ivars are useless if we're streaming.

carlosmn · 2017-01-04T12:59:47Z

lib/posix/spawn/child.rb

+        streams = {stdout => @stdout_stream, stderr => @stderr_stream}
+
+        bytes_seen = 0
+        chunk_buffer = ""


Due to the issue with appending to strings that have been mentioned, we might want to consider having #readpartial allocate a new string and give ownership of it to the stream, but since we're now re-using this buffer, it's probably already more efficient than what we had before, so we can probably leave it until we actually find a perf issue we can trace back to this specifically.

carlosmn · 2017-01-04T13:13:59Z

test/test_child.rb

+    begin
+      Child.new('yes', :streams => {:stdout => stdout_stream}, :max => limit)
+    rescue POSIX::Spawn::MaximumOutputExceeded
+    end


This should probably be a assert_raises block so we assert that the exception was raised, as-is this would not fail the test even if we never raise the error.

Good call 👍

carlosmn · 2017-01-04T18:50:26Z

This doesn't seem to be new, but the semantics of :max seem rather confusing. It doesn't mean "send me at most X bytes" but rather "if the process sends over X amount, give it to me and then raise an error", which means that for a :max which is a multiple of the effective buffer size, you get a whole extra chunk.

While I was playing around with the added test here, I noticed that all chunks are 16kB (which seems to be the default pipe buffer size here) so the BUFSIZE * 2 size four times the chunk size (since yes outputs so quickly, we always find it full) which means the output stream gets a whole extra 16kB beyond what is specified as :max. This seems to be semantics that are already there, but these seems really hard to plan for, if I can get an arbitrary amount of bytes beyond what I asked for.

brianmario · 2017-01-05T00:19:20Z

@carlosmn went ahead and added a failing test for that. Will get things fixed up so we only ever hand the caller max bytes. I don't think that'll break anything because it seems like that's been the assumption all along.

jbarnette · 2017-01-05T20:58:23Z

lib/posix/spawn/child.rb

@@ -95,6 +95,31 @@ def initialize(*args)
          @options[:pgroup] = true
        end
        @options.delete(:chdir) if @options[:chdir].nil?
+
+        @out, @err = "", ""


Might want to use String.new to avoid breaking when someone passes --enable-frozen-string-literal.

jbarnette · 2017-01-05T20:59:11Z

lib/posix/spawn/child.rb

+          @out << chunk
+
+          true
+        end


@stdout_stream = @out.method(:<<)

🚲 🏠

No I love it! Being able to use method was one of the main reasons for going back to the proc-based API ;)

jbarnette · 2017-01-05T21:01:44Z

lib/posix/spawn/child.rb

+        streams = {stdout => @stdout_stream, stderr => @stderr_stream}
+
+        bytes_seen = 0
+        chunk_buffer = ""


Another "" that might be better as String.new here.

arrbee · 2017-06-28T18:18:03Z

Any plans to merge this? I'd like to use it :-D

brianmario added 2 commits December 30, 2016 12:48

Add support for streaming stdout/stderr from Child invocations

9206712

allow aborting early while streaming

0c02f33

brianmario commented Dec 30, 2016

View reviewed changes

mhagger reviewed Jan 3, 2017

View reviewed changes

Switch to a writer interface from using a proc

b6e35f8

This requires that the stdout and stderr stream objects passed respond to `#write` and `#string` methods.

mclark reviewed Jan 3, 2017

View reviewed changes

peff reviewed Jan 4, 2017

View reviewed changes

brianmario added 5 commits January 3, 2017 21:15

go back to proc-based streams, cleanup a little

254704a

be explicit with default streams

25dc138

just to keep the diff smaller

54647e6

a couple of other reverts

f263cfc

reuse a buffer for reading chunks from streams

ac327f0

brianmario commented Jan 4, 2017

View reviewed changes

test crossing default buffer bounary

4bf5d80

carlosmn reviewed Jan 4, 2017

View reviewed changes

use assert_raises

90acd30

add failing test making sure we only ever receive :max bytes

c83a7ef

jbarnette reviewed Jan 5, 2017

View reviewed changes

Add support for streaming stdout/stderr from Child invocations #75

Are you sure you want to change the base?

Add support for streaming stdout/stderr from Child invocations #75

Conversation

brianmario commented Dec 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhagger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mclark Jan 3, 2017 • edited Loading

Choose a reason for hiding this comment

mclark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mclark Jan 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mclark commented Jan 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianmario left a comment

Choose a reason for hiding this comment

brianmario Jan 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlosmn commented Jan 4, 2017

brianmario commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arrbee commented Jun 28, 2017

brianmario commented Dec 30, 2016 •

edited

Loading

mclark Jan 3, 2017 •

edited

Loading

mclark Jan 3, 2017 •

edited

Loading

brianmario Jan 4, 2017 •

edited

Loading