progressutil: refactor to use channels+select #63

jonboulle · 2016-05-29T13:06:03Z

Rather than maintaining a list of results and individual synchronised
booleans signifying goroutine completion, just use a single channel to
propagate the result of each copyReader. This means we can use a for/select
loop as His Lord and Holiness Rob Pike always intended.

Notably, it is now explicitly prohibited to: i) add additional copy
operations to a CopyProgressPrinter after it has
been started (i.e. after PrintAndWait has been called) ii) call
PrintAndWait more than once If either of these is attempted,
ErrAlreadyStarted is returned.

To achieve i), we would need to either change the interface of the
CopyProgressPrinter to accept a pre-defined size (i.e. number of copy
operations), or other sychronise the "shutting down" of the
CopyProgressPrinter (i.e. to block further copies being added just before
the CopyProgressPrinter is about to return - otherwise we can never be sure
that another will not be added as we're finishing). Both of these seem
overly complex for now and I suggest we only explore them if use cases
arise.

Similarly, ii) would require more delicate refactoring and it's hard to
imagine a use case since this package is typically designed to be used with
a single output stream (i.e. stdout). Adding the safety check in this PR
helps to mitigate potential abuse of the package, though (e.g.
appc/docker2aci#167)

Based on #62

By setting the value during the Read call, we potentially miss errors in the CopyProgressPrinter's main loop, because the copyReader might have not yet propagated them before it marks itself as done.

There's a race between the time we first check the error slice and when we check for allDone, since a copyReader might finish and try to report an error anywhere between this loop. Correct this by checking for errors again before returning. This is a stopgap - the real fix here is to refactor this a little more invasively.

@dgonyeo

The parallel pull refactoring in 47f2cb9 introduced an unfortunate and subtle race condition. Because i) we were asynchronously starting the fetching of each layer [1][1] (which implies that the fetch is added to the progress bar asynchronously, via getLayerV2 [2][2] [3][3]), and ii) the progressutil package has no guards against calling PrintAndWait and AddCopy in indeterminate order [4][4], it was possible for us to enter the PrintAndWait loop [5][5] [6][6] before all of the layer fetches had actually been added to it. Then, in the event that the first layer was particularly fast, the CopyProgressPrinter could actually think it was done [7][7] before the other layer(s) had finished. In this situation, docker2aci would happily continue forward and try to generate an ACI from each layer [8][8], and for any layer(s) that had not actually finished downloading, the GenerateACI->writeACI->tarball.Walk call [9][9] would be operating on a partially written file and hence result in the errors we've been seeing ("unexpected EOF", "corrupt input before offset 45274", and so forth). A great case to reproduce this is the `docker:busybox` image because of its particular characteristics: two layers, one very small (32B) and one relatively larger (676KB) layer. Typical output looks like the following ``` % $(exit 0); while [ $? -eq 0 ]; do ./bin/docker2aci docker://busybox; done Downloading sha256:385e281300c [===============================] 676 KB / 676 KB Downloading sha256:a3ed95caeb0 [===============================] 32 B / 32 B Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [==============================] 32 B / 32 B Downloading sha256:385e281300c [==============================] 676 KB / 676 KB Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [===================================] 32 B / 32 B Error: conversion error: error generating ACI: error writing ACI: Error reading tar entry: unexpected EOF ``` Note that i) the order in which the layers are registered with the progress printer is indeterminate, and ii) every failure case (observed so far) is when the small layer is retrieved first, and the stdout contains no progress output at all from retrieving the other layer. This indeed suggests that the progress printer returns before realising the second layer is still being retrieved. Tragically, @dgonyeo's test case [10][10] probably gives a false sense of security (i.e. it cannot reproduce this issue), most likely because the `dgonyeo/test` image contains a number of layers (5) of varying sizes, and I suspect it's much less likely for this particular race to occur. Almost certainly fixes appc#166 - with this patch I'm unable to reproduce. [1]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L89 [2]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L115 [3]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L335 [4]: coreos/pkg#63 [5]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L123 [6]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L115 [7]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L144 [8]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L149 [9]: https://github.com/appc/docker2aci/blob/4e051449c0079ba8df59a51c14b7d310de1830b8/lib/internal/internal.go#L427 [10]: coreos/pkg#61 (comment)

jonboulle · 2016-05-29T14:01:33Z

Helps with appc/docker2aci#167. If I apply this patch to the HEAD of docker2aci (i.e. without 167 - see e.g. https://github.com/jonboulle/docker2aci/tree/progressutil-refactor), I basically consistently get

; bin/docker2aci docker://busybox
Error: conversion error: error getting the remote layer: cannot add copies after PrintAndWait has been called

@dgonyeo

The parallel pull refactoring in 47f2cb9 introduced an unfortunate and subtle race condition. Because i) we were asynchronously starting the fetching of each layer [1][1] (which implies that the fetch is added to the progress bar asynchronously, via getLayerV2 [2][2] [3][3]), and ii) the progressutil package has no guards against calling PrintAndWait and AddCopy in indeterminate order [4][4], it was possible for us to enter the PrintAndWait loop [5][5] [6][6] before all of the layer fetches had actually been added to it. Then, in the event that the first layer was particularly fast, the CopyProgressPrinter could actually think it was done [7][7] before the other layer(s) had finished. In this situation, docker2aci would happily continue forward and try to generate an ACI from each layer [8][8], and for any layer(s) that had not actually finished downloading, the GenerateACI->writeACI->tarball.Walk call [9][9] would be operating on a partially written file and hence result in the errors we've been seeing ("unexpected EOF", "corrupt input before offset 45274", and so forth). A great case to reproduce this is the `docker:busybox` image because of its particular characteristics: two layers, one very small (32B) and one relatively larger (676KB) layer. Typical output looks like the following ``` % $(exit 0); while [ $? -eq 0 ]; do ./bin/docker2aci docker://busybox; done Downloading sha256:385e281300c [===============================] 676 KB / 676 KB Downloading sha256:a3ed95caeb0 [===============================] 32 B / 32 B Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [==============================] 32 B / 32 B Downloading sha256:385e281300c [==============================] 676 KB / 676 KB Generated ACI(s): library-busybox-latest.aci Downloading sha256:a3ed95caeb0 [===================================] 32 B / 32 B Error: conversion error: error generating ACI: error writing ACI: Error reading tar entry: unexpected EOF ``` Note that i) the order in which the layers are registered with the progress printer is indeterminate, and ii) every failure case (observed so far) is when the small layer is retrieved first, and the stdout contains no progress output at all from retrieving the other layer. This indeed suggests that the progress printer returns before realising the second layer is still being retrieved. Tragically, @dgonyeo's test case [10][10] probably gives a false sense of security (i.e. it cannot reproduce this issue), most likely because the `dgonyeo/test` image contains a number of layers (5) of varying sizes, and I suspect it's much less likely for this particular race to occur. Almost certainly fixes appc#166 - with this patch I'm unable to reproduce. [1]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L89 [2]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L115 [3]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L335 [4]: coreos/pkg#63 [5]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L123 [6]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L115 [7]: https://github.com/coreos/pkg/blob/master/progressutil/iocopy.go#L144 [8]: https://github.com/appc/docker2aci/blob/ba503aa9b84b6c1ffbab03ec0589415ef598e5e0/lib/internal/backend/repository/repository2.go#L149 [9]: https://github.com/appc/docker2aci/blob/4e051449c0079ba8df59a51c14b7d310de1830b8/lib/internal/internal.go#L427 [10]: coreos/pkg#61 (comment)

Rather than maintaining a list of results and individual synchronised booleans signifying goroutine completion, just use a single channel to propagate the result of each copyReader. This means we can use a for/select loop as His Lord and Holiness Rob Pike always intended. Notably, it is now explicitly prohibited to: i) add additional copy operations to a CopyProgressPrinter after it has been started (i.e. after PrintAndWait has been called) ii) call PrintAndWait more than once If either of these is attempted, ErrAlreadyStarted is returned. To achieve i), we would need to either change the interface of the CopyProgressPrinter to accept a pre-defined size (i.e. number of copy operations), or other sychronise the "shutting down" of the CopyProgressPrinter (i.e. to block further copies being added just before the CopyProgressPrinter is about to return - otherwise we can never be sure that another will not be added as we're finishing). Both of these seem overly complex for now and I suggest we only explore them if use cases arise. Similarly, ii) would require more delicate refactoring and it's hard to imagine a use case since this package is typically designed to be used with a single output stream (i.e. stdout). Adding the safety check in this PR helps to mitigate potential abuse of the package, though (e.g. appc/docker2aci#167)

lucab · 2016-05-29T16:17:01Z

progressutil/iocopy.go

+				_, err = cpp.pbp.Print(printTo)
+			}
+			if err != nil {
+				return err


I'm not sure about this case snippet. Due to re-using err, it is throwing away copy errors, not sure if on purpose. You are also discarding the returned bool, I think it will print multiple times on simultaneous completions.

It's only discarding if the copy error was nil...

I'm not sure about the multiple thing - can you elaborate?

Luca Bruno notifications@github.com schrieb am So., 29. Mai 2016, 18:17:

In progressutil/iocopy.go
#63 (comment):

if allDone && len(readers) > 0 {

// We still need to check for errors again, as one may

// have occurred between when we first checked and when

// the readers were marked as done

cpp.lock.Lock()

errors = cpp.errors

cpp.lock.Unlock()

if len(errors) > 0 {

return errors[0]

case err := <-cpp.results:

i++

if err == nil {

_, err = cpp.pbp.Print(printTo)

}

if err != nil {

return err

I'm not sure about this case snippet. Due to re-using err, it is throwing
away copy errors, not sure if on purpose. You are also discarding the
returned bool, I think it will print multiple times on simultaneous
completions.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/coreos/pkg/pull/63/files/f24e4f6a55a09ae816d44b6ed1c1e09615ffcc73..08ada9715da766febf54352137eb1fa2fbbecf86#r65007478,
or mute the thread
https://github.com/notifications/unsubscribe/ACewN5_ZNgm5t1pq1DHcVCYHKfzv-PjXks5qGbv-gaJpZM4IpSVb
.

It's only discarding if the copy error was nil...

You are right, I misread it.

I'm not sure about the multiple thing - can you elaborate?

Print boolean signals if all copies have finished. This is ignored here, so even if all copies end within the first call, Print will be anyway called n times. (I am not sure what the observable behavior will be, I guess printing completions result n times).

Ah right, great point! I had this check in one of the versions (that's why
I noticed that allDone weirdness) but guess it fell out among the
refactoring

Luca Bruno notifications@github.com schrieb am So., 29. Mai 2016, 21:58:

In progressutil/iocopy.go
#63 (comment):

if allDone && len(readers) > 0 {

// We still need to check for errors again, as one may

// have occurred between when we first checked and when

// the readers were marked as done

cpp.lock.Lock()

errors = cpp.errors

cpp.lock.Unlock()

if len(errors) > 0 {

return errors[0]

case err := <-cpp.results:

i++

if err == nil {

_, err = cpp.pbp.Print(printTo)

}

if err != nil {

return err

It's only discarding if the copy error was nil...

You are right, I misread it.

I'm not sure about the multiple thing - can you elaborate?

Print boolean signals if all copies have finished. This is ignored here,
so even if all copies end within the first call, Print will be anyway
called n times. (I am not sure what the observable behavior will be, I
guess printing completions result n times).

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/coreos/pkg/pull/63/files/f24e4f6a55a09ae816d44b6ed1c1e09615ffcc73..08ada9715da766febf54352137eb1fa2fbbecf86#r65010551,
or mute the thread
https://github.com/notifications/unsubscribe/ACewN6kc3G1eZzLdDShp5D2BecU_P9ccks5qGe_8gaJpZM4IpSVb
.

s-urbaniak · 2016-05-30T06:13:11Z

progressutil/iocopy.go

+	t := time.NewTicker(printInterval)
+	for i := 0; i < n; {
+		select {
+		case <-cancel:


if we cancel we'll have a dangling goroutine started in AddCopy blocking on cpp.results.

Since we can't anticipate the size in the current interface let's just add a cancel channel for the copiers to listen on while they're trying to send to cpp results

jonboulle added 3 commits May 29, 2016 14:32

progressutil: fix setting of copyReader.done

efe3b61

By setting the value during the Read call, we potentially miss errors in the CopyProgressPrinter's main loop, because the copyReader might have not yet propagated them before it marks itself as done.

progressutil: remove extraneous else

77313bb

jonboulle mentioned this pull request May 29, 2016

lib/repository2: fix parallel pull synchronisation appc/docker2aci#167

Merged

jonboulle force-pushed the refactor branch from 326c5ea to 08ada97 Compare May 29, 2016 15:11

jonboulle mentioned this pull request May 29, 2016

stage0: sporadic fetch failures with docker2aci rkt/rkt#2716

Closed

lucab reviewed May 29, 2016
View reviewed changes

lucab mentioned this pull request May 29, 2016

progressbar: several more race fixes #62

Closed

s-urbaniak reviewed May 30, 2016
View reviewed changes

lucab mentioned this pull request May 30, 2016

progressutil: refactor to use channels #64

Merged

jonboulle closed this May 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progressutil: refactor to use channels+select #63

progressutil: refactor to use channels+select #63

jonboulle commented May 29, 2016

jonboulle commented May 29, 2016

lucab May 29, 2016

jonboulle May 29, 2016

lucab May 29, 2016

jonboulle May 29, 2016

s-urbaniak May 30, 2016

jonboulle May 30, 2016

progressutil: refactor to use channels+select #63

progressutil: refactor to use channels+select #63

Conversation

jonboulle commented May 29, 2016

jonboulle commented May 29, 2016

lucab May 29, 2016

Choose a reason for hiding this comment

jonboulle May 29, 2016

Choose a reason for hiding this comment

lucab May 29, 2016

Choose a reason for hiding this comment

jonboulle May 29, 2016

Choose a reason for hiding this comment

s-urbaniak May 30, 2016

Choose a reason for hiding this comment

jonboulle May 30, 2016

Choose a reason for hiding this comment