Fix GFile reading CRLF bug #2791

davidsoergel · 2019-10-17T20:07:15Z

The issue was that our stub GFile implementation confused byte offsets in a file on disk with character offsets in the data read from that file.

These two kinds of offsets become desynchronized from each other when a) there are multibyte characters, or b) there are CRLFs (\r\n), which Python automatically translates to \n on read.

testReadLines is modified here to exercise this issue, and indeed it would now fail in the absence of the given fix.

davidsoergel · 2019-10-17T20:08:50Z

@orionr FYI

nfelt

Thanks for wading into this!

I think probably it's sufficient to fix this just to return f.tell() for the LocalFileSystem and offset + len(result) for the S3 file system (aka the length in either bytes or chars), rather than trying to change the semantics of the offset to always be in bytes, which leads to other issues as described in the comments.

nfelt · 2019-10-18T21:22:27Z

tensorboard/compat/tensorflow_stub/io/gfile.py

@@ -101,21 +101,26 @@ def read(self, filename, binary_mode=False, size=None, offset=None):
            binary_mode: bool, read as binary if True, otherwise text
            size: int, number of bytes or characters to read, otherwise
                read all the contents of the file from the offset
-            offset: int, offset into file to read from, otherwise read
+            offset: int, offset into file to read from, in bytes; otherwise read


Unfortunately I don't think it works to use a byte offset if we're not in binary mode.

For example when we pass an encoding to io.open() what we get is a TextIOBase in which seek() and tell() operate in terms of characters rather than bytes:
https://docs.python.org/3/library/io.html#io.TextIOBase.tell

For consistency, I think if we're in binary mode we want size and offset to represent bytes, and if we're in text mode we want them both to represent characters. Otherwise we might end up doing text reads that return partial values of multibyte characters, which basically defeats the point of having a text mode.

Completely agree. This is actually what the code was already doing, but the variable names and docstrings were wrong/confusing.

I've clarified those, particularly emphasizing that a) we don't care what f.tell() returns, so long as we can pass it back to f.seek(), and b) similarly at the FileSystem.read(...) level, the continuation tokens should be opaque.

tensorboard/compat/tensorflow_stub/io/gfile.py

orionr · 2019-10-21T15:38:50Z

cc @sanekmelnikov, @natalialunova and @lanpa

davidsoergel · 2019-10-21T20:12:52Z

Oh one more thing: In this commit, I left a TODO(orionr) regarding how this bug may persist in the S3 case. :) Fixing that one requires a bit more logic (i.e., keeping a carryover for incomplete characters). I don't expect to address that further--at least, until it becomes a real issue for someone. Would you all like to take a look (@orionr, @sanekmelnikov, @natalialunova, @lanpa)?

Once this PR is in, I'll likely file a new bug to track that case.

orionr · 2019-10-21T21:08:33Z

Sounds good. Thank you, @davidsoergel!

nfelt

LGTM, thanks again for the fix!

tensorboard/compat/tensorflow_stub/io/gfile.py

davidsoergel added 2 commits October 17, 2019 15:52

Fix GFile reading CRLF bug

799e555

fix docstring

23818c1

davidsoergel requested review from nfelt and wchargin October 17, 2019 20:07

googlebot added the cla: yes label Oct 17, 2019

nfelt reviewed Oct 18, 2019

View reviewed changes

Clarify bytes vs. chars, per review

7e156db

davidsoergel requested a review from nfelt October 21, 2019 20:16

nfelt approved these changes Oct 23, 2019

View reviewed changes

tensorboard/compat/tensorflow_stub/io/gfile.py Outdated Show resolved Hide resolved

davidsoergel added 7 commits October 23, 2019 12:04

cleanup syntax nit

a3393c5

Don't mask global function 'len()'

c75c2bc

fix offset None bug

7c3d98e

Merge branch 'master' into gfile-wat

f27d7c0

Fix S3 offset+size logic

969669a

whitespace

23f84d7

Merge branch 'master' into gfile-wat

53cf598

davidsoergel merged commit 2b3479b into master Oct 25, 2019

davidsoergel mentioned this pull request Oct 25, 2019

GFile S3 reader may corrupt files containing multibyte characters or CR/LF #2839

Open

davidsoergel deleted the gfile-wat branch November 1, 2019 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GFile reading CRLF bug #2791

Fix GFile reading CRLF bug #2791

davidsoergel commented Oct 17, 2019

davidsoergel commented Oct 17, 2019

nfelt left a comment

nfelt Oct 18, 2019

davidsoergel Oct 21, 2019

orionr commented Oct 21, 2019

davidsoergel commented Oct 21, 2019

orionr commented Oct 21, 2019

nfelt left a comment

Fix GFile reading CRLF bug #2791

Fix GFile reading CRLF bug #2791

Conversation

davidsoergel commented Oct 17, 2019

davidsoergel commented Oct 17, 2019

nfelt left a comment

Choose a reason for hiding this comment

nfelt Oct 18, 2019

Choose a reason for hiding this comment

davidsoergel Oct 21, 2019

Choose a reason for hiding this comment

orionr commented Oct 21, 2019

davidsoergel commented Oct 21, 2019

orionr commented Oct 21, 2019

nfelt left a comment

Choose a reason for hiding this comment