Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GFile reading CRLF bug #2791

Merged
merged 10 commits into from
Oct 25, 2019
Merged

Fix GFile reading CRLF bug #2791

merged 10 commits into from
Oct 25, 2019

Conversation

davidsoergel
Copy link
Member

Fixes #2777.

The issue was that our stub GFile implementation confused byte offsets in a file on disk with character offsets in the data read from that file.

These two kinds of offsets become desynchronized from each other when a) there are multibyte characters, or b) there are CRLFs (\r\n), which Python automatically translates to \n on read.

testReadLines is modified here to exercise this issue, and indeed it would now fail in the absence of the given fix.

@davidsoergel
Copy link
Member Author

@orionr FYI

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for wading into this!

I think probably it's sufficient to fix this just to return f.tell() for the LocalFileSystem and offset + len(result) for the S3 file system (aka the length in either bytes or chars), rather than trying to change the semantics of the offset to always be in bytes, which leads to other issues as described in the comments.

@@ -101,21 +101,26 @@ def read(self, filename, binary_mode=False, size=None, offset=None):
binary_mode: bool, read as binary if True, otherwise text
size: int, number of bytes or characters to read, otherwise
read all the contents of the file from the offset
offset: int, offset into file to read from, otherwise read
offset: int, offset into file to read from, in bytes; otherwise read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think it works to use a byte offset if we're not in binary mode.

For example when we pass an encoding to io.open() what we get is a TextIOBase in which seek() and tell() operate in terms of characters rather than bytes:
https://docs.python.org/3/library/io.html#io.TextIOBase.tell

For consistency, I think if we're in binary mode we want size and offset to represent bytes, and if we're in text mode we want them both to represent characters. Otherwise we might end up doing text reads that return partial values of multibyte characters, which basically defeats the point of having a text mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree. This is actually what the code was already doing, but the variable names and docstrings were wrong/confusing.

I've clarified those, particularly emphasizing that a) we don't care what f.tell() returns, so long as we can pass it back to f.seek(), and b) similarly at the FileSystem.read(...) level, the continuation tokens should be opaque.

tensorboard/compat/tensorflow_stub/io/gfile.py Outdated Show resolved Hide resolved
@orionr
Copy link
Contributor

orionr commented Oct 21, 2019

cc @sanekmelnikov, @natalialunova and @lanpa

@davidsoergel
Copy link
Member Author

Oh one more thing: In this commit, I left a TODO(orionr) regarding how this bug may persist in the S3 case. :) Fixing that one requires a bit more logic (i.e., keeping a carryover for incomplete characters). I don't expect to address that further--at least, until it becomes a real issue for someone. Would you all like to take a look (@orionr, @sanekmelnikov, @natalialunova, @lanpa)?

Once this PR is in, I'll likely file a new bug to track that case.

@davidsoergel davidsoergel requested a review from nfelt October 21, 2019 20:16
@orionr
Copy link
Contributor

orionr commented Oct 21, 2019

Sounds good. Thank you, @davidsoergel!

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again for the fix!

tensorboard/compat/tensorflow_stub/io/gfile.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tf.GFile read file with more characters at the end of the last line
4 participants