-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GFile reading CRLF bug #2791
Conversation
@orionr FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for wading into this!
I think probably it's sufficient to fix this just to return f.tell()
for the LocalFileSystem and offset + len(result)
for the S3 file system (aka the length in either bytes or chars), rather than trying to change the semantics of the offset to always be in bytes, which leads to other issues as described in the comments.
@@ -101,21 +101,26 @@ def read(self, filename, binary_mode=False, size=None, offset=None): | |||
binary_mode: bool, read as binary if True, otherwise text | |||
size: int, number of bytes or characters to read, otherwise | |||
read all the contents of the file from the offset | |||
offset: int, offset into file to read from, otherwise read | |||
offset: int, offset into file to read from, in bytes; otherwise read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think it works to use a byte offset if we're not in binary mode.
For example when we pass an encoding to io.open()
what we get is a TextIOBase
in which seek()
and tell()
operate in terms of characters rather than bytes:
https://docs.python.org/3/library/io.html#io.TextIOBase.tell
For consistency, I think if we're in binary mode we want size and offset to represent bytes, and if we're in text mode we want them both to represent characters. Otherwise we might end up doing text reads that return partial values of multibyte characters, which basically defeats the point of having a text mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree. This is actually what the code was already doing, but the variable names and docstrings were wrong/confusing.
I've clarified those, particularly emphasizing that a) we don't care what f.tell()
returns, so long as we can pass it back to f.seek()
, and b) similarly at the FileSystem.read(...)
level, the continuation tokens should be opaque.
cc @sanekmelnikov, @natalialunova and @lanpa |
Oh one more thing: In this commit, I left a TODO(orionr) regarding how this bug may persist in the S3 case. :) Fixing that one requires a bit more logic (i.e., keeping a carryover for incomplete characters). I don't expect to address that further--at least, until it becomes a real issue for someone. Would you all like to take a look (@orionr, @sanekmelnikov, @natalialunova, @lanpa)? Once this PR is in, I'll likely file a new bug to track that case. |
Sounds good. Thank you, @davidsoergel! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks again for the fix!
Fixes #2777.
The issue was that our stub GFile implementation confused byte offsets in a file on disk with character offsets in the data read from that file.
These two kinds of offsets become desynchronized from each other when a) there are multibyte characters, or b) there are CRLFs (
\r\n
), which Python automatically translates to\n
on read.testReadLines
is modified here to exercise this issue, and indeed it would now fail in the absence of the given fix.