-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer module does not handle backslash characters correctly #90432
Comments
A source of one or more backslash-escaped newlines, and one final newline, is not tokenized the same as a source where those lines are "manually joined". The source
produces the tokens NEWLINE, ENDMARKER when piped to the tokenize module. Whereas the source
produces the tokens NL, ENDMARKER. What I expect is to receive only one NL token from both sources. As per the documentation "Two or more physical lines may be joined into logical lines using backslash characters" ... "A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated)" And, because these logical lines are not being ignored, if there are spaces/tabs, INDENT and DEDENT tokens are also being unexpectedly produced. The source
produces the tokens INDENT, NEWLINE, DEDENT, ENDMARKER. Whereas the source (with spaces)
produces the tokens NL, ENDMARKER. |
another similar example -- which I believe to be the root cause of hhatto/autopep8#669 is that these are treated differently: import sys
def f():
pass import sys
\
def f():
pass the former having a in this particular case it throws off |
this seems to fix it, but I'm not sure if this is the right direction -- it seems overly simple of a fix for such a complicated piece of code diff --git a/Lib/test/test_tokenize.py b/Lib/test/test_tokenize.py
index 911b53e581..2280374971 100644
--- a/Lib/test/test_tokenize.py
+++ b/Lib/test/test_tokenize.py
@@ -59,6 +59,26 @@ def test_implicit_newline(self):
self.assertEqual(tokens[-2].type, NEWLINE)
self.assertEqual(tokens[-1].type, ENDMARKER)
+ def test_line_continuation(self):
+ code = dedent("""\
+ import sys
+
+ \\
+
+ import os
+ """)
+
+ self.check_tokenize(code, """\
+ NAME 'import' (1, 0) (1, 6)
+ NAME 'sys' (1, 7) (1, 10)
+ NEWLINE '\\n' (1, 10) (1, 11)
+ NL '\\n' (2, 0) (2, 1)
+ NL '\\n' (4, 0) (4, 1)
+ NAME 'import' (5, 0) (5, 6)
+ NAME 'os' (5, 7) (5, 9)
+ NEWLINE '\\n' (5, 9) (5, 10)
+ """)
+
def test_basic(self):
self.check_tokenize("1 + 1", """\
NUMBER '1' (1, 0) (1, 1)
diff --git a/Lib/tokenize.py b/Lib/tokenize.py
index 46d2224f5c..cf66912262 100644
--- a/Lib/tokenize.py
+++ b/Lib/tokenize.py
@@ -593,6 +593,8 @@ def _tokenize(readline, encoding):
elif initial.isidentifier(): # ordinary name
yield TokenInfo(NAME, token, spos, epos, line)
elif initial == '\\': # continued stmt
+ if line == '\\\n': # ignore an empty escaped line
+ break
continued = 1
else:
if initial in '([{': |
@lysnikolaou it seems to fix that -- but also includes a regression(?) (this is 3.12.0b1): $ diff -u <(python3.11 -mtokenize t.py) <(python3.12 -mtokenize t.py)
--- /dev/fd/63 2023-05-24 08:44:07.429455783 -0400
+++ /dev/fd/62 2023-05-24 08:44:07.429455783 -0400
@@ -2,7 +2,7 @@
1,0-1,6: NAME 'import'
1,7-1,10: NAME 'sys'
1,10-1,11: NEWLINE '\n'
-3,0-3,1: NEWLINE '\n'
+3,0-3,1: NL '\n'
4,0-4,3: NAME 'def'
4,4-4,5: NAME 'f'
4,5-4,6: OP '('
@@ -12,5 +12,5 @@
5,0-5,4: INDENT ' '
5,4-5,8: NAME 'pass'
5,8-5,9: NEWLINE '\n'
-6,0-6,0: DEDENT ''
+5,9-5,9: DEDENT ''
6,0-6,0: ENDMARKER '' |
I think that the 3.12 version is the correct one regarding the |
This is now covered in the 3.12 what's new: Aditionally, there may be some minor behavioral changes as a consecuence of the changes required to support PEP 701. Some of these changes include: Some final DEDENT tokens are now emitted within the bounds of the input. This means that for a file containing 3 lines, the old version of the tokenizer returned a DEDENT token in line 4 whilst the new version returns the token in line 3. |
I think we can close this issue |
Thanks @pablogsal! |
Singe the latest fix, the only different now is the DEDENT:
|
@pablogsal then this should be reopened |
Why? That is an expected difference as I mentioned in my previous comment. |
the NEWLINE token should be a NL if this issue is resolved |
I'm confused -- I think we're talking about different things -- are you using the source here? |
I am. The difference I was mentioning was between Python 3.11 and 3.12. Just to be clear, given
with 3.12:
With 3.11:
The difference between 3.11 and 3.12 for the "correct" version is still the location of the last DEDENT token, which is expected. |
hmmm ok but this diff doesn't show the NEWLINE -> NL: #90432 (comment) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: