Tokenizer module does not handle backslash characters correctly #90432

ucodery · 2022-01-05T23:12:47Z

BPO	46274
Nosy	@lysnikolaou, @ucodery

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-01-05.23:12:47.391>
labels = ['interpreter-core', '3.8', '3.9', '3.10', '3.11', '3.7']
title = 'Tokenizer module does not handle backslash characters correctly'
updated_at = <Date 2022-01-06.13:57:33.135>
user = 'https://github.com/ucodery'

bugs.python.org fields:

activity = <Date 2022-01-06.13:57:33.135>
actor = 'pablogsal'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Parser']
creation = <Date 2022-01-05.23:12:47.391>
creator = 'ucodery'
dependencies = []
files = []
hgrepos = []
issue_num = 46274
keywords = []
message_count = 1.0
messages = ['409811']
nosy_count = 2.0
nosy_names = ['lys.nikolaou', 'ucodery']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue46274'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

ucodery · 2022-01-05T23:12:47Z

A source of one or more backslash-escaped newlines, and one final newline, is not tokenized the same as a source where those lines are "manually joined".

The source

\
\
\

produces the tokens NEWLINE, ENDMARKER when piped to the tokenize module.

Whereas the source

produces the tokens NL, ENDMARKER.

What I expect is to receive only one NL token from both sources. As per the documentation "Two or more physical lines may be joined into logical lines using backslash characters" ... "A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated)"

And, because these logical lines are not being ignored, if there are spaces/tabs, INDENT and DEDENT tokens are also being unexpectedly produced.

The source

produces the tokens INDENT, NEWLINE, DEDENT, ENDMARKER.

Whereas the source (with spaces)

produces the tokens NL, ENDMARKER.

asottile · 2023-05-06T19:32:21Z

another similar example -- which I believe to be the root cause of hhatto/autopep8#669 is that these are treated differently:

import sys

def f():
    pass

import sys
\

def f():
    pass

the former having a NL token whereas the latter has a NEWLINE

in this particular case it throws off pycodestyle and then autopep8

asottile · 2023-05-06T20:06:46Z

this seems to fix it, but I'm not sure if this is the right direction -- it seems overly simple of a fix for such a complicated piece of code

diff --git a/Lib/test/test_tokenize.py b/Lib/test/test_tokenize.py
index 911b53e581..2280374971 100644
--- a/Lib/test/test_tokenize.py
+++ b/Lib/test/test_tokenize.py
@@ -59,6 +59,26 @@ def test_implicit_newline(self):
         self.assertEqual(tokens[-2].type, NEWLINE)
         self.assertEqual(tokens[-1].type, ENDMARKER)
 
+    def test_line_continuation(self):
+        code = dedent("""\
+            import sys
+
+            \\
+
+            import os
+        """)
+
+        self.check_tokenize(code, """\
+    NAME       'import'      (1, 0) (1, 6)
+    NAME       'sys'         (1, 7) (1, 10)
+    NEWLINE    '\\n'          (1, 10) (1, 11)
+    NL         '\\n'          (2, 0) (2, 1)
+    NL         '\\n'          (4, 0) (4, 1)
+    NAME       'import'      (5, 0) (5, 6)
+    NAME       'os'          (5, 7) (5, 9)
+    NEWLINE    '\\n'          (5, 9) (5, 10)
+        """)
+
     def test_basic(self):
         self.check_tokenize("1 + 1", """\
     NUMBER     '1'           (1, 0) (1, 1)
diff --git a/Lib/tokenize.py b/Lib/tokenize.py
index 46d2224f5c..cf66912262 100644
--- a/Lib/tokenize.py
+++ b/Lib/tokenize.py
@@ -593,6 +593,8 @@ def _tokenize(readline, encoding):
                 elif initial.isidentifier():               # ordinary name
                     yield TokenInfo(NAME, token, spos, epos, line)
                 elif initial == '\\':                      # continued stmt
+                    if line == '\\\n':  # ignore an empty escaped line
+                        break
                     continued = 1
                 else:
                     if initial in '([{':

lysnikolaou · 2023-05-24T08:30:15Z

This appears to have been fixed in #104323. @asottile Can you check whether that solves the issue for you as well?

asottile · 2023-05-24T12:45:37Z

@lysnikolaou it seems to fix that -- but also includes a regression(?) (this is 3.12.0b1):

$ diff -u <(python3.11 -mtokenize t.py) <(python3.12 -mtokenize t.py)
--- /dev/fd/63	2023-05-24 08:44:07.429455783 -0400
+++ /dev/fd/62	2023-05-24 08:44:07.429455783 -0400
@@ -2,7 +2,7 @@
 1,0-1,6:            NAME           'import'       
 1,7-1,10:           NAME           'sys'          
 1,10-1,11:          NEWLINE        '\n'           
-3,0-3,1:            NEWLINE        '\n'           
+3,0-3,1:            NL             '\n'           
 4,0-4,3:            NAME           'def'          
 4,4-4,5:            NAME           'f'            
 4,5-4,6:            OP             '('            
@@ -12,5 +12,5 @@
 5,0-5,4:            INDENT         '    '         
 5,4-5,8:            NAME           'pass'         
 5,8-5,9:            NEWLINE        '\n'           
-6,0-6,0:            DEDENT         ''             
+5,9-5,9:            DEDENT         ''             
 6,0-6,0:            ENDMARKER      ''

lysnikolaou · 2023-05-24T13:23:15Z

I think that the 3.12 version is the correct one regarding the NEWLINE/NL tokens. Regarding the location of the DEDENT token, I don't have a strong opinion, but feel that it could stay this way. @pablogsal @mgmacias95 What do you think?

pablogsal · 2023-05-24T13:34:41Z

This is now covered in the 3.12 what's new:

Aditionally, there may be some minor behavioral changes as a consecuence of the changes required to support PEP 701. Some of these changes include:

Some final DEDENT tokens are now emitted within the bounds of the input. This means that for a file containing 3 lines, the old version of the tokenizer returned a DEDENT token in line 4 whilst the new version returns the token in line 3.
The type attribute of the tokens emitted when tokenizing some invalid Python characters such as ! has changed from ERRORTOKEN to OP.

pablogsal · 2023-05-24T13:35:41Z

I think we can close this issue

lysnikolaou · 2023-05-24T13:37:44Z

Thanks @pablogsal!

pablogsal · 2023-05-25T16:09:06Z

Singe the latest fix, the only different now is the DEDENT:

❯ diff -u <( ../3.11/python.exe -mtokenize lol.py) <(./python.exe -mtokenize lol.py )
--- /dev/fd/11	2023-05-25 17:08:47
+++ /dev/fd/12	2023-05-25 17:08:47
@@ -12,5 +12,5 @@
 4,0-4,4:            INDENT         '    '
 4,4-4,8:            NAME           'pass'
 4,8-4,9:            NEWLINE        '\n'
-5,0-5,0:            DEDENT         ''
+4,9-4,9:            DEDENT         ''
 5,0-5,0:            ENDMARKER      ''

asottile · 2023-05-25T16:17:40Z

@pablogsal then this should be reopened

pablogsal · 2023-05-25T16:31:09Z

@pablogsal then this should be reopened

Why? That is an expected difference as I mentioned in my previous comment.

asottile · 2023-05-25T17:28:39Z

the NEWLINE token should be a NL if this issue is resolved

pablogsal · 2023-05-25T18:26:44Z

the NEWLINE token should be a NL if this issue is resolved

No, that was a bug, see #104866 and #104870

asottile · 2023-05-25T18:41:54Z

I'm confused -- I think we're talking about different things -- are you using the source here?

#90432 (comment)

pablogsal · 2023-05-25T19:04:46Z

are you using the source here?

I am. The difference I was mentioning was between Python 3.11 and 3.12. Just to be clear, given

$cat source1.py
import sys

def f():
    pass

$ cat source2.py
import sys
\

def f():
    pass

with 3.12:

./python -m tokenize < source1.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
2,0-2,1:            NL             '\n'
3,0-3,3:            NAME           'def'
3,4-3,5:            NAME           'f'
3,5-3,6:            OP             '('
3,6-3,7:            OP             ')'
3,7-3,8:            OP             ':'
3,8-3,9:            NEWLINE        '\n'
4,0-4,4:            INDENT         '    '
4,4-4,8:            NAME           'pass'
4,8-4,9:            NEWLINE        '\n'
4,9-4,9:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

./python -m tokenize < source2.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,3:            NAME           'def'
4,4-4,5:            NAME           'f'
4,5-4,6:            OP             '('
4,6-4,7:            OP             ')'
4,7-4,8:            OP             ':'
4,8-4,9:            NEWLINE        '\n'
5,0-5,4:            INDENT         '    '
5,4-5,8:            NAME           'pass'
5,8-5,9:            NEWLINE        '\n'
5,9-5,9:            DEDENT         ''
6,0-6,0:            ENDMARKER      ''

With 3.11:

python -m tokenize < source1.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
2,0-2,1:            NL             '\n'
3,0-3,3:            NAME           'def'
3,4-3,5:            NAME           'f'
3,5-3,6:            OP             '('
3,6-3,7:            OP             ')'
3,7-3,8:            OP             ':'
3,8-3,9:            NEWLINE        '\n'
4,0-4,4:            INDENT         '    '
4,4-4,8:            NAME           'pass'
4,8-4,9:            NEWLINE        '\n'
5,0-5,0:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

python -m tokenize < source2.py
1,0-1,6:            NAME           'import'
1,7-1,10:           NAME           'sys'
1,10-1,11:          NEWLINE        '\n'
3,0-3,1:            NEWLINE        '\n'
4,0-4,3:            NAME           'def'
4,4-4,5:            NAME           'f'
4,5-4,6:            OP             '('
4,6-4,7:            OP             ')'
4,7-4,8:            OP             ':'
4,8-4,9:            NEWLINE        '\n'
5,0-5,4:            INDENT         '    '
5,4-5,8:            NAME           'pass'
5,8-5,9:            NEWLINE        '\n'
6,0-6,0:            DEDENT         ''
6,0-6,0:            ENDMARKER      ''

The difference between 3.11 and 3.12 for the "correct" version is still the location of the last DEDENT token, which is expected.

asottile · 2023-05-25T19:59:34Z

hmmm ok but this diff doesn't show the NEWLINE -> NL: #90432 (comment)

ucodery mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jan 5, 2022

pablogsal changed the title ~~backslash creating statement out of nothing~~ Tokenizer module does not handle backslash characters correctly Jan 6, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel added 3.12 bugs and security fixes and removed 3.9 only security fixes 3.8 (EOL) end of life 3.7 (EOL) end of life labels Oct 7, 2022

lysnikolaou closed this as completed May 24, 2023

EYH0602 mentioned this issue May 24, 2023

Hangs when line just contains backslash inbetween two empty line hhatto/autopep8#669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer module does not handle backslash characters correctly #90432

Tokenizer module does not handle backslash characters correctly #90432

ucodery mannequin commented Jan 5, 2022

ucodery mannequin commented Jan 5, 2022

asottile commented May 6, 2023

asottile commented May 6, 2023

lysnikolaou commented May 24, 2023

asottile commented May 24, 2023

lysnikolaou commented May 24, 2023 •

edited

Loading

pablogsal commented May 24, 2023 •

edited

Loading

pablogsal commented May 24, 2023

lysnikolaou commented May 24, 2023

pablogsal commented May 25, 2023

asottile commented May 25, 2023

pablogsal commented May 25, 2023 •

edited

Loading

asottile commented May 25, 2023

pablogsal commented May 25, 2023 •

edited

Loading

asottile commented May 25, 2023

pablogsal commented May 25, 2023 •

edited

Loading

asottile commented May 25, 2023

Tokenizer module does not handle backslash characters correctly #90432

Tokenizer module does not handle backslash characters correctly #90432

Comments

ucodery mannequin commented Jan 5, 2022

ucodery mannequin commented Jan 5, 2022

asottile commented May 6, 2023

asottile commented May 6, 2023

lysnikolaou commented May 24, 2023

asottile commented May 24, 2023

lysnikolaou commented May 24, 2023 • edited Loading

pablogsal commented May 24, 2023 • edited Loading

pablogsal commented May 24, 2023

lysnikolaou commented May 24, 2023

pablogsal commented May 25, 2023

asottile commented May 25, 2023

pablogsal commented May 25, 2023 • edited Loading

asottile commented May 25, 2023

pablogsal commented May 25, 2023 • edited Loading

asottile commented May 25, 2023

pablogsal commented May 25, 2023 • edited Loading

asottile commented May 25, 2023

lysnikolaou commented May 24, 2023 •

edited

Loading

pablogsal commented May 24, 2023 •

edited

Loading

pablogsal commented May 25, 2023 •

edited

Loading

pablogsal commented May 25, 2023 •

edited

Loading

pablogsal commented May 25, 2023 •

edited

Loading