-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize generate_tokens regression in CPython 3.12 #111224
Comments
For 3.12, the tokenize function was rewritten to use the real C-coded tokenize function instead of Python code intended to mimic the C function. "Looking through the release and migration notes,": did you look through the changelog? for 'token'? There were numerous bugfix issues that would not be mentioned in What's New. |
Sorry about the extra line, I'll remove it. Of the patches in the 3.12 changelog I've collected the ones that reference tokenize. I don't see any obvious notes indicating this should have changed, but if the implementation changed from Python to C, then it could be the case that the C variant never errored in the above case. gh-105564: Don’t include artificil newlines in the line attribute of tokens in the APIs of the tokenize module. Patch by Pablo Galindo gh-105549: Tokenize separately NUMBER and NAME tokens that are not ambiguous. Patch by Pablo Galindo. gh-105390: Correctly raise tokenize.TokenError exceptions instead of SyntaxError for tokenize errors such as incomplete input. Patch by Pablo Galindo gh-105259: Don’t include newline character for trailing NEWLINE tokens emitted in the tokenize module. Patch by Pablo Galindo gh-105324: Fix the main function of the tokenize module when reading from sys.stdin. Patch by Pablo Galindo gh-105017: Show CRLF lines in the tokenize string attribute in both NL and NEWLINE tokens. Patch by Marta Gómez. gh-105017: Do not include an additional final NL token when parsing files having CRLF lines. Patch by Marta Gómez. gh-104976: Ensure that trailing DEDENT tokenize.TokenInfo objects emitted by the tokenize module are reported as in Python 3.11. Patch by Pablo Galindo gh-104972: Ensure that the line attribute in tokenize.TokenInfo objects in the tokenize module are always correct. Patch by Pablo Galindo gh-104825: Tokens emitted by the tokenize module do not include an implicit \n character in the line attribute anymore. Patch by Pablo Galindo gh-102856: Implement PEP 701 changes in the tokenize module. Patch by Marta Gómez Macías and Pablo Galindo Salgado gh-102856: Implement the required C tokenizer changes for PEP 701. Patch by Pablo Galindo Salgado, Lysandros Nikolaou, Batuhan Taskaya, Marta Gómez Macías and sunmy2019. gh-99891: Fix a bug in the tokenizer that could cause infinite recursion when showing syntax warnings that happen in the first line of the source. Patch by Pablo Galindo gh-99581: Fixed a bug that was causing a buffer overflow if the tokenizer copies a line missing the newline caracter from a file that is as long as the available tokenizer buffer. Patch by Pablo galindo gh-97997: Add running column offset to the tokenizer state to avoid calculating AST column information with pointer arithmetic. gh-97973: Modify the tokenizer to return all necessary information the parser needs to set location information in the AST nodes, so that the parser does not have to calculate those doing pointer arithmetic. gh-94360: Fixed a tokenizer crash when reading encoded files with syntax errors from stdin with non utf-8 encoded text. Patch by Pablo Galindo EDIT: Thanks for the pointer about the change in implementation from Python -> C. I was able to patch xdoctest by including a vendored copy of tokenize.py from Python 3.11. This will ensure the library works with 3.12.0. However, it would be good to determine if this new behavior is indended or if the C code should also raise an error in the same instance. |
Thanks for opening this issue, I will take a look but unfortunately there isn't much we can promise here. Notice this warning on the tokenize docs: https://docs.python.org/3/library/tokenize.html
As per the warning this means that the behavior of the functions is not defined when you pass code that's not syntactically valid, which includes this case. Guaranteeing anything for invalid code is very hard for us because different users have different expectations. Indeed, some users have asked us to not raise in this case because they want to still receive the tokens even if they are unbalanced. Doing what you are suggesting will now break another set of users that don't expect us to raise. |
Notice this means that you won't be able to parse PEP 701 based code with that tokenize implementation |
This may not be the best tool for your use case, because is still somehow brittle. You can manually check if your statement is balanced by checking the tokens and check the paren balance yourself (adding +1 on opening of brackets and -1 on close). You can also consider the
this raises if is not balanced on the other end:
|
Yeah, confirmed that raising in this cade breaks some existing code that was adapted after 3.12 was released, so I propose to close this as won't fix. @lysnikolaou what do you think? |
Right... that means I do need to find a new solution. That's my bad for relying on undefined behavior. This needs to work not only for braces / brakets, but also for quotes. Also the current solution will mark What I ultimately need to do is: given a set of text lines that is probably valid Python code, label each line as starting its own statement (a PS1 line), or that it is continuing a previous statement (a PS2 line). Python has come a long way since I first implemented this in 3.7 and had to deal with issue16806, so maybe the current tooling (and better line number management) will help clean up the xdoctest parsing code. |
Technically speaking there is a map that will be able to identify PS1 lines with only the line as input but there is no such thing for PS2 because that depends on whatever has been written as a PS1 line. You can label them as "potentially" PS2 lines, but that will be very brittle. You can identify PS2 lines if you have something that can look at previous lines. |
Agreed. Especially since we did the work to make sure that cases like do not raise due to PyCQA tools etc. Thanks for the report @Erotemic and sorry we can't do more. |
FWIW Python 3.12 broke web.py / OpenLibrary as well. webpy/webpy#784 |
Bug report
Bug description:
I've noticed a regression when adding 3.12 support to xdoctest.
The following MWE has different behavior on 3.11 and 3.12.
On 3.11 and earlier versions this will result in a tokenize.TokenError being raised:
However, on 3.12, this no longer raises an error:
Instead I get:
This is a problem for xdoctest because it uses tokenize to determine if a statement is "balanced" (i.e. if it is part of a line continuation or not). This is the magic I use to autodetect PS1 vs PS2 lines and prevent users from needing to manually specify if a line is a continuation or not.
Looking through the release and migration notes, I don't see anything that would indicate that this new behavior is introduced, so I suspect it is a bug. I'm sorry I didn't catch this before the 3.12 release. I've been busy.
If this is not a bug and an intended change, then it should be documented (please link to the relevant section if I missed it). If there is a way to work around this so xdoctest works on 3.12.0 that would be helpful. (It's probably time some of the parsing code got a rewrite anyway).
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: