Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LC_CTYPE incorrectly references case sensitivity of "the functions of module string" #111276

Closed
glyph opened this issue Oct 24, 2023 · 3 comments
Closed
Labels
docs Documentation in the Doc dir

Comments

@glyph
Copy link
Contributor

glyph commented Oct 24, 2023

Documentation

https://docs.python.org/3.12/library/locale.html#locale.LC_CTYPE says:

Locale category for the character type functions. Depending on the settings of this category, the functions of module string dealing with case change their behaviour.

I believe this is referring to Python 2.7's 'string.lower et. al., which have been gone for quite some time. I think since Python 3.3 unicode case-conversion functions have quite intentionally been locale-independent.

Confusion about this issue seems pervasive, even in CPython itself; consider this bit of code with a somewhat misleading comment:

#The map below appears to be trivially lowercasing the key. However,
#there's more to it than meets the eye - in some locales, lowercasing
#gives unexpected results. See SF #1524081: in the Turkish locale,
#"INFO".lower() != "info"
priority_map = {
"DEBUG" : "debug",
"INFO" : "info",
"WARNING" : "warning",
"ERROR" : "error",
"CRITICAL" : "critical"
}

So it would be good to clean up the docs. Earlier in the same document it does say:

There is no way to perform case conversions and character classifications according to the locale. For (Unicode) text strings these are done according to the character value only, while for byte strings, the conversions and classifications are done according to the ASCII value of the byte, and bytes whose high bit is set (i.e., non-ASCII bytes) are never converted or considered part of a character class such as letter or whitespace.

Linked PRs

@glyph glyph added the docs Documentation in the Doc dir label Oct 24, 2023
@glyph glyph changed the title LC_TYPE incorrectly references case sensitivity of "the functions of module string" LC_CTYPE incorrectly references case sensitivity of "the functions of module string" Oct 24, 2023
@ambv
Copy link
Contributor

ambv commented Oct 25, 2023

Indeed, the locale documentation sentence you're referring to dates back all the way to 1997 when the module's documentation was originally added in bc12f78.

As for the logging bug workaround (original issue: GH-43683), you're also correct that in Python 3 that is no longer applicable:

LC_CTYPE=tr_TR.UTF-8 python3                                                                                                                                                                                                     13:50
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getlocale()
('tr_TR', 'UTF-8')
>>> assert "INFO".lower() == "info"
>>> assert "info".upper() == "INFO"
>>> info_l = "info"
>>> assert info_l.upper().lower() == info_l
>>>

This changed for Python 2.7+ in GH-50043 where Py_TOLOWER, etc. were defined. In fact, in Python 3.13 ctype.h isn't even included in Python.h anymore (part of GH-108765).

Similar issues kept creeping up before Python gave up on ctype.h, another hilarious example was GH-46138 where codec lookup failed after switching to the Turkish locale due to its i vs I discrepancy. This identified an issue in Fedora where saying LC_CTYPE=tr_TR.ISO8859-9 was failing because the "I" in "ISO" in the encoding part wasn't matching when the locale was Turkish.

I pushed a PR to fix the docs and the comment.

@glyph
Copy link
Contributor Author

glyph commented Oct 25, 2023

Thanks a ton @ambv ! Glad to see this obscure little corner get cleaned up.

ambv added a commit that referenced this issue Oct 27, 2023
)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.html#deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 27, 2023
…pythonGH-111319)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.htmlGH-deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

(cherry picked from commit 6d42759)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 27, 2023
…pythonGH-111319)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.htmlGH-deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

(cherry picked from commit 6d42759)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
ambv added a commit that referenced this issue Oct 27, 2023
GH-111319) (#111391)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.htmlGH-deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

(cherry picked from commit 6d42759)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
ambv added a commit that referenced this issue Oct 27, 2023
GH-111319) (#111392)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.htmlGH-deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

(cherry picked from commit 6d42759)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
@hugovk
Copy link
Member

hugovk commented Nov 9, 2023

Closing because the PRs have been merged. Thanks for the report!

@hugovk hugovk closed this as completed Nov 9, 2023
aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
…python#111319)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.html#deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…python#111319)

Fix locale.LC_CTYPE documentation to no longer mention string.lower() et al. Those functions were removed in Python 3.0:
https://docs.python.org/2/library/string.html#deprecated-string-functions

Also, fix a comment in logging about locale-specific behavior of `str.lower()`.

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

3 participants