-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: slug is empty if filename is non-ASCII #22118
fix: slug is empty if filename is non-ASCII #22118
Conversation
Codecov Report
@@ Coverage Diff @@
## master #22118 +/- ##
==========================================
- Coverage 67.12% 65.54% -1.58%
==========================================
Files 1831 1832 +1
Lines 69993 69837 -156
Branches 7570 7570
==========================================
- Hits 46983 45775 -1208
- Misses 21045 22097 +1052
Partials 1965 1965
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! And thanks for the comprehensive tests, great work! LGTM
@EugeneTorap @villebro There is a quite annoying problem with cyrillic languages like Russian. In short, dashboards, charts and databases containing only cyrillic letters cannot be imported into Superset! The function Eugene provided, for instance, converts chart title "Мой график" ("My chart") into an empty string in python, leading to this type of filename: Therefore, charts/dashboards/databases with these types of titles cannot not be imported into Superset. My suggestion is to expand werkzeug's import unicodedata
def secure_filename(filename: str) -> str:
r"""Pass it a filename and it will return a secure version of it. This
filename can then safely be stored on a regular file system and passed
to :func:`os.path.join`.
On windows systems the function also makes sure that the file is not
named after one of the special device files.
The function also takes filenames containing cyrillic letters.
>>> secure_filename("My cool movie.mov")
'My_cool_movie.mov'
>>> secure_filename("../../../etc/passwd")
'etc_passwd'
>>> secure_filename('i contain cool \xfcml\xe4uts.txt')
'i_contain_cool_umlauts.txt'
>>> secure_filename('Мой красивый график.yaml')
'Мой_красивыи_график.yaml'
The function might return an empty filename. It's your responsibility
to ensure that the filename is unique and that you abort or
generate a random filename if the function returned an empty one.
.. versionadded:: 0.5
:param filename: the filename to secure
"""
# If the text contains cyrillic letters, ASCII encoding should not
# be used as it does not contain cyrillic letters
contains_cyrillic_letters = bool(re.search("[\u0400-\u04FF]", filename))
_windows_device_files = (
"CON",
"AUX",
"COM1",
"COM2",
"COM3",
"COM4",
"LPT1",
"LPT2",
"LPT3",
"PRN",
"NUL",
)
_filename_ascii_strip_re = re.compile(r"[^A-Za-z0-9_.-]")
_filename_strip_re = (
re.compile(r"[^A-Za-zа-яА-ЯёЁ0-9_.-]")
if contains_cyrillic_letters
else _filename_ascii_strip_re
)
filename = unicodedata.normalize("NFKD", filename)
if not contains_cyrillic_letters:
filename = filename.encode("ascii", "ignore").decode("ascii")
for sep in os.path.sep, os.path.altsep:
if sep:
filename = filename.replace(sep, " ")
filename = str(_filename_strip_re.sub("", "_".join(filename.split()))).strip("._")
# on nt a couple of special files are present in each folder. We
# have to ensure that the target file is not such a filename. In
# this case we prepend an underline
if (
os.name == "nt"
and filename
and filename.split(".")[0].upper() in _windows_device_files
):
filename = f"_{filename}"
return filename Before: >>>print(secure_filename("Мой красивый график"))
>>>print(secure_filename("My beautiful график"))
>>>print(secure_filename("My beautiful chart"))
My_beautiful
My_beautiful_chart After: >>>print(secure_filename("Мой красивый график"))
>>>print(secure_filename("My beautiful график"))
>>>print(secure_filename("My beautiful chart"))
Мои_красивыи_график
My_beautiful_график
My_beautiful_chart |
Hi @artemonsh! My PR fix the issue with Cyrillic characters. |
@artemonsh I've read you suggested code. You only added Cyrillic support without others utf-8 characters like: hieroglyphs, emojis and so on. My PR supports import/export for all utf-8 characters. |
If create a chart only with non-ASCII characters then we have
_<id>.yaml
filename and can't import this file because the filename starts with underscore.This PR fix the issue just add a condition if chart_slug is empty then use only
<id>.yaml
filename.Fixed the issue for database, dataset, chart and dashboard export.
SUMMARY
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
ADDITIONAL INFORMATION