-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Add module creation with mypyc to speed up #182
Comments
You went a bit too fast. I am reopening that thread. The idea is tempting but needs a thorough analysis of its impacts. Here are some major subjects we have to care about. Dropping Python 3.5There is a high chance that dropping Python 3.5 and all the specific code associated with it. Inherent risksCompiling the package means providing ready-to-use whl, not a big problem using qemu and whatnot. Sub-packageMaybe, this should be published under another package name? To be discussed. TypesI don't think that the package has a "perfect" typing, so I think that a PR should address the remaining cases using the strict mode. And should not be difficult to do so. Task ahead
|
Let's wait for the drop of Python 3.5 |
I ran some more tests to see how mypyc compilation affects performance. performance1.py
percentile-plot.py#!/bin/python
from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir
from charset_normalizer import detect
from chardet import detect as chardet_detect
from cchardet import detect as cchardet_detect
from statistics import mean
from math import ceil
import matplotlib.pyplot as plt
def calc_percentile(data, percentile):
n = len(data)
p = n * percentile / 100
sorted_data = sorted(data)
return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]
def performance_compare(arguments):
parser = argparse.ArgumentParser(
description="Performance CI/CD check for Charset-Normalizer"
)
parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
help="Apply artificial size increase to challenge the detection mechanism further")
args = parser.parse_args(arguments)
if not isdir("./char-dataset"):
print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
exit(1)
chardet_results = []
cchardet_results = []
charset_normalizer_results = []
file_names_list = []
for tbt_path in sorted(glob("./char-dataset/**/*.*")):
print(tbt_path)
file_names_list.append(tbt_path.split('/')[-1])
# Read Bin file
with open(tbt_path, "rb") as fp:
content = fp.read() * args.size_coeff
#Chardet
before = time_ns()
chardet_detect(content)
chardet_results.append(
round((time_ns() - before) / 1000000000, 5)
)
print(" --> Chardet: " + str(chardet_results[-1]) + "s")
#Cchardet
before = time_ns()
cchardet_detect(content)
cchardet_results.append(
round((time_ns() - before) / 1000000000, 5)
)
print(" --> Cchardet: " + str(cchardet_results[-1]) + "s")
#Charset_normalizer
before = time_ns()
detect(content)
charset_normalizer_results.append(
round((time_ns() - before) / 1000000000, 5)
)
print(" --> Charset-Normalizer: " + str(charset_normalizer_results[-1]) + "s")
chardet_avg_delay = mean(chardet_results)
chardet_99p = calc_percentile(chardet_results, 99)
chardet_95p = calc_percentile(chardet_results, 95)
chardet_50p = calc_percentile(chardet_results, 50)
cchardet_avg_delay = mean(cchardet_results)
cchardet_99p = calc_percentile(cchardet_results, 99)
cchardet_95p = calc_percentile(cchardet_results, 95)
cchardet_50p = calc_percentile(cchardet_results, 50)
charset_normalizer_avg_delay = mean(charset_normalizer_results)
charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)
print("")
print("------------------------------")
print("--> Chardet Conclusions")
print(" --> Avg: " + str(chardet_avg_delay) + "s")
print(" --> 99th: " + str(chardet_99p) + "s")
print(" --> 95th: " + str(chardet_95p) + "s")
print(" --> 50th: " + str(chardet_50p) + "s")
print("------------------------------")
print("--> Cchardet Conclusions")
print(" --> Avg: " + str(cchardet_avg_delay) + "s")
print(" --> 99th: " + str(cchardet_99p) + "s")
print(" --> 95th: " + str(cchardet_95p) + "s")
print(" --> 50th: " + str(cchardet_50p) + "s")
print("------------------------------")
print("--> Charset-Normalizer Conclusions")
print(" --> Avg: " + str(charset_normalizer_avg_delay) + "s")
print(" --> 99th: " + str(charset_normalizer_99p) + "s")
print(" --> 95th: " + str(charset_normalizer_95p) + "s")
print(" --> 50th: " + str(charset_normalizer_50p) + "s")
print("------------------------------")
print("--> Charset-Normalizer / Chardet: Performance Сomparison")
print(" --> Avg: " + str(round(((chardet_avg_delay / charset_normalizer_avg_delay - 1) * 100), 2)) + "%")
print(" --> 99th: " + str(round(((chardet_99p / charset_normalizer_99p - 1) * 100), 2)) + "%")
print(" --> 95th: " + str(round(((chardet_95p / charset_normalizer_95p - 1) * 100), 2)) + "%")
print(" --> 50th: " + str(round(((chardet_50p / charset_normalizer_50p - 1) * 100), 2)) + "%")
'''
# time / files plot
x_chardet, y_chardet = [], []
for i,v in enumerate(chardet_results):
x_chardet.append(i)
y_chardet.append(v)
x_cchardet, y_cchardet = [], []
for i,v in enumerate(cchardet_results):
x_cchardet.append(i)
y_cchardet.append(v)
x_charset_normalizer, y_charset_normalizer = [], []
for i,v in enumerate(charset_normalizer_results):
x_charset_normalizer.append(i)
y_charset_normalizer.append(v)
plt.figure(figsize=(1000, 100), layout='constrained')
plt.plot(x_chardet, y_chardet, label='Chardet')
plt.plot(x_cchardet, y_cchardet, label='Cchardet')
plt.plot(x_charset_normalizer, y_charset_normalizer, label='Charset_normalizer')
plt.xlabel('files')
plt.ylabel('time')
# Create names on the x axis
plt.xticks(x_chardet, file_names_list, rotation=90)
plt.title("Simple Plot")
plt.legend()
plt.show()
'''
# persentile / time plot
x_chardet, y_chardet = [], []
for i in range(100):
x_chardet.append(i)
y_chardet.append(calc_percentile(chardet_results, i))
x_cchardet, y_cchardet = [], []
for i in range(100):
x_cchardet.append(i)
y_cchardet.append(calc_percentile(cchardet_results, i))
x_charset_normalizer, y_charset_normalizer = [], []
for i in range(100):
x_charset_normalizer.append(i)
y_charset_normalizer.append(calc_percentile(charset_normalizer_results, i))
plt.figure(figsize=(100, 100))
plt.plot(x_chardet, y_chardet, label='Chardet')
plt.plot(x_cchardet, y_cchardet, label='Cchardet')
plt.plot(x_charset_normalizer, y_charset_normalizer, label='Charset_normalizer')
plt.xlabel('%')
plt.ylabel('time')
# Create names on the x axis
plt.title("Percentile Plot")
plt.legend()
plt.show()
return
if __name__ == "__main__":
exit(
performance_compare(
argv[1:]
)
) The effect is not so great, the speed increases by about 2 times. |
Compilation is performed when the package is installed on the user's computer. You can test it yourself.
full setup.py
|
Compilation requires prerequisites macOSInstall Xcode command line tools:
LinuxYou need a C compiler and CPython headers and libraries. The specifics of how to install these varies by distribution. Here are instructions for Ubuntu 18.04, for example:
WindowsInstall Visual C++. Installing additional software can be a problem for the user, so it is not a good idea to compile by default. |
As long as As it is, installing
That is, you can install If |
It is not necessary to do this by default. But when you need to process a large number of files with an unknown encoding, there is a performance issue. I tried to improve the processing speed and got some results (#183). pip install charset_normalizer[mypyc]. I'm working on rewriting the code of this package in cython, but so far I'm having trouble understanding the algorithm. |
As far as I'm aware, the Setuptools extras syntax ( |
@akx There is a good chance, not negligible, that we eventually could upload some specific whl for specific platform WHILE always providing the whl-none. You just have to look at how mypy handle things. By the look of it, they manage it well, unless mistaken.
mypy does not impose any compilation as far as I know. coveragepy too. |
It might be helpful: |
something like this
|
Surprisingly, mypyc is almost catching up with cython isprime_cython.py
isprime_mypyc.py
test.py
results:
|
Well, charset-normalizer did drop Python 3.5. Some though that need to be considered beforehand. If we engage in this, this would mean? by extrapolation? that we should be x10 times faster. I expect (3.11) 19ms on avg and ~9ms with mypyc or better. |
Used mypy-0.970+dev.914297e9486b141c01b3459393938fdf423d892cef, because mypy 0.961 does not support python 3.11 performance1.pyfrom glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir
from charset_normalizer import detect
from chardet import detect as chardet_detect
from statistics import mean
from math import ceil
def calc_percentile(data, percentile):
n = len(data)
p = n * percentile / 100
sorted_data = sorted(data)
return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]
def performance_compare(arguments):
parser = argparse.ArgumentParser(
description="Performance CI/CD check for Charset-Normalizer"
)
parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
help="Apply artificial size increase to challenge the detection mechanism further")
args = parser.parse_args(arguments)
if not isdir("./char-dataset"):
print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
exit(1)
charset_normalizer_results = []
for tbt_path in sorted(glob("./char-dataset/**/*.*")):
with open(tbt_path, "rb") as fp:
content = fp.read() * args.size_coeff
before = time_ns()
detect(content)
charset_normalizer_results.append(
round((time_ns() - before) / 1000000000, 5)
)
print(str(charset_normalizer_results[-1]), tbt_path)
charset_normalizer_avg_delay = mean(charset_normalizer_results)
charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)
print("------------------------------")
print("--> Charset-Normalizer Conclusions")
print(" --> Avg: " + str(charset_normalizer_avg_delay) + "s")
print(" --> 99th: " + str(charset_normalizer_99p) + "s")
print(" --> 95th: " + str(charset_normalizer_95p) + "s")
print(" --> 50th: " + str(charset_normalizer_50p) + "s")
# persentile / time plot
print("Percentile data --------------")
print()
x_chardet, y_chardet = [], []
for i in range(100):
x_chardet.append(i)
y_chardet.append(calc_percentile(charset_normalizer_results, i))
print(calc_percentile(charset_normalizer_results, i))
return
if __name__ == "__main__":
exit(
performance_compare(
argv[1:]
)
) |
I started to work on a potential v3 including optional Mypyc. See https://github.com/Ousret/charset_normalizer/tree/3.0 To start testing:
On average 10ms per file. That is a good performance bump. I am doing some extra research on the subject. |
3.0 python3.10I. default 3.0 ------------------------------
--> Chardet Conclusions
--> Avg: 0.12321512765957447s
--> 99th: 0.74804s
--> 95th: 0.178s
--> 50th: 0.01804s
------------------------------
--> Charset-Normalizer Conclusions
--> Avg: 0.025958744680851065s
--> 99th: 0.25946s
--> 95th: 0.14132s
--> 50th: 0.01095s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
--> Avg: x4.75
--> 99th: x2.88
--> 95th: x1.26
--> 50th: x1.65 II. BUILD: python3 setup.py --use-mypyc build_ext --inplace ------------------------------
--> Chardet Conclusions
--> Avg: 0.1224901914893617s
--> 99th: 0.73647s
--> 95th: 0.17915s
--> 50th: 0.01755s
------------------------------
--> Charset-Normalizer Conclusions
--> Avg: 0.010322106382978723s
--> 99th: 0.11215s
--> 95th: 0.05355s
--> 50th: 0.00428s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
--> Avg: x11.87
--> 99th: x6.57
--> 95th: x3.35
--> 50th: x4.1 III. Marking constants as Final (#208) + BUILD: python3 setup.py --use-mypyc build_ext --inplace ------------------------------
--> Chardet Conclusions
--> Avg: 0.12217872340425531s
--> 99th: 0.72175s
--> 95th: 0.17481s
--> 50th: 0.01731s
------------------------------
--> Charset-Normalizer Conclusions
--> Avg: 0.01016231914893617s
--> 99th: 0.1085s
--> 95th: 0.05219s
--> 50th: 0.00419s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
--> Avg: x12.02
--> 99th: x6.65
--> 95th: x3.35
--> 50th: x4.13 3.0 python3.11b4I. default 3.0 ------------------------------
--> Chardet Conclusions
--> Avg: 0.09048142553191489s
--> 99th: 0.39248s
--> 95th: 0.10197s
--> 50th: 0.01008s
------------------------------
--> Charset-Normalizer Conclusions
--> Avg: 0.018134829787234043s
--> 99th: 0.17814s
--> 95th: 0.09568s
--> 50th: 0.00761s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
--> Avg: x4.99
--> 99th: x2.2
--> 95th: x1.07
--> 50th: x1.32 II. Marking constants as Final (#208) + BUILD: python3.11 setup.py --use-mypyc build_ext --inplace ------------------------------
--> Chardet Conclusions
--> Avg: 0.09024136170212767s
--> 99th: 0.39338s
--> 95th: 0.10097s
--> 50th: 0.00999s
------------------------------
--> Charset-Normalizer Conclusions
--> Avg: 0.009643191489361703s
--> 99th: 0.10487s
--> 95th: 0.04999s
--> 50th: 0.004s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
--> Avg: x9.36
--> 99th: x3.75
--> 95th: x2.02
--> 50th: x2.5 Summary of Charset-Normalizer Conclusions:
|
Optimizing md.py only is strictly sufficient. I could get the final whl size down to 80kB. |
Update on the topic. The first beta is available on https://pypi.org/project/charset-normalizer/3.0.0b1 and https://github.com/Ousret/charset_normalizer/releases/tag/3.0.0b1 The Whl size is no longer a problem to pursue this. |
For me, everything is ok. Answered by #209 |
Hello.
I ran some tests to find bottlenecks and speed up the package.
The easiest option, since you are already using mypy, is to compile the module during installation using mypyc.
In this case the acceleration is about 2 times.
Here are the results of the tests using your bin/performance.py file:
test_log.txt
I think the acceleration would be greater if annotate all functions
The text was updated successfully, but these errors were encountered: