Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Add module creation with mypyc to speed up #182

Closed
deedy5 opened this issue Apr 29, 2022 · 20 comments
Closed

[Proposal] Add module creation with mypyc to speed up #182

deedy5 opened this issue Apr 29, 2022 · 20 comments
Labels
enhancement New feature or request

Comments

@deedy5
Copy link
Contributor

deedy5 commented Apr 29, 2022

Hello.
I ran some tests to find bottlenecks and speed up the package.
The easiest option, since you are already using mypy, is to compile the module during installation using mypyc.
In this case the acceleration is about 2 times.
Here are the results of the tests using your bin/performance.py file:

------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.03485252343844548s
   --> 99th: 0.2629306570015615s
   --> 95th: 0.14874039799906313s
   --> 50th: 0.02182378301222343s
------------------------------
--> Charset-Normalizer_m Conclusions (Charset-Normalizer, compiled with mypyc )
   --> Avg: 0.01605459922575392s
   --> 99th: 0.12211546800972428s
   --> 95th: 0.06977643301070202s
   --> 50th: 0.009204783011227846s
------------------------------
--> Chardet Conclusions
   --> Avg: 0.12291852888552735s
   --> 99th: 0.6617688919941429s
   --> 95th: 0.17344348499318585s
   --> 50th: 0.023028297000564635s
------------------------------
--> Cchardet Conclusions
   --> Avg: 0.003174804929368931s
   --> 99th: 0.04868195200106129s
   --> 95th: 0.008641656007966958s
   --> 50th: 0.0005420649977168068s

test_log.txt
I think the acceleration would be greater if annotate all functions

@deedy5 deedy5 added the enhancement New feature or request label Apr 29, 2022
@deedy5 deedy5 closed this as completed May 1, 2022
@deedy5
Copy link
Contributor Author

deedy5 commented May 1, 2022

#183

@Ousret
Copy link
Member

Ousret commented May 1, 2022

You went a bit too fast. I am reopening that thread.

The idea is tempting but needs a thorough analysis of its impacts.
First of all, I have never used mypyc, so I would need to catch up a bit on the subject (already started.)

Here are some major subjects we have to care about.

Dropping Python 3.5

There is a high chance that dropping Python 3.5 and all the specific code associated with it.
I am in favor of dropping its support BEFORE attempting this optimization.

Inherent risks

Compiling the package means providing ready-to-use whl, not a big problem using qemu and whatnot.
But.. Are we capable of falling back to native python code in case your architecture was not served?

Sub-package

Maybe, this should be published under another package name? To be discussed.

Types

I don't think that the package has a "perfect" typing, so I think that a PR should address the remaining cases using the strict mode. And should not be difficult to do so.

Task ahead

  • Dropping Python 3.5
  • Improving typing
  • Making a solid proof of concept
  • Decide whenever it should be published under a different package name? answer: no
  • Writing the required actions (GHA) according to our needs
  • Heavy testing on every Python supported (3.6 to 3.11)

@Ousret Ousret reopened this May 1, 2022
@deedy5
Copy link
Contributor Author

deedy5 commented May 1, 2022

Let's wait for the drop of Python 3.5

@deedy5
Copy link
Contributor Author

deedy5 commented May 5, 2022

I ran some more tests to see how mypyc compilation affects performance.

mypyc_performance.xlsx

performance1.py
#!/bin/python
from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir

from charset_normalizer import detect
from chardet import detect as chardet_detect

from statistics import mean
from math import ceil


def calc_percentile(data, percentile):
    n = len(data)
    p = n * percentile / 100
    sorted_data = sorted(data)

    return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]


def performance_compare(arguments):
    parser = argparse.ArgumentParser(
        description="Performance CI/CD check for Charset-Normalizer"
    )

    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)

    charset_normalizer_results = []

    for tbt_path in sorted(glob("./char-dataset/**/*.*")):

        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff

        before = time_ns()
        detect(content)
        charset_normalizer_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print(str(charset_normalizer_results[-1]), tbt_path)

    charset_normalizer_avg_delay = mean(charset_normalizer_results)
    charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
    charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
    charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)

    print("------------------------------")
    print("--> Charset-Normalizer Conclusions")
    print("   --> Avg: " + str(charset_normalizer_avg_delay) + "s")
    print("   --> 99th: " + str(charset_normalizer_99p) + "s")
    print("   --> 95th: " + str(charset_normalizer_95p) + "s")
    print("   --> 50th: " + str(charset_normalizer_50p) + "s")
    
    # persentile / time plot
    print("Percentile data --------------")
    print()
    x_chardet, y_chardet = [], []
    for i in range(100):
        x_chardet.append(i)
        y_chardet.append(calc_percentile(charset_normalizer_results, i))
        print(calc_percentile(charset_normalizer_results, i))
    
    return


if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

mypyc_compare
Enlarged
screen2


percentile matplotlib
percentile_matplotlib

percentile-plot.py
#!/bin/python
from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir

from charset_normalizer import detect
from chardet import detect as chardet_detect
from cchardet import detect as cchardet_detect

from statistics import mean
from math import ceil

import matplotlib.pyplot as plt


def calc_percentile(data, percentile):
    n = len(data)
    p = n * percentile / 100
    sorted_data = sorted(data)

    return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]


def performance_compare(arguments):
    parser = argparse.ArgumentParser(
        description="Performance CI/CD check for Charset-Normalizer"
    )

    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)

    chardet_results = []
    cchardet_results = []
    charset_normalizer_results = []
    file_names_list = []

    for tbt_path in sorted(glob("./char-dataset/**/*.*")):
        print(tbt_path)
        file_names_list.append(tbt_path.split('/')[-1])
        
        # Read Bin file
        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff
        #Chardet
        before = time_ns()
        chardet_detect(content)
        chardet_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print("  --> Chardet: " + str(chardet_results[-1]) + "s")
        #Cchardet
        before = time_ns()
        cchardet_detect(content)
        cchardet_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print("  --> Cchardet: " + str(cchardet_results[-1]) + "s")
        #Charset_normalizer
        before = time_ns()
        detect(content)
        charset_normalizer_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print("  --> Charset-Normalizer: " + str(charset_normalizer_results[-1]) + "s")
        

    chardet_avg_delay = mean(chardet_results)
    chardet_99p = calc_percentile(chardet_results, 99)
    chardet_95p = calc_percentile(chardet_results, 95)
    chardet_50p = calc_percentile(chardet_results, 50)

    cchardet_avg_delay = mean(cchardet_results)
    cchardet_99p = calc_percentile(cchardet_results, 99)
    cchardet_95p = calc_percentile(cchardet_results, 95)
    cchardet_50p = calc_percentile(cchardet_results, 50)

    charset_normalizer_avg_delay = mean(charset_normalizer_results)
    charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
    charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
    charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)

    print("")

    print("------------------------------")
    print("--> Chardet Conclusions")
    print("   --> Avg: " + str(chardet_avg_delay) + "s")
    print("   --> 99th: " + str(chardet_99p) + "s")
    print("   --> 95th: " + str(chardet_95p) + "s")
    print("   --> 50th: " + str(chardet_50p) + "s")

    print("------------------------------")
    print("--> Cchardet Conclusions")
    print("   --> Avg: " + str(cchardet_avg_delay) + "s")
    print("   --> 99th: " + str(cchardet_99p) + "s")
    print("   --> 95th: " + str(cchardet_95p) + "s")
    print("   --> 50th: " + str(cchardet_50p) + "s")

    print("------------------------------")
    print("--> Charset-Normalizer Conclusions")
    print("   --> Avg: " + str(charset_normalizer_avg_delay) + "s")
    print("   --> 99th: " + str(charset_normalizer_99p) + "s")
    print("   --> 95th: " + str(charset_normalizer_95p) + "s")
    print("   --> 50th: " + str(charset_normalizer_50p) + "s")
    
    print("------------------------------")
    print("--> Charset-Normalizer / Chardet: Performance Сomparison")
    print("   --> Avg: " + str(round(((chardet_avg_delay / charset_normalizer_avg_delay - 1) * 100), 2)) + "%")        
    print("   --> 99th: " + str(round(((chardet_99p / charset_normalizer_99p - 1) * 100), 2)) + "%")
    print("   --> 95th: " + str(round(((chardet_95p / charset_normalizer_95p - 1) * 100), 2)) + "%")
    print("   --> 50th: " + str(round(((chardet_50p / charset_normalizer_50p - 1) * 100), 2)) + "%")

    '''
    # time / files plot
    x_chardet, y_chardet = [], []
    for i,v in enumerate(chardet_results):
        x_chardet.append(i)
        y_chardet.append(v)

    x_cchardet, y_cchardet = [], []
    for i,v in enumerate(cchardet_results):
        x_cchardet.append(i)
        y_cchardet.append(v)

    x_charset_normalizer, y_charset_normalizer = [], []
    for i,v in enumerate(charset_normalizer_results):
        x_charset_normalizer.append(i)
        y_charset_normalizer.append(v)
        
    plt.figure(figsize=(1000, 100), layout='constrained')
    plt.plot(x_chardet, y_chardet, label='Chardet') 
    plt.plot(x_cchardet, y_cchardet, label='Cchardet')
    plt.plot(x_charset_normalizer, y_charset_normalizer, label='Charset_normalizer')
    plt.xlabel('files')
    plt.ylabel('time')
    # Create names on the x axis
    plt.xticks(x_chardet, file_names_list, rotation=90)
    plt.title("Simple Plot")
    plt.legend()
    plt.show()
    '''

    # persentile / time plot
    x_chardet, y_chardet = [], []
    for i in range(100):
        x_chardet.append(i)
        y_chardet.append(calc_percentile(chardet_results, i))

    x_cchardet, y_cchardet = [], []
    for i in range(100):
        x_cchardet.append(i)
        y_cchardet.append(calc_percentile(cchardet_results, i))

    x_charset_normalizer, y_charset_normalizer = [], []
    for i in range(100):
        x_charset_normalizer.append(i)
        y_charset_normalizer.append(calc_percentile(charset_normalizer_results, i))
        
    plt.figure(figsize=(100, 100))
    plt.plot(x_chardet, y_chardet, label='Chardet') 
    plt.plot(x_cchardet, y_cchardet, label='Cchardet')
    plt.plot(x_charset_normalizer, y_charset_normalizer, label='Charset_normalizer')
    plt.xlabel('%')
    plt.ylabel('time')
    # Create names on the x axis
    plt.title("Percentile Plot")
    plt.legend()
    plt.show()
    
    return

if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

The effect is not so great, the speed increases by about 2 times.
But when processing a large number of files, I think it will be very noticeable.

@deedy5
Copy link
Contributor Author

deedy5 commented May 5, 2022

Compilation is performed when the package is installed on the user's computer.
The source files are not deleted, and functionality will not be affected if a compilation error occurs.

You can test it yourself.
Mypyc docs


  1. mypy must be installed
pip install -U mypy

  1. Add to setup.py to compile during installation
ext_modules=mypycify([
        "charset_normalizer/__init__.py",
        "charset_normalizer/api.py",
        "charset_normalizer/cd.py",
        "charset_normalizer/constant.py",
        "charset_normalizer/legacy.py",
        "charset_normalizer/md.py",
        "charset_normalizer/models.py",
        "charset_normalizer/utils.py",
        "charset_normalizer/assets/__init__.py",
        "charset_normalizer/cli/normalizer.py",
    ]),
full setup.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import io
import os
from re import search

from setuptools import find_packages, setup

from mypyc.build import mypycify


def get_version():
    with open('charset_normalizer/version.py') as version_file:
        return search(r"""__version__\s+=\s+(['"])(?P<version>.+?)\1""",
                      version_file.read()).group('version')


# Package meta-data.
NAME = 'charset-normalizer'
DESCRIPTION = 'The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.'
URL = 'https://github.com/ousret/charset_normalizer'
EMAIL = 'ahmed.tahri@cloudnursery.dev'
AUTHOR = 'Ahmed TAHRI @Ousret'
REQUIRES_PYTHON = '>=3.5.0'
VERSION = get_version()

REQUIRED = []

EXTRAS = {
    'unicode_backport': ['unicodedata2']
}

here = os.path.abspath(os.path.dirname(__file__))

try:
    with io.open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
        long_description = '\n' + f.read()
except FileNotFoundError:
    long_description = DESCRIPTION

setup(
    name=NAME,
    version=VERSION,
    description=DESCRIPTION,
    long_description=long_description.replace(':heavy_check_mark:', '✅'),
    long_description_content_type='text/markdown',
    author=AUTHOR,
    author_email=EMAIL,
    python_requires=REQUIRES_PYTHON,
    url=URL,
    keywords=['encoding', 'i18n', 'txt', 'text', 'charset', 'charset-detector', 'normalization', 'unicode', 'chardet'],
    packages=find_packages(exclude=["tests", "*.tests", "*.tests.*", "tests.*"]),
    install_requires=REQUIRED,
    extras_require=EXTRAS,
    include_package_data=True,
    package_data={"charset_normalizer": ["py.typed"]},
    license='MIT',
    entry_points={
        'console_scripts':
            [
                'normalizer = charset_normalizer.cli.normalizer:cli_detect'
            ]
    },
    classifiers=[
        'License :: OSI Approved :: MIT License',
        'Intended Audience :: Developers',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Programming Language :: Python :: 3.7',
        'Programming Language :: Python :: 3.8',
        'Programming Language :: Python :: 3.9',
        'Programming Language :: Python :: 3.10',
        'Programming Language :: Python :: 3.11',
        'Topic :: Text Processing :: Linguistic',
        'Topic :: Utilities',
        'Programming Language :: Python :: Implementation :: PyPy',
        'Typing :: Typed'
    ],
    project_urls={
        'Bug Reports': 'https://github.com/Ousret/charset_normalizer/issues',
        'Documentation': 'https://charset-normalizer.readthedocs.io/en/latest',
    },
    ext_modules=mypycify([
        "charset_normalizer/__init__.py",
        "charset_normalizer/api.py",
        "charset_normalizer/cd.py",
        "charset_normalizer/constant.py",
        "charset_normalizer/legacy.py",
        "charset_normalizer/md.py",
        "charset_normalizer/models.py",
        "charset_normalizer/utils.py",
        "charset_normalizer/assets/__init__.py",
        "charset_normalizer/cli/normalizer.py",
    ]),
)

  1. run
python3 setup.py build_ext --inplace

@deedy5
Copy link
Contributor Author

deedy5 commented May 5, 2022

Compilation requires prerequisites

macOS

Install Xcode command line tools:

xcode-select --install
Linux

You need a C compiler and CPython headers and libraries. The specifics of how to install these varies by distribution. Here are instructions for Ubuntu 18.04, for example:

sudo apt install python3-dev
Windows

Install Visual C++.


Installing additional software can be a problem for the user, so it is not a good idea to compile by default.
But as an option, it would be nice to add such a feature to the installation.

@akx
Copy link
Contributor

akx commented May 12, 2022

As long as charset_normalizer is a hard dependency for requests (see psf/requests#5875, psf/requests#5871 etc.), I really don't think this should be done.

As it is, installing requests does not install any packages with binary components (all .whls are -none-):

# pip install requests
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
  Downloading idna-3.3-py3-none-any.whl (61 kB)

That is, you can install requests wherever Python runs even if you don't have a C compiler.

If charset_normalizer starts including a binary module, then installing requests will require a C compiler, or the maintainers of charset_normalizer will need to start shipping binary wheels on multiple platforms and architectures (even more esoteric ones such as manylinux on arm64, since Raspberry Pis are a thing :) ) unless they wish to be inundated by issues asking why their particular installation fails with an obscure C compiler error.

@deedy5
Copy link
Contributor Author

deedy5 commented May 12, 2022

It is not necessary to do this by default.
The average Internet user will not notice any difference whether the library is compiled or not.

But when you need to process a large number of files with an unknown encoding, there is a performance issue.
This package has the largest number of supported encodings, and today there is no alternative.

I tried to improve the processing speed and got some results (#183).
But I also found that compiling the library with mypyc speeds up more than twice.
I suggest adding compilation as an option during installation. That is, when installing requests no compilation will take place. But I would like to be able to compile the library using a command like

pip install charset_normalizer[mypyc].

I'm working on rewriting the code of this package in cython, but so far I'm having trouble understanding the algorithm.

@akx
Copy link
Contributor

akx commented May 12, 2022

As far as I'm aware, the Setuptools extras syntax ([mypyc]) won't allow for optional compilation, just additional packages to be installed. The mypyc-compilable version could thus be packaged as a separate "charset-normalizer-speedups" package, and installed via the extra.

@Ousret
Copy link
Member

Ousret commented May 12, 2022

I really don't think this should be done...unless they wish to be inundated by issues asking why their particular installation fails with an obscure C compiler error.

@akx
While I appreciate your concerns, there is next to no chance that this project would compromise our integrators. We are very much aware of the risks and opportunities.

There is a good chance, not negligible, that we eventually could upload some specific whl for specific platform WHILE always providing the whl-none.

You just have to look at how mypy handle things. By the look of it, they manage it well, unless mistaken.

mypy-0.950-py3-none-any.whl
mypy-0.950-cp310-cp310-win_amd64.whl
....

mypy does not impose any compilation as far as I know. coveragepy too.
The right study is required and it is gonna take some time.

@deedy5
Copy link
Contributor Author

deedy5 commented May 12, 2022

It might be helpful:
psf/black#1009
psf/black#2431
mypyc/mypyc#886

@deedy5
Copy link
Contributor Author

deedy5 commented May 12, 2022

As far as I'm aware, the Setuptools extras syntax ([mypyc]) won't allow for optional compilation, just additional packages to be installed. The mypyc-compilable version could thus be packaged as a separate "charset-normalizer-speedups" package, and installed via the extra.

something like this
https://github.com/psf/black/blob/main/setup.py

USE_MYPYC = False
# To compile with mypyc, a mypyc checkout must be present on the PYTHONPATH
if len(sys.argv) > 1 and sys.argv[1] == "--use-mypyc":
    sys.argv.pop(1)
    USE_MYPYC = True
if os.getenv("BLACK_USE_MYPYC", None) == "1":
    USE_MYPYC = True

if USE_MYPYC:
    from mypyc.build import mypycify

@deedy5
Copy link
Contributor Author

deedy5 commented May 13, 2022

Surprisingly, mypyc is almost catching up with cython

isprime_cython.py
import cython

@cython.cdivision(True)
@cython.ccall
def is_prime(n: cython.ulonglong) -> cython.bint:
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    isqrt: cython.ulong = int(n**0.5)
    sqrtn: cython.ulong = isqrt + 1
    i: cython.ulong = 0
    for i in range(5, sqrtn, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return False
    return True
cythonize -a -i isprime_cython.py
isprime_mypyc.py
def is_prime(n: int) -> bool:
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    for i in range(5, int(n**0.5) + 1, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return False
    return True
mypyc isprime_mypyc.py
test.py
from time import monotonic
from isprime_cython import is_prime as is_prime_cython
from isprime_mypyc import is_prime as is_prime_mypyc


def is_prime(n: int) -> bool:
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    for i in range(5, int(n**0.5) + 1, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return False
    return True

START, END = 0, 10_000_000

t0 = monotonic()
r = sum(x for x in range(START, END) if is_prime(x))
print(f"is_prime: {monotonic() - t0}")

t0 = monotonic()
r = sum(x for x in range(START, END) if is_prime_cython(x))
print(f"is_prime_cython: {monotonic() - t0}")

t0 = monotonic()
r = sum(x for x in range(START, END) if is_prime_mypyc(x))
print(f"is_prime_mypyc: {monotonic() - t0}")
python3 test.py

results:

is_prime: 21.9134322960017
is_prime_cython: 3.197920835002151
is_prime_mypyc: 3.577863503996923

@Ousret
Copy link
Member

Ousret commented Jun 30, 2022

Well, charset-normalizer did drop Python 3.5.

Some though that need to be considered beforehand.
Python 3.11 did X2 on the performance side of things and favored a bit Chardet but not by much. (Probably due to the simpleness of the code in Chardet's sources).

If we engage in this, this would mean? by extrapolation? that we should be x10 times faster. I expect (3.11) 19ms on avg and ~9ms with mypyc or better.
Mypy does have more than half a million fetch per day, so the whole mypyc does engage some confidence.

@deedy5
Copy link
Contributor Author

deedy5 commented Jul 2, 2022

Used mypy-0.970+dev.914297e9486b141c01b3459393938fdf423d892cef, because mypy 0.961 does not support python 3.11

performance1.py
from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir

from charset_normalizer import detect
from chardet import detect as chardet_detect

from statistics import mean
from math import ceil


def calc_percentile(data, percentile):
    n = len(data)
    p = n * percentile / 100
    sorted_data = sorted(data)

    return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]


def performance_compare(arguments):
    parser = argparse.ArgumentParser(
        description="Performance CI/CD check for Charset-Normalizer"
    )

    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)

    charset_normalizer_results = []

    for tbt_path in sorted(glob("./char-dataset/**/*.*")):

        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff

        before = time_ns()
        detect(content)
        charset_normalizer_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print(str(charset_normalizer_results[-1]), tbt_path)

    charset_normalizer_avg_delay = mean(charset_normalizer_results)
    charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
    charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
    charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)

    print("------------------------------")
    print("--> Charset-Normalizer Conclusions")
    print("   --> Avg: " + str(charset_normalizer_avg_delay) + "s")
    print("   --> 99th: " + str(charset_normalizer_99p) + "s")
    print("   --> 95th: " + str(charset_normalizer_95p) + "s")
    print("   --> 50th: " + str(charset_normalizer_50p) + "s")

    # persentile / time plot
    print("Percentile data --------------")
    print()
    x_chardet, y_chardet = [], []
    for i in range(100):
        x_chardet.append(i)
        y_chardet.append(calc_percentile(charset_normalizer_results, i))
        print(calc_percentile(charset_normalizer_results, i))

    return


if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

comparison2
comparison
mypyc_performance.xlsx

@Ousret
Copy link
Member

Ousret commented Aug 14, 2022

I started to work on a potential v3 including optional Mypyc. See https://github.com/Ousret/charset_normalizer/tree/3.0

To start testing:

git clone https://github.com/Ousret/charset_normalizer.git
cd charset_normalizer
git checkout 3.0
pip install -r dev-requirements.txt
python setup.py --use-mypyc install

On average 10ms per file. That is a good performance bump.
But, I am worried about the final Whl size. charset_normalizer-3.0.0b1-cp310-cp310-win_amd64.whl is about 500kB to 1MB (given different conf), that is heavier than Chardet Whl.

I am doing some extra research on the subject.

@deedy5
Copy link
Contributor Author

deedy5 commented Aug 14, 2022

3.0 python3.10

I. default 3.0

------------------------------
--> Chardet Conclusions
   --> Avg: 0.12321512765957447s
   --> 99th: 0.74804s
   --> 95th: 0.178s
   --> 50th: 0.01804s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.025958744680851065s
   --> 99th: 0.25946s
   --> 95th: 0.14132s
   --> 50th: 0.01095s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x4.75
   --> 99th: x2.88
   --> 95th: x1.26
   --> 50th: x1.65

II. BUILD: python3 setup.py --use-mypyc build_ext --inplace

------------------------------
--> Chardet Conclusions
   --> Avg: 0.1224901914893617s
   --> 99th: 0.73647s
   --> 95th: 0.17915s
   --> 50th: 0.01755s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.010322106382978723s
   --> 99th: 0.11215s
   --> 95th: 0.05355s
   --> 50th: 0.00428s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x11.87
   --> 99th: x6.57
   --> 95th: x3.35
   --> 50th: x4.1

III. Marking constants as Final (#208) + BUILD: python3 setup.py --use-mypyc build_ext --inplace

------------------------------
--> Chardet Conclusions
   --> Avg: 0.12217872340425531s
   --> 99th: 0.72175s
   --> 95th: 0.17481s
   --> 50th: 0.01731s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.01016231914893617s
   --> 99th: 0.1085s
   --> 95th: 0.05219s
   --> 50th: 0.00419s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x12.02
   --> 99th: x6.65
   --> 95th: x3.35
   --> 50th: x4.13
3.0 python3.11b4

I. default 3.0

------------------------------
--> Chardet Conclusions
   --> Avg: 0.09048142553191489s
   --> 99th: 0.39248s
   --> 95th: 0.10197s
   --> 50th: 0.01008s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.018134829787234043s
   --> 99th: 0.17814s
   --> 95th: 0.09568s
   --> 50th: 0.00761s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x4.99
   --> 99th: x2.2
   --> 95th: x1.07
   --> 50th: x1.32

II. Marking constants as Final (#208) + BUILD: python3.11 setup.py --use-mypyc build_ext --inplace

------------------------------
--> Chardet Conclusions
   --> Avg: 0.09024136170212767s
   --> 99th: 0.39338s
   --> 95th: 0.10097s
   --> 50th: 0.00999s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.009643191489361703s
   --> 99th: 0.10487s
   --> 95th: 0.04999s
   --> 50th: 0.004s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x9.36
   --> 99th: x3.75
   --> 95th: x2.02
   --> 50th: x2.5

Summary of Charset-Normalizer Conclusions:

version mypyc Avg., s 99th., s 95th., s 50th., s
python3.10 - 0.025958744680851065 0.25946 0.14132 0.01095
python3.10 + 0.01016231914893617 0.1085 0.05219 0.00419
python3.11b4 - 0.018134829787234043 0.17814 0.09568 0.00761
python3.11b4 + 0.009643191489361703 0.10487 0.04999 0.004

@Ousret
Copy link
Member

Ousret commented Aug 15, 2022

Optimizing md.py only is strictly sufficient. I could get the final whl size down to 80kB.
Initial benchmarks show an insignificant difference. I expected it.
Now place to generate the compiled whl for all platforms (as much as possible).

@Ousret
Copy link
Member

Ousret commented Aug 17, 2022

Update on the topic.

The first beta is available on https://pypi.org/project/charset-normalizer/3.0.0b1 and https://github.com/Ousret/charset_normalizer/releases/tag/3.0.0b1
First results extracted from a personal server are good. Running h24 to challenge the solution. So far, nothing.

The Whl size is no longer a problem to pursue this.

@Ousret
Copy link
Member

Ousret commented Aug 19, 2022

For me, everything is ok.
Scheduled for release when mypy/c is ready for 3.11

Answered by #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

3 participants