Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use freelist for range object, iterator objects and other often used objects #126703

Open
eendebakpt opened this issue Nov 11, 2024 · 5 comments
Open
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@eendebakpt
Copy link
Contributor

eendebakpt commented Nov 11, 2024

Feature or enhancement

Proposal:

We can add freelists for the range object and various iter objects to improve performance. Using the new methods from #121934 the amount of code needed for adding a freelist is quite small. A prototype implementation is here:

main...eendebakpt:cpython:iter_freelists
main...eendebakpt:cpython:range_freelist

@markshannon In faster-cpython/ideas#132 (comment) you noted that freelists should be used for all types. Is adding freelists in a similar style to the already existing freelists a good approach? We could also try to make a more generic construction (for example inside PyObject_New/PyObject_Free), but that would have a much larger impact and I cannot oversee all consequences.

The freelists from the prototype improve performance of range in microbenchmarks:

range(1): Mean +- std dev: [rp_main] 50.0 ns +- 0.4 ns -> [rp_pr] 40.6 ns +- 0.5 ns: 1.23x faster
iter(range(1)): Mean +- std dev: [rp_main] 91.7 ns +- 2.4 ns -> [rp_pr] 68.5 ns +- 0.7 ns: 1.34x faster
list(range(1)): Mean +- std dev: [rp_main] 165 ns +- 1 ns -> [rp_pr] 144 ns +- 1 ns: 1.15x faster
list(iter(range(1))): Mean +- std dev: [rp_main] 240 ns +- 17 ns -> [rp_pr] 229 ns +- 13 ns: 1.05x faster
range(2, 1): Mean +- std dev: [rp_main] 61.6 ns +- 0.6 ns -> [rp_pr] 46.7 ns +- 0.4 ns: 1.32x faster
iter(range(2, 1)): Mean +- std dev: [rp_main] 100 ns +- 4 ns -> [rp_pr] 79.6 ns +- 4.0 ns: 1.26x faster
list(iter(range(2, 1))): Mean +- std dev: [rp_main] 227 ns +- 3 ns -> [rp_pr] 209 ns +- 4 ns: 1.08x faster
for loop length 1: Mean +- std dev: [rp_main] 146 ns +- 4 ns -> [rp_pr] 126 ns +- 6 ns: 1.16x faster
range(10): Mean +- std dev: [rp_main] 51.7 ns +- 2.4 ns -> [rp_pr] 41.0 ns +- 0.9 ns: 1.26x faster
iter(range(10)): Mean +- std dev: [rp_main] 92.7 ns +- 3.5 ns -> [rp_pr] 68.3 ns +- 2.2 ns: 1.36x faster
list(range(10)): Mean +- std dev: [rp_main] 205 ns +- 4 ns -> [rp_pr] 184 ns +- 9 ns: 1.11x faster
list(iter(range(10))): Mean +- std dev: [rp_main] 279 ns +- 7 ns -> [rp_pr] 261 ns +- 5 ns: 1.07x faster
range(2, 10): Mean +- std dev: [rp_main] 57.7 ns +- 1.5 ns -> [rp_pr] 48.5 ns +- 2.2 ns: 1.19x faster
iter(range(2, 10)): Mean +- std dev: [rp_main] 101 ns +- 3 ns -> [rp_pr] 78.1 ns +- 1.1 ns: 1.29x faster
list(iter(range(2, 10))): Mean +- std dev: [rp_main] 284 ns +- 17 ns -> [rp_pr] 262 ns +- 10 ns: 1.08x faster
for loop length 10: Mean +- std dev: [rp_main] 329 ns +- 7 ns -> [rp_pr] 295 ns +- 5 ns: 1.12x faster
range(100): Mean +- std dev: [rp_main] 52.3 ns +- 2.2 ns -> [rp_pr] 41.1 ns +- 1.3 ns: 1.27x faster
iter(range(100)): Mean +- std dev: [rp_main] 91.7 ns +- 3.5 ns -> [rp_pr] 69.6 ns +- 2.1 ns: 1.32x faster
list(range(100)): Mean +- std dev: [rp_main] 653 ns +- 27 ns -> [rp_pr] 603 ns +- 29 ns: 1.08x faster
list(iter(range(100))): Mean +- std dev: [rp_main] 701 ns +- 17 ns -> [rp_pr] 671 ns +- 17 ns: 1.04x faster
range(2, 100): Mean +- std dev: [rp_main] 58.5 ns +- 2.8 ns -> [rp_pr] 48.1 ns +- 2.3 ns: 1.22x faster
iter(range(2, 100)): Mean +- std dev: [rp_main] 99.6 ns +- 1.5 ns -> [rp_pr] 78.4 ns +- 0.8 ns: 1.27x faster
list(iter(range(2, 100))): Mean +- std dev: [rp_main] 710 ns +- 28 ns -> [rp_pr] 674 ns +- 16 ns: 1.05x faster
for loop length 100: Mean +- std dev: [rp_main] 2.06 us +- 0.07 us -> [rp_pr] 1.98 us +- 0.11 us: 1.04x faster
range(400): Mean +- std dev: [rp_main] 71.3 ns +- 2.2 ns -> [rp_pr] 61.7 ns +- 5.1 ns: 1.16x faster
iter(range(400)): Mean +- std dev: [rp_main] 112 ns +- 5 ns -> [rp_pr] 89.5 ns +- 2.8 ns: 1.25x faster
list(iter(range(400))): Mean +- std dev: [rp_main] 3.88 us +- 0.18 us -> [rp_pr] 3.77 us +- 0.08 us: 1.03x faster
range(2, 400): Mean +- std dev: [rp_main] 79.3 ns +- 1.7 ns -> [rp_pr] 66.0 ns +- 3.7 ns: 1.20x faster
iter(range(2, 400)): Mean +- std dev: [rp_main] 119 ns +- 2 ns -> [rp_pr] 99.5 ns +- 3.1 ns: 1.19x faster
for loop length 400: Mean +- std dev: [rp_main] 12.1 us +- 0.7 us -> [rp_pr] 10.9 us +- 0.4 us: 1.11x faster

Benchmark hidden because not significant (2): list(range(400)), list(iter(range(2, 400)))

Geometric mean: 1.16x faster
Benchmark script
import pyperf
runner = pyperf.Runner()

loop = """
def g(n):
    x=0
    for ii in range(n):
        x += 1
"""
        
for s in [1, 10, 100, 400]:
	time = runner.timeit(name=f'range({s})', stmt=f"range({s})")
	time = runner.timeit(name=f'iter(range({s}))', stmt=f"iter(range({s}))")
	time = runner.timeit(name=f'list(range({s}))', stmt=f"list(range({s}))")
	time = runner.timeit(name=f'list(iter(range({s})))', stmt=f"list(iter(range({s})))")

	time = runner.timeit(name=f'range(2, {s})', stmt=f"range(2, {s})")
	time = runner.timeit(name=f'iter(range(2, {s}))', stmt=f"iter(range(2, {s}))")
	time = runner.timeit(name=f'list(iter(range(2, {s})))', stmt=f"list(iter(range(2, {s})))")

	time = runner.timeit(name=f'for loop length {s}', stmt=f"g({s})", setup=loop)

``
</details>

On the pyperformance test suite (actually, a subset of the suite, not all benchmarks run on my system) shows the percentage of successfull freelist allocations increases about 1%.

Main:

Allocations from freelist 2,004,971,371 39.8%
Frees to freelist 2,005,350,418
Allocations 3,034,877,938 60.2%
Allocations to 512 bytes 3,008,791,812 59.7%
Allocations to 4 kbytes 18,648,072 0.4%
Allocations over 4 kbytes 7,438,054 0.1%
Frees 3,142,033,922

PR

Allocations from freelist 2,064,126,652 40.8%
Frees to freelist 2,064,499,538
Allocations 2,989,239,063 59.2%
Allocations to 512 bytes 2,963,207,194 58.6%
Allocations to 4 kbytes 18,596,157 0.4%
Allocations over 4 kbytes 7,435,712 0.1%
Frees 3,096,349,170

(Some quick testing shows that most of the allocations not via freelists are due to python int objects, so that is another candidate to use freelists for)



### Has this already been discussed elsewhere?

No response given

### Links to previous discussion of this feature:

Some references to discussions on freelists:

* https://github.com/python/cpython/pull/121934
* https://github.com/faster-cpython/ideas/discussions/132
* https://github.com/python/cpython/issues/91912
* https://github.com/python/cpython/pull/101453

<!-- gh-linked-prs -->
### Linked PRs
* gh-128368
* gh-128592
* gh-128594
* gh-128619
* gh-128692
<!-- /gh-linked-prs -->
@eendebakpt eendebakpt added the type-feature A feature request or enhancement label Nov 11, 2024
@tomasr8 tomasr8 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 11, 2024
@Fidget-Spinner
Copy link
Member

I would be willing to shepherd/champion such a PR. Back in the day we chose not to do this presumably because the added code complexity was not worth it. However, with Sam's consistent implementation of freelists, that point is now moot. Freelists are really simple to add mostly.

However, may I suggest trying first with medium ints? We can re-benchmark pyperformance with it. If you're not interested, just let me know and I can take over for medium ints.

@eendebakpt
Copy link
Contributor Author

I started with range because the ints are more complex (either do only the medium ints, or use a more complex scheme with a size based freelists).

I already have a working branch for the medium int case, see main...eendebakpt:cpython:int_freelist. Once I have some more benchmarks and statistics I will open a PR.

@eendebakpt eendebakpt changed the title Use freelist for range and range_iter objects Use freelist for range and iter objects Dec 21, 2024
@eendebakpt
Copy link
Contributor Author

Also for iter objects for list and tuple we can gain performance in microbenchmarks:

bench_list: Mean +- std dev: [iter_main] 180 ns +- 9 ns -> [iter_pr] 161 ns +- 2 ns: 1.12x faster
bench_tuple: Mean +- std dev: [iter_main] 186 ns +- 16 ns -> [iter_pr] 169 ns +- 22 ns: 1.10x faster
bench_range: Mean +- std dev: [iter_main] 186 ns +- 1 ns -> [iter_pr] 180 ns +- 3 ns: 1.04x faster

Geometric mean: 1.09x faster

Script:


import pyperf


def bench_list(loops):
    range_it = iter(range(loops))

    lst = list(range(5))
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = 0
        for ii in lst:
            x += ii
    return pyperf.perf_counter() - t0


def bench_tuple(loops):
    range_it = iter(range(loops))

    tpl = tuple(range(5))
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = 0
        for ii in tpl:
            x += ii
    return pyperf.perf_counter() - t0


def bench_range(loops):
    range_it = iter(range(loops))

    r = range(5)
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = 0
        for ii in r:
            x += ii
    return pyperf.perf_counter() - t0


# %timeit bench_list(1000)

if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("bench_list", bench_list)
    runner.bench_time_func("bench_tuple", bench_tuple)
    runner.bench_time_func("bench_range", bench_range)

@eendebakpt
Copy link
Contributor Author

eendebakpt commented Dec 30, 2024

To determine for which objects a freelist should be added (or removed) some statistics have been gathered. The data is gathered by adding some additional statistics functionality (branch: main...eendebakpt:cpython:wip_stats) and running the pyperformance benchmarks (minus some benchmarks that do not run on my system).

Most object allocations are tracked via generic calls like _PyObject_New or _PyObject_NewVar (using the type->tp_name). Some objects allocate memory allocated directly (e.g. lists and tuples), for these objects manually statistics have been added. Not all objects have been added manually, so a few will be missing from the statistics.

Here are the results on the types most often allocated and the number of freelist allocations:

Allocations Counts
_PyLong_New 514.4M
_PyLong_New_digits_2 369.5M
list_iterator 215.6M
builtin_function_or_method 165.6M
generator 146.7M
tuple_iterator 133.0M
method 120.2M
_PyLong_New_digits_1 104.2M
StopIteration 94.7M
function 88.7M
_safe_key 72.0M
range 65.8M
coroutine 59.4M
cell 46.7M
range_iterator 44.1M
frame 42.5M
dict_items 41.7M
Fraction 30.0M
xml.etree.ElementTree.Element 30.0M
_PyLong_New_digits_3 28.4M
builtin_method 28.1M
Vector 27.2M
traceback 24.7M
decimal.Decimal 24.0M
dict_itemiterator 21.5M
PyFloat_FromDouble_alloc 19.9M
tuple 16.5M
PyList_New_allocate 16.0M
list 10.7M
Tree 12.0M
IndexError 11.0M
iterator 10.0M
set_iterator 9.3M
re.Match 8.6M
Node 8.1M
weakref.ReferenceType 8.0M
AttributeError 7.9M
dict 7.8M
set 6.7M
Point 6.3M
frozenset 6.3M
StopAsyncIteration 6.0M
async_generator 6.0M
Ray 5.9M
KeyError 4.8M
mappingproxy 4.7M
collections._deque_iterator 4.6M
_PyLong_New_digits_4 4.2M
GVector 4.2M
map 3.8M
halfstate_t 3.3M
_pickle.Pickler 3.2M
Markup 2.9M
enumerate 2.7M
str_ascii_iterator 2.4M
posix.DirEntry 2.1M
sqlite3.Cursor 2.0M
C 2.0M
_json.Encoder 1.9M
list_reverseiterator 1.8M
Text 1.8M
Freelist allocations Counts
ints 2064.0M
floats 1461.5M
tuples[index] 1251.7M
lists 451.0M
slices 347.0M
dicts 150.0M
async_gen_asends 100.1M
async_gens 94.1M
dictkeys 42.9M
unicode_writers 1.9M
contexts 511282
futureiters 495349

Some observations:

  • Most allocations are of type int (via _PyLong_New). There is already a freelist for compacts ints that is working well (the ratio allocations/freelist is about 1:4). There is a large number of allocations with 2 or 3 digits (not covered by the freelist). So perhaps extending the freelist to cover ints with digits of size 2 (and 3) is worthwhile. Note that most hash values are ints with 2 digits. There are still some allocations of ints with 1 digit. Perhaps the freelist size can be increased, or bulk allocation of freelist objects (see Use a principled, and consistent, implementation of freelists. faster-cpython/ideas#132)

  • Also high in the number of allocations are iterators (tuple_iter, list_iter, range_iterator, PySeqIter_new). Adding freelists is relatively easy. Perhaps we can combine the freelist for these iterators (we then need to update the type, but otherwise they can be shared). A prototype of a shared freelist for different iterators is in main...eendebakpt:cpython:iter_freelists
    For the iterators the freelist size can probably be small.

  • There are many allocations for types builtin_function_or_method, generator, method and function. For the method adding a freelist is standard (in Objects/classobject.c). The builtin_function_or_method happens often (for example a new object of this type is generated when for a list x=[1,2] one calls x.append). The adaptive interpreter seems to reduce the number of builtin_function_or_method objects significantly, but still we could add a freelist. This requires some thought, as the PyCMethod_New can either allocate for a PyCMethodObject or a PyCFunctionObject, see

    if (ml->ml_flags & METH_METHOD) {
    if (!cls) {
    PyErr_SetString(PyExc_SystemError,
    "attempting to create PyCMethod with a METH_METHOD "
    "flag but no class");
    return NULL;
    }
    PyCMethodObject *om = PyObject_GC_New(PyCMethodObject, &PyCMethod_Type);
    if (om == NULL) {
    return NULL;
    }
    om->mm_class = (PyTypeObject*)Py_NewRef(cls);
    op = (PyCFunctionObject *)om;
    } else {
    if (cls) {
    PyErr_SetString(PyExc_SystemError,
    "attempting to create PyCFunction with class "
    "but no METH_METHOD flag");
    return NULL;
    }
    op = PyObject_GC_New(PyCFunctionObject, &PyCFunction_Type);
    if (op == NULL) {
    return NULL;
    }
    }
    .
    The function and generator objects don't seem to fit standard freelist at first glance.

  • The StopIteration is in the top stats. Seems like a good candidate for a freelist, together with IndexError. The implementation (in Objects/exceptions.c) uses macros to define several variations of exceptions. Maybe they can be combined in a freelist, but this requires some more thinking.

  • There is the _safe_key object showing up. This is probably from pprint. Is that a performance bottleneck? Maybe not, it is a python class so overhead of allocation is probably small. Not sure which benchmark is triggering this one

  • tuple and list are still allocated a lot. However, the 8 million tuple allocations seem okay with 625 million freelist results from tuple!

  • There are some objects specific for pyperformance benchmarks (e.g. the Vector class from bm_raytrace and GVector class from bm_chaos). Also Fraction is there. Perhaps we can create a freelist for the PyType_GenericAlloc based on the size? Code is here: https://github.com/python/cpython/blob/71de839ec987bb67b98fcfecfc687281841a713c/Objects/typeobject.c#L2213C1-L2213C21?

  • For lists we can use a size based freelist (just as for tuple). The reason for doing this would be to avoid having to allocate/deallocate the space for the list items on each freelist pop/push. Data suggest that most list allocations happen for < 20 items (see figure below). One issue with lists is that they can be resized. So perhaps we are allocating many lists of small size, but when deallocating the lists can be larger.
    Screenshot from 2024-12-30 23-40-59

@Fidget-Spinner
Copy link
Member

Wow, excellent investigative work!

@eendebakpt eendebakpt changed the title Use freelist for range and iter objects Use freelist for range object, iterator objects and other often used objects Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants