Use freelist for range object, iterator objects and other often used objects #126703

eendebakpt · 2024-11-11T21:33:21Z

Feature or enhancement

Proposal:

We can add freelists for the range object and various iter objects to improve performance. Using the new methods from #121934 the amount of code needed for adding a freelist is quite small. A prototype implementation is here:

main...eendebakpt:cpython:iter_freelists
main...eendebakpt:cpython:range_freelist

@markshannon In faster-cpython/ideas#132 (comment) you noted that freelists should be used for all types. Is adding freelists in a similar style to the already existing freelists a good approach? We could also try to make a more generic construction (for example inside PyObject_New/PyObject_Free), but that would have a much larger impact and I cannot oversee all consequences.

The freelists from the prototype improve performance of range in microbenchmarks:

range(1): Mean +- std dev: [rp_main] 50.0 ns +- 0.4 ns -> [rp_pr] 40.6 ns +- 0.5 ns: 1.23x faster
iter(range(1)): Mean +- std dev: [rp_main] 91.7 ns +- 2.4 ns -> [rp_pr] 68.5 ns +- 0.7 ns: 1.34x faster
list(range(1)): Mean +- std dev: [rp_main] 165 ns +- 1 ns -> [rp_pr] 144 ns +- 1 ns: 1.15x faster
list(iter(range(1))): Mean +- std dev: [rp_main] 240 ns +- 17 ns -> [rp_pr] 229 ns +- 13 ns: 1.05x faster
range(2, 1): Mean +- std dev: [rp_main] 61.6 ns +- 0.6 ns -> [rp_pr] 46.7 ns +- 0.4 ns: 1.32x faster
iter(range(2, 1)): Mean +- std dev: [rp_main] 100 ns +- 4 ns -> [rp_pr] 79.6 ns +- 4.0 ns: 1.26x faster
list(iter(range(2, 1))): Mean +- std dev: [rp_main] 227 ns +- 3 ns -> [rp_pr] 209 ns +- 4 ns: 1.08x faster
for loop length 1: Mean +- std dev: [rp_main] 146 ns +- 4 ns -> [rp_pr] 126 ns +- 6 ns: 1.16x faster
range(10): Mean +- std dev: [rp_main] 51.7 ns +- 2.4 ns -> [rp_pr] 41.0 ns +- 0.9 ns: 1.26x faster
iter(range(10)): Mean +- std dev: [rp_main] 92.7 ns +- 3.5 ns -> [rp_pr] 68.3 ns +- 2.2 ns: 1.36x faster
list(range(10)): Mean +- std dev: [rp_main] 205 ns +- 4 ns -> [rp_pr] 184 ns +- 9 ns: 1.11x faster
list(iter(range(10))): Mean +- std dev: [rp_main] 279 ns +- 7 ns -> [rp_pr] 261 ns +- 5 ns: 1.07x faster
range(2, 10): Mean +- std dev: [rp_main] 57.7 ns +- 1.5 ns -> [rp_pr] 48.5 ns +- 2.2 ns: 1.19x faster
iter(range(2, 10)): Mean +- std dev: [rp_main] 101 ns +- 3 ns -> [rp_pr] 78.1 ns +- 1.1 ns: 1.29x faster
list(iter(range(2, 10))): Mean +- std dev: [rp_main] 284 ns +- 17 ns -> [rp_pr] 262 ns +- 10 ns: 1.08x faster
for loop length 10: Mean +- std dev: [rp_main] 329 ns +- 7 ns -> [rp_pr] 295 ns +- 5 ns: 1.12x faster
range(100): Mean +- std dev: [rp_main] 52.3 ns +- 2.2 ns -> [rp_pr] 41.1 ns +- 1.3 ns: 1.27x faster
iter(range(100)): Mean +- std dev: [rp_main] 91.7 ns +- 3.5 ns -> [rp_pr] 69.6 ns +- 2.1 ns: 1.32x faster
list(range(100)): Mean +- std dev: [rp_main] 653 ns +- 27 ns -> [rp_pr] 603 ns +- 29 ns: 1.08x faster
list(iter(range(100))): Mean +- std dev: [rp_main] 701 ns +- 17 ns -> [rp_pr] 671 ns +- 17 ns: 1.04x faster
range(2, 100): Mean +- std dev: [rp_main] 58.5 ns +- 2.8 ns -> [rp_pr] 48.1 ns +- 2.3 ns: 1.22x faster
iter(range(2, 100)): Mean +- std dev: [rp_main] 99.6 ns +- 1.5 ns -> [rp_pr] 78.4 ns +- 0.8 ns: 1.27x faster
list(iter(range(2, 100))): Mean +- std dev: [rp_main] 710 ns +- 28 ns -> [rp_pr] 674 ns +- 16 ns: 1.05x faster
for loop length 100: Mean +- std dev: [rp_main] 2.06 us +- 0.07 us -> [rp_pr] 1.98 us +- 0.11 us: 1.04x faster
range(400): Mean +- std dev: [rp_main] 71.3 ns +- 2.2 ns -> [rp_pr] 61.7 ns +- 5.1 ns: 1.16x faster
iter(range(400)): Mean +- std dev: [rp_main] 112 ns +- 5 ns -> [rp_pr] 89.5 ns +- 2.8 ns: 1.25x faster
list(iter(range(400))): Mean +- std dev: [rp_main] 3.88 us +- 0.18 us -> [rp_pr] 3.77 us +- 0.08 us: 1.03x faster
range(2, 400): Mean +- std dev: [rp_main] 79.3 ns +- 1.7 ns -> [rp_pr] 66.0 ns +- 3.7 ns: 1.20x faster
iter(range(2, 400)): Mean +- std dev: [rp_main] 119 ns +- 2 ns -> [rp_pr] 99.5 ns +- 3.1 ns: 1.19x faster
for loop length 400: Mean +- std dev: [rp_main] 12.1 us +- 0.7 us -> [rp_pr] 10.9 us +- 0.4 us: 1.11x faster

Benchmark hidden because not significant (2): list(range(400)), list(iter(range(2, 400)))

Geometric mean: 1.16x faster

Benchmark script

import pyperf
runner = pyperf.Runner()

loop = """
def g(n):
    x=0
    for ii in range(n):
        x += 1
"""
        
for s in [1, 10, 100, 400]:
	time = runner.timeit(name=f'range({s})', stmt=f"range({s})")
	time = runner.timeit(name=f'iter(range({s}))', stmt=f"iter(range({s}))")
	time = runner.timeit(name=f'list(range({s}))', stmt=f"list(range({s}))")
	time = runner.timeit(name=f'list(iter(range({s})))', stmt=f"list(iter(range({s})))")

	time = runner.timeit(name=f'range(2, {s})', stmt=f"range(2, {s})")
	time = runner.timeit(name=f'iter(range(2, {s}))', stmt=f"iter(range(2, {s}))")
	time = runner.timeit(name=f'list(iter(range(2, {s})))', stmt=f"list(iter(range(2, {s})))")

	time = runner.timeit(name=f'for loop length {s}', stmt=f"g({s})", setup=loop)

``
</details>

On the pyperformance test suite (actually, a subset of the suite, not all benchmarks run on my system) shows the percentage of successfull freelist allocations increases about 1%.

Main:

Allocations from freelist 2,004,971,371 39.8%
Frees to freelist 2,005,350,418
Allocations 3,034,877,938 60.2%
Allocations to 512 bytes 3,008,791,812 59.7%
Allocations to 4 kbytes 18,648,072 0.4%
Allocations over 4 kbytes 7,438,054 0.1%
Frees 3,142,033,922

PR

Allocations from freelist 2,064,126,652 40.8%
Frees to freelist 2,064,499,538
Allocations 2,989,239,063 59.2%
Allocations to 512 bytes 2,963,207,194 58.6%
Allocations to 4 kbytes 18,596,157 0.4%
Allocations over 4 kbytes 7,435,712 0.1%
Frees 3,096,349,170

(Some quick testing shows that most of the allocations not via freelists are due to python int objects, so that is another candidate to use freelists for)



### Has this already been discussed elsewhere?

No response given

### Links to previous discussion of this feature:

Some references to discussions on freelists:

* https://github.com/python/cpython/pull/121934
* https://github.com/faster-cpython/ideas/discussions/132
* https://github.com/python/cpython/issues/91912
* https://github.com/python/cpython/pull/101453

<!-- gh-linked-prs -->
### Linked PRs
* gh-128368
* gh-128592
* gh-128594
* gh-128619
* gh-128692
<!-- /gh-linked-prs -->

The text was updated successfully, but these errors were encountered:

Fidget-Spinner · 2024-11-13T14:42:57Z

I would be willing to shepherd/champion such a PR. Back in the day we chose not to do this presumably because the added code complexity was not worth it. However, with Sam's consistent implementation of freelists, that point is now moot. Freelists are really simple to add mostly.

However, may I suggest trying first with medium ints? We can re-benchmark pyperformance with it. If you're not interested, just let me know and I can take over for medium ints.

eendebakpt · 2024-11-13T15:10:08Z

I started with range because the ints are more complex (either do only the medium ints, or use a more complex scheme with a size based freelists).

I already have a working branch for the medium int case, see main...eendebakpt:cpython:int_freelist. Once I have some more benchmarks and statistics I will open a PR.

eendebakpt · 2024-12-21T21:26:03Z

Also for iter objects for list and tuple we can gain performance in microbenchmarks:

bench_list: Mean +- std dev: [iter_main] 180 ns +- 9 ns -> [iter_pr] 161 ns +- 2 ns: 1.12x faster
bench_tuple: Mean +- std dev: [iter_main] 186 ns +- 16 ns -> [iter_pr] 169 ns +- 22 ns: 1.10x faster
bench_range: Mean +- std dev: [iter_main] 186 ns +- 1 ns -> [iter_pr] 180 ns +- 3 ns: 1.04x faster

Geometric mean: 1.09x faster

Script:


import pyperf


def bench_list(loops):
    range_it = iter(range(loops))

    lst = list(range(5))
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = 0
        for ii in lst:
            x += ii
    return pyperf.perf_counter() - t0


def bench_tuple(loops):
    range_it = iter(range(loops))

    tpl = tuple(range(5))
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = 0
        for ii in tpl:
            x += ii
    return pyperf.perf_counter() - t0


def bench_range(loops):
    range_it = iter(range(loops))

    r = range(5)
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = 0
        for ii in r:
            x += ii
    return pyperf.perf_counter() - t0


# %timeit bench_list(1000)

if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("bench_list", bench_list)
    runner.bench_time_func("bench_tuple", bench_tuple)
    runner.bench_time_func("bench_range", bench_range)

eendebakpt · 2024-12-30T22:41:43Z

To determine for which objects a freelist should be added (or removed) some statistics have been gathered. The data is gathered by adding some additional statistics functionality (branch: main...eendebakpt:cpython:wip_stats) and running the pyperformance benchmarks (minus some benchmarks that do not run on my system).

Most object allocations are tracked via generic calls like _PyObject_New or _PyObject_NewVar (using the type->tp_name). Some objects allocate memory allocated directly (e.g. lists and tuples), for these objects manually statistics have been added. Not all objects have been added manually, so a few will be missing from the statistics.

Here are the results on the types most often allocated and the number of freelist allocations:

Allocations	Counts
_PyLong_New	514.4M
_PyLong_New_digits_2	369.5M
list_iterator	215.6M
builtin_function_or_method	165.6M
generator	146.7M
tuple_iterator	133.0M
method	120.2M
_PyLong_New_digits_1	104.2M
StopIteration	94.7M
function	88.7M
_safe_key	72.0M
range	65.8M
coroutine	59.4M
cell	46.7M
range_iterator	44.1M
frame	42.5M
dict_items	41.7M
Fraction	30.0M
xml.etree.ElementTree.Element	30.0M
_PyLong_New_digits_3	28.4M
builtin_method	28.1M
Vector	27.2M
traceback	24.7M
decimal.Decimal	24.0M
dict_itemiterator	21.5M
PyFloat_FromDouble_alloc	19.9M
tuple	16.5M
PyList_New_allocate	16.0M
list	10.7M
Tree	12.0M
IndexError	11.0M
iterator	10.0M
set_iterator	9.3M
re.Match	8.6M
Node	8.1M
weakref.ReferenceType	8.0M
AttributeError	7.9M
dict	7.8M
set	6.7M
Point	6.3M
frozenset	6.3M
StopAsyncIteration	6.0M
async_generator	6.0M
Ray	5.9M
KeyError	4.8M
mappingproxy	4.7M
collections._deque_iterator	4.6M
_PyLong_New_digits_4	4.2M
GVector	4.2M
map	3.8M
halfstate_t	3.3M
_pickle.Pickler	3.2M
Markup	2.9M
enumerate	2.7M
str_ascii_iterator	2.4M
posix.DirEntry	2.1M
sqlite3.Cursor	2.0M
C	2.0M
_json.Encoder	1.9M
list_reverseiterator	1.8M
Text	1.8M

Freelist allocations	Counts
ints	2064.0M
floats	1461.5M
tuples[index]	1251.7M
lists	451.0M
slices	347.0M
dicts	150.0M
async_gen_asends	100.1M
async_gens	94.1M
dictkeys	42.9M
unicode_writers	1.9M
contexts	511282
futureiters	495349

Some observations:

Most allocations are of type int (via _PyLong_New). There is already a freelist for compacts ints that is working well (the ratio allocations/freelist is about 1:4). There is a large number of allocations with 2 or 3 digits (not covered by the freelist). So perhaps extending the freelist to cover ints with digits of size 2 (and 3) is worthwhile. Note that most hash values are ints with 2 digits. There are still some allocations of ints with 1 digit. Perhaps the freelist size can be increased, or bulk allocation of freelist objects (see Use a principled, and consistent, implementation of freelists. faster-cpython/ideas#132)
Also high in the number of allocations are iterators (tuple_iter, list_iter, range_iterator, PySeqIter_new). Adding freelists is relatively easy. Perhaps we can combine the freelist for these iterators (we then need to update the type, but otherwise they can be shared). A prototype of a shared freelist for different iterators is in main...eendebakpt:cpython:iter_freelists
For the iterators the freelist size can probably be small.

There are many allocations for types builtin_function_or_method, generator, method and function. For the method adding a freelist is standard (in Objects/classobject.c). The builtin_function_or_method happens often (for example a new object of this type is generated when for a list x=[1,2] one calls x.append). The adaptive interpreter seems to reduce the number of builtin_function_or_method objects significantly, but still we could add a freelist. This requires some thought, as the PyCMethod_New can either allocate for a PyCMethodObject or a PyCFunctionObject, see

cpython/Objects/methodobject.c

Lines 81 to 105 in 71de839

    
           if (ml->ml_flags & METH_METHOD) { 
        
               if (!cls) { 
        
                   PyErr_SetString(PyExc_SystemError, 
        
                                   "attempting to create PyCMethod with a METH_METHOD " 
        
                                   "flag but no class"); 
        
                   return NULL; 
        
               } 
        
               PyCMethodObject *om = PyObject_GC_New(PyCMethodObject, &PyCMethod_Type); 
        
               if (om == NULL) { 
        
                   return NULL; 
        
               } 
        
               om->mm_class = (PyTypeObject*)Py_NewRef(cls); 
        
               op = (PyCFunctionObject *)om; 
        
           } else { 
        
               if (cls) { 
        
                   PyErr_SetString(PyExc_SystemError, 
        
                                   "attempting to create PyCFunction with class " 
        
                                   "but no METH_METHOD flag"); 
        
                   return NULL; 
        
               } 
        
               op = PyObject_GC_New(PyCFunctionObject, &PyCFunction_Type); 
        
               if (op == NULL) { 
        
                   return NULL; 
        
               } 
        
           }

.
The function and generator objects don't seem to fit standard freelist at first glance.

The StopIteration is in the top stats. Seems like a good candidate for a freelist, together with IndexError. The implementation (in Objects/exceptions.c) uses macros to define several variations of exceptions. Maybe they can be combined in a freelist, but this requires some more thinking.
There is the _safe_key object showing up. This is probably from pprint. Is that a performance bottleneck? Maybe not, it is a python class so overhead of allocation is probably small. Not sure which benchmark is triggering this one
tuple and list are still allocated a lot. However, the 8 million tuple allocations seem okay with 625 million freelist results from tuple!
There are some objects specific for pyperformance benchmarks (e.g. the Vector class from bm_raytrace and GVector class from bm_chaos). Also Fraction is there. Perhaps we can create a freelist for the PyType_GenericAlloc based on the size? Code is here: https://github.com/python/cpython/blob/71de839ec987bb67b98fcfecfc687281841a713c/Objects/typeobject.c#L2213C1-L2213C21?
For lists we can use a size based freelist (just as for tuple). The reason for doing this would be to avoid having to allocate/deallocate the space for the list items on each freelist pop/push. Data suggest that most list allocations happen for < 20 items (see figure below). One issue with lists is that they can be resized. So perhaps we are allocating many lists of small size, but when deallocating the lists can be larger.

Fidget-Spinner · 2024-12-31T04:28:18Z

Wow, excellent investigative work!

eendebakpt added the type-feature A feature request or enhancement label Nov 11, 2024

tomasr8 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 11, 2024

eendebakpt mentioned this issue Nov 15, 2024

gh-126868: Add freelist for compact int objects #126865

Merged

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report hugovk/test#88

Closed

eendebakpt changed the title ~~Use freelist for range and range_iter objects~~ Use freelist for range and iter objects Dec 21, 2024

bedevere-app bot mentioned this issue Dec 30, 2024

gh-126703: Add freelists for iterators and range, method and builtin_function_or_method objects #128368

Draft

This was referenced Jan 7, 2025

gh-126703: Add freelists for list and tuple iterators #128592

Open

gh-126703: Add freelist for PyMethodObject #128594

Merged

eendebakpt changed the title ~~Use freelist for range and iter objects~~ Use freelist for range object, iterator objects and other often used objects Jan 7, 2025

This was referenced Jan 8, 2025

gh-126703: Add freelist for range and range_iter objects #128619

Open

gh-126703: Add PyCFunction freelist #128692

Open

kumaraditya303 pushed a commit that referenced this issue Jan 12, 2025

gh-126703: Add freelist for PyMethodObject (#128594)

ff39e3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use freelist for range object, iterator objects and other often used objects #126703

Use freelist for range object, iterator objects and other often used objects #126703

eendebakpt commented Nov 11, 2024 •

edited

Loading

Fidget-Spinner commented Nov 13, 2024

eendebakpt commented Nov 13, 2024

eendebakpt commented Dec 21, 2024

eendebakpt commented Dec 30, 2024 •

edited

Loading

Fidget-Spinner commented Dec 31, 2024

Use freelist for range object, iterator objects and other often used objects #126703

Use freelist for range object, iterator objects and other often used objects #126703

Comments

eendebakpt commented Nov 11, 2024 • edited Loading

Feature or enhancement

Proposal:

Fidget-Spinner commented Nov 13, 2024

eendebakpt commented Nov 13, 2024

eendebakpt commented Dec 21, 2024

eendebakpt commented Dec 30, 2024 • edited Loading

Fidget-Spinner commented Dec 31, 2024

eendebakpt commented Nov 11, 2024 •

edited

Loading

eendebakpt commented Dec 30, 2024 •

edited

Loading