Pre-allocate list based on source length #206

atifaziz · 2024-09-17T21:50:29Z

This PR addresses issue #173.

It consolidates tuple and list construction since they shared the bulk of the implementation and both have the same semantics and signatures in CPython (and both, PyTuple_SetItem and PyList_SetItem steal references). It uses a similar approach to the one applied in PR #204. That is, the C APIs for each type are abstracted behind a new interface (with static members only):

private interface IListOrTupleBuilder
{
    static abstract nint New(nint size);
    static abstract int SetItemRaw(nint ob, nint pos, nint o);
}

Then there's an implementation for tuples (TupleBuilder) and another for lists (ListsBuilder) that just delegate to the corresponding APIs. I also tried an approach with function pointers in d584188, but it didn't seem to bring any advantage and also read less clear for now (it was reverted with d45b2a0).

The CreateTuple and CreateList methods now just call CreateListOrTuple with the right builder type:

internal static class Pack
{
    internal static PyObject CreateTuple(Span<PyObject> items) =>
        PyObject.Create(CreateListOrTuple<TupleBuilder>(items));

    internal static PyObject CreateList(Span<PyObject> items) =>
        PyObject.Create(CreateListOrTuple<ListBuilder>(items));

    // ...

    private static nint CreateListOrTuple<TBuilder>(Span<PyObject> items)
        where TBuilder : IListOrTupleBuilder
    {
        // ...
    }
}

Internally, CreateListOrTuple optimises memory allocations by using the stack for tuples and lists with 8 items or less. For larger structures, it spills to an array allocated on the heap. This is done for handles (like before), but also for marshallers via fixed-length arrays that are inlined on the stack. Spilling means that for a list of 16 elements, 8 handles and marshallers go on the stack and 8 on heap-allocated arrays. Ideally, this could be further optimised by an array pool. For now, however, the situation is better than before when marshallers always ended up on the heap with a list + array:

CSnakes/src/CSnakes.Runtime/Python/Pack.cs

Line 15 in 3d94d12

    
           List<SafeHandleMarshaller<PyObject>.ManagedToUnmanagedIn> marshallers = new(items.Length);

Since the majority of the code is shared, any improvements to the core approach will benefit lists and tuples without additional effort.

The one potential performance regression that this PR introduces is that CreateTuple allocated handles on the stack for tuples up to the max size of 17:

CSnakes/src/CSnakes.Runtime/Python/Pack.cs

Lines 18 to 20 in 3d94d12

    
           var handles = items.Length < 18 // .NET tuples are max 17 items. This is a performance optimization. 
        
               ? stackalloc IntPtr[items.Length] 
        
               : new IntPtr[items.Length];

This PR uses 8 for tuples and lists. I think tuples with 9+ elements is going to be extremely rare so this could be a reasonable compromise, but if we want to maintain a different threshold between tuples and lists then this is something that could be addressed in the future.

Possible list of things to consider before publishing this draft (but which could also be deferred to a future PR as improvements):

Consolidate duplication with Pack.CreateTuple
Allocate marshallers (SafeHandleMarshaller<PyObject>.ManagedToUnmanagedIn) on the stack for small lists

This reverts commit d584188.

tonybaloney · 2024-09-19T05:56:23Z

This PR uses 8 for tuples and lists. I think tuples with 9+ elements is going to be extremely rare so this could be a reasonable compromise

Agreed. there was very little thought into the selection of 17 as the boundary, except for the .net constraint. Also, .NET requires nested tuples beyond a certain size (iirc 7)

tonybaloney

This is really well designed. I'll be keen to run the benchmarks on this branch as well

tonybaloney · 2024-09-19T06:22:18Z

No regressions on the benchmarks, although I think the list one uses large lists

atifaziz · 2024-09-19T13:41:55Z

Ideally, this could be further optimised by an array pool.

I've gone ahead and implemented this too with 97f73c0 and 46dc33b.

The array pool is used for tuples and lists of 100 elements, so lists/tuples within that size should be allocation-free (for handles and marshallers) once the pool is hydrated. This can be tuned later, but should be a good starting point. There's not much to comment on the implementation otherwise except that some new types are introduced to encapsulate the complexity (ArrayPools, RentedArray<> and RentalState) such that CreateListOrTuple remains largely the same and readable.

I ran the benchmarks with memory diagnostics:

diff --git a/src/Profile/MarshallingBenchmarks.cs b/src/Profile/MarshallingBenchmarks.cs
index 5394d7f..c1b47cc 100644
--- a/src/Profile/MarshallingBenchmarks.cs
+++ b/src/Profile/MarshallingBenchmarks.cs
@@ -5,6 +5,7 @@ using Microsoft.VSDiagnostics;
 namespace Profile;
 
 [CPUUsageDiagnoser]
+[MemoryDiagnoser]
 [MarkdownExporter]
 public class MarshallingBenchmarks: BaseBenchmark
 {

For main, this is what I got:

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
ComplexReturn	2,041.9 ns	40.69 ns	66.86 ns	0.0725	0.0687	952 B
ComplexReturnLazy	1,667.9 ns	33.38 ns	31.22 ns	0.0591	0.0572	744 B
FunctionReturnsList	2,524.5 ns	50.27 ns	55.87 ns	0.0191	0.0153	272 B
FunctionTakesList	19,484.9 ns	347.16 ns	324.74 ns	0.7019	0.6714	8824 B
FunctionTakesDictionary	37,480.4 ns	719.03 ns	769.35 ns	1.7700	1.7090	22384 B
FunctionReturnsDictionary	10,556.8 ns	169.37 ns	158.43 ns	0.0153	-	320 B
FunctionReturnsTuple	804.9 ns	14.29 ns	13.37 ns	0.0467	-	592 B
FunctionTakesTuple	1,030.8 ns	18.37 ns	16.28 ns	0.0648	0.0629	816 B
FunctionTakesValueTypes	571.2 ns	11.30 ns	22.30 ns	0.0210	-	272 B
EmptyFunction	189.0 ns	3.48 ns	7.64 ns	0.0024	-	32 B

Upto 7f61cd1 and before adding array pooling optimisations, the benchmarks are:

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
ComplexReturn	2,166.8 ns	42.88 ns	57.25 ns	0.0687	0.0648	864 B
ComplexReturnLazy	1,828.8 ns	35.34 ns	52.90 ns	0.0515	0.0496	656 B
FunctionReturnsList	2,492.0 ns	42.42 ns	39.68 ns	0.0191	0.0153	272 B
FunctionTakesList	18,754.5 ns	363.73 ns	521.65 ns	0.8545	0.8240	11080 B
FunctionTakesDictionary	38,709.9 ns	726.10 ns	993.90 ns	1.7700	1.7090	22384 B
FunctionReturnsDictionary	10,706.9 ns	173.63 ns	162.42 ns	0.0153	-	320 B
FunctionReturnsTuple	809.6 ns	9.46 ns	8.39 ns	0.0467	-	592 B
FunctionTakesTuple	1,145.9 ns	22.69 ns	25.22 ns	0.0553	0.0534	696 B
FunctionTakesValueTypes	566.3 ns	10.76 ns	12.81 ns	0.0210	-	272 B
EmptyFunction	187.3 ns	3.68 ns	4.52 ns	0.0024	-	32 B

While memory usage the same or better in most cases, the one that stands out is FunctionTakesList where it increases from 8,824 to 11,080 bytes (25% worse).

With array pooling, the benchmarks are:

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
ComplexReturn	2,096.6 ns	41.80 ns	42.93 ns	0.0687	0.0648	864 B
ComplexReturnLazy	1,871.3 ns	37.15 ns	48.31 ns	0.0515	0.0496	656 B
FunctionReturnsList	2,387.1 ns	47.65 ns	108.53 ns	0.0191	0.0153	272 B
FunctionTakesList	18,893.7 ns	360.33 ns	442.52 ns	0.7019	0.6714	8824 B
FunctionTakesDictionary	39,269.9 ns	754.89 ns	839.05 ns	1.7700	1.7090	22384 B
FunctionReturnsDictionary	10,416.8 ns	207.16 ns	203.46 ns	0.0153	-	320 B
FunctionReturnsTuple	766.8 ns	8.03 ns	7.12 ns	0.0467	-	592 B
FunctionTakesTuple	1,203.0 ns	23.00 ns	22.59 ns	0.0553	0.0534	696 B
FunctionTakesValueTypes	553.1 ns	9.34 ns	9.59 ns	0.0210	-	272 B
EmptyFunction	184.7 ns	3.39 ns	5.17 ns	0.0024	-	32 B

Here, memory is better across the board where expected (where tuples & lists are involved).

The table below shows the overall comparison:

Method	Δ Mean 1	Δ Mean 2	Δ Memory 1	Δ Memory 2
ComplexReturn	6.1%	2.7%	-88	-88
ComplexReturnLazy	9.6%	12.2%	-88	-88
FunctionReturnsList	-1.3%	-5.4%	0	0
FunctionTakesList	-3.7%	-3.0%	2,256	0
FunctionTakesDictionary	3.3%	4.8%	0	0
FunctionReturnsDictionary	1.4%	-1.3%	0	0
FunctionReturnsTuple	0.6%	-4.7%	0	0
FunctionTakesTuple	11.2%	16.7%	-120	-120
FunctionTakesValueTypes	-0.9%	-3.2%	0	0
EmptyFunction	-0.9%	-2.3%	0	0

Legend:

Δ Mean 1 = Mean delta with main without array pooling
Δ Mean 2 = Mean delta with main with array pooling
Δ Memory 1 = Memory delta with main without array pooling
Δ Memory 2 = Memory delta with main with array pooling

In short, there's an approximate 10% cost to saving the memory for this profile of benchmarks.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
13th Gen Intel Core i7-1370P, 1 CPU, 20 logical and 14 physical cores
.NET SDK 8.0.400:
- [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
- DefaultJob : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

tonybaloney · 2024-09-23T10:08:13Z

I looked at ArrayPool, but based on what I heard from @stephentoub with only 100 items, it wouldn't add a huge amount of value. Stephen, you see this I'd be interested in your thoughts.

That said, it might be worth mixing up the benchmarks to parameterize the number of elements in the list. i.e. FunctionTakesList[1], FunctionTakesList[100], FunctionTakesList[1000]. All are likely use cases.

This reverts commit 9e8f98f.

atifaziz · 2024-09-23T21:45:12Z

I looked at ArrayPool, but based on what I heard from @stephentoub with only 100 items, it wouldn't add a huge amount of value. Stephen, you see this I'd be interested in your thoughts.

The decision to use 100 in this PR was somewhat arbitrary to be honest and therefore open to debate. The overall idea was to have something in place to prevent allocating arrays for small lists and tuples. I used a custom configured ArrayPool<> to avoid reinventing the wheel. The ArrayPool<> can be replaced with simply an array in a thread-static (which I tried with 9e8f98f) and eventually behind a weak reference so it acts like a cache where the array can get evicted by the GC under memory pressure.

That said, it might be worth mixing up the benchmarks to parameterize the number of elements in the list. i.e. FunctionTakesList[1], FunctionTakesList[100], FunctionTakesList[1000]. All are likely use cases.

Good idea!

src/CSnakes.Runtime/Python/Pack.cs

This reverts commit to 7f61cd1.

src/CSnakes.Runtime/Python/Pack.cs

AaronRobinsonMSFT

Thanks!

atifaziz added 7 commits September 17, 2024 23:44

Pre-allocate list based on source length

4f6f8ad

Consolidate duplication between list & tuple building

17906fd

Use function pointers interface of static interfaces

d584188

Revert "Use function pointers interface of static interfaces"

d45b2a0

This reverts commit d584188.

Render all types private

6eaba90

Make builders non-instantiable

68b4ad7

Use stack + heap spill for handles and marshallers

4def5cd

atifaziz changed the title ~~🚧 Pre-allocate list based on source length~~ Pre-allocate list based on source length Sep 18, 2024

atifaziz marked this pull request as ready for review September 18, 2024 14:35

Remove unused import

7f61cd1

tonybaloney approved these changes Sep 19, 2024

View reviewed changes

tonybaloney requested a review from AaronRobinsonMSFT September 19, 2024 06:03

atifaziz added 2 commits September 19, 2024 13:57

Use pools to optimise array allocations

97f73c0

Make "RentedArray" disposal idempotent

46dc33b

atifaziz added 2 commits September 23, 2024 23:37

Reuse thread-static arrays for handles/marshalers

9e8f98f

Revert "Reuse thread-static arrays for handles/marshalers"

c0de81a

This reverts commit 9e8f98f.

AaronRobinsonMSFT reviewed Sep 23, 2024

View reviewed changes

src/CSnakes.Runtime/Python/Pack.cs Outdated Show resolved Hide resolved

AaronRobinsonMSFT reviewed Sep 23, 2024

View reviewed changes

src/CSnakes.Runtime/Python/Pack.cs Outdated Show resolved Hide resolved

Remove array allocation saving strategies

a1c1625

This reverts commit to 7f61cd1.

AaronRobinsonMSFT reviewed Sep 24, 2024

View reviewed changes

src/CSnakes.Runtime/Python/Pack.cs Show resolved Hide resolved

Use static local to avoid closure

20a59e1

AaronRobinsonMSFT approved these changes Sep 25, 2024

View reviewed changes

tonybaloney approved these changes Sep 25, 2024

View reviewed changes

tonybaloney merged commit 5423d6b into tonybaloney:main Sep 25, 2024
37 checks passed

atifaziz deleted the prealloc-list branch September 26, 2024 06:03

atifaziz mentioned this pull request Oct 5, 2024

Consider using an array pool for the spilled handles and marshallers. #227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-allocate list based on source length #206

Pre-allocate list based on source length #206

atifaziz commented Sep 17, 2024 •

edited

Loading

tonybaloney commented Sep 19, 2024

tonybaloney left a comment

tonybaloney commented Sep 19, 2024

atifaziz commented Sep 19, 2024 •

edited

Loading

tonybaloney commented Sep 23, 2024

atifaziz commented Sep 23, 2024 •

edited

Loading

AaronRobinsonMSFT left a comment

	var handles = items.Length < 18 // .NET tuples are max 17 items. This is a performance optimization.
	? stackalloc IntPtr[items.Length]
	: new IntPtr[items.Length];

Pre-allocate list based on source length #206

Pre-allocate list based on source length #206

Conversation

atifaziz commented Sep 17, 2024 • edited Loading

tonybaloney commented Sep 19, 2024

tonybaloney left a comment

Choose a reason for hiding this comment

tonybaloney commented Sep 19, 2024

atifaziz commented Sep 19, 2024 • edited Loading

tonybaloney commented Sep 23, 2024

atifaziz commented Sep 23, 2024 • edited Loading

AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

atifaziz commented Sep 17, 2024 •

edited

Loading

atifaziz commented Sep 19, 2024 •

edited

Loading

atifaziz commented Sep 23, 2024 •

edited

Loading