Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-allocate list based on source length #206

Merged
merged 14 commits into from
Sep 25, 2024

Conversation

atifaziz
Copy link
Contributor

@atifaziz atifaziz commented Sep 17, 2024

This PR addresses issue #173.

It consolidates tuple and list construction since they shared the bulk of the implementation and both have the same semantics and signatures in CPython (and both, PyTuple_SetItem and PyList_SetItem steal references). It uses a similar approach to the one applied in PR #204. That is, the C APIs for each type are abstracted behind a new interface (with static members only):

private interface IListOrTupleBuilder
{
    static abstract nint New(nint size);
    static abstract int SetItemRaw(nint ob, nint pos, nint o);
}

Then there's an implementation for tuples (TupleBuilder) and another for lists (ListsBuilder) that just delegate to the corresponding APIs. I also tried an approach with function pointers in d584188, but it didn't seem to bring any advantage and also read less clear for now (it was reverted with d45b2a0).

The CreateTuple and CreateList methods now just call CreateListOrTuple with the right builder type:

internal static class Pack
{
    internal static PyObject CreateTuple(Span<PyObject> items) =>
        PyObject.Create(CreateListOrTuple<TupleBuilder>(items));

    internal static PyObject CreateList(Span<PyObject> items) =>
        PyObject.Create(CreateListOrTuple<ListBuilder>(items));

    // ...

    private static nint CreateListOrTuple<TBuilder>(Span<PyObject> items)
        where TBuilder : IListOrTupleBuilder
    {
        // ...
    }
}

Internally, CreateListOrTuple optimises memory allocations by using the stack for tuples and lists with 8 items or less. For larger structures, it spills to an array allocated on the heap. This is done for handles (like before), but also for marshallers via fixed-length arrays that are inlined on the stack. Spilling means that for a list of 16 elements, 8 handles and marshallers go on the stack and 8 on heap-allocated arrays. Ideally, this could be further optimised by an array pool. For now, however, the situation is better than before when marshallers always ended up on the heap with a list + array:

List<SafeHandleMarshaller<PyObject>.ManagedToUnmanagedIn> marshallers = new(items.Length);

Since the majority of the code is shared, any improvements to the core approach will benefit lists and tuples without additional effort.

The one potential performance regression that this PR introduces is that CreateTuple allocated handles on the stack for tuples up to the max size of 17:

var handles = items.Length < 18 // .NET tuples are max 17 items. This is a performance optimization.
? stackalloc IntPtr[items.Length]
: new IntPtr[items.Length];

This PR uses 8 for tuples and lists. I think tuples with 9+ elements is going to be extremely rare so this could be a reasonable compromise, but if we want to maintain a different threshold between tuples and lists then this is something that could be addressed in the future.


Possible list of things to consider before publishing this draft (but which could also be deferred to a future PR as improvements):

  • Consolidate duplication with Pack.CreateTuple
  • Allocate marshallers (SafeHandleMarshaller<PyObject>.ManagedToUnmanagedIn) on the stack for small lists

@atifaziz atifaziz changed the title 🚧 Pre-allocate list based on source length Pre-allocate list based on source length Sep 18, 2024
@atifaziz atifaziz marked this pull request as ready for review September 18, 2024 14:35
@tonybaloney
Copy link
Owner

This PR uses 8 for tuples and lists. I think tuples with 9+ elements is going to be extremely rare so this could be a reasonable compromise

Agreed. there was very little thought into the selection of 17 as the boundary, except for the .net constraint. Also, .NET requires nested tuples beyond a certain size (iirc 7)

Copy link
Owner

@tonybaloney tonybaloney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really well designed. I'll be keen to run the benchmarks on this branch as well

@tonybaloney
Copy link
Owner

No regressions on the benchmarks, although I think the list one uses large lists

@atifaziz
Copy link
Contributor Author

atifaziz commented Sep 19, 2024

Ideally, this could be further optimised by an array pool.

I've gone ahead and implemented this too with 97f73c0 and 46dc33b.

The array pool is used for tuples and lists of 100 elements, so lists/tuples within that size should be allocation-free (for handles and marshallers) once the pool is hydrated. This can be tuned later, but should be a good starting point. There's not much to comment on the implementation otherwise except that some new types are introduced to encapsulate the complexity (ArrayPools, RentedArray<> and RentalState) such that CreateListOrTuple remains largely the same and readable.

I ran the benchmarks with memory diagnostics:

diff --git a/src/Profile/MarshallingBenchmarks.cs b/src/Profile/MarshallingBenchmarks.cs
index 5394d7f..c1b47cc 100644
--- a/src/Profile/MarshallingBenchmarks.cs
+++ b/src/Profile/MarshallingBenchmarks.cs
@@ -5,6 +5,7 @@ using Microsoft.VSDiagnostics;
 namespace Profile;
 
 [CPUUsageDiagnoser]
+[MemoryDiagnoser]
 [MarkdownExporter]
 public class MarshallingBenchmarks: BaseBenchmark
 {

For main, this is what I got:

Method Mean Error StdDev Gen0 Gen1 Allocated
ComplexReturn 2,041.9 ns 40.69 ns 66.86 ns 0.0725 0.0687 952 B
ComplexReturnLazy 1,667.9 ns 33.38 ns 31.22 ns 0.0591 0.0572 744 B
FunctionReturnsList 2,524.5 ns 50.27 ns 55.87 ns 0.0191 0.0153 272 B
FunctionTakesList 19,484.9 ns 347.16 ns 324.74 ns 0.7019 0.6714 8824 B
FunctionTakesDictionary 37,480.4 ns 719.03 ns 769.35 ns 1.7700 1.7090 22384 B
FunctionReturnsDictionary 10,556.8 ns 169.37 ns 158.43 ns 0.0153 - 320 B
FunctionReturnsTuple 804.9 ns 14.29 ns 13.37 ns 0.0467 - 592 B
FunctionTakesTuple 1,030.8 ns 18.37 ns 16.28 ns 0.0648 0.0629 816 B
FunctionTakesValueTypes 571.2 ns 11.30 ns 22.30 ns 0.0210 - 272 B
EmptyFunction 189.0 ns 3.48 ns 7.64 ns 0.0024 - 32 B

Upto 7f61cd1 and before adding array pooling optimisations, the benchmarks are:

Method Mean Error StdDev Gen0 Gen1 Allocated
ComplexReturn 2,166.8 ns 42.88 ns 57.25 ns 0.0687 0.0648 864 B
ComplexReturnLazy 1,828.8 ns 35.34 ns 52.90 ns 0.0515 0.0496 656 B
FunctionReturnsList 2,492.0 ns 42.42 ns 39.68 ns 0.0191 0.0153 272 B
FunctionTakesList 18,754.5 ns 363.73 ns 521.65 ns 0.8545 0.8240 11080 B
FunctionTakesDictionary 38,709.9 ns 726.10 ns 993.90 ns 1.7700 1.7090 22384 B
FunctionReturnsDictionary 10,706.9 ns 173.63 ns 162.42 ns 0.0153 - 320 B
FunctionReturnsTuple 809.6 ns 9.46 ns 8.39 ns 0.0467 - 592 B
FunctionTakesTuple 1,145.9 ns 22.69 ns 25.22 ns 0.0553 0.0534 696 B
FunctionTakesValueTypes 566.3 ns 10.76 ns 12.81 ns 0.0210 - 272 B
EmptyFunction 187.3 ns 3.68 ns 4.52 ns 0.0024 - 32 B

While memory usage the same or better in most cases, the one that stands out is FunctionTakesList where it increases from 8,824 to 11,080 bytes (25% worse).

With array pooling, the benchmarks are:

Method Mean Error StdDev Gen0 Gen1 Allocated
ComplexReturn 2,096.6 ns 41.80 ns 42.93 ns 0.0687 0.0648 864 B
ComplexReturnLazy 1,871.3 ns 37.15 ns 48.31 ns 0.0515 0.0496 656 B
FunctionReturnsList 2,387.1 ns 47.65 ns 108.53 ns 0.0191 0.0153 272 B
FunctionTakesList 18,893.7 ns 360.33 ns 442.52 ns 0.7019 0.6714 8824 B
FunctionTakesDictionary 39,269.9 ns 754.89 ns 839.05 ns 1.7700 1.7090 22384 B
FunctionReturnsDictionary 10,416.8 ns 207.16 ns 203.46 ns 0.0153 - 320 B
FunctionReturnsTuple 766.8 ns 8.03 ns 7.12 ns 0.0467 - 592 B
FunctionTakesTuple 1,203.0 ns 23.00 ns 22.59 ns 0.0553 0.0534 696 B
FunctionTakesValueTypes 553.1 ns 9.34 ns 9.59 ns 0.0210 - 272 B
EmptyFunction 184.7 ns 3.39 ns 5.17 ns 0.0024 - 32 B

Here, memory is better across the board where expected (where tuples & lists are involved).

The table below shows the overall comparison:

Method Δ Mean 1 Δ Mean 2 Δ Memory 1 Δ Memory 2
ComplexReturn 6.1% 2.7% -88 -88
ComplexReturnLazy 9.6% 12.2% -88 -88
FunctionReturnsList -1.3% -5.4% 0 0
FunctionTakesList -3.7% -3.0% 2,256 0
FunctionTakesDictionary 3.3% 4.8% 0 0
FunctionReturnsDictionary 1.4% -1.3% 0 0
FunctionReturnsTuple 0.6% -4.7% 0 0
FunctionTakesTuple 11.2% 16.7% -120 -120
FunctionTakesValueTypes -0.9% -3.2% 0 0
EmptyFunction -0.9% -2.3% 0 0

Legend:

  • Δ Mean 1 = Mean delta with main without array pooling
  • Δ Mean 2 = Mean delta with main with array pooling
  • Δ Memory 1 = Memory delta with main without array pooling
  • Δ Memory 2 = Memory delta with main with array pooling

In short, there's an approximate 10% cost to saving the memory for this profile of benchmarks.


  • BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
  • 13th Gen Intel Core i7-1370P, 1 CPU, 20 logical and 14 physical cores
  • .NET SDK 8.0.400:
    • [Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
    • DefaultJob : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

@tonybaloney
Copy link
Owner

I looked at ArrayPool, but based on what I heard from @stephentoub with only 100 items, it wouldn't add a huge amount of value. Stephen, you see this I'd be interested in your thoughts.

That said, it might be worth mixing up the benchmarks to parameterize the number of elements in the list. i.e. FunctionTakesList[1], FunctionTakesList[100], FunctionTakesList[1000]. All are likely use cases.

@atifaziz
Copy link
Contributor Author

atifaziz commented Sep 23, 2024

I looked at ArrayPool, but based on what I heard from @stephentoub with only 100 items, it wouldn't add a huge amount of value. Stephen, you see this I'd be interested in your thoughts.

The decision to use 100 in this PR was somewhat arbitrary to be honest and therefore open to debate. The overall idea was to have something in place to prevent allocating arrays for small lists and tuples. I used a custom configured ArrayPool<> to avoid reinventing the wheel. The ArrayPool<> can be replaced with simply an array in a thread-static (which I tried with 9e8f98f) and eventually behind a weak reference so it acts like a cache where the array can get evicted by the GC under memory pressure.

That said, it might be worth mixing up the benchmarks to parameterize the number of elements in the list. i.e. FunctionTakesList[1], FunctionTakesList[100], FunctionTakesList[1000]. All are likely use cases.

Good idea!

Copy link
Collaborator

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tonybaloney tonybaloney merged commit 5423d6b into tonybaloney:main Sep 25, 2024
37 checks passed
@atifaziz atifaziz deleted the prealloc-list branch September 26, 2024 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants