-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-allocate list based on source length #206
Conversation
This reverts commit d584188.
Agreed. there was very little thought into the selection of 17 as the boundary, except for the .net constraint. Also, .NET requires nested tuples beyond a certain size (iirc 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really well designed. I'll be keen to run the benchmarks on this branch as well
No regressions on the benchmarks, although I think the list one uses large lists |
I've gone ahead and implemented this too with 97f73c0 and 46dc33b. The array pool is used for tuples and lists of 100 elements, so lists/tuples within that size should be allocation-free (for handles and marshallers) once the pool is hydrated. This can be tuned later, but should be a good starting point. There's not much to comment on the implementation otherwise except that some new types are introduced to encapsulate the complexity ( I ran the benchmarks with memory diagnostics: diff --git a/src/Profile/MarshallingBenchmarks.cs b/src/Profile/MarshallingBenchmarks.cs
index 5394d7f..c1b47cc 100644
--- a/src/Profile/MarshallingBenchmarks.cs
+++ b/src/Profile/MarshallingBenchmarks.cs
@@ -5,6 +5,7 @@ using Microsoft.VSDiagnostics;
namespace Profile;
[CPUUsageDiagnoser]
+[MemoryDiagnoser]
[MarkdownExporter]
public class MarshallingBenchmarks: BaseBenchmark
{ For
Upto 7f61cd1 and before adding array pooling optimisations, the benchmarks are:
While memory usage the same or better in most cases, the one that stands out is With array pooling, the benchmarks are:
Here, memory is better across the board where expected (where tuples & lists are involved). The table below shows the overall comparison:
Legend:
In short, there's an approximate 10% cost to saving the memory for this profile of benchmarks.
|
I looked at ArrayPool, but based on what I heard from @stephentoub with only 100 items, it wouldn't add a huge amount of value. Stephen, you see this I'd be interested in your thoughts. That said, it might be worth mixing up the benchmarks to parameterize the number of elements in the list. i.e. |
The decision to use 100 in this PR was somewhat arbitrary to be honest and therefore open to debate. The overall idea was to have something in place to prevent allocating arrays for small lists and tuples. I used a custom configured
Good idea! |
This reverts commit to 7f61cd1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR addresses issue #173.
It consolidates tuple and list construction since they shared the bulk of the implementation and both have the same semantics and signatures in CPython (and both,
PyTuple_SetItem
andPyList_SetItem
steal references). It uses a similar approach to the one applied in PR #204. That is, the C APIs for each type are abstracted behind a new interface (with static members only):Then there's an implementation for tuples (
TupleBuilder
) and another for lists (ListsBuilder
) that just delegate to the corresponding APIs. I also tried an approach with function pointers in d584188, but it didn't seem to bring any advantage and also read less clear for now (it was reverted with d45b2a0).The
CreateTuple
andCreateList
methods now just callCreateListOrTuple
with the right builder type:Internally,
CreateListOrTuple
optimises memory allocations by using the stack for tuples and lists with 8 items or less. For larger structures, it spills to an array allocated on the heap. This is done for handles (like before), but also for marshallers via fixed-length arrays that are inlined on the stack. Spilling means that for a list of 16 elements, 8 handles and marshallers go on the stack and 8 on heap-allocated arrays. Ideally, this could be further optimised by an array pool. For now, however, the situation is better than before when marshallers always ended up on the heap with a list + array:CSnakes/src/CSnakes.Runtime/Python/Pack.cs
Line 15 in 3d94d12
Since the majority of the code is shared, any improvements to the core approach will benefit lists and tuples without additional effort.
The one potential performance regression that this PR introduces is that
CreateTuple
allocated handles on the stack for tuples up to the max size of 17:CSnakes/src/CSnakes.Runtime/Python/Pack.cs
Lines 18 to 20 in 3d94d12
This PR uses 8 for tuples and lists. I think tuples with 9+ elements is going to be extremely rare so this could be a reasonable compromise, but if we want to maintain a different threshold between tuples and lists then this is something that could be addressed in the future.
Possible list of things to consider before publishing this draft (but which could also be deferred to a future PR as improvements):
Pack.CreateTuple
SafeHandleMarshaller<PyObject>.ManagedToUnmanagedIn
) on the stack for small lists