Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smaller Task object #36802

Merged
merged 2 commits into from
Aug 5, 2020
Merged

smaller Task object #36802

merged 2 commits into from
Aug 5, 2020

Conversation

JeffBezanson
Copy link
Member

@JeffBezanson JeffBezanson commented Jul 25, 2020

With this sizeof(jl_task_t) goes from 616 bytes to 416 352. Who likes free performance?

master, pfib(31), 4 threads:

  1.879936 seconds (13.24 M allocations: 1.902 GiB, 41.12% gc time)

This PR:

  1.617563 seconds (13.22 M allocations: 1.499 GiB, 36.26% gc time)

master, spawn/wait test from #36699, 4 threads:

3.090069 seconds (21.87 M allocations: 2.872 GiB, 2.63% gc time)

this PR:

2.852884 seconds (21.90 M allocations: 2.218 GiB, 3.54% gc time)

I would consider it unlikely/inadvisable for a task to hold 29 locks 😂
I'm planning to remove a couple more redundant fields from Task, but this should be a big obvious win for now.

@JeffBezanson JeffBezanson added performance Must go faster multithreading Base.Threads and related functionality labels Jul 25, 2020
// does not need to be defined until the task runs
int16_t tid;
/* for the multiqueue */
int16_t prio;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the fields tid and prio will sadly break our C access to the Julia kernel in GAP.jl. I'll look into adding an API that will allow us to avoid direct access to jl_task_t members, but in the meantime, could I request that these members are not moved around? This changed doesn't seem to be required by this PR (and is mentioned neither in the PR description nor the commit message).

Perhaps this reordering is meant to make access to tid and prio faster by moving them to the start of the struct, but it's not clear whether that's the case, and ideally it'd be benchmark separately anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It packs the struct by optimizing field alignment. Our general policy for internal changes is "caveat emptor" and that users who poke at internals need to use version checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, if that's the motivation for the change, shouldn't it still be explicitly mentioned in the commit message?

Secondly, of course I understand that this is the general rule, but given that we have no choice but to poke at this, and given that it's unclear to me how much that minor part of the commit contributes to the overall performance win, I think it's not unreasonable to at least ask whether this chance could be delayed for the time being? "Just a version check" is not enough to deal with it, one needs to compute the correct offsets for all target platforms, and all relevant Julia versions (here: < 1.6 and >= 1.6, so just two for now), then hardcode these in the C code accessing them. I can of course do that, but it's not quite that trivial as "use a version check" might make it sound like :-).

See also discussion in PR #36064

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, if that's the motivation for the change, shouldn't it still be explicitly mentioned in the commit message?

I mean, the commit message is "Smaller task objects" which is the motivation for this. Sure, it could perhaps be even more descriptive but it isn't like it is an unrelated change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one needs to compute the correct offsets for all target platforms, and all relevant Julia versions (here: < 1.6 and >= 1.6, so just two for now), then hardcode these in the C code accessing them

Why is that necessary? The natural way to solve this is to include the header file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off: I only now fully realized what @vchuravy pointed out, namely this is about "packing" the struct jl_task_t. As such, it may very well be that the offset of the copy_stack member does not change, and hence my worries are for nothing, and this PR actually causes no problems for us. As soon as I am at a computer again, I'll verify this.

@JeffBezanson Of course we are using the header file. But that only shows us the layout of the struct in that specific version of Julia. If the layout of the struct changes between Julia versions in a way that affects the generated machine code (e.g. because the offset of a member we access changes), then that means we either needsto recompile for each (set of) Julia versions were the struct differs (i.e. have multiple versions of the binary), or else use tricks like hardcoding offsets (so effectively bypassing the header).

For details, see #36064 and specifically #36064 (comment) (that talks about ptls->safe_restore, but the same applies to task->copy_stack.

I've also started work on #36823 in an attempt to allow us to get rid of direct access to task->copy_stack, but that's not yet ready (but I hope it'll be ready in time for 1.6, we'll see).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I checked on various systems, and indeed on 64bit systems this does not change the offset of copy_stack. Since we are only planning on support 64bit anyway, I withdraw my concerns, and apologize for the noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you lucked out this time, but these offsets will probably change in other PRs. I will try to keep the offset of copy_stack if possible. But in general we do not guarantee ABI compatibility at this level between minor versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure -- we initially had the plan of compiling the code once for each major Julia version for this reason. But with the advent of JLLs, now it is interesting to use those. One model there is to keep having a separate version of the JLL for each Julia version; but the logistics of that are complicated. So now we are trying to support all with a single compilation, and in the course of that, are trying to get rid of all access to members of internal structs. One step was taken with PR #36064, and the final bit (assuming we didn't miss something...) will come with PR #36823

- move locks array to thread-local state
- move tid,prio to occupy existing alignment padding
@JeffBezanson JeffBezanson merged commit 48602c5 into master Aug 5, 2020
@JeffBezanson JeffBezanson deleted the jb/smallertask branch August 5, 2020 20:55
simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020
- move locks array to thread-local state
- move tid,prio to occupy existing alignment padding
int16_t tid;
// multiqueue priority
int16_t prio;

jl_ucontext_t ctx; // saved thread state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially store this inside the stkbuf object too, if that's helpful for making it cheaper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants