Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ITTAPI hooks to jl_mutex_wait #49434

Merged
merged 7 commits into from
Apr 21, 2023
Merged

Add ITTAPI hooks to jl_mutex_wait #49434

merged 7 commits into from
Apr 21, 2023

Conversation

pchintalapudi
Copy link
Member

This allows identifying our jl_mutex_ts as mutexes by VTune

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@topolarity you might want to take a look at this for tracy integration

@vchuravy
Copy link
Member

@pchintalapudi can you split this into two PRs?
This one for jl_mutex_wait, and a second one instead of hooking the probe points we should use ittapi as a timings backend like tracy #49140

@vchuravy vchuravy requested a review from topolarity April 20, 2023 17:49
Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchintalapudi Can you also share some of the screenshots we looked at earlier?

@pchintalapudi
Copy link
Member Author

Examples:

image
image

@pchintalapudi
Copy link
Member Author

The profiling events have been split into #49448.

@topolarity
Copy link
Member

Thanks for this work! The bulk of the changes look great to me, although I did request a bit of re-factoring to better align with jl_timing_*

Nice to see some example data already too 🙂 Out of curiosity/ignorance, are any of these mutexes considerably "hot" or are they rare enough that we essentially don't need to worry about overhead?

P.S. Can you think of any significant locks that this change leaves out? I believe I/O has non-jl_mutex_t locks, or am I mistaken?

src/threading.c Outdated

#ifdef USE_ITTAPI

#define profile_lock_init(lock, name) __itt_sync_create(lock, "jl_mutex_t", name, __itt_attr_mutex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to move these definitions into timing.c/.h?

Ideally this would follow the conventions established there to provide backend-specific implementations (in particular, it should be possible to turn on multiple back-end implementations at once)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also OK if these stay local to threading.c, but I would like to try to support multiple back-ends being enabled at once

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @vchuravy preferred them as macros here specifically to avoid them showing up in the profile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a static inline function fix that problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Let's go with the static_inline

@pchintalapudi
Copy link
Member Author

I/O and task switching have uv_mutex_t locks, but VTune at least picks those up automatically by tapping pthread_mutex_wait.

@pchintalapudi
Copy link
Member Author

WRT hot locks, this workload was an intentionally bad one for the codegen lock, and I've seen the method's write lock get heavily contended when type inference runs in parallel trying to infer the same method with different arguments.

@topolarity
Copy link
Member

I think I'm less worried about contention (which I'd want the profiler to show) and more worried about hot locks that are uncontended, where the profiler overhead might be significant relative to the lock / critical section

@pchintalapudi
Copy link
Member Author

I can't think of a case where we acquire and release a lock in a hot loop, but if there was such a case I think it would be better to fix the acquire and release in the user code? We could also just insert a single trial of the cmpxchg that doesn't run the profiler to skip out on the uncontended case as well.

@topolarity
Copy link
Member

We could also just insert a single trial of the cmpxchg that doesn't run the profiler to skip out on the uncontended case as well.

Yeah, that's a nice idea - I think there's a lot of effective tricks we could do to catch exceptional events for very hot locks.

If our existing locks aren't that hot though, let's just record all the events for now. If the overhead becomes suspect later, we can go back and try to use some tricks to reduce it.

Copy link
Member

@topolarity topolarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good on my end 👍

@vchuravy vchuravy merged commit ecc3751 into master Apr 21, 2023
@vchuravy vchuravy deleted the pc/ittapi-uds branch April 21, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants