Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dirty tracking performance improvements #210

Merged
merged 56 commits into from
Jan 7, 2022
Merged

Dirty tracking performance improvements #210

merged 56 commits into from
Jan 7, 2022

Conversation

Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented Dec 29, 2021

This PR is the third (and final) in a series of overhauls of the threading, snapshotting and dirty tracking model.

This PR addresses the underlying issue of the performance of the dirty tracking, which is currently based on soft dirty PTEs. While it works, it has a couple of shortcomings:

  • Soft dirty PTEs can only be reset globally, creating a synchronisation point and preventing us from performing dirty tracking on more than one application at a time on each host.
  • Resetting and reading soft dirty PTEs requires writing to /proc/self/clear_refs and reading from /proc/self/pagemap respectively, which can be a drain on performance when done repeatedly in a tight loop.

Ideally we would do this tracking with userfaultfd write-protected pages, however, this is only available in kernels 5.7+, i.e. Ubuntu 22+ which isn't yet stable. In the interim we can use an mprotect + SIGSEGV approach, where we make pages read-only using mprotect, then catch the resulting segfault when they're written to, and mark them as dirty.

Although in our current use-cases the PTEs are less performant that the segfault approach, this may not be the case for all workloads, so I'd like to keep the soft PTEs approach around.

PTEs, segfaults and userfaultfd all result in a different approach to aggregating diffs for a batch of threads. PTEs are system-wide, while segfaults are handled by individual threads (so we have to introduce thread-local tracking), and userfaultfd would use a single background tracker thread per application. To abstract the boilerplate for doing this and support switching more easily, I've introduced a single DirtyTracker class that abstracts all the details, and a configuration parameter DIRTY_TRACKING_MODE which can be set to softpte or segfault for now, and uffd in future.

This results in quite a few changes to the code:

  • Abstract all dirty tracking logic to an interface encapsulated in the DirtyTracker class.
  • Support both global and thread-local dirty tracking.
  • Move all logic related to handling thread snapshots, invoking threads and dirty tracking into the Executor class (previously this was scattered across Faabric and Faasm).
  • Introduced the concept of a "main thread snapshot", used by threaded applications and controlled by Faabric. Previously this snapshot had to be created and updated outside of Faabric, which was confusing and error-prone.
  • When repeatedly executing batches of threads from the same application, cache each scheduling decision and pass it as a hint to the next batch of the same size.
  • Stop using dirty page tracking to track changes made to snapshots. We can do this manually with higher precision as all changes to each snapshot now go through methods on the SnapshotData class.
  • Improve the performance of inner loops related to dirty tracking and diffing.

@Shillaker Shillaker self-assigned this Dec 29, 2021
@Shillaker Shillaker mentioned this pull request Dec 29, 2021
@Shillaker Shillaker changed the title Dirty tracking with userfaultfd experiment Dirty tracking performance improvements Dec 30, 2021
@Shillaker Shillaker marked this pull request as draft January 4, 2022 14:26
@Shillaker Shillaker marked this pull request as ready for review January 4, 2022 17:27
@Shillaker Shillaker marked this pull request as draft January 4, 2022 18:01
@Shillaker Shillaker marked this pull request as ready for review January 5, 2022 15:36
@Shillaker Shillaker requested a review from csegarragonz January 6, 2022 17:58
Copy link
Collaborator

@csegarragonz csegarragonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very minor changes but will approve so that you don't have to re-request a review.

/*
* Returns a list of flags marking which bytes differ between the two arrays.
*/
std::vector<bool> diffArrays(std::span<const uint8_t> a,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a bit-wise XOR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently byte-wise, not bit-wise, but this may change. I noticed that this function isn't actually used, so I should probably just delete it.

src/scheduler/Executor.cpp Show resolved Hide resolved
src/scheduler/Executor.cpp Show resolved Hide resolved
src/scheduler/Executor.cpp Show resolved Hide resolved
src/scheduler/Executor.cpp Show resolved Hide resolved
src/scheduler/Executor.cpp Show resolved Hide resolved
src/scheduler/Executor.cpp Show resolved Hide resolved
src/util/dirty.cpp Show resolved Hide resolved

uint32_t diffPageStart = 0;
bool diffInProgress = false;
for (int i = 0; i < nPages; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, isn't this essentially the same as faabric::util::getDiffRegions()?

Copy link
Collaborator

@eigenraven eigenraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a quick look at this, because it sounds very similar to what I'm doing to reduce the size of incremental snapshots for offloading in my project. I'd suggest not creating byte-wise diffs, as that's pretty costly performance-wise, you can play around with variants in the quickbench link I put in a comment. The code in faabric/util/delta uses a configurable "page" size (any difference in a page = emit a diff for the whole page), in my testing around 64 yielded the best performance without making the diffs larger than a few percent - you might want to test this for your applications. The code could be relatively easily modified to take in an array of modified OS pages to skip known-unmodified pages, I'm happy to do this if you're interested as it's on my todo list anyway, and then you could use the generate/applyDelta functions


std::vector<bool> diffs(a.size(), false);
for (int i = 0; i < a.size(); i++) {
diffs[i] = a.data()[i] != b.data()[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be very slow due to the vector<bool> specialization, just swapping the type to uint8_t (at the cost of 8x more memory) makes it 11x faster: https://quick-bench.com/q/d8INA76hIDM3m4gfC4xJKI4HcQM

src/util/dirty.cpp Show resolved Hide resolved
@Shillaker
Copy link
Collaborator Author

I'm going to merge this and take discussions of improvements and tweaks offline.

@Shillaker Shillaker merged commit b3229aa into master Jan 7, 2022
@Shillaker Shillaker deleted the userfault branch January 7, 2022 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants