Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file read ahead through AsyncDataCache #3389

Closed
wants to merge 1 commit into from

Conversation

oerling
Copy link
Contributor

@oerling oerling commented Dec 1, 2022

adds a code sample and supporting functions for using CachedBufferedInput for smart prefetching of sequential files. A SeekableInputStream is registered for each file of interest. Each time the read proceeds past a given fraction of the current load quantum, the load of the next quantum is scheduled. The load prefetches into AsyncDataCache and will be found there when moving past the end of thecurrent quantum. Prefetch will fail silently if there is no memory or if over half the cache is taken by not yet accessed prefetched data.

Adds a special StreamIdentifier to denote expected sequential access. The default for no StreamIdentifier in enqueue is preloading the whole enqueued range.

Adds a mode where each visited cache entry is made evictable immediately after unpinning. This allows not polluting the cache with one time large sequential accesses.

Adds a test that simulates a multifile merge. Each thread has 100 files and consumes a chunk of each in turn.

@netlify
Copy link

netlify bot commented Dec 1, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d0b326b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/638a4655461f0a000b54a8b6

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 1, 2022
@oerling oerling requested a review from Yuhta December 1, 2022 04:27
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some suggestions regarding removing doneIndices from allCoalescedLoad_. Also the new test is not passing TSAN.

auto nextQuantum = position_ - offsetInQuantum + loadQuantum_;
auto prefetchThreshold = loadQuantum_ * prefetchPct_ / 100;
if (!prefetchStarted_ && offsetInQuantum + *size > prefetchThreshold &&
position_ - offsetInQuantum + loadQuantum_ < region_.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextQuantum < region_.length?

// Remove the loads that were done. There can be done loads if the same
// CachedBufferedInput has multiple cycles of enqueues and loads.
for (int32_t i = doneIndices.size() - 1; i >= 0; --i) {
allCoalescedLoads_.erase(allCoalescedLoads_.begin() + doneIndices[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be outside the looping over allCoalescedLoads_.

Also this is more efficient if allCoalescedLoads_ is large:

std::swap(allCoalescedLoads_[doneIndices[i]], allCoalescedLoads_.back());
allCoalescedLoads_.pop_back();

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@oerling oerling force-pushed the cache-ra-pr branch 2 times, most recently from c1c8bb0 to fb50810 Compare December 1, 2022 21:42
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

oerling pushed a commit to oerling/velox-1 that referenced this pull request Dec 1, 2022
Summary:
adds a code sample and supporting functions for using CachedBufferedInput for smart prefetching of sequential files.  A SeekableInputStream is registered for each file of interest. Each time the read proceeds past a given fraction of the current load quantum, the load of the next quantum is scheduled. The load prefetches into AsyncDataCache and will be found there when moving past the end of thecurrent quantum. Prefetch will fail silently if there is no memory or if over half the cache is taken by not yet accessed prefetched data.

Adds a special StreamIdentifier to denote expected sequential access. The default for no StreamIdentifier in enqueue is preloading the whole enqueued range.

Adds a mode where each visited cache entry is made evictable immediately after unpinning. This allows not polluting the cache with one time large sequential accesses.

Adds a test that simulates a multifile merge. Each thread has 100 files and consumes a chunk of each in turn.

Pull Request resolved: facebookincubator#3389

Reviewed By: Yuhta

Differential Revision: D41642368

Pulled By: oerling

fbshipit-source-id: 3272aca53cc3beb7b8a5489418802d552b92441e
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41642368

Summary:
adds a code sample and supporting functions for using CachedBufferedInput for smart prefetching of sequential files.  A SeekableInputStream is registered for each file of interest. Each time the read proceeds past a given fraction of the current load quantum, the load of the next quantum is scheduled. The load prefetches into AsyncDataCache and will be found there when moving past the end of thecurrent quantum. Prefetch will fail silently if there is no memory or if over half the cache is taken by not yet accessed prefetched data.

Adds a special StreamIdentifier to denote expected sequential access. The default for no StreamIdentifier in enqueue is preloading the whole enqueued range.

Adds a mode where each visited cache entry is made evictable immediately after unpinning. This allows not polluting the cache with one time large sequential accesses.

Adds a test that simulates a multifile merge. Each thread has 100 files and consumes a chunk of each in turn.

Pull Request resolved: facebookincubator#3389

Reviewed By: Yuhta

Differential Revision: D41642368

Pulled By: oerling

fbshipit-source-id: 86adf8ea400aed76f2681c8f291f3d619ea4f492
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41642368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants