-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348
Conversation
For pgbench -i 1000 with subsequent 1000 seconds of updates the difference in storage size is not so large:
|
It is not so easy to reproduce behavior reported in #2948 (so that number of image layers is greater than delta layers).
update.sql is the following:
main: 22GB |
Now we can pile up an unlimited number of deltas. The last lsn cache will help avoid false positives and locate the first relevant layer, but won't help finding the second relevant layer. It's a solveable layer map problem with some effort, but IMO the bigger problem is if we need to download any of these layers on demand. Doesn't feel good to not have any bound on how many layers need to be downloaded to serve a get_page if the layers are not on the pageserver. This was the whole point of compaction from the start: to make on demand download feasible. If we don't care about this we shouldn't be creating L1 layers at all, just use L0. |
Yes, I also thought about it. |
I tried my method of changing the definition of L1 to mean "sufficiently dense layer", such that those 2 out of 10 sparse layers would get re-compacted. I somewhat prefer this approach to adding L2 layers (which was probably the original plan) because at least the L1 layers that come out dense from the start don't need to be compacted. It was a 2 line change in |
Here's a branch if you'd like to help debugging. I'm getting
https://github.com/neondatabase/neon/compare/density-based-l0?expand=1 (after the bug is found I'll have to tweak the threshold and then we can see if it's effective) |
I do not think that this approach will really work. Also, as I found myself with PR trying to limit layer's key range, it is hard if ever possible to find some reasonable key rage threshold. There can be just one big relation (i.e. 1Tb). But it key range will e just 256Mb (so not satisfying your criteria of sparse layer). But it can contain just one page of this relation. Fro the other side, there can be thousands of relatively small relation and splitting them in separate layers will lead to large number of very small layers. Also not good. We have yesterday discussed this problems with @hlinnaka and @shanyp So, summarizing all above: right now we can propose four different approaches for solving problems with image layers:
|
Here's the effect on pgbench 10gb: It's not the best workload to test this on, because I don't think this change has much effect on the random updates phase (top of the diagram). But at least we can see that layers are being created later than usual, and all at the same time. So the two long L1s at the bottom probably have large holes and are not affecting the image heuristic. I don't know if there's a better way to test this than to deploy and see what happens. The cost is not that big (only log_2(10) increased pressure on the new layer map, not sure about current layer map). Off topic: I don't know how to interpret all the layers on the right end of the key space. Something changed on main branch since fe8cef3. See for example: #2995 (comment) which is a PR branched from an older version on main and doesn't have those layers. I also see some 24KB L1 deltas now. There might be some regression on main |
What would be the worst case workload to demonstrate the problem? |
Worst case is probably synthetic uniform random updates. The top 10 holes wouldn't take up a big percentage of the range. But this seems adversarial, and in real workloads I'd expect that the top 10 holes would take up a significant percentage of the range. |
I meant, worst case to demonstrate the original problem that this fixes. |
The bigger the database the worse it is. If we have 1TB database we will spontaneously create 1TB images once in a while even if compute is doing nothing but vacuum. We routinely see databases with 5-10x their size in useless images. It's hard to reproduce this because of nondeterminism. But generally because we do a full reimage of the database once we get 3 batches of L1s that cover the entire key space, and most L1 batches cover the entire keyspace, 30-50GB of wal is (my guess) enough to trigger reimaging. But it can also trigger just from passage of time. We can periodically flush tiny L0s, then compact them into tiny L1s and reimage. We've also seen this in prod. |
We could write a binary that we execute against a timelie on prod to download layer metadata: layer range + 100 largest holes. This gives us enough information to run the Yes, all this testing is extra effort compared to just merging this, waiting a week and then looking at prod metrics, but IMO that local reimaging simulation would make a nice unit test anyway. We should be testing compaction/reimaging/gc/branching/upload/download in unit tests, just using metadata. There's an old issue for this assigned to me that I somehow got distracted from :) #2031 |
I have implemented proposed binary.
So looks like this hole optimization should really help to reduce number of extra image layers. |
2985120
to
d9c1ee3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few nits.
I'm curious about results from at least one of these databases: https://neondb.slack.com/archives/C03UVBDM40J/p1674749418292749
pageserver/src/tenant/layer_map.rs
Outdated
let image_exact_match = img_lsn + 1 == end_lsn; | ||
if image_is_newer || image_exact_match { | ||
pub fn search(&self, key: Key, mut end_lsn: Lsn) -> Option<SearchResult<L>> { | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more efficient to change the insert method instead of search (insert the occupied ranges separately). Maybe leave a TODO if that's not planned for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be ok to have multiple references to the same layers?
Can it cause returning the same layer twice in iteration through layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When iterating the coverage it's good to return the same layer multiple times. But iter_historic
should return each layer only once.
pageserver/src/tenant/layer_map.rs
Outdated
@@ -417,18 +433,21 @@ where | |||
/// TODO The optimal number should probably be slightly higher than 1, but to | |||
/// implement that we need to plumb a lot more context into this function | |||
/// than just the current partition_range. | |||
pub fn is_reimage_worthy(layer: &L, partition_range: &Range<Key>) -> bool { | |||
pub fn is_reimage_worthy(layer: &L, partition_range: &Range<Key>) -> Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It never returns Err
. Is there a reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overlaps
may return error
4f96c77
to
ecc2381
Compare
Just so the comment doesn't get lost in history:
these projects have potential for more than 50% images disappearing, and I'm wondering if that will happen |
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
a62d1bd
to
6e5efc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick comment while trying to understand the on-demand calculation of holes while not actually holding any locks. Are there intentional changes for vendor/postgres-v1{4,5}
as well? I cannot see how these changes could require postgres chjanges.
anyhow::bail!("replacing downloaded layer into layermap failed because layer was not found"); | ||
} | ||
Replacement::RemovalBuffered => { | ||
unreachable!("current implementation does not remove anything") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be pulled into the replace_historic_noflush
, because this was true only for the one callsite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I didn't find any other call where Replacement::RemovalBuffered
is handled differently, did you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffered updates api allows you to do remove_historic
and replace
which would make this path viable.
Layers are read-only. So why do we need any locks here?
Sorry, them are completely unrelated. |
## Describe your changes This is yet another attempt to address problem with storage size ballooning #2948 Previous PR #3348 tries to address this problem by maintaining list of holes for each layer. The problem with this approach is that we have to load all layer on pageserver start. Lazy loading of layers is not possible any more. This PR tries to collect information of N largest holes on compaction time and exclude this holes from produced layers. It can cause generation of larger number of layers (up to 2 times) and producing small layers. But it requires minimal changes in code and doesn't affect storage format. For graphical explanation please see thread: #3597 (comment) ## Issue ticket number and link #2948 #3348 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.
refer #2948