-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some() #3387
Comments
Most probably this assertion doesnt play well with on-demand download |
But it seems to me that even with on-demand download layer map at page server should contain all layers, shouldn't it? |
It should. The error happens when we've downloaded a layer and trying to remove existing remote layer and replace it with downloaded one. Looking further |
This is the sequence:
|
I have a hypothesis. In |
Update: Per sync with @shanyp and @problame The reason why this lead to unavailability is the incorrectly set The above hypothesis still looks valid to me. @hlinnaka, wdyt? Do you think there can be other reasons for this behavior? We need to decide what would be the proper fix. Available options:
The idea is to not rush the fix and do it properly. In the meantime I'll use a workaround for migration (download all layers script by @problame) |
Correct me if I misremember, but we discussed we should first try to figure out how to reproduce this, then move on to ignoring/fixing it. Triple checked the logs, there's nothing to support the above hypothesis there for the today's case (assuming there was nothing for the earlier). Doesn't reproduce locally with:
Did manage to reproduce the compute refuses to start with sent tarball but didn't record it. This was on a local dev build with incremental compilation (many cgus => better chance of hitting Arc::ptr_eq). |
This comment was marked as outdated.
This comment was marked as outdated.
@problame suggested I use failpoint to grow the logical_size_calculation length to cover gc but I don't think that should be enough, even if layers were removed because we didn't see the affected layer being removed anywhere. I am thinking of following solutions:
|
Don't want to edit #3387 (comment) completly, so adding fixes here: Looking now, it seems that
Tenant::gather_size_inputs calls Tenant::refresh_gc_info first, only later it starts to calculate logical sizes, which explains why there is no So to clarify, logical size calculation was no running throughout the gc's. Marking my previous comment outdated. Will now inspect the pitr calculation to see if there's anything obviously wrong there. |
This comment was marked as outdated.
This comment was marked as outdated.
Reproduced the layer download itself with the release-2722 binary (sha256: [tenant_config]
pitr_interval = "713007s"
gc_horizon = 87772040 Where pitr interval is near |
As later found out in #3589 we always fail with "cannot iterate a remote layer" for |
Let's close this issue because the code is gone? |
Related problematic/missed log line is now changed after #3664 as #3431 (comment). I think this could be closed, unless we want to still try reproduce it with the old binary. To recap, since then or inspired by this issue we've:
|
lets close this one |
Steps to reproduce
Investigating.
Expected result
No panics
Actual result
One task panicked. Additionally this caused problems for other tenants. Other tenants became unavailable.
Part of the log containing stacktrace:
Environment
prod
Logs, links
The text was updated successfully, but these errors were encountered: