-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Simplified Snapshot v2 with Timestamp Pinning in Remote Store #15057
Comments
Nice proposal @sachinpkale , the timestamp pinning approach sounds much better than locking mechanism we have today for shallow snapshots. Couple of questions:
|
@sachinpkale thanks for the RFC, I think I got the idea but have a question (my apologies if I missing something): where the timestamps (or epochs as you refer to them) are coming from? UPD: really sorry for timing but this is the same question @linuxpi is asking (one of) |
Thanks for the review @linuxpi and @reta
Initially, index level snapshots will not be supported for snapshot that use pinned timestamps. Index level restore will be supported in the same way it works today. I haven't given a lot thoughts around how to support index level snapshot but the format of pinned timestamps need to be changes in the way you suggested.
Good question. Timestamp on different servers in a cluster need not be exactly same (explained in next para) but yes, users need to make sure that diff is not very high. If we use existing cloud services, they promise microsecond level accuracy (Example: https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/) Why don't we need timestamps to be synchronised on different nodes in the cluster?
Yes. In remote backed storage, we purge remote translog on refresh. This means, we will be holding translog data since last refresh in the remote store.
Timestamp Pinning would be owned by remote backed storage. Snapshot would be one of the users of it. Initially, only snapshot would be pinning the timestamp but we plan to expose an API if required. |
Also, how snapshot will pin the timestamp will be covered in another RFC. |
@sachinpkale Thanks for the RFC. Looking forward to lower level information regarding the garbage cleanup for pinned timestamp information during failure scenarios. |
Thanks for the RFC @sachinpkale. Couple of questions.
|
Thanks for the review @backslasht
With remote store, we retain remote translog since last refresh. So, in this case, if we pin timestamp at
We will be retaining translog data since last refresh for a given snapshot.
We will be supporting lock based snapshots at least in 2.x to retain backwards compatibility. We can think of deprecating it as part of 3.x |
Thanks @sachinkale for this detailed RFC. Very excited to see this. with this feature, we will be very close to supporting PITR. Have following comments/queries:
why is this the case? Are we saying we will mark a snapshot as failed if snapshot metadata of any index of that snapshot fails?
Looks like we support this pinning for segment data and translog data. since we support capturing cluster state snapshot as well. Do we plan to do something similar for remote cluster state as well in future? not totally related (or maybe will be discussed as part of design), but one another issue we had with shallow snapshots was. once a index is deleted, snapshot layer had to take care of remote store cleanup. With this approach, i can see that we do not need any direct communication between snapshot layer and remote store. So are we planning to introduce some other cluster level garbage collector or something as well that would take care of pinned md cleanup after the index is deleted? |
Yes, we plan to support pinning of cluster state as well.
We still need the same cleanup approach as of today. |
Goal
Today, for cluster with remote backed storage feature, we use a variant of snapshot, called as shallow snapshot. Shallow snapshots refer data that is already uploaded as part of remote store. In order to prevent deletion of data in remote store that is referred by shallow snapshots, we need a locking mechanism that is used by remote store garbage collection. In this RFC, we discuss current locking mechanism and its shortcomings and propose a new mechanism that scales independent of number of shards/indices/nodes in the cluster. We also discuss how this new approach can be evolved into PITR (point-in-time restore).
Current Locking Mechanism
<metadata_filename>__<snapshot_id>
file underlock
directory in remote store.Sequence Diagram
Issues with Current Locking Mechanism
lightweight
shallow snapshots becomesbulky
.Requirements
Timestamp Based Implicit Locking
In this approach, we will move away from explicit lock file creation for a given metadata file. Instead, we will use timestamp in metadata filename to acquire implicit lock (Refer Metadata Filename Format section in Appendix for more details on metadata filename). We call it Timestamp Pinning.
Proposed Pinned Timestamp Format
Approach
We maintain a list of pinned timestamps at a cluster level. For each timestamp in this list, garbage collection for segment as well as translog will skip deletion of metadata file that matches (Appendix: Metadata file matching a timestamp) the pinned timestamp. To avoid triggering flush/refresh on each shard and handling potential failures, in this approach, we make translog garbage collector aware of snapshot locks.
Steps
remote_store_pinned_timestamps
remote_store_pinned_timestamps
is > X mins, skip garbage collection.md1
is >pinned_timestamp_a
and the timestamp of next metadata filemd2
<=pinned_timestamp_a
addmd2
topinned_metadata_files
pinned_metadata_files
and corresponding data filestimestamp
to restore data to.Sequence Diagram
Pros
Cons
Extending the approach to PITR
As this approach uses timestamp based pinning, it can be extended to point-in-time restore. As pinning timestamp does not involve multiple remote store or node-node calls, we can support timestamp pinning at lower granularity.
To avoid the synchronisation delay between pinning and communicating it to data node, in PITR, we can provide capability of fixed intervals. With this, we can support PITR granularity as low as 1 minute (we need to control retention based on granularity). Pinning the timestamp can still be supported for on-demand cases.
Appendix
Metadata file matching a timestamp
T
, if it has the max timestamp among all metadata files with timestamp at mostT
.2024/07/05 17:00:00
and we have following options:metadata_2024_07_05_16_05_51
metadata_2024_07_05_16_25_34
metadata_2024_07_05_16_56_47
metadata_2024_07_05_16_58_21
metadata_2024_07_05_16_59_35
metadata_2024_07_05_17_00_09
metadata_2024_07_05_17_45_12
metadata_2024_07_05_16_59_35
is considered as the metadata file that matches given timestampT
.Metadata Filename Format
Remote Segment Store
metadata__<Inverted Primary Term>__<Inverted Commit Generation>__<Inverted Translog Generation>__<Inverted Refresh Counter>__<Node ID>__<Inverted EPOCH>__<Metadata Version>
metadata__9223372036854775806__9223372036854775796__9223372036854775647__9223372036854775883__-396831118__9223370334830299234__1
Remote Translog
metadata__<Inverted Primary Term>__<Inverted Translog Generation>__<Inverted EPOCH>__<Node ID>__<Metadata Version>
metadata__9223372036854775806__9223372036854775648__9223370334830643807__-396831118__1
Existing Remote Store Garbage Collection Example
Remote Store Garbage Collection Example with Pinned Timestamps
The text was updated successfully, but these errors were encountered: