-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
track which timestamp does each LSN corresponds to #1361
Comments
My initial thoughts (not too attached to the approach, just starting the discussion. Feel free to disagree):
Commit time seems to make most sense, since it's the monotonic time that's closest to user action. If that's the case, then safekeepers need to send timestamp metadata with the wal entries. We don't need more than one timestamp per second per tenant (for PITR), so this shouldn't be a performance problem. (off topic: providing edit timestamps on each database cell would be a sick feature, but maybe even that wouldn't need more than second accuracy)
We need to persist these timestamps with almost the same durability semantics that we use for persisting data: once it's committed it can't be lost. There's a lower cost to losing timestamp data than losing actual data, but still let's try not to lose it. Users may need PITR in dire situations so we should be as helpful as possible. Easiest way to guarantee durability is to include timestamp data in delta layers. AFAIK all we need to do is add a new variant behind the layered repository key tagged union, and we get storage, compaction, branch logic, and billing for free.
Pageservers should already have the latest value of every page on local disk. This metadata is part of the key space, so it should be on local disk at some pageserver. |
only commits and their corresponding timestamps matter. PITR requests will come at either precise time (customer knows when thing went wrong) or at an approximate time, either time is compute time (as customer doesn't know any other time in our system) XLOG_XACT_COMMIT already carries a timestamp, that solves the problem of data source. now let's look at the important phases in PITR (assuming restore with no page server data):
This means that PITR workflow should start with downloading and processing commit-ts-lsn before it proceed with the rest of restore. Given the data density in the structure (XID, timestamp, LSN) it may need to be stored in more granular form in S3 (than current 1GB/segment we're planning). Given that s3 [list object and] download is much longer process than reading local data, I suspect there will be very little benefit from indexing or caching (even if data downloaded from S3 is evicted from memory) |
Summary of 1-1 with @antons: Points of agreement:
Remaining problem: If our LSM implementation always maintains a clean state (deletes redundant top level layers from s3 after compaction), then we just need to download the latest image and a few layers on top of it. We can take frequent images (say, every L1 layer) of the timestamps relish if needed, since it's a small and important relish. However if we have lots of top-level layer junk (L0, L1 files that haven't been deleted yet), then we'll have to pay for an expensive s3 ls operation. Some observations, leading towards solution:
Next steps:
|
I still think that maintenance of timestamp->LSM mapping is too expensive (why otherwise it is switched off in postgres by default) and should be avoided as much as possible. Right now, in PiTR PR is am using system file modification data of layer files.
In any case, the first problem with using system file time can be easily solved by including timestamp in filename itself (as well as LSN and key range). In this case timestamp will not be lost after restoring data from S3. Some more arguments against maintaining LSM->timestamp map:
|
regarding maintenance cost: Now given that PITR requests are not frequent, scanning a CLOG log records locally is not a big deal (even if it's on disk, worst case the flat structure is 32GB); https://aws.amazon.com/blogs/aws/s3-glacier-select/ |
Yes, timestamp is included in commit record. But it actually costs almost nothing (extra 8 bytes comparing with > 100 bytes of typical commit record size). But maintaining map is really expensive.
Sorry, I do not understand you. First of all maximal size of CLOG is 1Gb (2 bits per XID). But you are right, instead of scanning original WAL, we can scan only WAL related with CLOG pages (but please notice that at some workloads total size of commit records will be comparable with total WAL size, so principally it changes nothing and may be even scanning original WAL will be faster than scanning scattered WAL associated with CLOG pages). Also in this discussion we frequently refer to interaction with S3: how timestamp->LSN mapping can help to retrieve some older snapshots from S3. It based on the assumption that pageserver is storing only most recent versions and historical data is swapped to S3. It may be really the right approach. But right now we have implemented different model. There is not swapping to S3. We are uploading data to S3, but we are not able to retrieve layers on demand from S3. So summarizing all above:
Do you agree with this statements? |
Agree on most points. Storing this information in filenames is a viable option from a durability perspective, as long as this timestamp in the filename represents the range of commit timestamps in the layer, rather than file modification metadata (as was suggested eariler outside of this thread). My only concern is that scanning and parsing these files will slow down recovery significantly. We have to eagerly scan all of them to find an exact LSN to recover from. We can't trim different L1 files at different LSNs (or maybe we can? It should be proven if so.). This means we have to download even files that are not immediately needed during recovery. If smgr is asking for only 1 relation in the beginning, we can't start from that relation without downloading other relations first. |
Do we really need to truncate files? Can we treat this the same way as branching? We need to locate branch point and then we create a branch in the usual way. Am I missing something? |
Yes, I didn't mean literally truncate. But there needs to be a single "branch point" or "restore point". Would restore behave correctly if different layers restore from a different LSN (as a result of independently performing time -> LSN conversion based on file contents)? |
Sorry, I do not understand this. First of all timestamp->lsn mapping is not needed during recovery. It is needed by GC (to determine cutoff horizon) and may be to specify branching point (not by LSN, but by timestamp). Second - if timestamp is included in files name, there is no need to scan and parse some files. Third: timestamp -> LSN is build based on XLOG_XACT_COMMIT records, associated with CLOG. It is the only source of data. Situation when each layers stores its own "timestamp->LSN" map is not possible. By storing timestamp in layer file name or in layer's metsadata, we just indicated the most recent timestamp contained by this layer. This information is not used by recovery. It is needed by GC to delete layers which are out of PiTR interval. |
Do you agree that:
|
By recovery you mean "R" in "PiTR"? The next question is how to specify branch LSN. Right now it is possible only by specifying LSN. But we may also need to specify at as timestamp. This is why we need timestamp->LSN mapping. This is what I am implementing now.
I have already explain my position concerning PiTR from S3 (see 4 days ago comment). We may need it. But right now we do not have swapping to S3. Content of pageserver and S3 is identical. GC is performed locally at pageserver. So before we implement PiTR fmro S3, we need to implement more sophisticated interaction with S3. |
IN PR #1386 I have implemented on demand mapping from xid/timestamp to LSN. I have also implemented mapping from XID to LSN (not only from timestamp), because it seems to be also possible use case when DBA checks when some records was updated (by |
Implemented in PR #1590 |
No description provided.
The text was updated successfully, but these errors were encountered: