Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track which timestamp does each LSN corresponds to #1361

Closed
Tracked by #1320
stepashka opened this issue Mar 14, 2022 · 15 comments
Closed
Tracked by #1320

track which timestamp does each LSN corresponds to #1361

stepashka opened this issue Mar 14, 2022 · 15 comments
Labels
c/storage Component: storage

Comments

@stepashka
Copy link
Member

No description provided.

@stepashka
Copy link
Member Author

we could store this in a special table?
there were some ideas discussed, cc @hlinnaka @kelvich

@bojanserafimov
Copy link
Contributor

My initial thoughts (not too attached to the approach, just starting the discussion. Feel free to disagree):

  1. What's the best definition of time? Query time? Commit time? Storage time?

Commit time seems to make most sense, since it's the monotonic time that's closest to user action. If that's the case, then safekeepers need to send timestamp metadata with the wal entries. We don't need more than one timestamp per second per tenant (for PITR), so this shouldn't be a performance problem. (off topic: providing edit timestamps on each database cell would be a sick feature, but maybe even that wouldn't need more than second accuracy)

  1. How do we store this?

We need to persist these timestamps with almost the same durability semantics that we use for persisting data: once it's committed it can't be lost. There's a lower cost to losing timestamp data than losing actual data, but still let's try not to lose it. Users may need PITR in dire situations so we should be as helpful as possible.

Easiest way to guarantee durability is to include timestamp data in delta layers. AFAIK all we need to do is add a new variant behind the layered repository key tagged union, and we get storage, compaction, branch logic, and billing for free.

  1. Is there a need for indexing/caching this metadata?

Pageservers should already have the latest value of every page on local disk. This metadata is part of the key space, so it should be on local disk at some pageserver.

@antons-antons
Copy link
Contributor

only commits and their corresponding timestamps matter. PITR requests will come at either precise time (customer knows when thing went wrong) or at an approximate time, either time is compute time (as customer doesn't know any other time in our system)

XLOG_XACT_COMMIT already carries a timestamp, that solves the problem of data source.

now let's look at the important phases in PITR (assuming restore with no page server data):

  1. Identify what set of objects to download
    1.1 start with the latest page images prior to PITR time
    1.2 end is defined by an LSN that is lower or equal to highest xact commit LSN (for which commit timestamp is <= Request timestamp)
  2. Download (solved)
  3. Truncate log records past LSN defined in 1.2 (given that we upload data in chunks, downloaded image layers will contain data beyond PITR time; Such data likely be harmless, except for "commit-ts-lsn" relish)

This means that PITR workflow should start with downloading and processing commit-ts-lsn before it proceed with the rest of restore. Given the data density in the structure (XID, timestamp, LSN) it may need to be stored in more granular form in S3 (than current 1GB/segment we're planning).

Given that s3 [list object and] download is much longer process than reading local data, I suspect there will be very little benefit from indexing or caching (even if data downloaded from S3 is evicted from memory)

@bojanserafimov
Copy link
Contributor

Summary of 1-1 with @antons:

Points of agreement:

  1. For getting the timestamp data, as he previously said, we can just read from XLOG_XACT_COMMIT. Users shouldn't care about PITR between commits (there's no observable difference), and these timestamps are monotonic enough for now (or maybe forever).
  2. It makes sense to store this data as some localized chunk of keyspace. Either modify how we store "commit-ts-lsn" relish, or add a new relish. Both are feasible approaches and this is not an important decision yet. Let's call this "the timestamps relish" for the sake of discussion for now.
  3. It would be nice to allocate a special segment (in terms of image generation) for this relish, so that it can be recovered independently. However, that would be a very small file, and would (probably) have a high s3 PUT cost. There are alternative solutions, using s3 SELECT over parquet or binary json files. That way we can write big files but read small chunks.
  4. During normal operation the pageserver cache of this relish will be enough to find the PITR lsn, given timestamp. However, ...

Remaining problem:
If pageserver is down or overloaded, we'll need to start a new pageserver to read this relish from s3. This needs to be feasible.

If our LSM implementation always maintains a clean state (deletes redundant top level layers from s3 after compaction), then we just need to download the latest image and a few layers on top of it. We can take frequent images (say, every L1 layer) of the timestamps relish if needed, since it's a small and important relish.

However if we have lots of top-level layer junk (L0, L1 files that haven't been deleted yet), then we'll have to pay for an expensive s3 ls operation.

Some observations, leading towards solution:

  1. This is a general "how do we quick-start a pageserver, giving priority to some relishes" problem that should be solved anyway. @hlinnaka should know if we can ignore the problem for the sake of this issue. AFAIK it's solveable.
  2. Cleaning up s3 junk should be done by some background process/service, and we should rarely pay the ls cost for it. It's fine if we do 1% of the time, as long as it's not a very inconvenient time :)
  3. If we sometimes have to list many files in order to find an image file (which is costly), at least we can clean up right afterwards, so we don't have a big amortized cost.

Next steps:

  1. Give people a chance to comment on this approach
  2. Research the s3 put cost of multipart uploads. The result might change our approach. I'll leave this to @antons-antons , unless we have someone more knowleadgeable on s3.
  3. Build a simple prototype
  4. Consider optimizations, like the s3 SELECT trick and others if necessary

@neondatabase-bot neondatabase-bot bot added this to the 0.6 Towards Tech Prev milestone Mar 15, 2022
@knizhnik
Copy link
Contributor

knizhnik commented Mar 16, 2022

I still think that maintenance of timestamp->LSM mapping is too expensive (why otherwise it is switched off in postgres by default) and should be avoided as much as possible.
The fact that we should take in account only commit records doesn't help much, because in many workloads there will be larger number of small transactions (up to millions per second). So we will have to support some ordered collection (tree) which is growing with speed about 1M inserts per second and it user for some reason set PITR to be several months, then its size will be also very large. Also we should provide efficient truncation of this collection. It is quite challenging task. Yes it can be done, but do we really need it?

Right now, in PiTR PR is am using system file modification data of layer files.
I heard only two argument against this approach:

  1. if we restore data from S3, then file creation time will not be correct.
    First of all I do not think that it is actually so critical. User will not loose some data because of it: this check is conservative, i.e. PiTR is local to pageserver. If we restore some data from S3 and it is out of PiTR interval, we will not remove it (although we can), but ... why we have not removed it before? In any case after some time it should be removed unless pageserver is constantly migrating from one node to another... But it is really strange behavior.
  2. We may need timestamp->LSN mapping not only for GC, but to create branch in the past on specified timestamp (rather than LSN). I do not think that such "branching in the past" will be really popular use case. IMHO users will either want to rewind to some specific snapshot (previously explicitly created branch), either undo last N transactions, either undo to some particular database set (which can be expressed by SQL query, like checking presence of some table)

In any case, the first problem with using system file time can be easily solved by including timestamp in filename itself (as well as LSN and key range). In this case timestamp will not be lost after restoring data from S3.
Second problem can be solved by building LSM->timestamp map on demand when it actually needed. We just need to store original (unscattered) WAL for all PiTR interval. As far as at least for beginning we are going to store WAL forever, it should not be a problem.

Some more arguments against maintaining LSM->timestamp map:

  1. GC can only delete layer completely. So there is no sense in providing more precise timestamp mapping. If we have 128Mb layer contained millions commit, storing their timestamps is just waste of space.
  2. Time is unreliable instrument. Even if we do not take in account theory of relativity, there are still IT specific problems with timezones and time synchronization at different hosts. What is the source of truth in our case: page server or compute node? If we grab timestamp from commit record, then we rely on compute node's local time. But probability that compute node will be restarted is much higher than page server restart or migration. And time at servers where compute node is launched may not be synchronized. Well, actually I do not expect much time differences at EC2 nodes. And PiTR interval in any case is not expected to be high precision. But in any case, it is better to eliminate this problems at all.

@antons-antons
Copy link
Contributor

regarding maintenance cost:
Postgres example is sort of irrelevant as Postgres tracks (XID commit, LSN, Timestamp) by default (XLOG_XACT_COMMIT is stores exactly that mapping in WAL) but it doesn't index this data.
Given that we store all of the log writes and we back them up to S3 the question is how can we use this data.

Now given that PITR requests are not frequent, scanning a CLOG log records locally is not a big deal (even if it's on disk, worst case the flat structure is 32GB);
Now it's important to avoid dumping aforementioned 32GB (more if xid epoch is >0) from S3 when doing PITR thus uploading data in a digestible format for S3 Select or Athena.
This will speed up (and potentially reduce cost of) discovery of exact LSN to restore to

https://aws.amazon.com/blogs/aws/s3-glacier-select/
https://aws.amazon.com/athena/features/?nc=sn&loc=2

@knizhnik
Copy link
Contributor

Postgres example is sort of irrelevant as Postgres tracks (XID commit, LSN, Timestamp) by default

Yes, timestamp is included in commit record. But it actually costs almost nothing (extra 8 bytes comparing with > 100 bytes of typical commit record size). But maintaining map is really expensive.

Now given that PITR requests are not frequent, scanning a CLOG log records locally is not a big deal (even if it's on disk, worst case the flat structure is 32GB);

Sorry, I do not understand you. First of all maximal size of CLOG is 1Gb (2 bits per XID).
But it is just bimap, so scanning it has no sense. And if you mean XLOG_XACT_* records, then total size of this records is unlimited and depends only on PiTR interval. If PiTR interval is unlimited, then number of transaction we should scan is also unlimited.

But you are right, instead of scanning original WAL, we can scan only WAL related with CLOG pages (but please notice that at some workloads total size of commit records will be comparable with total WAL size, so principally it changes nothing and may be even scanning original WAL will be faster than scanning scattered WAL associated with CLOG pages).

Also in this discussion we frequently refer to interaction with S3: how timestamp->LSN mapping can help to retrieve some older snapshots from S3. It based on the assumption that pageserver is storing only most recent versions and historical data is swapped to S3. It may be really the right approach. But right now we have implemented different model. There is not swapping to S3. We are uploading data to S3, but we are not able to retrieve layers on demand from S3.
What we can do is just restore pageserver from S3. Certainly in some architecture discussions we should take in account further roadmap. But still in many cases main principle of "extremal programming" (do not try to solve unexisted problem) is still very relevant.

So summarizing all above:

  1. I do not think that we need to persists timestamp->LSN mapping at pageserver and S3.
  2. To support PiTR for GC it is enough to include timestamp in layer metadata (for example in layer file name).
  3. If we need more precise timestamp->LSN mapping (to locate particular XID) we can scan WAL on demand. It can be either original (plain) WAL or XLOG_XACT_* WAL records associated with CLOG pages.

Do you agree with this statements?

@bojanserafimov
Copy link
Contributor

Agree on most points. Storing this information in filenames is a viable option from a durability perspective, as long as this timestamp in the filename represents the range of commit timestamps in the layer, rather than file modification metadata (as was suggested eariler outside of this thread).

My only concern is that scanning and parsing these files will slow down recovery significantly. We have to eagerly scan all of them to find an exact LSN to recover from. We can't trim different L1 files at different LSNs (or maybe we can? It should be proven if so.). This means we have to download even files that are not immediately needed during recovery. If smgr is asking for only 1 relation in the beginning, we can't start from that relation without downloading other relations first.

@LizardWizzard
Copy link
Contributor

Truncate log records past LSN

We can't trim different L1 files at different LSNs

Do we really need to truncate files? Can we treat this the same way as branching? We need to locate branch point and then we create a branch in the usual way. Am I missing something?

@bojanserafimov
Copy link
Contributor

Yes, I didn't mean literally truncate. But there needs to be a single "branch point" or "restore point". Would restore behave correctly if different layers restore from a different LSN (as a result of independently performing time -> LSN conversion based on file contents)?

@knizhnik
Copy link
Contributor

My only concern is that scanning and parsing these files will slow down recovery significantly.

Sorry, I do not understand this. First of all timestamp->lsn mapping is not needed during recovery. It is needed by GC (to determine cutoff horizon) and may be to specify branching point (not by LSN, but by timestamp).

Second - if timestamp is included in files name, there is no need to scan and parse some files.

Third: timestamp -> LSN is build based on XLOG_XACT_COMMIT records, associated with CLOG. It is the only source of data. Situation when each layers stores its own "timestamp->LSN" map is not possible. By storing timestamp in layer file name or in layer's metsadata, we just indicated the most recent timestamp contained by this layer. This information is not used by recovery. It is needed by GC to delete layers which are out of PiTR interval.

@bojanserafimov
Copy link
Contributor

Do you agree that:

  1. Recovery has a definite LSN starting point
  2. We need to be able to PITR from s3

@knizhnik
Copy link
Contributor

Do you agree that:

1. Recovery has a definite LSN starting point

By recovery you mean "R" in "PiTR"?
Because in terms of Zenith API, it is not a recovery. It is creation of branch at some particular LSN. Zenith allows to create branch in the past, if LSN is within PiTR interval.
Before #1332
it was possible to specify PiTR interval only as LSN range.
With this PR it is also possible to specify it as time range.

The next question is how to specify branch LSN. Right now it is possible only by specifying LSN. But we may also need to specify at as timestamp. This is why we need timestamp->LSN mapping. This is what I am implementing now.

2. We need to be able to PITR from s3

I have already explain my position concerning PiTR from S3 (see 4 days ago comment). We may need it. But right now we do not have swapping to S3. Content of pageserver and S3 is identical. GC is performed locally at pageserver. So before we implement PiTR fmro S3, we need to implement more sophisticated interaction with S3.

@knizhnik
Copy link
Contributor

IN PR #1386 I have implemented on demand mapping from xid/timestamp to LSN.
It is extractning XID/timestamp from ClogSetCommitted records associated with CLOG segments.
Please notice that we store in ClogSetCommitted minimal possible set of information (just XIDs and timestamp).
It means that if we will try to store transaction timestamps in some other way, then we will have to traverse comparable amount of data.

I have also implemented mapping from XID to LSN (not only from timestamp), because it seems to be also possible use case when DBA checks when some records was updated (by select xmin,xmax from...) and than may want to revert to particular transaction. Also XID of transaction can be gabbed from logs.

@hlinnaka
Copy link
Contributor

hlinnaka commented May 3, 2022

Implemented in PR #1590

@hlinnaka hlinnaka closed this as completed May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage Component: storage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants