Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make files without Metadata readable with Qbeast #121

Closed
osopardo1 opened this issue Jul 22, 2022 · 6 comments · Fixed by #152
Closed

Make files without Metadata readable with Qbeast #121

osopardo1 opened this issue Jul 22, 2022 · 6 comments · Fixed by #152
Assignees
Labels
type: enhancement Improvement of existing feature or code

Comments

@osopardo1
Copy link
Member

osopardo1 commented Jul 22, 2022

To be more compatible with underlying Table Formats and set up an easier conversion to Qbeast, we should be able to process files that do not have any Qbeast Metadata on them.

For example

This is a File with Qbeast Metadata:

{
  "add": {
    "path": "d54ba0cd-c315-4388-9bce-fe573f5d0a64.parquet",
    ...
    "tags": {
      "state": "FLOODED",
      "cube": "gw",
      "revision": "1",
      "elementCount": "10836",
      "minWeight": "-1253864150",
      "maxWeight": "1254740128"
    }
  }
}

And this is a file without Qbeast Metadata:

{
  "add": {
    "path": "d54ba0cd-c315-4388-9bce-fe573f5d0a64.parquet",
    ...
    "tags": ""
}

One solution could be the following:

When reading the Delta Log and encountering a file with tags, we put the following synthetic metadata:

val rootTags = Map(
                "maxWeight" -> Weight.MaxValue.value.toString,
                "minWeight" -> Weight.MinValue.value.toString,
                "cube" -> "",
                "state" -> State.FLOODED,
                "revision" -> lastRevisionID.toString,
                "elementCount" -> "0")

This means we are putting all the unknown files onto the last revision root cube with a weight range of [MinValue, MaxValue] ([0.0, 1.0]).

Questions/design decisions:

  • What happens with elementCount? Is it necessary to know the value? If so, how can we compute it without wasting time?
  • Is it fair to use the last revision as a placeholder for this data? Would it be better to choose a non-existing empty revision with ID 0?
  • When optimizing/compating... Should we process those files and convert them to Qbeast?
@osopardo1 osopardo1 added type: bug Something isn't working type: enhancement Improvement of existing feature or code and removed type: bug Something isn't working labels Jul 22, 2022
@osopardo1 osopardo1 self-assigned this Jul 22, 2022
@osopardo1 osopardo1 added the high label Jul 22, 2022
@osopardo1 osopardo1 mentioned this issue Jul 27, 2022
6 tasks
@alexeiakimov
Copy link
Contributor

Regarding elementCount

  1. If DeltaTable file has Stats, then the value can be obtained as Stats.num_records.
  2. The number of elements in the file is used by sharing protocol to limit the number of records the client can download. In more details the sharing server adds file links to the query result while the sum of elementCount is less then a specified limit.

@alexeiakimov
Copy link
Contributor

Maybe I am wrong the last revision can have (min, max) ranges of the values (later used by liner transformation) which are smaller than the corresponding values ranges of the records from the file. As I remember while indexing if given data does not fit the latest revision space then a new revision is created.

@alexeiakimov
Copy link
Contributor

Let me formulate the last item a different way: can we treat files without Qbeast metadata as indexed? Possibly they are indexed badly, but if they do not violate any invariant, then it is safe to add them to index as if they were indexed.

@osopardo1
Copy link
Member Author

osopardo1 commented Aug 30, 2022

  1. On element count, unfortunately, we cannot assume that DeltaTable has stats, but it's a workaround for those cases. If no Stats.num_records is written, we could compute a count() for the file, which would have a cost in performance. Another possible solution is to investigate if Parquet files had metadata we could read and avoid the computation.
  2. We want to be able to Convert to Qbeast without the overhead of indexing, and also let the user use other Lakehouse operations of formats underneath without losing information. Yes, the goal of the issue is to treat them as indexed (badly, as you said), and slowly index them correctly (as the index grows). There can be two cases:
    1. A Revision already exists. In this case, the user had done an operation in Delta that affected the DeltaLog, and now he cannot read it correctly. If we put them in the last revision, we need to ensure those files are in the [min, max] range. But doing this computation at the reading time is too expensive (if we don't have any metadata like Stats). That's why putting them in the last revision without knowledge of the space could violate the constraint.
    2. A Revision does not exist. This is the case in which we convert the table from 0 to Qbeast. Here we have more freedom to write the DeltaLog with extra metadata like min-max and element count. But this process it's more for issue Add Convert To Qbeast  #102

@osopardo1
Copy link
Member Author

osopardo1 commented Jan 20, 2023

UPDATE

From last conversations, we agreed that:

  • A table that is fully written in Parquet or Delta would not be readable through Qbeast.
  • The user would need to execute Convert To Qbeast to trigger the creation of the first revision.
  • All files that aren't indexed (the "staging" ones), would be assigned as part of the root of the last revision available. If they don't belong to the space revision, we are going to read them anyways and filter the records in memory (or use file-skipping techniques of the format underlying, in this case, Delta Lake).

This issue is a dependency of #102

@osopardo1 osopardo1 removed the high label Jan 24, 2023
@osopardo1 osopardo1 added this to the Benchmarking Qbeast Format milestone Jan 25, 2023
@osopardo1
Copy link
Member Author

Fixed on #152

@Jiaweihu08 Jiaweihu08 mentioned this issue Jan 27, 2023
5 tasks
@Jiaweihu08 Jiaweihu08 linked a pull request Jan 27, 2023 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants