Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Convert To Qbeast #102

Closed
osopardo1 opened this issue May 9, 2022 · 2 comments · Fixed by #152
Closed

Add Convert To Qbeast #102

osopardo1 opened this issue May 9, 2022 · 2 comments · Fixed by #152
Assignees
Labels
type: enhancement Improvement of existing feature or code

Comments

@osopardo1
Copy link
Member

osopardo1 commented May 9, 2022

The only way of writing in Qbeast Format is to load your data and write it again with Spark Dataframes API.

It could be good to have some more easy ways to convert data in other formats to Qbeast, and that can be compatible with reading when no Metadata is found.

For that, we can think of two approaches:

  1. Write the data in the same place but organized with the Qbeast index. If more data is added while the conversion is taking place, we are targeting this data as non-indexed and reading all of them in case we need it.
  2. Write the data in the same place and mark it as replicated cubes. So we will only duplicate the data we need for optimizing.

Doubts/things we need to figure out:

  • How to specify the columns to index in the API
  • How to handle partitioning? Should be useful to index the columns that are in partition values?
  • Study the feasibility of the second approach
  • Other design problems that could arise
@osopardo1 osopardo1 added type: enhancement Improvement of existing feature or code high labels May 9, 2022
@osopardo1 osopardo1 mentioned this issue May 10, 2022
14 tasks
@osopardo1 osopardo1 self-assigned this May 25, 2022
@osopardo1 osopardo1 removed the high label Nov 3, 2022
@osopardo1
Copy link
Member Author

osopardo1 commented Jan 20, 2023

UPDATE

The Convert To Qbeast command would be a naïve implementation and would only mark the table with Qbeast Metadata. It will not index any of the existing files, and not even add extra metadata to each entry.
The objective is to slowly convert the table into the Qbeast Format, to avoid rewriting the whole dataset in one single process.

The files without Qbeast metadata in the tags would be read as usual, and we need to finish #121 in order to make this operation feasible. The idea is that those files are in a "staging" area, and would be eventually indexed in batches.

The usage would be something like:

QbeastTable.convertToQbeast(columnsToIndex="col1,col2", cubeSize=500)

The operation will trigger a Metadata Update that will change the Delta Log with an entry like:

{
  "metaData": {
    "id": "aa43874a-9688-4d14-8168-e16088641fdb",
    ...
    "configuration": {
      "qbeast.lastRevisionID": "1",
      "qbeast.revision.1": "{\"revisionID\":1,\"timestamp\":1637851757680,\"tableID\":\"/tmp/qb-testing1584592925006274975\",\"desiredCubeSize\":500,\"columnTransformers\":..}"
    },
    "createdTime": 1637851765848
  }
}

@osopardo1
Copy link
Member Author

osopardo1 commented Jan 23, 2023

Other aspects/scope of the command:

  • A table entirely written in Delta would not be readable from Qbeast unless we trigger Convert To Qbeast command. You can find all the information about it here Make files without Metadata readable with Qbeast #121
  • The files not converted to Qbeast would not be optimized, analyzed or compacted. Manage the compaction of files without Qbeast metadata can be complex. The goal of the conversion is that those files would be gradually written in Qbeast Format.
  • How to manage updates? -> This is a complex topic, not sure how should be handled. We did not tested it yet. Need to explore and understand more about it.

@osopardo1 osopardo1 added this to the Benchmarking Qbeast Format milestone Jan 25, 2023
@Jiaweihu08 Jiaweihu08 mentioned this issue Jan 25, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants