Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data format versioning #44

Closed
3 tasks
osopardo1 opened this issue Nov 16, 2021 · 5 comments
Closed
3 tasks

Implement data format versioning #44

osopardo1 opened this issue Nov 16, 2021 · 5 comments
Labels
type: bug Something isn't working

Comments

@osopardo1
Copy link
Member

osopardo1 commented Nov 16, 2021

What went wrong?
During the development and cleaning of the software, we missed having a version control on each of the implementations.
Without this track, in the main branch, we have two different ways of writing the metadata with no possible hint to differentiate between the releases.

For example, one dataset written before 7eb77dd would contain this information:

{
  "add" : {
    "..." : {},
    "tags" : {
      "cube" : "A",
      "indexedColumns" : "ss_sales_price,ss_ticket_number",
      "maxWeight" : "462168771",
      "minWeight" : "-2147483648",
      "rowCount" : "508765",
      "space" : "{\"timestamp\":1631692406506,\"transformations\":[{\"min\":-99.76,\"max\":299.28000000000003,\"scale\":0.0025060144346431435},{\"min\":-119998.5,\"max\":359999.5,\"scale\":2.083342013925058E-6}]}",
      "state" : "FLOODED"
    }
  }
}

While the new ones have a different structures:

{
  "add": {
    "path": "4b36340e-0bf7-44ec-97da-f7a03bf06ea3.parquet",
    ...
    "tags": {
      "state": "FLOODED",
      "rowCount": "3",
      "cube": "gA",
      "revision": "1634196697656",
      "minWeight": "-2147483648",
      "maxWeight": "-857060062"
    }
  }
}

We should: [EDITED]

  • Add the File Format Version in the Delta Log. Version should be an increasing number starting from 1
  • Check if we are reading the right version when loading a new DataFrame
  • Update the docs, listing the versions and tagging the latest version of the code that can read it.
@osopardo1 osopardo1 added the type: bug Something isn't working label Nov 16, 2021
@osopardo1 osopardo1 added this to the Christmas party milestone Nov 16, 2021
@osopardo1 osopardo1 added the type: enhancement Improvement of existing feature or code label Nov 24, 2021
@osopardo1 osopardo1 removed this from the Christmas party milestone Dec 2, 2021
@cugni cugni changed the title Implement version control Implement data format versioning Dec 7, 2021
@cugni
Copy link
Member

cugni commented Dec 7, 2021

I don't think right now the goal is to support multiple versions in the same release, but we must make the version explicit. We should add in the metadata that we are using a specific version of the Qbeast format. The reader should then check if it supports that version and throws an exception if not. The user should then look for the correct version git and compile the proper version. In the future, we can support upgrades, but now it is not necessary.

Therefore the todo should me more:

start versioning the format evolution. I guess the first one was 0.0.1 and now we use 0.0.2?

  • add the file format version in the delta log somewhere. I guess it should be a global property for the whole table
  • Check the correct if we are reading the right version when loading a new data frame
  • Update the docs, listing the versions and tagging the latest version of the code that can read it.

@eavilaes
Copy link
Contributor

start versioning the format evolution. I guess the first one was 0.0.1 and now we use 0.0.2?

We may not have that many format versions to use semantic versioning, so I'd say the first one was v1 and now we are using v2. This way it won't be confused with the proper qbeast-spark version.
What do you think?

For the rest I think it's fine, it will also help issue #53 to be solved.

@osopardo1
Copy link
Member Author

start versioning the format evolution. I guess the first one was 0.0.1 and now we use 0.0.2?

We may not have that many format versions to use semantic versioning, so I'd say the first one was v1 and now we are using v2. This way it won't be confused with the proper qbeast-spark version. What do you think?

For the rest I think it's fine, it will also help issue #53 to be solved.

Mmh, I think you're right! And with just the number should be enough (1, 2...) because it's more like a protocol version

@osopardo1 osopardo1 added high and removed type: enhancement Improvement of existing feature or code labels Dec 21, 2021
@eavilaes
Copy link
Contributor

eavilaes commented Jan 5, 2022

I've been thinking about developing this, and I thought of doing it the following way:

Use DeltaLog's protocol tag to store information about our protocol versioning.
If you have a look at Action Reconciliation and Protocol Evolution from Delta can help to understand the solution.

So basically, my idea is to include our tags inside the protocol tag:

{
  "protocol":{
    "minReaderVersion":1,
    "minWriterVersion":2,
    "minQbeastReaderVersion":1,
    "minQbeastWriterVersion":1,
  }
}

I think is a clean solution; I did some really quick tests with it and everything seems to work fine. The problem I found is to extend/modify the Protocol class from Delta. Could someone provide a hint?

As a summary, I've seen that when committing a transaction, it includes a Seq[Action], which contains the Protocol information. But the information from the Protocol is generated inside the commit() method of the OptimisticTransaction, when calling prepareCommit(). I don't really know how to advance in this.

@osopardo1
Copy link
Member Author

As the issue has been inactive for an extended period, and there has been no recent activity or updates, we will close this frozen issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants