Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Update DatasetVersion versioning #2071

Open
RNHTTR opened this issue Aug 12, 2022 · 4 comments
Open

Proposal: Update DatasetVersion versioning #2071

RNHTTR opened this issue Aug 12, 2022 · 4 comments
Milestone

Comments

@RNHTTR
Copy link
Contributor

RNHTTR commented Aug 12, 2022

There has been some discussion (mostly in #1977) about reworking the versioning system for DatasetVersion.

Motivation

The current DatasetVersion versioning system leads to confusion (e.g. #1883). DatasetVersion has a uuid field (of type UUID) and a version field (also of type UUID). In a practical sense, I think these fields are redundant.

Additionally, external data systems might already support dataset versioning (e.g. delta, iceberg). It'd make sense for Marquez to support these.

Proposal

I propose that a Version's uuid field should assume the functionality currently provided by Version's version field, and add an additional field external_version to support dataset versions provided by external applications. This would have a downstream impact on JobVersion.

Work required

  1. Update Version.getValue() to be of type String
  2. Drop DatasetVersion's version field
  3. Add a field to DatasetVersion: external_version (String)
  4. Drop JobVersion's version field
  5. Add a field to JobVersion: external_version (String).
    1. I'm not sure if this is currently necessary, but it seems reasonable to assume that data applications might support job versions tied to code in the future if they don't already.
  6. Use OpenLineage's DatasetVersionDatasetFacet facet to support external dataset versions.
  7. Upstream/downstream code changes to support 1-6 (e.g. updating queries to use dv.uuid instead of dv.version)
  8. Database migrations

If this proposal is accepted, I'll open an official proposal.

@collado-mike
Copy link
Collaborator

What's the reasoning behind steps 2 and 3? I can see the usefulness of the external_version column to note specifically that the dataset version is assigned by some other system. But what column do we use when we compute the dataset version? Will we still write to the external_version column?

@RNHTTR
Copy link
Contributor Author

RNHTTR commented Sep 2, 2022

What's the reasoning behind steps 2 and 3?

In my opinion, the version field is confusing. We're dealing with a DatasetVersion already; does it make sense that a version has a version (genuinely asking)? Also, version is redundant to uuid, and the behavior of these two fields is the same: When a dataset is changed, each become a new UUID value.

But what column do we use when we compute the dataset version? Will we still write to the external_version column?

Each dataset will have its uuid (required). external_version will be nullable, so if it comes from an external source, it will be populated with that dataset version. Otherwise, it will be null.

@RNHTTR
Copy link
Contributor Author

RNHTTR commented Sep 28, 2022

@collado-mike @wslulciuc Is there anything additional I need to do to bring this to a vote or anything like that?

@collado-mike
Copy link
Collaborator

Sorry for the radio silence here, Ryan. Please open a PR with the proposal and we can approve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants