Proposal: Update DatasetVersion versioning #2071

RNHTTR · 2022-08-12T18:12:03Z

There has been some discussion (mostly in #1977) about reworking the versioning system for DatasetVersion.

Motivation

The current DatasetVersion versioning system leads to confusion (e.g. #1883). DatasetVersion has a uuid field (of type UUID) and a version field (also of type UUID). In a practical sense, I think these fields are redundant.

Additionally, external data systems might already support dataset versioning (e.g. delta, iceberg). It'd make sense for Marquez to support these.

Proposal

I propose that a Version's uuid field should assume the functionality currently provided by Version's version field, and add an additional field external_version to support dataset versions provided by external applications. This would have a downstream impact on JobVersion.

Work required

Update Version.getValue() to be of type String
Drop DatasetVersion's version field
Add a field to DatasetVersion: external_version (String)
Drop JobVersion's version field
Add a field to JobVersion: external_version (String).
1. I'm not sure if this is currently necessary, but it seems reasonable to assume that data applications might support job versions tied to code in the future if they don't already.
Use OpenLineage's DatasetVersionDatasetFacet facet to support external dataset versions.
Upstream/downstream code changes to support 1-6 (e.g. updating queries to use dv.uuid instead of dv.version)
Database migrations

If this proposal is accepted, I'll open an official proposal.

The text was updated successfully, but these errors were encountered:

collado-mike · 2022-08-24T16:37:10Z

What's the reasoning behind steps 2 and 3? I can see the usefulness of the external_version column to note specifically that the dataset version is assigned by some other system. But what column do we use when we compute the dataset version? Will we still write to the external_version column?

RNHTTR · 2022-09-02T11:07:42Z

What's the reasoning behind steps 2 and 3?

In my opinion, the version field is confusing. We're dealing with a DatasetVersion already; does it make sense that a version has a version (genuinely asking)? Also, version is redundant to uuid, and the behavior of these two fields is the same: When a dataset is changed, each become a new UUID value.

But what column do we use when we compute the dataset version? Will we still write to the external_version column?

Each dataset will have its uuid (required). external_version will be nullable, so if it comes from an external source, it will be populated with that dataset version. Otherwise, it will be null.

RNHTTR · 2022-09-28T15:25:49Z

@collado-mike @wslulciuc Is there anything additional I need to do to bring this to a vote or anything like that?

collado-mike · 2022-09-29T23:46:26Z

Sorry for the radio silence here, Ryan. Please open a PR with the proposal and we can approve it.

RNHTTR mentioned this issue Sep 30, 2022

Proposal/2071 update version versioning #2153

Merged

7 tasks

davidjgoss mentioned this issue Dec 6, 2023

add proposal for dataset schema versions #2696

Merged

7 tasks

wslulciuc added this to Marquez Oct 23, 2024

wslulciuc added this to the 0.51.0 milestone Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Update DatasetVersion versioning #2071

Proposal: Update DatasetVersion versioning #2071

RNHTTR commented Aug 12, 2022

collado-mike commented Aug 24, 2022

RNHTTR commented Sep 2, 2022

RNHTTR commented Sep 28, 2022

collado-mike commented Sep 29, 2022

Proposal: Update DatasetVersion versioning #2071

Proposal: Update DatasetVersion versioning #2071

Comments

RNHTTR commented Aug 12, 2022

Motivation

Proposal

Work required

collado-mike commented Aug 24, 2022

RNHTTR commented Sep 2, 2022

RNHTTR commented Sep 28, 2022

collado-mike commented Sep 29, 2022