-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy in building DatasetVersionIds and which version
is used
#1977
Comments
Of note - it is expected that both the |
The
The |
If that was the original intention, it is definitely no longer the case. See the |
The We do return the So, I do agree the names are confusing as the users sees It really comes down to naming here. Maybe we should have gone with |
Definitely not just a naming issue - look at the code I linked to. The $ curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories"
{"id":{"namespace":"food_delivery","name":"public.categories"},"type":"DB_TABLE","name":"public.categories","physicalName":"public.categories","createdAt":"2022-05-05T22:48:12.288321Z","updatedAt":"2022-05-05T22:55:46.524400Z","namespace":"food_delivery","sourceName":"analytics_db","fields":[{"name":"menu_id","type":"INTEGER","tags":[],"description":"The ID of the menu related to the category."},{"name":"description","type":"TEXT","tags":[],"description":"The description of the category."},{"name":"id","type":"INTEGER","tags":[],"description":"The unique ID of the category."},{"name":"name","type":"VARCHAR","tags":[],"description":"The name of the category."}],"tags":[],"lastModifiedAt":"2022-05-05T22:52:01.524400Z","lastLifecycleState":null,"description":null,"currentVersion":"e247cf97-d567-4c51-81e4-7c3eded64dd6","facets":{"stats":{"size":4430400,"rowCount":1000},"schema":{"fields":[{"name":"id","type":"INTEGER","description":"The unique ID of the category."},{"name":"name","type":"VARCHAR","description":"The name of the category."},{"name":"menu_id","type":"INTEGER","description":"The ID of the menu related to the category."},{"name":"description","type":"TEXT","description":"The description of the category."}],"_producer":"com.datakin.cli.SeedCommand","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet"},"dataSource":{"uri":"jdbc:postgresql://localhost:3306/deliveries","name":"analytics_db","_producer":"com.datakin.cli.SeedCommand","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet"},"description":"A table for categories.","dataQuality":{"bytes":4430400,"rowCount":1000,"columnMetrics":{"id":{"nullCount":0,"distinctCount":1000},"name":{"nullCount":0,"distinctCount":700}}},"greatExpectations_assertions":{"assertions":[{"success":true,"expectationType":"expect_table_row_count_to_be_between"},{"success":true,"expectationType":"expect_column_to_exist"},{"column":"id","success":true,"expectationType":"expect_column_values_to_be_unique"},{"column":"id","success":true,"expectationType":"expect_column_values_to_not_be_null"},{"column":"name","success":true,"expectationType":"expect_column_values_to_not_match_regex"},{"column":"menu_id","success":true,"expectationType":"expect_column_values_to_not_be_null"}]}},"deleted":false} Here, the $ curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories/versions/e247cf97-d567-4c51-81e4-7c3eded64dd6"
{"code":404,"message":"Dataset version 'e247cf97-d567-4c51-81e4-7c3eded64dd6' not found."} And if I query all versions for that dataset
the response is empty. So, I was wrong that the |
Right, I agree with you that there's an issue, and certain queries are using the wrong column, what I meant was that it's a side effect of the confusion around the column naming. Here's a related issue #1883 |
Ultimately I agree with @wslulciuc. The queries I agree that this is pretty confusing, and This update would also pave the way to resolve #1883 Thoughts? |
My concern with using the In general, I think it's better to hide the primary key from users when there's another user-facing unique key that can be used instead (e.g., we use job names and dataset names, not primary keys, which aren't even part of the user-facing service model). |
I might be misunderstanding, but |
@collado-mike: I view this concern very similar to runIDs. Given that a runID must be a valid The Marquez API doesn't provide the option to query a run based on an external runID, but that can be something we can eventually support and even display in the UI (which is probably a good idea anyways). Similarly for versioning, we can support an external version that is provided by the user as an OpenLineage facet as part of the event (that can possibly be queried). As @RNHTTR, the version must be a valid Now, it really comes down to naming here. I think the How about we just use |
TBH, I don't see a lot of value coming from an internal I agree there are similarities to the job
Because From that perspective, I don't agree that the version must be a valid UUID. Especially, if the original intention was to keep the So I think that if we can eliminate artificial constraints on what the version column contains, we can start to consider alternatives that provide more value for the user. I think everyone agrees that random UUIDs are not human-friendly or intuitive URL values. For that reason, we use job names and namespaces in our URLs rather than their primary keys. That URL design means it is possible to one day support cross-linking between different tools - e.g., imagine a link in the Airflow UI that says "see lineage for this task in Marquez" and the Airflow code immediately knows how to construct the link without knowing any of the details about how Marquez constructs its primary keys. We can offer that same experience for Dataset versions - given a dataset version in any tool, we can easily construct a URL to the lineage of that dataset without knowing anything about the internal details of how we Marquez constructs a checksum or primary key. A company that had multiple Marquez deployments could rely on deterministic Dataset URLs between deployments in cases where lineage crossed installations (Willy, I'm thinking of a specific case you'll be familiar with). Primary keys are great for creating foreign keys and indexes (it's easier for another table to simply reference a record's UUID rather than using composite keys), but I don't think they're great user-facing values. I think we already have better alternatives. |
In the
OpenLineageService
, we construct aDatasetVersionId
using theuuid
property of the record - that is, the primary key. However, in theRunDao
, when we construct theDatasetVersionId
of the inputs and outputs of aRun
, we use theversion
property. This leads to confusion, as code that depends on the return value of theOpenLineageService
will expect aDatasetVersionId
that can be matched to the values returned by theRunDao
and it turns out they never match.OpenLineageService code: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/OpenLineageService.java#L208
RunDao code: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/RunDao.java#L85-L87
The text was updated successfully, but these errors were encountered: