Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add proposal for dataset schema versions #2696

Merged
merged 2 commits into from
Feb 21, 2024

Conversation

davidjgoss
Copy link
Contributor

Official proposal for #2676.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Copy link

netlify bot commented Dec 5, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit e0ddcbe
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/65d625d6ec5c650008fb01d6

Copy link

codecov bot commented Dec 5, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a9b0b3e) 84.45% compared to head (e0ddcbe) 84.45%.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2696   +/-   ##
=========================================
  Coverage     84.45%   84.45%           
  Complexity     1416     1416           
=========================================
  Files           251      251           
  Lines          6447     6447           
  Branches        291      291           
=========================================
  Hits           5445     5445           
  Misses          850      850           
  Partials        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidjgoss davidjgoss force-pushed the proposal/2676 branch 3 times, most recently from 3f2376f to 05a4d4d Compare December 6, 2023 09:24
@davidjgoss
Copy link
Contributor Author

davidjgoss commented Dec 6, 2023

@wslulciuc @pawel-big-lebowski I just pushed some updates to this based on our discussion yesterday:

  • version column omitted from new table, with reasoning (see Proposal: Update DatasetVersion versioning #2071)
  • Detail added on how equality check and matching on existing schema versions will work - the idea of using a hash is borrowed from Pact which uses it to good effect
  • Detail added on how the migration script would work, roughly

@pawel-big-lebowski
Copy link
Collaborator

Superb proposal with A1 diagram 👍

The proposal contains an example of job run every 10 mins resulting in 864,000 rows in dataset_versions_field_mapping . Would this result in 21600 dataset schema versions after the migration? Are you planning just to migrate the data or also clean redundant entries?

@davidjgoss
Copy link
Contributor Author

@pawel-big-lebowski

Would this result in 21600 dataset schema versions after the migration?

For migration of existing data, the simple thing to do would be to create a schema version for every existing dataset version, even though it would cause duplication. I would definitely prefer to do a smarter script so it would only create distinct schema versions. In our example above this would result in just 1 schema version. I think my next move should be to experiment with whether this can be scripted in just SQL, and how slow it is for e.g. millions of records.

@davidjgoss
Copy link
Contributor Author

Added a section regarding the behaviour with input datasets, where the current dataset version is updated, and how we will handle this differently for dataset schema versions.

@wslulciuc wslulciuc added the db.perf This issue or pull request improves DB performance label Feb 1, 2024
Signed-off-by: David Goss <david.goss@matillion.com>
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀 💯

@wslulciuc wslulciuc merged commit 29a6794 into MarquezProject:main Feb 21, 2024
16 checks passed
@davidjgoss davidjgoss deleted the proposal/2676 branch February 21, 2024 17:02
@wslulciuc wslulciuc modified the milestones: 0.45.0, Roadmap Apr 16, 2024
jonathanpmoraes pushed a commit to nubank/NuMarquez that referenced this pull request Feb 6, 2025
Signed-off-by: David Goss <david.goss@matillion.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
db.perf This issue or pull request improves DB performance docs proposal
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants