Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to using mashumaro jsonschema with acceptable performance #8437

Merged
merged 12 commits into from
Aug 30, 2023

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Aug 16, 2023

resolves #8426

Problem

Original conversion performed in #8132, but with performance issues. Use caching to improve performance.

See the comments in #8132 for additional context for code reviews.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@gshank gshank requested review from a team as code owners August 16, 2023 22:45
@gshank gshank requested review from mikealfare, emmyoop and aranke and removed request for a team August 16, 2023 22:45
@cla-bot cla-bot bot added the cla:yes label Aug 16, 2023
@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (07372db) 86.34% compared to head (6122517) 86.34%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8437   +/-   ##
=======================================
  Coverage   86.34%   86.34%           
=======================================
  Files         174      174           
  Lines       25579    25531   -48     
=======================================
- Hits        22087    22046   -41     
+ Misses       3492     3485    -7     
Flag Coverage Δ
integration 83.14% <100.00%> (+0.01%) ⬆️
unit 65.10% <95.04%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
core/dbt/parser/base.py 93.45% <ø> (ø)
core/dbt/utils.py 81.38% <ø> (ø)
core/setup.py 0.00% <ø> (ø)
core/dbt/context/context_config.py 94.11% <100.00%> (+0.26%) ⬆️
core/dbt/contracts/connection.py 96.03% <100.00%> (ø)
core/dbt/contracts/graph/model_config.py 92.09% <100.00%> (-1.72%) ⬇️
core/dbt/contracts/graph/nodes.py 95.25% <100.00%> (ø)
core/dbt/contracts/graph/unparsed.py 93.10% <100.00%> (ø)
core/dbt/contracts/project.py 97.68% <100.00%> (+0.06%) ⬆️
core/dbt/contracts/util.py 93.83% <100.00%> (+1.47%) ⬆️
... and 2 more

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

)
pre_hook: List[Hook] = field(
default_factory=list,
metadata=MergeBehavior.Append.meta(),
metadata={"merge": MergeBehavior.Append, "alias": "pre-hook"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are aliases needed for Append now? Why does packages not need it on line 466?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "alias" is for handling the dashes in the names properly. Most of the other field definitions use that kind of hacky metadata=MergeBehavior.DictKeyAppend.meta() thing, which doesn't allow setting additional metadata.

core/dbt/contracts/project.py Show resolved Hide resolved
@@ -72,12 +72,12 @@
# ----
# These are major-version-0 packages also maintained by dbt-labs. Accept patches.
"dbt-extractor~=0.5.0",
"hologram~=0.0.16", # includes transitive dependencies on python-dateutil and jsonschema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎊

@gshank gshank marked this pull request as draft August 17, 2023 13:27
@gshank gshank marked this pull request as ready for review August 17, 2023 14:14
@gshank
Copy link
Contributor Author

gshank commented Aug 17, 2023

@jtcohen6 @graciegoheen I tagged you on this because the output jsonschema is different than what was generated by hologram in a number of ways. Do people actually read it? Do we have any concerns there?

For example, the resource_type shows up as a const, and the use of OneOf vs AnyOf is different.

@jtcohen6
Copy link
Contributor

@gshank I think that's fine, as long as this is a forward-looking change for new versions of dbt-core, and not a change to the existing published & versioned schemas.

We know that the jsonschemas generated by hologram were not always even technically correct, which could lead to edge cases if used for programmatic validation (e.g. #4657). I am hoping that the ones produced by mashumaro can achieve better correctness!

I just want to clarify that:

  • The actual contents of our contracted artifacts (manifest.json, run_results.json, catalog.json, sources.json) are not changing
  • The new JSONSchemas (produced by mashumaro) can be used to successfully validate those artifacts, in the ways that we know some users try to do programmatically (e.g. [CT-2268] [Bug] dbt-core >= 1.4.2 manifests not passing v8 schema validation #7119)
  • We will still be able to publish these JSONSchemas, and the "human-readable" versions, at schemas.getdbt.com

cc @dbt-labs/cloud-artifacts for visibility

@gshank
Copy link
Contributor Author

gshank commented Aug 23, 2023

That's right, the other schemas will change too. Should I update the other schemas too? Nothing has probably changed as far as validation... Or should we wait for an actual change and just verify that newly generated schemas still work?

@jtcohen6
Copy link
Contributor

@gshank Good point re: artifacts that won't actually be changing their schema in v1.7 (most likely catalog.json + run_results.json + sources.json). Let's verify that the new generated schemas actually work for validating instances of those artifacts, as produced by older versions of dbt-core — we can use our internal-analytics project as a real-world example. That sense-check would make me feel much better about updating the jsonschemas that we have published at schemas.getdbt.com.

Copy link
Contributor

@mikealfare mikealfare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions, no serious concerns, and multiple nits (take them or leave them, just things I noticed).

return updated

def translate_hook_names(self, project_dict):
# This is a kind of kludge because the fix for #6411 specifically allowed misspelling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not the intended input format, should we raise a warning here indicating that? I wouldn't cause anything to fail, but providing some direction would make it easier for us to deprecate the incorrect spelling in the future (likely one less thing for folks to change for 2.0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That ticket specifically allowed the "incorrect" spellings, so it's now a feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no :lolsob: emoji, why is there no :lolsob: emoji when I need one so badly.

That being said, we don't intend on ever migrating folks off of the "incorrect" spelling either?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to ask product and Doug :). If you want to open a ticket, go ahead. Not in scope for this one though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The misspelling here we mean is, we'll accept either kebab case or snake case for these two configs, in the several places they could be potentially defined:

  • post-hook or post_hook
  • pre-hook or pre_hook

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The misspelling here we mean is, we'll accept either kebab case or snake case for these two configs

Agreed, I'm asking if we ever want to back out of that ditch, or support that for the foreseeable future.

core/dbt/contracts/graph/model_config.py Show resolved Hide resolved
core/dbt/contracts/util.py Show resolved Hide resolved
core/dbt/dataclass_schema.py Show resolved Hide resolved
core/dbt/dataclass_schema.py Outdated Show resolved Hide resolved
core/dbt/dataclass_schema.py Show resolved Hide resolved
core/dbt/dataclass_schema.py Show resolved Hide resolved
core/dbt/dataclass_schema.py Outdated Show resolved Hide resolved
tests/unit/test_graph.py Outdated Show resolved Hide resolved
tests/unit/utils.py Outdated Show resolved Hide resolved
core/dbt/parser/base.py Outdated Show resolved Hide resolved
@gshank gshank requested a review from a team as a code owner August 25, 2023 14:09
@gshank gshank requested review from heysweet and removed request for a team August 25, 2023 14:09
@heysweet heysweet requested review from eddowh and removed request for heysweet August 25, 2023 14:29

# Check that catalog validates with jsonschema
catalog_dict = catalog.to_dict()
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't explain it, but this feels like an odd flow to me. Would something like this work?

assert catalog.validate(catalog_dict), "Catalog validation failed"

or even

assert catalog.validate(catalog.to_dict()), "Catalog validation failed"

@@ -81,6 +82,10 @@ def _assert_freshness_results(self, path, state):
with open(path) as fp:
data = json.load(fp)

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in test_docs_generate_defer.

@gshank gshank merged commit f063e4e into main Aug 30, 2023
@gshank gshank deleted the ct-3000-mashumaro_jsonschema branch August 30, 2023 18:07
peterallenwebb added a commit that referenced this pull request Aug 30, 2023
* Add compiled node properties to run_results.json

* Include compiled-node attributes in run_results.json

* Fix typo

* Bump schema version of run_results

* Fix test assertions

* Update expected run_results to reflect new attributes

* Code review changes

* Fix mypy warnings for ManifestLoader.load() (#8443)

* revert python version for docker images (#8445)

* revert python version for docker images

* add comment to not update python version, update changelog

* Bumping version to 1.7.0b1 and generate changelog

* [CT-3013]  Fix parsing of `window_groupings` (#8454)

* Update semantic model parsing tests to check measure non_additive_dimension spec

* Make `window_groupings` default to empty list if not specified on `non_additive_dimension`

* Add changie doc for `window_groupings`  parsing fix

* update `Number` class to handle integer values (#8306)

* add show test for json data

* oh changie my changie

* revert unecessary cahnge to fixture

* keep decimal class for precision methods, but return __int__ value

* jerco updates

* update integer type

* update other tests

* Update .changes/unreleased/Fixes-20230803-093502.yaml

---------

Co-authored-by: Emily Rockman <emily.rockman@dbtlabs.com>

* Improve docker image README (#8212)

* Improve docker image README

- Fix unnecessary/missing newline escapes
- Remove double whitespace between parameters
- 2-space indent for extra lines in image build commands

* Add changelog entry for #8212

* ADAP-814: Refactor prep for MV updates (#8459)

* apply reformatting changes only for #8449
* add logging back to get_create_materialized_view_as_sql
* changie

* swap trigger (#8463)

* update the implementation template (#8466)

* update the implementation template

* add colon

* Split tests into classes (#8474)

* add flaky decorator

* split up tests into classes

* revert update agate for int (#8478)

* updated typing and methods to meet mypy standards (#8485)

* Convert error to conditional warning for unversioned contracted model, fix msg format (#8451)

* first pass, tests need updates

* update proto defn

* fixing tests

* more test fixes

* finish fixing test file

* reformat the message

* formatting messages

* changelog

* add event to unit test

* feedback on message structure

* WIP

* fix up event to take in all fields

* fix test

* Fix ambiguous reference error for duplicate model names across packages with tests (#8488)

* Safely remove external nodes from manifest (#8495)

* [CT-2840] Improved semantic layer protocol satisfaction tests (#8456)

* Test `SemanticModel` satisfies protocol when none of it's `Optionals` are specified

* Add tests ensuring SourceFileMetadata and FileSlice satisfiy DSI protocols

* Add test asserting Defaults obj satisfies protocol

* Add test asserting SemanticModel with optionals specified satisfies protocol

* Split dimension protocol satisfaction tests into with and without optionals

* Simplify DSI Protocol import strategy in protocol satisfaction tests

* Add test asserting DimensionValidtyParams satisfies protocol

* Add test asserting DimensionTypeParams satisfies protocol

* Split entity protocol satisfaction tests into with and without optionals

* Split measure protocol satisfication tests and add measure aggregation params satisficaition test

* Split metric protocol satisfaction test into optional specified an unspecified

Additionally, create where_filter pytest fixture

* Improve protocol satisfaction tests for MetricTypeParams and sub protocols

Specifically we added/improved protocol satisfaction tests for
- MetricTypeParams
- MetricInput
- MetricInputMeasure
- MetricTimeWindow

* Convert to using mashumaro jsonschema with acceptable performance (#8437)

* Regenerate run_results schema after merging in changes from main.

---------

Co-authored-by: Gerda Shank <gerda@dbtlabs.com>
Co-authored-by: Matthew McKnight <91097623+McKnight-42@users.noreply.github.com>
Co-authored-by: Github Build Bot <buildbot@fishtownanalytics.com>
Co-authored-by: Quigley Malcolm <QMalcolm@users.noreply.github.com>
Co-authored-by: dave-connors-3 <73915542+dave-connors-3@users.noreply.github.com>
Co-authored-by: Emily Rockman <emily.rockman@dbtlabs.com>
Co-authored-by: Jaime Martínez Rincón <jaime@jamezrin.name>
Co-authored-by: Mike Alfare <13974384+mikealfare@users.noreply.github.com>
Co-authored-by: Michelle Ark <MichelleArk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-3000] Handle performance/caching issues with using mashumaro jsonschema generation
5 participants