Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAD-308 schema and table description updates #589

Merged
merged 1 commit into from
Nov 24, 2022
Merged

Conversation

keegansmith21
Copy link
Contributor

@keegansmith21 keegansmith21 commented Nov 2, 2022

This update was initially scoped to fix an issue where a table's descriptions are wiped if the table is updated.
In looking through the foundational utility functions, the following things have been changed:

find_schema()

The function has been altered in the following ways:

  • Now returns dateless schema files where a date is not supplied. The date is now an optional argument
  • Removed the versioning utility of the function as it is not used in any case other than in tests (which have been updated accordingly)

This results in the following changes to underlying code:

  • The schema finding function has been abstracted out of the underlying table creation, making it more obvious where the schema is coming from during table creation.
  • Schema files should now be explicitly supplied to the table creation functions. This gives the user more control over what schema is being supplied at the telescope/workflow level. This is significantly more important now as we are forcing more tables to comply with the supplied schema. Some cases disallow this use however, such as the stream release telescope's partition load . This could be changed, but would require reworking some parts of the telescope.

create_bigquery_table_from_query()

The function has been altered in the following ways:

  • If a schema file is supplied, any existing table with the supplied name will be deleted before being recreated

This results in the following behaviour changes when calling the function

  • This ensures that the table descriptions are updated when a table exists already and is being updated with a schema. Since bigquery will wipe existing descriptions when updating a table (for some reason).

Table loading functions

prepare_bq_load(), prepare_bq_load_v2(), bq_load_shard() and bq_load_shard_v2()

These functions have changed in the following ways:

  1. Removed prepare_bq_load and prepare_bq_load_v2
  2. Merged the functionality of the two bq_load_shard functions into bq_load_shard()

This was done for the following reasons:

  1. prepare_bq_load was used to wrap the find_schema function and create_bigquery_dataset. Since find_schema has been abstracted out, this meant it was only wrapping a single function, making it redundant.
  2. The two bq_load_shard functions did very similar things but gathered the necessary information differently. I have forced the information to be supplied explicitly in all cases. This removes duplication of code and also makes it more clear what tables and datasets are being made at the telescope/workflow level

bq_load_ingestion_partition(), bq_load_partition(), bq_append_from_file()

These functions have changed in the following ways:

  • Each function now expects a supplied schema, project_id, bucket and data_location

This was done for the following reasons:

  • Schema files should now be defined more explicitly wherever possible for reasons given above
  • project_id, bucket and data_location were normally inferred from prepare_bq_load(). They need to be supplied in a different manner now. I think that this way is better as this information is always available to the telescope/workflow anyway and should make it more clear where the table is going to be created.

Relates to:

@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Base: 93.72% // Head: 93.63% // Decreases project coverage by -0.09% ⚠️

Coverage data is based on head (e396a88) compared to base (0c99e5f).
Patch coverage: 80.95% of modified lines in pull request are covered.

❗ Current head e396a88 differs from pull request most recent head 13f6726. Consider uploading reports for the commit 13f6726 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #589      +/-   ##
===========================================
- Coverage    93.72%   93.63%   -0.10%     
===========================================
  Files           50       50              
  Lines         6866     6847      -19     
  Branches       781      781              
===========================================
- Hits          6435     6411      -24     
- Misses         315      320       +5     
  Partials       116      116              
Impacted Files Coverage Δ
...-platform/observatory/platform/utils/test_utils.py 93.59% <40.00%> (-0.05%) ⬇️
...servatory/platform/workflows/snapshot_telescope.py 78.94% <40.00%> (-4.08%) ⬇️
...ry-platform/observatory/platform/utils/gc_utils.py 83.59% <60.00%> (-0.30%) ⬇️
...latform/observatory/platform/utils/config_utils.py 96.55% <100.00%> (-0.06%) ⬇️
...tform/observatory/platform/utils/workflow_utils.py 75.17% <100.00%> (-2.22%) ⬇️
...observatory/platform/workflows/stream_telescope.py 97.20% <100.00%> (+0.12%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@keegansmith21 keegansmith21 force-pushed the BAD-308_schema_updates branch from 5fac679 to f2b3f4a Compare November 3, 2022 00:46
@keegansmith21 keegansmith21 requested a review from jdddog November 3, 2022 02:23
@keegansmith21 keegansmith21 marked this pull request as ready for review November 3, 2022 02:23
Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Keegan, the changes look good, just a couple of points for feedback:

  • I think we should change the logic so that if you don't supply a release_date, then it just finds the schema with the exact table name. Then if you supply the release date then it finds the latest version of the schema. That would make the logic easier to understand.
  • How are schemas with versions handled? Some workflows in the academic observatory have a schema_version parameter. I don't know if they are used, maybe we could just remove them.

Copy link
Contributor

@jdddog jdddog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Keegan, these changes look good to me, nice work!

@jdddog jdddog merged commit 79a20ef into develop Nov 24, 2022
@jdddog jdddog deleted the BAD-308_schema_updates branch November 24, 2022 23:52
keegansmith21 added a commit that referenced this pull request Dec 2, 2022
jdddog pushed a commit that referenced this pull request Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants