Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Nov 19, 2024

Purpose and background context

As we move into Transmogrifier writing to a parquet dataset, one important bit of information it will need is the concept of a "run id". This correlates directly to an "Execution UUID" that every StepFunction invocation produces. This identifier is then used when writing the records to the parquet dataset, allowing for quick and easy access to records associated with that identifier.

There is a small many-to-one relationship that makes naming a bit awkward: each StepFunction invocation may run Transmogrifier multiple times (e.g. multiple input files). Each time it invokes Transmogrifier, the same run_id would be passed. This effectively groups the outputs of all Transmogrifier invocations under the same run_id partition in the parquet dataset. The language of this new run_id in Transmogrifier is intentionally somewhat high level, indicating it's just an identifier to associate with that invocation of the application.

How this addresses that need:

  • Adds new CLI argument -r / --run-id
  • Transformer gets new attribute run_id
  • Transformer mints a UUID of a run id is not passed, making the change backwards compatible and inconsequential if a run id is not passed

Side effects of this change:

  • Going forward, invocations of Transmogrifier can use the run_id as part of the parquet record writing. Until then, it has no effect.

How can a reviewer manually see the effects of these changes?

Until the run_id is utilized during writing, it has no effect on the output of records at this time. See the new unit tests for some examples of when and how it's passed.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO (not yet)

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

As we move into Transmogrifier writing to a parquet dataset, one
important bit of information it will need is the concept of a
"run id".  This correlates directly to an "Execution UUID" that
every StepFunction invocation produces.  This identifier is then
used when writing the records to the parquet dataset, allowing
for quick and easy access to records associated with that identifier.

There is a small many-to-one relationship that makes naming a bit
awkward: each StepFunction invocation may run Transmogrifier multiple
times (e.g. multiple input files).  Each time it invokes Transmogrifier,
the same "run_id" would be passed.  This effectively groups the outputs
of all Transmogrifier invocations in the same location in the parquet
dataset.  The language of this new "run_id" in Transmogrifier is
intentionally somewhat high level, indicating it's just an identifier
to associate with that invocation of the run.

How this addresses that need:
* Adds new CLI argument -r / --run-id
* Transformer gets new attribute 'run_id'
* Transformer mints a UUID of a run id is not passed, making
the change backwards compatible and inconsequential if a run
id is not passed

Side effects of this change:
* Going forward, invocations of Transmogrifier can use the run id
as part of the parquet record writing.  Until then, it has no effect.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-403
Comment on lines +83 to +85
if not run_id:
logger.info("explicit run_id not passed, minting new UUID")
run_id = str(uuid.uuid4())
Copy link
Contributor Author

@ghukill ghukill Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, specifically, it what keeps this backwards compatible. If a run_id is not passed via the CLI, which our current StepFunction will not do, a UUID is minted... and then just never used in any meaningful way.

But in addition to being backwards compatible, this is also kind of a nice-to-have for development, where passing a --run-id is not necessary when invoking Transmog locally. A UUID will get minted, and if writing to parquet, that will get used in the output.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just one question. 🤓

with mock.patch(
"transmogrifier.sources.transformer.Transformer.transform_and_write_output_files"
) as mocked_transform_and_write:
mocked_transform_and_write.side_effect = Exception("stopping transformation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of this side_effect to force stop the transform CLI command after loading the transformer (i.e., running the command until we can confirm that the run_id attrib is set? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct! I'm sure there are other ways to do it... but this felt kind of convenient.

And, I suspect that method transform_and_write_output_files() will be removed entirely by the parquet work, so I'm not feeling too precious about tests that interact with it.

@ghukill ghukill merged commit 68bef31 into main Nov 20, 2024
4 checks passed
@ehanson8 ehanson8 deleted the TIMX-403-inputs-support-parquet-writing branch November 25, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants