TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210

ghukill · 2024-11-19T20:30:09Z

Purpose and background context

As we move into Transmogrifier writing to a parquet dataset, one important bit of information it will need is the concept of a "run id". This correlates directly to an "Execution UUID" that every StepFunction invocation produces. This identifier is then used when writing the records to the parquet dataset, allowing for quick and easy access to records associated with that identifier.

There is a small many-to-one relationship that makes naming a bit awkward: each StepFunction invocation may run Transmogrifier multiple times (e.g. multiple input files). Each time it invokes Transmogrifier, the same run_id would be passed. This effectively groups the outputs of all Transmogrifier invocations under the same run_id partition in the parquet dataset. The language of this new run_id in Transmogrifier is intentionally somewhat high level, indicating it's just an identifier to associate with that invocation of the application.

How this addresses that need:

Adds new CLI argument -r / --run-id
Transformer gets new attribute run_id
Transformer mints a UUID of a run id is not passed, making the change backwards compatible and inconsequential if a run id is not passed

Side effects of this change:

Going forward, invocations of Transmogrifier can use the run_id as part of the parquet record writing. Until then, it has no effect.

How can a reviewer manually see the effects of these changes?

Until the run_id is utilized during writing, it has no effect on the output of records at this time. See the new unit tests for some examples of when and how it's passed.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO (not yet)

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-403

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: As we move into Transmogrifier writing to a parquet dataset, one important bit of information it will need is the concept of a "run id". This correlates directly to an "Execution UUID" that every StepFunction invocation produces. This identifier is then used when writing the records to the parquet dataset, allowing for quick and easy access to records associated with that identifier. There is a small many-to-one relationship that makes naming a bit awkward: each StepFunction invocation may run Transmogrifier multiple times (e.g. multiple input files). Each time it invokes Transmogrifier, the same "run_id" would be passed. This effectively groups the outputs of all Transmogrifier invocations in the same location in the parquet dataset. The language of this new "run_id" in Transmogrifier is intentionally somewhat high level, indicating it's just an identifier to associate with that invocation of the run. How this addresses that need: * Adds new CLI argument -r / --run-id * Transformer gets new attribute 'run_id' * Transformer mints a UUID of a run id is not passed, making the change backwards compatible and inconsequential if a run id is not passed Side effects of this change: * Going forward, invocations of Transmogrifier can use the run id as part of the parquet record writing. Until then, it has no effect. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-403

ghukill · 2024-11-19T21:14:42Z

transmogrifier/sources/transformer.py

+        if not run_id:
+            logger.info("explicit run_id not passed, minting new UUID")
+            run_id = str(uuid.uuid4())


This, specifically, it what keeps this backwards compatible. If a run_id is not passed via the CLI, which our current StepFunction will not do, a UUID is minted... and then just never used in any meaningful way.

But in addition to being backwards compatible, this is also kind of a nice-to-have for development, where passing a --run-id is not necessary when invoking Transmog locally. A UUID will get minted, and if writing to parquet, that will get used in the output.

ehanson8

Looks good to me!

jonavellecuerdo

Looks good to me! Just one question. 🤓

jonavellecuerdo · 2024-11-20T18:34:54Z

tests/test_cli.py

+    with mock.patch(
+        "transmogrifier.sources.transformer.Transformer.transform_and_write_output_files"
+    ) as mocked_transform_and_write:
+        mocked_transform_and_write.side_effect = Exception("stopping transformation")


Is the purpose of this side_effect to force stop the transform CLI command after loading the transformer (i.e., running the command until we can confirm that the run_id attrib is set? 🤔

That's correct! I'm sure there are other ways to do it... but this felt kind of convenient.

And, I suspect that method transform_and_write_output_files() will be removed entirely by the parquet work, so I'm not feeling too precious about tests that interact with it.

ghukill requested review from ehanson8 and jonavellecuerdo November 19, 2024 20:30

ghukill commented Nov 19, 2024

View reviewed changes

ehanson8 approved these changes Nov 19, 2024

View reviewed changes

jonavellecuerdo approved these changes Nov 20, 2024

View reviewed changes

ghukill merged commit 68bef31 into main Nov 20, 2024
4 checks passed

ehanson8 deleted the TIMX-403-inputs-support-parquet-writing branch November 25, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210

TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210

ghukill commented Nov 19, 2024 •

edited

Loading

ghukill Nov 19, 2024 •

edited

Loading

ehanson8 left a comment

jonavellecuerdo left a comment

jonavellecuerdo Nov 20, 2024

ghukill Nov 20, 2024

TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210

TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210

Conversation

ghukill commented Nov 19, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

jonavellecuerdo Nov 20, 2024

Choose a reason for hiding this comment

ghukill Nov 20, 2024

Choose a reason for hiding this comment

ghukill commented Nov 19, 2024 •

edited

Loading

ghukill Nov 19, 2024 •

edited

Loading