-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIMX 403 - Add "run_id" CLI argument and Transformer attribute #210
Conversation
Why these changes are being introduced: As we move into Transmogrifier writing to a parquet dataset, one important bit of information it will need is the concept of a "run id". This correlates directly to an "Execution UUID" that every StepFunction invocation produces. This identifier is then used when writing the records to the parquet dataset, allowing for quick and easy access to records associated with that identifier. There is a small many-to-one relationship that makes naming a bit awkward: each StepFunction invocation may run Transmogrifier multiple times (e.g. multiple input files). Each time it invokes Transmogrifier, the same "run_id" would be passed. This effectively groups the outputs of all Transmogrifier invocations in the same location in the parquet dataset. The language of this new "run_id" in Transmogrifier is intentionally somewhat high level, indicating it's just an identifier to associate with that invocation of the run. How this addresses that need: * Adds new CLI argument -r / --run-id * Transformer gets new attribute 'run_id' * Transformer mints a UUID of a run id is not passed, making the change backwards compatible and inconsequential if a run id is not passed Side effects of this change: * Going forward, invocations of Transmogrifier can use the run id as part of the parquet record writing. Until then, it has no effect. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-403
if not run_id: | ||
logger.info("explicit run_id not passed, minting new UUID") | ||
run_id = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, specifically, it what keeps this backwards compatible. If a run_id
is not passed via the CLI, which our current StepFunction will not do, a UUID is minted... and then just never used in any meaningful way.
But in addition to being backwards compatible, this is also kind of a nice-to-have for development, where passing a --run-id
is not necessary when invoking Transmog locally. A UUID will get minted, and if writing to parquet, that will get used in the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just one question. 🤓
with mock.patch( | ||
"transmogrifier.sources.transformer.Transformer.transform_and_write_output_files" | ||
) as mocked_transform_and_write: | ||
mocked_transform_and_write.side_effect = Exception("stopping transformation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose of this side_effect
to force stop the transform
CLI command after loading the transformer (i.e., running the command until we can confirm that the run_id
attrib is set? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct! I'm sure there are other ways to do it... but this felt kind of convenient.
And, I suspect that method transform_and_write_output_files()
will be removed entirely by the parquet work, so I'm not feeling too precious about tests that interact with it.
Purpose and background context
As we move into Transmogrifier writing to a parquet dataset, one important bit of information it will need is the concept of a "run id". This correlates directly to an "Execution UUID" that every StepFunction invocation produces. This identifier is then used when writing the records to the parquet dataset, allowing for quick and easy access to records associated with that identifier.
There is a small many-to-one relationship that makes naming a bit awkward: each StepFunction invocation may run Transmogrifier multiple times (e.g. multiple input files). Each time it invokes Transmogrifier, the same
run_id
would be passed. This effectively groups the outputs of all Transmogrifier invocations under the samerun_id
partition in the parquet dataset. The language of this newrun_id
in Transmogrifier is intentionally somewhat high level, indicating it's just an identifier to associate with that invocation of the application.How this addresses that need:
-r / --run-id
run_id
Side effects of this change:
run_id
as part of the parquet record writing. Until then, it has no effect.How can a reviewer manually see the effects of these changes?
Until the
run_id
is utilized during writing, it has no effect on the output of records at this time. See the new unit tests for some examples of when and how it's passed.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO (not yet)
What are the relevant tickets?
Developer
Code Reviewer(s)