-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 288 marc field method refactor 4 #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, mostly 1:1 refactoring of the code as-was to field methods.
I did leave a fairly length question re: get_publishers()
about the relatively complex list comprehension.
To summarize here: I'm more than comfortable proceeding with it as-is, but curious others thoughts on this.
transmogrifier/sources/xml/marc.py
Outdated
@classmethod | ||
def get_numbering(cls, source_record: Tag) -> str | None: | ||
if numbering_values := [ | ||
cls.create_subfield_value_string_from_datafield(datafield, "a", " ") | ||
for datafield in source_record.find_all("datafield", tag="362") | ||
]: | ||
return " ".join(numbering_values) or None | ||
return None | ||
|
||
@classmethod | ||
def get_physical_description(cls, source_record: Tag) -> str | None: | ||
if physical_description_values := [ | ||
cls.create_subfield_value_string_from_datafield(datafield, "abcefg", " ") | ||
for datafield in source_record.find_all("datafield", tag="300") | ||
]: | ||
return " ".join(physical_description_values) or None | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely okay with these as-is, but I'm noticing they are nearly identical (and maybe others?).
Wonder if a utility method on Marc
could be worth exploring that does this:
- get a list of subfield values for a given tag
- concatenate with a space
These might then collapse down into something like:
return concatenate_subfield_values_for_tag(tag:str, subfields:str)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, not proposing this needs to happen now. @jonavellecuerdo, I know you've mentioned some "cleanup" PRs, maybe this consideration could fall under that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to apply this change now. See latest commit which includes the addition of a concatenate_subfield_value_strings_from_datafield
utility method. Field methods get_numbering
and get_physicial_description
are updated.
transmogrifier/sources/xml/marc.py
Outdated
[ | ||
timdex.Publisher( | ||
name=publisher_name.rstrip(".,") if publisher_name else None, | ||
date=publisher_date.rstrip(".,") if publisher_date else None, | ||
location=( | ||
publisher_location.rstrip(" :") | ||
if publisher_location | ||
else None | ||
), | ||
) | ||
for datafield in source_record.find_all( | ||
"datafield", tag=publisher_marc_field | ||
) | ||
if any( | ||
[ | ||
( | ||
publisher_name := cls.get_single_subfield_string( | ||
datafield, "b" | ||
) | ||
), | ||
( | ||
publisher_date := cls.get_single_subfield_string( | ||
datafield, "c" | ||
) | ||
), | ||
( | ||
publisher_location := cls.get_single_subfield_string( | ||
datafield, "a" | ||
) | ||
), | ||
] | ||
) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty wild list comprehension; starting to feel like it's own DSL!
I'm onboard with the degree to which we've leaned into list comprehensions for Transmog transforms, just given how common that is. And, I know we've added some if...
logic into a handful of them as well, to good effect.
I think where this one starts to tip for me are the ternary expressions like:
name=publisher_name.rstrip(".,") if publisher_name else None
which are based on the walrus operator variables in the if any(...)
block at the end of the comprehension.
If I'm thinking about this right, the only thing that would prevent us from something like the following is that we need to strip certain characters -- .,:
from the end of the string but we can't do that for None
values?
publishers.extend(
[
timdex.Publisher(
name=cls.get_single_subfield_string(datafield, "b"),
date=cls.get_single_subfield_string(datafield, "c"),
location=cls.get_single_subfield_string(datafield, "a"),
)
for datafield in source_record.find_all(
"datafield", tag=publisher_marc_field
)
]
)
Just to throw out another approach, what if we were to:
- create a "messy" list of publishers which may or may not have
None
for all three subfields, or strings with trailing puncuation.,:
- loop through this list and prune all
None
's, or.rstrip()
values if they do exist
My thinking is that it might feel a little more approachable. But curious what others think, as this is fairly easy enough to reason about after you take a moment to take it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly minimal consideration, but worth mentioning: this list comprehension approach is possibly more performant than the "messy" list + prune approach outlined above.
If time and luxury affords, it could be interesting to add some time.time()
debugging variables in there to see how much of a difference it makes. If it's 0.01
difference, when you're talking 200k records in a single MARC batch, that ends up being 2000 seconds = 33 minutes!
Extend that across 20 MARC batches, and suddenly that adds 11 hours!? That can't possibly be right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I tried creating another method without list comprehension (basically what we had before but using walrus operators inside the if any ...
statement.
@classmethod
def get_publishers_updated(cls, source_record: Tag) -> list[timdex.Publisher] | None:
publishers = []
for publisher_marc_tag in ["260", "264"]:
for datafield in source_record.find_all("datafield", tag=publisher_marc_tag):
if any(
[
publisher_name := cls.get_single_subfield_string(datafield, "b"),
publisher_date := cls.get_single_subfield_string(datafield, "c"),
publisher_location := cls.get_single_subfield_string(
datafield, "a"
),
]
):
publishers.append(
timdex.Publisher(
name=publisher_name.rstrip(".,") if publisher_name else None,
date=publisher_date.rstrip(".,") if publisher_date else None,
location=(
publisher_location.rstrip(" :")
if publisher_location
else None
),
)
)
return publishers or None
Then I tried to time the difference between the functions and they are pretty similar. Worth noting, is that Marc.get_publishers_updated
is faster than the current field method with list comprehension. 😅 Here are the results from 5 sample runs:
RUN 1
=================================
Using current Marc.get_publishers
Time elapsed: 0.0016551017761230469
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015499591827392578
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
RUN 2
=================================
Using current Marc.get_publishers
Time elapsed: 0.0016760826110839844
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015530586242675781
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
RUN 3
=================================
Using current Marc.get_publishers
Time elapsed: 0.0016961097717285156
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0016319751739501953
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
RUN 4
=================================
Using current Marc.get_publishers
Time elapsed: 0.0016372203826904297
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015740394592285156
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
RUN 5
=================================
Using current Marc.get_publishers
Time elapsed: 0.001680135726928711
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015420913696289062
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Given this information, is it sufficient -- for purposes of the field method refactor -- to remove the list comprehension for this field method specifically? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_publisher_updated
is definitely readable to me and I think the walrus operators are a nice improvement, I'm fully in favor of that. I think we'll be better off if we tackle the speed issues comprehensively as a later step when we can exclusively focus on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah fascinating! Thanks for running the test @jonavellecuerdo. Agree with @ehanson8: the upated version is pretty easy to scan and understand (with my own reservations that a variable defined in a if any(...)
block is still kind of spooky-action-at-a-distance).
Re: speed, if the difference truly is 0.0001s
, that works out to shaving about 6 minutes for 3+ million records! But in all seriousness, clearly very similar, so feels like going with readability is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for your input! I will change get_publisher
to the code example I shared above.
@ghukill I was surprised that it works! From this article on walrus operators , it reads:
the := operator gives you a new syntax for assigning variables in the middle of expressions.
The way I interpret it is the code inside the list passed to any(..)
are expressions and we're assigning the values of those expressions to a variable via the walrus operator. Some of the examples under "Walrus Operator Use Cases" look more complex than what we're doing with the if any(...)
block, which makes me feel that it's okay to use it as we do here! Let me know if that helps, @ghukill 🤔
[ | ||
timdex.RelatedItem( | ||
description=related_item_value.rstrip(" ."), | ||
relationship=related_item_marc_field["relationship"], | ||
) | ||
for datafield in source_record.find_all( | ||
"datafield", tag=related_item_marc_field["tag"] | ||
) | ||
if ( | ||
related_item_value := ( | ||
cls.create_subfield_value_string_from_datafield( | ||
datafield, | ||
related_item_marc_field["subfields"], | ||
" ", | ||
) | ||
) | ||
) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this comprehension is significantly simpler, probably worth simultaneously considering with the commnent above for get_publishers()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if it's alright, I'll leave it as is for now. 🤔 Other field methods that use list comprehension and create a timdex.<field>
object with some string sanitation are get_identifiers
, get_locations
, and get_notes
. In thinking about what makes the case of get_publishers
different from these is the complex if any ...
block. 🤔 Definitely like where we ended up with get_publishers
though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
transmogrifier/sources/xml/marc.py
Outdated
@classmethod | ||
def get_physical_description(cls, source_record: Tag) -> str | None: | ||
if physical_description_values := [ | ||
cls.create_subfield_value_string_from_datafield(datafield, "abcefg", " ") | ||
for datafield in source_record.find_all("datafield", tag="300") | ||
]: | ||
return " ".join(physical_description_values) or None | ||
return None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern shows up enough that we might consider abstracting it out further for easier re-use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abstracted it has!
transmogrifier/sources/xml/marc.py
Outdated
name=publisher_name.rstrip(".,") if publisher_name else None, | ||
date=publisher_date.rstrip(".,") if publisher_date else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good update to .rstrip
, should have caught that given the ℗2022,
in the all fields test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing and running unit tests per field method really allows us to pick up on these small things that are easily missed when scrolling through the lengthy assertions of the older tests. 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
return None | ||
return ( | ||
cls.concatenate_subfield_value_strings_from_datafield( | ||
source_record, tag="362", subfield_codes="abcefg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good use of named args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment about @statichmethod
vs @classmethod
on the new utility method, but otherwise looking great to me. I'm comfortable approving with or without this change.
transmogrifier/sources/xml/marc.py
Outdated
def concatenate_subfield_value_strings_from_datafield( | ||
source_record: Tag, tag: str, subfield_codes: str | ||
) -> str: | ||
return " ".join( | ||
Marc.create_subfield_value_string_from_datafield( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to request more changes on this PR, as this is kind of minor, but I might propose:
- making this a
@classmethod
- having it then call
cls.create_subfield_value_string_from_datafield(...)
I think unless there is good reason, having a staticmethod
call that class that it's defined on is probably a good indication it could/should be a @classmethod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, nice change! I wouldn't be surprised if other parts of the MARC record could potentially use this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point, I agree @classmethod
would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the heads up! Updated. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
f797a7a
to
61a84f2
Compare
Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Added field methods and corresponding unit tests: numbering, physical_description, publication_frequency, publishers, related_items, subjects, summary * Add class method 'concatenate_subfield_value_strings_from_datafield' Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-288
61a84f2
to
19f03f5
Compare
Purpose and background context
Field method refactor for transform class Marc (Part 4).
Added field methods and corresponding unit tests for the following fields: [
numbering
,physical_description
,publication_frequency
,publishers
,related_items
,subjects
,summary
].As of this PR, all optional fields now have field methods. A final PR for tidying / cleanup will follow.
How can a reviewer manually see the effects of these changes?
Run
make test
and verify all unit tests are passing.Run CLI command
Output:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/TIMX-288
Developer
Code Reviewer(s)