Timx 288 marc field method refactor 4 #203

jonavellecuerdo · 2024-07-30T14:38:31Z

Purpose and background context

Field method refactor for transform class Marc (Part 4).

Added field methods and corresponding unit tests for the following fields: [numbering, physical_description, publication_frequency, publishers, related_items, subjects, summary].

As of this PR, all optional fields now have field methods. A final PR for tidying / cleanup will follow.

How can a reviewer manually see the effects of these changes?

Run make test and verify all unit tests are passing.

Run CLI command

pipenv run transform -i tests/fixtures/marc/marc_record_all_fields.xml -o output/marc-transformed-records.json -s alma

Output:

2024-07-30 10:37:44,721 INFO transmogrifier.cli.main(): Logger 'root' configured with level=INFO
2024-07-30 10:37:44,721 INFO transmogrifier.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2024-07-30 10:37:44,721 INFO transmogrifier.cli.main(): Running transform for source alma
2024-07-30 10:37:45,261 INFO transmogrifier.cli.main(): Completed transform, total records processed: 1, transformed records: 1, skipped records: 0, deleted records: 0
2024-07-30 10:37:45,261 INFO transmogrifier.cli.main(): Total time to complete transform: 0:00:00.540431

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-288

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

ghukill

Overall looks good, mostly 1:1 refactoring of the code as-was to field methods.

I did leave a fairly length question re: get_publishers() about the relatively complex list comprehension.

To summarize here: I'm more than comfortable proceeding with it as-is, but curious others thoughts on this.

ghukill · 2024-07-30T14:45:34Z

transmogrifier/sources/xml/marc.py

+    @classmethod
+    def get_numbering(cls, source_record: Tag) -> str | None:
+        if numbering_values := [
+            cls.create_subfield_value_string_from_datafield(datafield, "a", " ")
+            for datafield in source_record.find_all("datafield", tag="362")
+        ]:
+            return " ".join(numbering_values) or None
+        return None
+
+    @classmethod
+    def get_physical_description(cls, source_record: Tag) -> str | None:
+        if physical_description_values := [
+            cls.create_subfield_value_string_from_datafield(datafield, "abcefg", " ")
+            for datafield in source_record.find_all("datafield", tag="300")
+        ]:
+            return " ".join(physical_description_values) or None
+        return None


I definitely okay with these as-is, but I'm noticing they are nearly identical (and maybe others?).

Wonder if a utility method on Marc could be worth exploring that does this:

get a list of subfield values for a given tag

concatenate with a space

These might then collapse down into something like:

return concatenate_subfield_values_for_tag(tag:str, subfields:str)

FWIW, not proposing this needs to happen now. @jonavellecuerdo, I know you've mentioned some "cleanup" PRs, maybe this consideration could fall under that work.

Decided to apply this change now. See latest commit which includes the addition of a concatenate_subfield_value_strings_from_datafield utility method. Field methods get_numbering and get_physicial_description are updated.

ghukill · 2024-07-30T14:56:24Z

transmogrifier/sources/xml/marc.py

+                [
+                    timdex.Publisher(
+                        name=publisher_name.rstrip(".,") if publisher_name else None,
+                        date=publisher_date.rstrip(".,") if publisher_date else None,
+                        location=(
+                            publisher_location.rstrip(" :")
+                            if publisher_location
+                            else None
+                        ),
+                    )
+                    for datafield in source_record.find_all(
+                        "datafield", tag=publisher_marc_field
+                    )
+                    if any(
+                        [
+                            (
+                                publisher_name := cls.get_single_subfield_string(
+                                    datafield, "b"
+                                )
+                            ),
+                            (
+                                publisher_date := cls.get_single_subfield_string(
+                                    datafield, "c"
+                                )
+                            ),
+                            (
+                                publisher_location := cls.get_single_subfield_string(
+                                    datafield, "a"
+                                )
+                            ),
+                        ]
+                    )
+                ]


This is a pretty wild list comprehension; starting to feel like it's own DSL!

I'm onboard with the degree to which we've leaned into list comprehensions for Transmog transforms, just given how common that is. And, I know we've added some if... logic into a handful of them as well, to good effect.

I think where this one starts to tip for me are the ternary expressions like:

name=publisher_name.rstrip(".,") if publisher_name else None

which are based on the walrus operator variables in the if any(...) block at the end of the comprehension.

If I'm thinking about this right, the only thing that would prevent us from something like the following is that we need to strip certain characters -- .,: from the end of the string but we can't do that for None values?

publishers.extend( [ timdex.Publisher( name=cls.get_single_subfield_string(datafield, "b"), date=cls.get_single_subfield_string(datafield, "c"), location=cls.get_single_subfield_string(datafield, "a"), ) for datafield in source_record.find_all( "datafield", tag=publisher_marc_field ) ] )

Just to throw out another approach, what if we were to:

create a "messy" list of publishers which may or may not have None for all three subfields, or strings with trailing puncuation .,:

loop through this list and prune all None's, or .rstrip() values if they do exist

My thinking is that it might feel a little more approachable. But curious what others think, as this is fairly easy enough to reason about after you take a moment to take it in.

Possibly minimal consideration, but worth mentioning: this list comprehension approach is possibly more performant than the "messy" list + prune approach outlined above.

If time and luxury affords, it could be interesting to add some time.time() debugging variables in there to see how much of a difference it makes. If it's 0.01 difference, when you're talking 200k records in a single MARC batch, that ends up being 2000 seconds = 33 minutes!

Extend that across 20 MARC batches, and suddenly that adds 11 hours!? That can't possibly be right...

Hmm, so I tried creating another method without list comprehension (basically what we had before but using walrus operators inside the if any ... statement.

@classmethod def get_publishers_updated(cls, source_record: Tag) -> list[timdex.Publisher] | None: publishers = [] for publisher_marc_tag in ["260", "264"]: for datafield in source_record.find_all("datafield", tag=publisher_marc_tag): if any( [ publisher_name := cls.get_single_subfield_string(datafield, "b"), publisher_date := cls.get_single_subfield_string(datafield, "c"), publisher_location := cls.get_single_subfield_string( datafield, "a" ), ] ): publishers.append( timdex.Publisher( name=publisher_name.rstrip(".,") if publisher_name else None, date=publisher_date.rstrip(".,") if publisher_date else None, location=( publisher_location.rstrip(" :") if publisher_location else None ), ) ) return publishers or None

Then I tried to time the difference between the functions and they are pretty similar. Worth noting, is that Marc.get_publishers_updated is faster than the current field method with list comprehension. 😅 Here are the results from 5 sample runs:

RUN 1

================================= Using current Marc.get_publishers Time elapsed: 0.0016551017761230469 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] ================================= Using updated Marc.get_publishers Time elapsed: 0.0015499591827392578 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] =================================

RUN 2

================================= Using current Marc.get_publishers Time elapsed: 0.0016760826110839844 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] ================================= Using updated Marc.get_publishers Time elapsed: 0.0015530586242675781 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] =================================

RUN 3

================================= Using current Marc.get_publishers Time elapsed: 0.0016961097717285156 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] ================================= Using updated Marc.get_publishers Time elapsed: 0.0016319751739501953 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] =================================

RUN 4

================================= Using current Marc.get_publishers Time elapsed: 0.0016372203826904297 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] ================================= Using updated Marc.get_publishers Time elapsed: 0.0015740394592285156 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] =================================

RUN 5

================================= Using current Marc.get_publishers Time elapsed: 0.001680135726928711 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] ================================= Using updated Marc.get_publishers Time elapsed: 0.0015420913696289062 Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)] =================================

Given this information, is it sufficient -- for purposes of the field method refactor -- to remove the list comprehension for this field method specifically? 🤔

get_publisher_updated is definitely readable to me and I think the walrus operators are a nice improvement, I'm fully in favor of that. I think we'll be better off if we tackle the speed issues comprehensively as a later step when we can exclusively focus on that

Ah fascinating! Thanks for running the test @jonavellecuerdo. Agree with @ehanson8: the upated version is pretty easy to scan and understand (with my own reservations that a variable defined in a if any(...) block is still kind of spooky-action-at-a-distance).

Re: speed, if the difference truly is 0.0001s, that works out to shaving about 6 minutes for 3+ million records! But in all seriousness, clearly very similar, so feels like going with readability is best.

Thank you both for your input! I will change get_publisher to the code example I shared above.

@ghukill I was surprised that it works! From this article on walrus operators , it reads:

the := operator gives you a new syntax for assigning variables in the middle of expressions.

The way I interpret it is the code inside the list passed to any(..) are expressions and we're assigning the values of those expressions to a variable via the walrus operator. Some of the examples under "Walrus Operator Use Cases" look more complex than what we're doing with the if any(...) block, which makes me feel that it's okay to use it as we do here! Let me know if that helps, @ghukill 🤔

ghukill · 2024-07-30T14:58:26Z

transmogrifier/sources/xml/marc.py

+                [
+                    timdex.RelatedItem(
+                        description=related_item_value.rstrip(" ."),
+                        relationship=related_item_marc_field["relationship"],
+                    )
+                    for datafield in source_record.find_all(
+                        "datafield", tag=related_item_marc_field["tag"]
+                    )
+                    if (
+                        related_item_value := (
+                            cls.create_subfield_value_string_from_datafield(
+                                datafield,
+                                related_item_marc_field["subfields"],
+                                " ",
+                            )
+                        )
+                    )
+                ]


Though this comprehension is significantly simpler, probably worth simultaneously considering with the commnent above for get_publishers().

Hmm, if it's alright, I'll leave it as is for now. 🤔 Other field methods that use list comprehension and create a timdex.<field> object with some string sanitation are get_identifiers, get_locations, and get_notes. In thinking about what makes the case of get_publishers different from these is the complex if any ... block. 🤔 Definitely like where we ended up with get_publishers though!

ehanson8

A few comments

ehanson8 · 2024-07-30T15:04:08Z

transmogrifier/sources/xml/marc.py

+    @classmethod
+    def get_physical_description(cls, source_record: Tag) -> str | None:
+        if physical_description_values := [
+            cls.create_subfield_value_string_from_datafield(datafield, "abcefg", " ")
+            for datafield in source_record.find_all("datafield", tag="300")
+        ]:
+            return " ".join(physical_description_values) or None
+        return None
+


This pattern shows up enough that we might consider abstracting it out further for easier re-use

Abstracted it has!

ehanson8 · 2024-07-30T15:08:29Z

transmogrifier/sources/xml/marc.py

+                        name=publisher_name.rstrip(".,") if publisher_name else None,
+                        date=publisher_date.rstrip(".,") if publisher_date else None,


Good update to .rstrip, should have caught that given the ℗2022, in the all fields test

Writing and running unit tests per field method really allows us to pick up on these small things that are easily missed when scrolling through the lengthy assertions of the older tests. 🤓

transmogrifier/sources/xml/marc.py

ehanson8

Great work!

ehanson8 · 2024-08-01T13:46:12Z

transmogrifier/sources/xml/marc.py

-        return None
+        return (
+            cls.concatenate_subfield_value_strings_from_datafield(
+                source_record, tag="362", subfield_codes="abcefg"


Good use of named args

ghukill

Left a comment about @statichmethod vs @classmethod on the new utility method, but otherwise looking great to me. I'm comfortable approving with or without this change.

ghukill · 2024-08-01T14:13:17Z

transmogrifier/sources/xml/marc.py

+    def concatenate_subfield_value_strings_from_datafield(
+        source_record: Tag, tag: str, subfield_codes: str
+    ) -> str:
+        return " ".join(
+            Marc.create_subfield_value_string_from_datafield(


I'm hesitant to request more changes on this PR, as this is kind of minor, but I might propose:

making this a @classmethod

having it then call cls.create_subfield_value_string_from_datafield(...)

I think unless there is good reason, having a staticmethod call that class that it's defined on is probably a good indication it could/should be a @classmethod.

That said, nice change! I wouldn't be surprised if other parts of the MARC record could potentially use this...

Ah, good point, I agree @classmethod would be better

Ah, thanks for the heads up! Updated. 😄

Looks good!

Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Added field methods and corresponding unit tests: numbering, physical_description, publication_frequency, publishers, related_items, subjects, summary * Add class method 'concatenate_subfield_value_strings_from_datafield' Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-288

jonavellecuerdo self-assigned this Jul 30, 2024

jonavellecuerdo requested a review from ghukill July 30, 2024 14:39

jonavellecuerdo marked this pull request as ready for review July 30, 2024 14:39

jonavellecuerdo requested a review from ehanson8 July 30, 2024 14:39

ghukill reviewed Jul 30, 2024

View reviewed changes

ehanson8 reviewed Jul 30, 2024

View reviewed changes

jonavellecuerdo requested review from ghukill and ehanson8 August 1, 2024 13:31

ehanson8 approved these changes Aug 1, 2024

View reviewed changes

ghukill approved these changes Aug 1, 2024

View reviewed changes

jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor-4 branch from f797a7a to 61a84f2 Compare August 1, 2024 14:44

jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor-4 branch from 61a84f2 to 19f03f5 Compare August 1, 2024 14:46

jonavellecuerdo merged commit c6ac913 into main Aug 1, 2024
5 checks passed

jonavellecuerdo deleted the TIMX-288-marc-field-method-refactor-4 branch August 1, 2024 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timx 288 marc field method refactor 4 #203

Timx 288 marc field method refactor 4 #203

jonavellecuerdo commented Jul 30, 2024

ghukill left a comment

ghukill Jul 30, 2024

ghukill Jul 30, 2024

jonavellecuerdo Aug 1, 2024

ghukill Jul 30, 2024

ghukill Jul 30, 2024

jonavellecuerdo Jul 31, 2024

ehanson8 Jul 31, 2024

ghukill Jul 31, 2024

jonavellecuerdo Jul 31, 2024

ghukill Jul 30, 2024

jonavellecuerdo Aug 1, 2024

ehanson8 left a comment

ehanson8 Jul 30, 2024

jonavellecuerdo Aug 1, 2024

ehanson8 Jul 30, 2024

jonavellecuerdo Jul 31, 2024

ehanson8 left a comment

ehanson8 Aug 1, 2024

ghukill left a comment

ghukill Aug 1, 2024

ghukill Aug 1, 2024

ehanson8 Aug 1, 2024

jonavellecuerdo Aug 1, 2024

ehanson8 Aug 1, 2024

		name=publisher_name.rstrip(".,") if publisher_name else None,
		date=publisher_date.rstrip(".,") if publisher_date else None,

Timx 288 marc field method refactor 4 #203

Timx 288 marc field method refactor 4 #203

Conversation

jonavellecuerdo commented Jul 30, 2024

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment