Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 288 marc field method refactor 4 #203

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

Purpose and background context

Field method refactor for transform class Marc (Part 4).

Added field methods and corresponding unit tests for the following fields: [numbering, physical_description, publication_frequency, publishers, related_items, subjects, summary].

As of this PR, all optional fields now have field methods. A final PR for tidying / cleanup will follow.

How can a reviewer manually see the effects of these changes?

  1. Run make test and verify all unit tests are passing.

  2. Run CLI command

    pipenv run transform -i tests/fixtures/marc/marc_record_all_fields.xml -o output/marc-transformed-records.json -s alma
    

    Output:

    2024-07-30 10:37:44,721 INFO transmogrifier.cli.main(): Logger 'root' configured with level=INFO
    2024-07-30 10:37:44,721 INFO transmogrifier.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
    2024-07-30 10:37:44,721 INFO transmogrifier.cli.main(): Running transform for source alma
    2024-07-30 10:37:45,261 INFO transmogrifier.cli.main(): Completed transform, total records processed: 1, transformed records: 1, skipped records: 0, deleted records: 0
    2024-07-30 10:37:45,261 INFO transmogrifier.cli.main(): Total time to complete transform: 0:00:00.540431
    

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-288

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jul 30, 2024
@jonavellecuerdo jonavellecuerdo requested a review from ghukill July 30, 2024 14:39
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review July 30, 2024 14:39
@jonavellecuerdo jonavellecuerdo requested a review from ehanson8 July 30, 2024 14:39
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, mostly 1:1 refactoring of the code as-was to field methods.

I did leave a fairly length question re: get_publishers() about the relatively complex list comprehension.

To summarize here: I'm more than comfortable proceeding with it as-is, but curious others thoughts on this.

Comment on lines 826 to 842
@classmethod
def get_numbering(cls, source_record: Tag) -> str | None:
if numbering_values := [
cls.create_subfield_value_string_from_datafield(datafield, "a", " ")
for datafield in source_record.find_all("datafield", tag="362")
]:
return " ".join(numbering_values) or None
return None

@classmethod
def get_physical_description(cls, source_record: Tag) -> str | None:
if physical_description_values := [
cls.create_subfield_value_string_from_datafield(datafield, "abcefg", " ")
for datafield in source_record.find_all("datafield", tag="300")
]:
return " ".join(physical_description_values) or None
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely okay with these as-is, but I'm noticing they are nearly identical (and maybe others?).

Wonder if a utility method on Marc could be worth exploring that does this:

  1. get a list of subfield values for a given tag
  2. concatenate with a space

These might then collapse down into something like:

return concatenate_subfield_values_for_tag(tag:str, subfields:str)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, not proposing this needs to happen now. @jonavellecuerdo, I know you've mentioned some "cleanup" PRs, maybe this consideration could fall under that work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to apply this change now. See latest commit which includes the addition of a concatenate_subfield_value_strings_from_datafield utility method. Field methods get_numbering and get_physicial_description are updated.

Comment on lines 861 to 893
[
timdex.Publisher(
name=publisher_name.rstrip(".,") if publisher_name else None,
date=publisher_date.rstrip(".,") if publisher_date else None,
location=(
publisher_location.rstrip(" :")
if publisher_location
else None
),
)
for datafield in source_record.find_all(
"datafield", tag=publisher_marc_field
)
if any(
[
(
publisher_name := cls.get_single_subfield_string(
datafield, "b"
)
),
(
publisher_date := cls.get_single_subfield_string(
datafield, "c"
)
),
(
publisher_location := cls.get_single_subfield_string(
datafield, "a"
)
),
]
)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty wild list comprehension; starting to feel like it's own DSL!

I'm onboard with the degree to which we've leaned into list comprehensions for Transmog transforms, just given how common that is. And, I know we've added some if... logic into a handful of them as well, to good effect.

I think where this one starts to tip for me are the ternary expressions like:

name=publisher_name.rstrip(".,") if publisher_name else None

which are based on the walrus operator variables in the if any(...) block at the end of the comprehension.

If I'm thinking about this right, the only thing that would prevent us from something like the following is that we need to strip certain characters -- .,: from the end of the string but we can't do that for None values?

publishers.extend(
    [
        timdex.Publisher(
            name=cls.get_single_subfield_string(datafield, "b"),
            date=cls.get_single_subfield_string(datafield, "c"),
            location=cls.get_single_subfield_string(datafield, "a"),
        )
        for datafield in source_record.find_all(
            "datafield", tag=publisher_marc_field
        )
    ]
)

Just to throw out another approach, what if we were to:

  1. create a "messy" list of publishers which may or may not have None for all three subfields, or strings with trailing puncuation .,:
  2. loop through this list and prune all None's, or .rstrip() values if they do exist

My thinking is that it might feel a little more approachable. But curious what others think, as this is fairly easy enough to reason about after you take a moment to take it in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly minimal consideration, but worth mentioning: this list comprehension approach is possibly more performant than the "messy" list + prune approach outlined above.

If time and luxury affords, it could be interesting to add some time.time() debugging variables in there to see how much of a difference it makes. If it's 0.01 difference, when you're talking 200k records in a single MARC batch, that ends up being 2000 seconds = 33 minutes!

Extend that across 20 MARC batches, and suddenly that adds 11 hours!? That can't possibly be right...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so I tried creating another method without list comprehension (basically what we had before but using walrus operators inside the if any ... statement.

@classmethod
    def get_publishers_updated(cls, source_record: Tag) -> list[timdex.Publisher] | None:
        publishers = []
        for publisher_marc_tag in ["260", "264"]:
            for datafield in source_record.find_all("datafield", tag=publisher_marc_tag):
                if any(
                    [
                        publisher_name := cls.get_single_subfield_string(datafield, "b"),
                        publisher_date := cls.get_single_subfield_string(datafield, "c"),
                        publisher_location := cls.get_single_subfield_string(
                            datafield, "a"
                        ),
                    ]
                ):
                    publishers.append(
                        timdex.Publisher(
                            name=publisher_name.rstrip(".,") if publisher_name else None,
                            date=publisher_date.rstrip(".,") if publisher_date else None,
                            location=(
                                publisher_location.rstrip(" :")
                                if publisher_location
                                else None
                            ),
                        )
                    )
        return publishers or None

Then I tried to time the difference between the functions and they are pretty similar. Worth noting, is that Marc.get_publishers_updated is faster than the current field method with list comprehension. 😅 Here are the results from 5 sample runs:

RUN 1

=================================
Using current Marc.get_publishers
Time elapsed: 0.0016551017761230469
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015499591827392578
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================

RUN 2

=================================
Using current Marc.get_publishers
Time elapsed: 0.0016760826110839844
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015530586242675781
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================

RUN 3

=================================
Using current Marc.get_publishers
Time elapsed: 0.0016961097717285156
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0016319751739501953
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================

RUN 4

=================================
Using current Marc.get_publishers
Time elapsed: 0.0016372203826904297
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015740394592285156
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================

RUN 5

=================================
Using current Marc.get_publishers
Time elapsed: 0.001680135726928711
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================
Using updated Marc.get_publishers
Time elapsed: 0.0015420913696289062
Transformed: [Publisher(name='New Press', date='2005', location='New York'), Publisher(name='Wiley', date='c1992', location='New York'), Publisher(name='Alpha', date='[2022]', location='France'), Publisher(name=None, date='℗2022', location=None)]
=================================

Given this information, is it sufficient -- for purposes of the field method refactor -- to remove the list comprehension for this field method specifically? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_publisher_updated is definitely readable to me and I think the walrus operators are a nice improvement, I'm fully in favor of that. I think we'll be better off if we tackle the speed issues comprehensively as a later step when we can exclusively focus on that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah fascinating! Thanks for running the test @jonavellecuerdo. Agree with @ehanson8: the upated version is pretty easy to scan and understand (with my own reservations that a variable defined in a if any(...) block is still kind of spooky-action-at-a-distance).

Re: speed, if the difference truly is 0.0001s, that works out to shaving about 6 minutes for 3+ million records! But in all seriousness, clearly very similar, so feels like going with readability is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for your input! I will change get_publisher to the code example I shared above.

@ghukill I was surprised that it works! From this article on walrus operators , it reads:

the := operator gives you a new syntax for assigning variables in the middle of expressions.

The way I interpret it is the code inside the list passed to any(..) are expressions and we're assigning the values of those expressions to a variable via the walrus operator. Some of the examples under "Walrus Operator Use Cases" look more complex than what we're doing with the if any(...) block, which makes me feel that it's okay to use it as we do here! Let me know if that helps, @ghukill 🤔

Comment on lines +944 to +958
[
timdex.RelatedItem(
description=related_item_value.rstrip(" ."),
relationship=related_item_marc_field["relationship"],
)
for datafield in source_record.find_all(
"datafield", tag=related_item_marc_field["tag"]
)
if (
related_item_value := (
cls.create_subfield_value_string_from_datafield(
datafield,
related_item_marc_field["subfields"],
" ",
)
)
)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this comprehension is significantly simpler, probably worth simultaneously considering with the commnent above for get_publishers().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if it's alright, I'll leave it as is for now. 🤔 Other field methods that use list comprehension and create a timdex.<field> object with some string sanitation are get_identifiers, get_locations, and get_notes. In thinking about what makes the case of get_publishers different from these is the complex if any ... block. 🤔 Definitely like where we ended up with get_publishers though!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Comment on lines 960 to 1021
@classmethod
def get_physical_description(cls, source_record: Tag) -> str | None:
if physical_description_values := [
cls.create_subfield_value_string_from_datafield(datafield, "abcefg", " ")
for datafield in source_record.find_all("datafield", tag="300")
]:
return " ".join(physical_description_values) or None
return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern shows up enough that we might consider abstracting it out further for easier re-use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstracted it has!

Comment on lines 964 to 965
name=publisher_name.rstrip(".,") if publisher_name else None,
date=publisher_date.rstrip(".,") if publisher_date else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good update to .rstrip, should have caught that given the ℗2022, in the all fields test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing and running unit tests per field method really allows us to pick up on these small things that are easily missed when scrolling through the lengthy assertions of the older tests. 🤓

transmogrifier/sources/xml/marc.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

return None
return (
cls.concatenate_subfield_value_strings_from_datafield(
source_record, tag="362", subfield_codes="abcefg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of named args

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about @statichmethod vs @classmethod on the new utility method, but otherwise looking great to me. I'm comfortable approving with or without this change.

Comment on lines 157 to 161
def concatenate_subfield_value_strings_from_datafield(
source_record: Tag, tag: str, subfield_codes: str
) -> str:
return " ".join(
Marc.create_subfield_value_string_from_datafield(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to request more changes on this PR, as this is kind of minor, but I might propose:

  • making this a @classmethod
  • having it then call cls.create_subfield_value_string_from_datafield(...)

I think unless there is good reason, having a staticmethod call that class that it's defined on is probably a good indication it could/should be a @classmethod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, nice change! I wouldn't be surprised if other parts of the MARC record could potentially use this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, I agree @classmethod would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the heads up! Updated. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor-4 branch from f797a7a to 61a84f2 Compare August 1, 2024 14:44
Why these changes are being introduced:
* These updates are required to implement the architecture described
in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md

How this addresses that need:
* Added field methods and corresponding unit tests:
  numbering, physical_description, publication_frequency,
  publishers, related_items, subjects, summary
* Add class method 'concatenate_subfield_value_strings_from_datafield'

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-288
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor-4 branch from 61a84f2 to 19f03f5 Compare August 1, 2024 14:46
@jonavellecuerdo jonavellecuerdo merged commit c6ac913 into main Aug 1, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-288-marc-field-method-refactor-4 branch August 1, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants