-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 1st set of Datacite field methods #178
Conversation
Why these changes are being introduced: * Refactor Datacite to use field methods How this addresses that need: * Add comments to organize conftest.py by source * Add additional Datacite fixtures to conftest.py * Update datacite_record_all_fields.xml to align with other source fixtures * Add create_datacite_source_record_stub function to Datacite test module * Rename param xml > source_record * Add field methods and associated private methods for alternate_titles, content_type, and contributors * Add unit tests for new field methods * Shift note-related code block from content_type to notes Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-284
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think it looks great!
Left a few comments, but mostly just open questions and musings given this is one of the first PRs in this series.
I'd likely just outright approve with my questions for discussion, but I was curious about one particular comment about the self.get_<field_method>() or None
pattern, pondering if it might be a good practice early here to make the field methods fully encapsulated to provide the final value.
tests/sources/xml/test_datacite.py
Outdated
def create_datacite_source_record_stub(xml_insert: str) -> BeautifulSoup: | ||
xml_string = f""" | ||
<records> | ||
<record xmlns="http://www.openarchives.org/OAI/2.0/" | ||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> | ||
<metadata> | ||
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xmlns="http://datacite.org/schema/kernel-4" | ||
xsi:schemaLocation="http://datacite.org/schema/kernel-4 | ||
http://schema.datacite.org/meta/kernel-4.1/metadata.xsd"> | ||
{xml_insert} | ||
</resource> | ||
</metadata> | ||
</record> | ||
</records> | ||
""" | ||
return BeautifulSoup(xml_string, "xml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I still like this pattern. I like that the XML template (let's call it) is fairly complete, such that any BS4 (or lxml) logic is accurately tested, including namespaces, DOM traversal, etc.
That said, wondering if there is an opportunity to abstract this a bit for more shared code between source tests? Imagining standlone XML files for each source as the template, then a shared utility function to a) read the template file, and b) pass in whatever elements will be tested.
Might be an interesting discussion at some point. But I think this method is readily understandable, lightweight, and could be a good stepping stone to that kind of centralization once all the edge cases are known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion and that feels like an efficiency that would serve us better earlier in the refactor than later, @jonavellecuerdo your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a ticket and I would propose one of us pick it up soon to save future refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned myself to the ticket!
* Shift "or None" logic to field methods and update type hinting * Update unit tests to all use stub records * Remove unnecessary fixtures from conftest.py
@ghukill @jonavellecuerdo Pushed a new commit with the changes we discussed |
* Add default value to create_datacite_source_record_stub's xml_insert param * Replace calls of source_record_id variable with calls of get_source_record_id method inside field methods
@ghukill @jonavellecuerdo Another update after our very helpful discussion this morning! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 Just have a couple of questions and suggestions for changes. Let me know what you think! The Datacite
transform has some complex logic, and I appreciate your efforts to make these updates as clear and digestible as possible. :)
# aardvark ########################## | ||
|
||
|
||
@pytest.fixture | ||
def aardvark_records(): | ||
return JSONTransformer.parse_source_file("tests/fixtures/aardvark_records.jsonl") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant to this transform, can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set up those comments (including marc, oaidc, and timdex) to organize the existing fixtures conftest
by source so I'd prefer to leave it
alternate_titles.extend(list(cls._get_additional_titles(source_record))) | ||
return alternate_titles or None | ||
|
||
@classmethod | ||
def _get_additional_titles( | ||
cls, source_record: Tag | ||
) -> Iterator[timdex.AlternateTitle]: | ||
"""Get additional titles from get_main_titles method.""" | ||
for index, title in enumerate(cls.get_main_titles(source_record)): | ||
if index > 0: | ||
yield timdex.AlternateTitle(value=title) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the creation of the private method _get_additional_titles()
brings the number of title
-related methods to 4:
get_valid_title
get_main_titles
get_alternate_titles
_get_additional_titles
Because of this, I personally prefer moving the logic in _get_additional_titles()
private method into get_alternate_titles()
and adding a comment instead. If not, maybe rename to _get_additional_alternate_titles()
.
That said, I think we did kinda' discuss that get_valid_title()
might be revisited at some point...
Curious what @ghukill thinks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising @jonavellecuerdo. Without comment on your proposal above, I see something else in these title parsings that also stood out, after some hopping back and forth.
Looks like maybe the same XML parsing is used here in get_main_titles() and then again here in get_alternate_titles()? The difference being the first is looking for strings, and the second is returning timdex.AlternateTitle
instances.
I don't know what the solution is, but something is feeling a bit wonky here.
Wouldn't this result in duplicating values for "main titles" and "alternate titles"? Does that have some bearing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another one where I agree 100% but I think we're better off creating an issue to address this separately. This is getting too close to core logic that I'm hesitant to mess with it during the field method refactor. Agree or disagree with that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill I think the difference between get_main_titles()
and get_alternate_titles
is:
get_main_titles()
retrieves values where attribute@titleType
is None.get_alternate_titles()
retrieves values where attribute@titleType
exists._get_additional_titles
retrieves all values fromget_main_titles()
, excluding the first item in the list (if there are multiple values returned forget_main_titles()
, the first value is used fortimdexRecord.title
.
That said, @ehanson8 you make a good point! This is stepping into the logic as opposed to the field method refactor work. I started an issue for Datacite transform logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed @ehanson8, and thanks for the eagle eyes @jonavellecuerdo! This is like the 100th time this week I've missed a not
in code. I might need new glasses.
Thanks for starting the issue. It does feel like something that could get untangled and streamlined, but agree with all comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for creating the issue, though I would argue that title
issue goes much deeper than just Datacite
so you could frame it more broadly as Refactor title logic
. I think other transforms may be equally messy re: titles
else: | ||
logger.warning( | ||
"Datacite record %s missing required Datacite field resourceType", | ||
cls.get_source_record_id(source_record), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that content_type
is not a required field, I would opt to remove lines 287-291. The warning is confusing as it is technically "not required". This might've been related to our discussion earlier about "useful logging"? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should go but propose we create an issue of "evaluate logging for usefulness" to handle it globally (with other items flagged in the comments as they are encountered) and not bog down the field method refactor. The origin of this is that resourceType
is required by the Datacite
schema, but we've never done anything with it and it's incongruous with how transmogrifier
logging has developed since. Datacite
was the first transform written and the context was the RDI project, which is no longer our context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for clarifying! Created an issue for future work and discussion. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating the issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@@ -261,7 +261,7 @@ def get_alternate_titles(cls, source_record: Tag) -> list[timdex.AlternateTitle] | |||
] | |||
) | |||
alternate_titles.extend(list(cls._get_additional_titles(source_record))) | |||
return alternate_titles | |||
return alternate_titles or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this time, I do like this full encapsulation, thanks for making these changes.
Relevant here and elsewhere...
Once we get into orchestration, which I think touches on the instantiation of the TIMDEXRecord
instance, it might be worth considering if maybe it's okay to return an empty list? If so, and that was filtered out during serialization to JSON, then we could probably drop a bunch of these conversions to None
. Just noting for the future, but again, think this is great for now, and in the spirit of the incremental approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am most likely onboard with that approach when we get there!
Purpose and background context
Refactors the first set of
Datacite
fields as separate field methods. As this is the first of many of these PRs that DataEng will be writing/reviewing, I encourage nit-picking so we can refine the overall approach early.How can a reviewer manually see the effects of these changes?
Run the following command to see that the
Datacite
transform still transforms a source file:Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)