-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
date-whitespace-bug-fix #56
Conversation
Why these changes are being introduced: * There was a bug related to whitespace in date values. strip() was performed before the validation but not on the original value thus allowing values with whitespace to be passed to TIM where they caused errors. This update ensures that all date values are stripped before being passed to validate_date and removes the strip() call from that function. How this addresses that need: * Remove strip() call from validate_date function and update log message * Update source transforms with strip() calls on all date values * Update unit tests and add new test for whitespace Side effects of this change: * None Relevant ticket(s): * NA
tests/test_helpers.py
Outdated
) in caplog.text | ||
|
||
|
||
def test_validate_date_whitespace_date_logs_error(caplog): | ||
assert validate_date("30 ", "1234") is False | ||
assert ("Record # '1234' has a date that couldn't be parsed: '30 '") in caplog.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this line, but are we sure OpenSearch is treating "30" as a date the same way it might be meant in MARC? Adding the strip() prevents the problem... but are we ending up with data that matches the intent of the cataloger? I'm not saying we aren't, but since there is no test here to clarify these types of edge cases and whether we want "30" to be valid I'm not sure if we are doing the correct thing by just stripping and not confirming the intent remains for these yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a record id we can compare in OpenSearch with the record in Alma to ensure the two systems are treating two digit dates the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a new commit so that 2 character year values are no longer accepted and will now trigger the date-can't-be-parsed message
* Remove 2 character year values as an accepted date format * Update corresponding unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time you are using a NavigableString
as a value, you need to convert it to a regular Python string per the bs4 docs. There's a ticket to make that change across the repo, but you should go ahead and update any code you're already touching, including all of these dates.
* Convert NavgiableStrings to str in datacite, dspace_dim, and dspace_mets transforms * Add date validation to dspace_dim transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments.
* Update datacite transform to fix errors and add validation to publicationYear * Update dspace_dim transform for code consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Helpful background context
Though
Marc
is the transform most likely to encounter this issue, I updated the date processing for all of the transforms. Let me know if this overkill.How can a reviewer manually see the effects of these changes?
I uploaded a file with a whitespace date to dev1. Run the following command on
main
to see that the whitespace is in the output and then withdate-whitespace-bug-fix
to see that the date appears without whitespace:Developer
Code Reviewer
(not just this pull request message)
Includes new or updated dependencies?
NO