Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

date-whitespace-bug-fix #56

Merged
merged 4 commits into from
Mar 15, 2023
Merged

date-whitespace-bug-fix #56

merged 4 commits into from
Mar 15, 2023

Conversation

ehanson8
Copy link
Contributor

Helpful background context

Though Marc is the transform most likely to encounter this issue, I updated the date processing for all of the transforms. Let me know if this overkill.

How can a reviewer manually see the effects of these changes?

I uploaded a file with a whitespace date to dev1. Run the following command on main to see that the whitespace is in the output and then with date-whitespace-bug-fix to see that the date appears without whitespace:

pipenv run transform -i s3://timdex-extract-dev-222053980223/alma/date-error.xml -o output/date-error-transformed-records.json -s alma

Developer

  • All new ENV is documented in README
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Includes new or updated dependencies?

NO

Why these changes are being introduced:
* There was a bug related to whitespace in date values. strip() was performed before the validation but not on the original value thus allowing values with whitespace to be passed to TIM where they caused errors. This update ensures that all date values are stripped before being passed to validate_date and removes the strip() call from that function.

How this addresses that need:
* Remove strip() call from validate_date function and update log message
* Update source transforms with strip() calls on all date values
* Update unit tests and add new test for whitespace

Side effects of this change:
* None

Relevant ticket(s):
* NA
@ehanson8 ehanson8 requested review from JPrevost and hakbailey March 10, 2023 18:24
) in caplog.text


def test_validate_date_whitespace_date_logs_error(caplog):
assert validate_date("30 ", "1234") is False
assert ("Record # '1234' has a date that couldn't be parsed: '30 '") in caplog.text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this line, but are we sure OpenSearch is treating "30" as a date the same way it might be meant in MARC? Adding the strip() prevents the problem... but are we ending up with data that matches the intent of the cataloger? I'm not saying we aren't, but since there is no test here to clarify these types of edge cases and whether we want "30" to be valid I'm not sure if we are doing the correct thing by just stripping and not confirming the intent remains for these yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a record id we can compare in OpenSearch with the record in Alma to ensure the two systems are treating two digit dates the same way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a new commit so that 2 character year values are no longer accepted and will now trigger the date-can't-be-parsed message

* Remove 2 character year values as an accepted date format
* Update corresponding unit tests
Copy link
Contributor

@hakbailey hakbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time you are using a NavigableString as a value, you need to convert it to a regular Python string per the bs4 docs. There's a ticket to make that change across the repo, but you should go ahead and update any code you're already touching, including all of these dates.

transmogrifier/sources/dspace_dim.py Outdated Show resolved Hide resolved
* Convert NavgiableStrings to str in datacite, dspace_dim, and dspace_mets transforms
* Add date validation to dspace_dim transform
@ehanson8 ehanson8 requested a review from hakbailey March 14, 2023 17:14
Copy link
Contributor

@hakbailey hakbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

transmogrifier/sources/datacite.py Outdated Show resolved Hide resolved
transmogrifier/sources/datacite.py Outdated Show resolved Hide resolved
transmogrifier/sources/dspace_dim.py Outdated Show resolved Hide resolved
* Update datacite transform to fix errors and add validation to publicationYear
* Update dspace_dim transform for code consistency
@ehanson8 ehanson8 requested a review from hakbailey March 14, 2023 18:26
Copy link
Contributor

@hakbailey hakbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ehanson8 ehanson8 merged commit 0949a82 into main Mar 15, 2023
@ehanson8 ehanson8 deleted the date-whitespace-bug-fix branch March 15, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants