Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 332 dedupe function #206

Merged
merged 3 commits into from
Aug 15, 2024
Merged

Timx 332 dedupe function #206

merged 3 commits into from
Aug 15, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Aug 9, 2024

Purpose and background context

Add a global dedupe method using attrs converters to remove duplicated data in TimdexRecord fields that comprise of item lists. The following approach was taken:

How can a reviewer manually see the effects of these changes?

  1. Review new tests added in commit: add dedupe tests for dates and locations

  2. Review screenshots showing Docker runs using pulled Docker image from Dev (representing latest code in this branch) and Docker image from Prod (representing current state). Instructions are provided in the attached Markdown file: simple_ab_testing.md.

image

Note: All three of us (@ghukill , @ehanson8 , myself) tried this out on our local machines and got similar results! More formal A/B testing will be explored as a separate effort, but all three of us agreed this method is sufficient for purposes of this PR.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-332

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Aug 9, 2024
Base automatically changed from TIMX-291-orchestration to main August 9, 2024 15:12
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-332-dedupe-function branch 2 times, most recently from 148c95b to f78ba8a Compare August 9, 2024 19:42
Why these changes are being introduced:
* Improve data quality of TIMDEX records
by reducing duplication of data in list fields.

How this addresses that need:
* Create an attrs converter function to dedupe list of items
* Create ListFields abstract class with hash method
* Set hash methods in custom classes to ListFields.__hash__
* Set 'converter=dedupe' for every list field in TimdexRecord
* Add unit tests verifying deduplication of list fields

Side effects of this change:
* Deduplication is highly likely to result in diffs
when comparing transformed records before and after this
change. However (and more importantly), reducing duplicates
improves the data quality of TIMDEX records.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-332
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-332-dedupe-function branch from f78ba8a to 2871769 Compare August 12, 2024 15:54
Comment on lines -3 to -12
from transmogrifier.models import (
AlternateTitle,
Contributor,
Date,
DateRange,
Identifier,
Link,
Note,
Subject,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to easily use timdex.<field> without having to explicitly import each field class, similar to what we do for other test modules.

Comment on lines +376 to +383
def test_timdex_record_dedupe_alternate_titles(timdex_record_required_fields):
timdex_record_required_fields.alternate_titles = [
timdex.AlternateTitle(value="My Octopus Teacher"),
timdex.AlternateTitle(value="My Octopus Teacher"),
]
assert timdex_record_required_fields.alternate_titles == [
timdex.AlternateTitle(value="My Octopus Teacher")
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the assertions, lists are used because list allow duplicates; with these assertions, the test confirm that a single instance of an object remains in the list after the converter function is applied.

@@ -35,6 +35,12 @@ def list_of(item_type: Any) -> Callable: # noqa: ANN401
)


def dedupe(item_list: list | Any) -> list | None: # noqa: ANN401
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The item_list: list | Any type hint is to support scenarios where a non-list type is set for a list field. Since converters are applied before validators, the dedupe converter is set up to "pass" these values so that the validator can effectively handle the values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review August 12, 2024 17:04
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, loving this pattern of allowing attrs classes to be hashable, and then applying a dedupe() method for the converter attribute on a field.

It feels like we're leaning into a built-in and documented feature of attrs classes by using Converters to manipulate the values getting set for a field; in our case deduping a list.

I did leave a comment about the __hash__ method override pattern, and reflecting that it doesn't appear that we can just extend ListField by all the other classes. Would be interested in discussing this a bit, but definitely open to this approach.

@define
class AlternateTitle:
value: str = field(validator=instance_of(str)) # Required subfield
kind: str | None = field(default=None, validator=optional(instance_of(str)))

__hash__ = ListField.__hash__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking on this one, but applies to all...

My assumption had been that given the new ListField class, we could just extend that for all field classes that were list type, e.g.:

@define
class AlternateTitle(ListField):
    value: str = field(validator=instance_of(str))  # Required subfield
    kind: str | None = field(default=None, validator=optional(instance_of(str)))

But when I try that, I am seeing this error crop up:

TypeError: unhashable type: 'AlternateTitle'

I see from the attrs docs, that they indicate you can reset a hash method by doing the following, which is what I see in this PR:

The easiest way to reset __hash__ on a class is adding __hash__ = object.__hash__ in the class body.

But, they also mention the ability to use @define(frozen=True). When I set that for this AlternateTitle class, tests appear to pass... but maybe because deduping is not tested on this particular class?

@define(frozen=True)  # <------- added here
class AlternateTitle:
    value: str = field(validator=instance_of(str))  # Required subfield
    kind: str | None = field(default=None, validator=optional(instance_of(str)))

Is it worth discussing / exploring if @define(frozen=True) is sufficient? or does that introduce other issues?

UPDATE: After testing on Date and DateRange, a whole host of errors cropped up with the frozen=True approach, so maybe not a good fit. Keeping this comment here though, as I'm curious if @jonavellecuerdo has other insights from hacking on this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: is @define(unsafe_hash=True) an option we can work with? In some messy testing, tests are still passing when applied to AlternateTitles, Date, and DateRange:

Note the following:

  • @define(unsafe_hash=True) for all
  • none of them are defining __hash__ or even using ListField at all
@define(unsafe_hash=True)
class AlternateTitle:
    value: str = field(validator=instance_of(str))  # Required subfield
    kind: str | None = field(default=None, validator=optional(instance_of(str)))


@define(unsafe_hash=True)
class DateRange:
    gt: str | None = field(default=None, validator=optional(instance_of(str)))
    gte: str | None = field(default=None, validator=optional(instance_of(str)))
    lt: str | None = field(default=None, validator=optional(instance_of(str)))
    lte: str | None = field(default=None, validator=optional(instance_of(str)))


@define(unsafe_hash=True)
class Date:
    kind: str | None = field(default=None, validator=optional(instance_of(str)))
    note: str | None = field(default=None, validator=optional(instance_of(str)))
    range: DateRange | None = field(  # type: ignore[misc]
        default=None,
        validator=[optional(instance_of(DateRange)), check_range],
    )
    value: str | None = field(default=None, validator=optional(instance_of(str)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel off to define a class and not use it for inheritance, I think it could be worth a huddle to discuss this further

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if the class isn't used for inheritance, is it better to just defining a timdex_object_hash() function and setting __hash__ = timdex_object_hash()? Not fully sure of the implications of that though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per this comment above, I wonder if it's passing because the hash created is always unique, so while no runtime exceptions, it's not actually deduping anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circled back to this after doing some more reading (this thread was particularly helpful), I think either unsafe_hash=True or timdex_object_hash() works here, the ListField class feels like too much overhead if it can't properly be an inherited class due to the hash issues

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for looking into this more closely! I agree about the ListField class feeling redundant as it's not accomplishing inheritance in the way we expected it to. 🤔

Just wanted to share my learnings that led me to the current state of this PR!:

Regarding @ghukill 's initial comment:

  • The child classes do not inherit the __hash__ method from the parent class ListField because attrs default behavior will write an __eq__ method for the class. the Python 3 docs, it reads:

    A class that overrides eq() and does not define hash() will have its hash() implicitly set to None. When the hash() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value. . .

    I believe this is why we can't simply inherit the ListField class. 🤔

  • @define(frozen=True) will make the list field (it decorates) immutable. The cause of all those errors cropping up when this setting is configured are field methods that involve creating the instance of the object and then proceeding to update attributes (e.g. dates field method in Ead transform). Given the current state of our transforms, I agree that this approach is not feasible.

  • When I originally came across the documentation regarding the unsafe_hash in attrs, I think I chose not to use it after seeing the repeated warnings to not use this setting, in particular 😅:

    Setting @define(unsafe_hash=False) (which implies eq=True) is almost certainly a bug.

    It sounds like a config that the authors of attrs don't really like / wish they didn't have to support (maybe deprecate in the future).
    That said, I do think it's achieving the same thing as the current state of this PR:

    • the list field classes remain mutable
    • attrs by default writes an __eq__ method for every list field class
    • a __hash__ method is assigned to every list field class

In short, I personally feel more drawn to the solution that @ehanson8 proposed in #206 (comment):

  • Move ListField.__hash__ into an independent method like timdex_object_hash().
  • Remove ListField
  • Use @define as-is (i.e., attrs writes __eq__ method for every list field class)

Copy link
Contributor

@ghukill ghukill Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonavellecuerdo for all the detailed notes and explanations here, super helpful.

I can confirm that using unsafe_hash=True is, in fact, unsafe!

from attrs import define

@define(unsafe_hash=True)
class DogA:
    name: str

@define(unsafe_hash=True)
class DogB:
    name: str
    
"""
In [5]: DogA('rex') == DogB('rex')
Out[5]: False

# And...

In [7]: DogA('rex').__hash__() == DogB('rex').__hash__()
Out[7]: False
"""

To keep this moving along, I'm definitely in support of the explicit __hash__ = ... approach from this PR, whether or not it comes from a class method or just a standalone method, I'm amenable to either.

I must admit, I still do find this a little confusing however:

import attrs
from attrs import define

def custom_hash(instance) -> int:
    values = tuple(
        [
            tuple(attrib) if isinstance(attrib, list) else attrib
            for attrib in attrs.astuple(instance)
        ]
    )
    return hash(values)


@define
class DogC:
    name: str
    __hash__ = custom_hash


@define
class DogD:
    name: str
    __hash__ = custom_hash


"""
In [5]: DogC('rex') == DogD('rex')
Out[5]: False

# But....

In [6]: DogC('rex').__hash__() == DogD('rex').__hash__()
Out[6]: True
"""

I wonder if this is because 'name' is not a list? or that it's not utilizing attrs.field explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to clarify my comment a bit here after a discussion yesterday with @jonavellecuerdo.

I had not considered in this toy example that different classes (e.g. DogC vs DogD) should likely not be equal, given they are different classes.

While I was focused on instance equality vs hash method equality... @jonavellecuerdo observed that the class type should be a consideration.

Example, the following should not be equal:

AlternateTitle(value="x", kind="y") == Location(value="x", kind="y")

If we are thinking only of field names + values, then they would be considered equal. But it's clear that to dedupe, the class type is an important variable. I got the impression that @jonavellecuerdo was exploring changes to reflect this, so won't dig in more here, but did want to include these noets that stemmed from that conversation yesterday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Thank you for providing this summary! TLDR: From #206 (comment), I noticed that what unsafe_hash=True did differently (at least prior to the updates introduced f81475b) is that it took class types into account when generating the hash method.

Prior to the updates in f81475b, this would've occurred:

AlternateTitle(value="x", kind="y") == Location(value="x", kind="y") # returns True!

To fix this, the class type is now taken into account my timdex_object_hash. Unit tests to assert that: if x == y, it must follow that hash(x) == y (and vice versa) have also been added.

transmogrifier/models.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is pleasantly concise, I share @ghukill 's concerns about the ListField.__hash__ pattern so I'd appreciate a chance to discuss further to see if there's a better option!

tests/test_models.py Show resolved Hide resolved
@define
class AlternateTitle:
value: str = field(validator=instance_of(str)) # Required subfield
kind: str | None = field(default=None, validator=optional(instance_of(str)))

__hash__ = ListField.__hash__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel off to define a class and not use it for inheritance, I think it could be worth a huddle to discuss this further

* Remove 'ListField' class and create standalone hash method
* Add additional tests for hash and dedupe methods
@jonavellecuerdo
Copy link
Contributor Author

@ghukill @ehanson8 Please review the updates in the latest commit f81475b and the updated description, detailing the simple A/B test we used for reviewing this PR. Thank you for your reviews!

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great!

Requesting a change re: the tuple created from the class name. Also made another stylistic comment, but happy either way on that one.

tests/test_models.py Show resolved Hide resolved
By making TIMDEX objects hashable, dedupe methods
can be applied to a list of TIMDEX objects.
"""
values = tuple(type(timdex_object).__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple requests here.

First, I think this approach might actually creating a tuple with the individual characters of the class name, e.g.

from transmogrifier.models import *

timdex_object = AlternateTitle(value='horse',kind='animal')

tuple(type(timdex_object).__name__)
# Out[13]: ('A', 'l', 't', 'e', 'r', 'n', 'a', 't', 'e', 'T', 'i', 't', 'l', 'e')

Think this could be resolved via an explicit list, and then converting to tuple:

tuple([type(timdex_object).__name__])
# Out[2]: ('AlternateTitle',)

Second, any objection with using X.__class__ vs type(X)?

Example:

values = tuple([timdex_object.__class__.__name__])

Personally, when we know it's some kind of well-known class instance (like classes we define), I think using .__class__ is more readable. I'm having trouble finding supporting documentation or reasoning, where it sounds like type(X) and X.__class__ are mostly interchangeable; this SO post being a good example where it's kind of fluid, matter of taste, and some pretty edge-y gotchas.

I tend to think of type(X) as when I really don't know anything about an object. But in this case, where we know it's a class instance of some kind, and we wrote the classes, it feels like accessing it directly is kind of nice.

TL/DR: I'd propose we make the first change to avoid a tuple of each character, but the second type(X) vs X.__class__ happy either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALL that said -- and should have led with this -- this is a nice and elegant solution to this equality checking! Nice work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider values = tuple([timdex_object.__class__.__name__]) more readable as well, good suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EGAD 😱

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Thank you for catching this erroneous error 🙈 . This has been fixed by 426b185. I also prefer using __class__ for the same reason you described:

we know it's a class instance of some kind, and we wrote the classes

(Also remembering you suggested this during our check-in yesterday so apologies for forgetting!)

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, fully approved after that change!

tests/test_models.py Show resolved Hide resolved
By making TIMDEX objects hashable, dedupe methods
can be applied to a list of TIMDEX objects.
"""
values = tuple(type(timdex_object).__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider values = tuple([timdex_object.__class__.__name__]) more readable as well, good suggestion!

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-332-dedupe-function branch from 426b185 to 0448d3b Compare August 15, 2024 15:40
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved!! Nice work on this. Seems like a great addition.

@jonavellecuerdo jonavellecuerdo merged commit 395e612 into main Aug 15, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-332-dedupe-function branch August 15, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants