-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 332 dedupe function #206
Conversation
148c95b
to
f78ba8a
Compare
Why these changes are being introduced: * Improve data quality of TIMDEX records by reducing duplication of data in list fields. How this addresses that need: * Create an attrs converter function to dedupe list of items * Create ListFields abstract class with hash method * Set hash methods in custom classes to ListFields.__hash__ * Set 'converter=dedupe' for every list field in TimdexRecord * Add unit tests verifying deduplication of list fields Side effects of this change: * Deduplication is highly likely to result in diffs when comparing transformed records before and after this change. However (and more importantly), reducing duplicates improves the data quality of TIMDEX records. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-332
f78ba8a
to
2871769
Compare
from transmogrifier.models import ( | ||
AlternateTitle, | ||
Contributor, | ||
Date, | ||
DateRange, | ||
Identifier, | ||
Link, | ||
Note, | ||
Subject, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allows us to easily use timdex.<field>
without having to explicitly import each field class, similar to what we do for other test modules.
def test_timdex_record_dedupe_alternate_titles(timdex_record_required_fields): | ||
timdex_record_required_fields.alternate_titles = [ | ||
timdex.AlternateTitle(value="My Octopus Teacher"), | ||
timdex.AlternateTitle(value="My Octopus Teacher"), | ||
] | ||
assert timdex_record_required_fields.alternate_titles == [ | ||
timdex.AlternateTitle(value="My Octopus Teacher") | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the assertions, lists are used because list allow duplicates; with these assertions, the test confirm that a single instance of an object remains in the list after the converter function is applied.
transmogrifier/models.py
Outdated
@@ -35,6 +35,12 @@ def list_of(item_type: Any) -> Callable: # noqa: ANN401 | |||
) | |||
|
|||
|
|||
def dedupe(item_list: list | Any) -> list | None: # noqa: ANN401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The item_list: list | Any
type hint is to support scenarios where a non-list type is set for a list field. Since converters are applied before validators, the dedupe
converter is set up to "pass" these values so that the validator can effectively handle the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, loving this pattern of allowing attrs classes to be hashable, and then applying a dedupe()
method for the converter
attribute on a field.
It feels like we're leaning into a built-in and documented feature of attrs
classes by using Converters to manipulate the values getting set for a field; in our case deduping a list.
I did leave a comment about the __hash__
method override pattern, and reflecting that it doesn't appear that we can just extend ListField
by all the other classes. Would be interested in discussing this a bit, but definitely open to this approach.
transmogrifier/models.py
Outdated
@define | ||
class AlternateTitle: | ||
value: str = field(validator=instance_of(str)) # Required subfield | ||
kind: str | None = field(default=None, validator=optional(instance_of(str))) | ||
|
||
__hash__ = ListField.__hash__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking on this one, but applies to all...
My assumption had been that given the new ListField
class, we could just extend that for all field classes that were list type, e.g.:
@define
class AlternateTitle(ListField):
value: str = field(validator=instance_of(str)) # Required subfield
kind: str | None = field(default=None, validator=optional(instance_of(str)))
But when I try that, I am seeing this error crop up:
TypeError: unhashable type: 'AlternateTitle'
I see from the attrs docs, that they indicate you can reset a hash method by doing the following, which is what I see in this PR:
The easiest way to reset
__hash__
on a class is adding__hash__ = object.__hash__
in the class body.
But, they also mention the ability to use @define(frozen=True)
. When I set that for this AlternateTitle
class, tests appear to pass... but maybe because deduping is not tested on this particular class?
@define(frozen=True) # <------- added here
class AlternateTitle:
value: str = field(validator=instance_of(str)) # Required subfield
kind: str | None = field(default=None, validator=optional(instance_of(str)))
Is it worth discussing / exploring if @define(frozen=True)
is sufficient? or does that introduce other issues?
UPDATE: After testing on Date
and DateRange
, a whole host of errors cropped up with the frozen=True
approach, so maybe not a good fit. Keeping this comment here though, as I'm curious if @jonavellecuerdo has other insights from hacking on this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: is @define(unsafe_hash=True)
an option we can work with? In some messy testing, tests are still passing when applied to AlternateTitles
, Date
, and DateRange
:
Note the following:
@define(unsafe_hash=True)
for all- none of them are defining
__hash__
or even usingListField
at all
@define(unsafe_hash=True)
class AlternateTitle:
value: str = field(validator=instance_of(str)) # Required subfield
kind: str | None = field(default=None, validator=optional(instance_of(str)))
@define(unsafe_hash=True)
class DateRange:
gt: str | None = field(default=None, validator=optional(instance_of(str)))
gte: str | None = field(default=None, validator=optional(instance_of(str)))
lt: str | None = field(default=None, validator=optional(instance_of(str)))
lte: str | None = field(default=None, validator=optional(instance_of(str)))
@define(unsafe_hash=True)
class Date:
kind: str | None = field(default=None, validator=optional(instance_of(str)))
note: str | None = field(default=None, validator=optional(instance_of(str)))
range: DateRange | None = field( # type: ignore[misc]
default=None,
validator=[optional(instance_of(DateRange)), check_range],
)
value: str | None = field(default=None, validator=optional(instance_of(str)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel off to define a class and not use it for inheritance, I think it could be worth a huddle to discuss this further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if the class isn't used for inheritance, is it better to just defining a timdex_object_hash()
function and setting __hash__ = timdex_object_hash()
? Not fully sure of the implications of that though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per this comment above, I wonder if it's passing because the hash created is always unique, so while no runtime exceptions, it's not actually deduping anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circled back to this after doing some more reading (this thread was particularly helpful), I think either unsafe_hash=True
or timdex_object_hash()
works here, the ListField
class feels like too much overhead if it can't properly be an inherited class due to the hash issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for looking into this more closely! I agree about the ListField
class feeling redundant as it's not accomplishing inheritance in the way we expected it to. 🤔
Just wanted to share my learnings that led me to the current state of this PR!:
Regarding @ghukill 's initial comment:
-
The child classes do not inherit the
__hash__
method from the parent classListField
becauseattrs
default behavior will write an__eq__
method for the class. the Python 3 docs, it reads:A class that overrides eq() and does not define hash() will have its hash() implicitly set to None. When the hash() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value. . .
I believe this is why we can't simply inherit the
ListField
class. 🤔 -
@define(frozen=True)
will make the list field (it decorates) immutable. The cause of all those errors cropping up when this setting is configured are field methods that involve creating the instance of the object and then proceeding to update attributes (e.g. dates field method in Ead transform). Given the current state of our transforms, I agree that this approach is not feasible. -
When I originally came across the documentation regarding the
unsafe_hash
inattrs
, I think I chose not to use it after seeing the repeated warnings to not use this setting, in particular 😅:Setting @define(unsafe_hash=False) (which implies eq=True) is almost certainly a bug.
It sounds like a config that the authors of
attrs
don't really like / wish they didn't have to support (maybe deprecate in the future).
That said, I do think it's achieving the same thing as the current state of this PR:- the list field classes remain mutable
attrs
by default writes an__eq__
method for every list field class- a
__hash__
method is assigned to every list field class
In short, I personally feel more drawn to the solution that @ehanson8 proposed in #206 (comment):
- Move
ListField.__hash__
into an independent method liketimdex_object_hash()
. - Remove
ListField
- Use
@define
as-is (i.e.,attrs
writes__eq__
method for every list field class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonavellecuerdo for all the detailed notes and explanations here, super helpful.
I can confirm that using unsafe_hash=True
is, in fact, unsafe!
from attrs import define
@define(unsafe_hash=True)
class DogA:
name: str
@define(unsafe_hash=True)
class DogB:
name: str
"""
In [5]: DogA('rex') == DogB('rex')
Out[5]: False
# And...
In [7]: DogA('rex').__hash__() == DogB('rex').__hash__()
Out[7]: False
"""
To keep this moving along, I'm definitely in support of the explicit __hash__ = ...
approach from this PR, whether or not it comes from a class method or just a standalone method, I'm amenable to either.
I must admit, I still do find this a little confusing however:
import attrs
from attrs import define
def custom_hash(instance) -> int:
values = tuple(
[
tuple(attrib) if isinstance(attrib, list) else attrib
for attrib in attrs.astuple(instance)
]
)
return hash(values)
@define
class DogC:
name: str
__hash__ = custom_hash
@define
class DogD:
name: str
__hash__ = custom_hash
"""
In [5]: DogC('rex') == DogD('rex')
Out[5]: False
# But....
In [6]: DogC('rex').__hash__() == DogD('rex').__hash__()
Out[6]: True
"""
I wonder if this is because 'name' is not a list? or that it's not utilizing attrs.field
explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to clarify my comment a bit here after a discussion yesterday with @jonavellecuerdo.
I had not considered in this toy example that different classes (e.g. DogC
vs DogD
) should likely not be equal, given they are different classes.
While I was focused on instance equality vs hash method equality... @jonavellecuerdo observed that the class type should be a consideration.
Example, the following should not be equal:
AlternateTitle(value="x", kind="y") == Location(value="x", kind="y")
If we are thinking only of field names + values, then they would be considered equal. But it's clear that to dedupe, the class type is an important variable. I got the impression that @jonavellecuerdo was exploring changes to reflect this, so won't dig in more here, but did want to include these noets that stemmed from that conversation yesterday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Thank you for providing this summary! TLDR: From #206 (comment), I noticed that what unsafe_hash=True
did differently (at least prior to the updates introduced f81475b) is that it took class types into account when generating the hash method.
Prior to the updates in f81475b, this would've occurred:
AlternateTitle(value="x", kind="y") == Location(value="x", kind="y") # returns True!
To fix this, the class type is now taken into account my timdex_object_hash
. Unit tests to assert that: if x == y
, it must follow that hash(x) == y
(and vice versa) have also been added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is pleasantly concise, I share @ghukill 's concerns about the ListField.__hash__
pattern so I'd appreciate a chance to discuss further to see if there's a better option!
transmogrifier/models.py
Outdated
@define | ||
class AlternateTitle: | ||
value: str = field(validator=instance_of(str)) # Required subfield | ||
kind: str | None = field(default=None, validator=optional(instance_of(str))) | ||
|
||
__hash__ = ListField.__hash__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel off to define a class and not use it for inheritance, I think it could be worth a huddle to discuss this further
* Remove 'ListField' class and create standalone hash method * Add additional tests for hash and dedupe methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks great!
Requesting a change re: the tuple created from the class name. Also made another stylistic comment, but happy either way on that one.
transmogrifier/models.py
Outdated
By making TIMDEX objects hashable, dedupe methods | ||
can be applied to a list of TIMDEX objects. | ||
""" | ||
values = tuple(type(timdex_object).__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple requests here.
First, I think this approach might actually creating a tuple with the individual characters of the class name, e.g.
from transmogrifier.models import *
timdex_object = AlternateTitle(value='horse',kind='animal')
tuple(type(timdex_object).__name__)
# Out[13]: ('A', 'l', 't', 'e', 'r', 'n', 'a', 't', 'e', 'T', 'i', 't', 'l', 'e')
Think this could be resolved via an explicit list, and then converting to tuple:
tuple([type(timdex_object).__name__])
# Out[2]: ('AlternateTitle',)
Second, any objection with using X.__class__
vs type(X)
?
Example:
values = tuple([timdex_object.__class__.__name__])
Personally, when we know it's some kind of well-known class instance (like classes we define), I think using .__class__
is more readable. I'm having trouble finding supporting documentation or reasoning, where it sounds like type(X)
and X.__class__
are mostly interchangeable; this SO post being a good example where it's kind of fluid, matter of taste, and some pretty edge-y gotchas.
I tend to think of type(X)
as when I really don't know anything about an object. But in this case, where we know it's a class instance of some kind, and we wrote the classes, it feels like accessing it directly is kind of nice.
TL/DR: I'd propose we make the first change to avoid a tuple of each character, but the second type(X)
vs X.__class__
happy either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALL that said -- and should have led with this -- this is a nice and elegant solution to this equality checking! Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider values = tuple([timdex_object.__class__.__name__])
more readable as well, good suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EGAD 😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Thank you for catching this erroneous error 🙈 . This has been fixed by 426b185. I also prefer using __class__
for the same reason you described:
we know it's a class instance of some kind, and we wrote the classes
(Also remembering you suggested this during our check-in yesterday so apologies for forgetting!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, fully approved after that change!
transmogrifier/models.py
Outdated
By making TIMDEX objects hashable, dedupe methods | ||
can be applied to a list of TIMDEX objects. | ||
""" | ||
values = tuple(type(timdex_object).__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider values = tuple([timdex_object.__class__.__name__])
more readable as well, good suggestion!
426b185
to
0448d3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved!! Nice work on this. Seems like a great addition.
Purpose and background context
Add a global dedupe method using
attrs
converters to remove duplicated data inTimdexRecord
fields that comprise of item lists. The following approach was taken:attrs
converter method to dedupe list of itemsattrs
class attribute.dedupe
method is defined. The method accepts anitem_list
and useslist(dict.fromkeys(item_list))
to remove duplicates while retaining the order of items as they were passed into the list.ListField
abstract class with a__hash__
methodListField
was implemented as we discussed, the syntax to have the custom classes inherit the hash method was different than expected. As the Python 3 docs explains:ListField.__hash__
? This is required by the custom classes whose attributes also comprise of list fields (i.e., Contributor, Note, and Subject).How can a reviewer manually see the effects of these changes?
Review new tests added in commit: add dedupe tests for dates and locations
Review screenshots showing Docker runs using pulled Docker image from Dev (representing latest code in this branch) and Docker image from Prod (representing current state). Instructions are provided in the attached Markdown file: simple_ab_testing.md.
Note: All three of us (@ghukill , @ehanson8 , myself) tried this out on our local machines and got similar results! More formal A/B testing will be explored as a separate effort, but all three of us agreed this method is sufficient for purposes of this PR.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/TIMX-332
Developer
Code Reviewer(s)