-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986
Conversation
Hello @h-vetinari! Thanks for updating the PR.
Comment last updated on October 11, 2018 at 20:09 Hours UTC |
8464750
to
9918d52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this absolutely needs performance tests run
you might be able to always compute the labels
and remove the return_inverse keyword (and just always return to the calling function) as these are all private functions
but perf will dictate this
Here's the full ASV run on a cleanly restarted machine with nothing else running. Bit surprised at the spread (both positive and negative):
Since #22978 wasn't merged yet 10h ago, I tested against master+#22978 see |
I'm guessing that the perf decrease for some indexing methods will be a dealbreaker, and I don't believe that always returning the inverse will be faster than reading the |
Yes (which is basically what it was before? but with some additions to get_labels to satisfy the behaviour of unique) But first, in addition to asv, I would recommend to run a few targetted benchmarks (eg unique and factorize on a single array of some dtype) manually using |
Codecov Report
@@ Coverage Diff @@
## master #22986 +/- ##
==========================================
- Coverage 92.19% 92.19% -0.01%
==========================================
Files 169 169
Lines 50954 50952 -2
==========================================
- Hits 46975 46973 -2
Misses 3979 3979
Continue to review full report at Codecov.
|
After trying to force compilation of the different
Following the advice of @jorisvandenbossche, I ran the following benchmarks for the commits in question (first is master + #22978; then the fully unified function; then the different functions for different Relative to master:
Absolute values:
Code:
|
Here's the ASV run with of 8481e19 against master + #22978. I ran it with The spread still indicates to me that the results are noisy, but in any case, it's clearly skewed towards being faster now. I'll probably rerun against just BTW, is it possible to do an ASV run for a PR on a server? Or could someone else rerun it locally? Not sure how representative my machine is.
|
And here's the one with Purely from what's happening in the code:
To me, these changes shouldn't amount to much difference at all - I'm guessing the spread should mostly be the result of randomness (i.e. what else is happening on the computer in that exact moment), although the mean is (luckily) skewed towards being faster. What do you all think? Also repeating my previous point:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is just moving code around w/o combing anything (which IIUC was the original intention), so why should this PR be accepted?
@jreback: However, this still gives |
and how does this differ from factorize? which requires no code changes |
The difference between And for implementing a |
then ok with adding the keyword to the existing functions. I just see a giant moving around code with new names for little gain here. let's try to minimize the changes. |
I don't want to implement this within |
@@ -851,20 +885,40 @@ cdef class PyObjectHashTable(HashTable): | |||
val = values[i] | |||
hash(val) | |||
|
|||
if ((val != val or val is None) or | |||
(use_na_value and val == na_value)): | |||
if ignore_na and ((val != val or val is None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this keyword is completely untested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that it is (at least for ignore_na=True
) because all the tests for factorize need to run through this.
Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse
later on. It will be excessively tested there (because unique(return_inverse=True)
calls _unique_with_inverse(ignore_na=False)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes of course, you added a keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the ignore_na keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that I added this keyword to switch between .unique
/.factorize
behaviour.
Can you point me to where the cython hashtable code is tested? I could add some tests there. That's why I asked "Do we explicitly test cython code?", because I haven't come across it yet.
Otherwise, I would add copious tests for the behaviour as soon as the keyword makes its way into the pandas API, but this is just the preparation for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari : IMO tests/generic/test_generic.py
or tests/generic/test_frame.py
could be good places to put any kinds of tests for this refactoring for now.
If it is possible to hit those keyword arguments from Python land, then do add some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's drop the ignore_na for now. and add it when I see that you actually need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
This aspect is the whole point of the PR - to not duplicate the code between .factorize
(=hashtable._unique_with_inverse(ignore_na=True)
) and the not-yet-implemented-because-this-PR-is-starting-the-process .unique(return_inverse=True)
(=hashtable._unique_with_inverse(ignore_na=False)
)
hashtable._unique_with_inverse(ignore_na=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review; undid the moving. Hopefully clearer now.
@@ -851,20 +885,40 @@ cdef class PyObjectHashTable(HashTable): | |||
val = values[i] | |||
hash(val) | |||
|
|||
if ((val != val or val is None) or | |||
(use_na_value and val == na_value)): | |||
if ignore_na and ((val != val or val is None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that it is (at least for ignore_na=True
) because all the tests for factorize need to run through this.
Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse
later on. It will be excessively tested there (because unique(return_inverse=True)
calls _unique_with_inverse(ignore_na=False)
).
I think we should stop arguing about that point, as I completely understood your point, but clearly didn't make my point clear :-)
But is it that complicated? |
Absolutely (I added this test for you). It's just that then you're only testing the object dtype, plus there might be bugs that only occur for larger examples etc. So I've made another try at simplifying but keeping the other test - I switched to testing the inverse implicitly through reconstruction - that's essentially just as thorough, but the test becomes much shorter/easier |
@jorisvandenbossche @jreback How does this look to you now? |
i will look soon |
|
||
Returns | ||
------- | ||
uniques : ndarray[object] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be dtype here
s_duplicated.values.setflags(write=writable) | ||
na_mask = s_duplicated.isna().values | ||
|
||
result_inverse, result_unique = htable().factorize(s_duplicated.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would rather you not test unrelated things directly here; these are pure unit tests
it just makes the test confusing. add with the other duplicated tests if you must
though i think these are already well covered, but if not you can add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not unrelated - it's the central functionality of factorize
, and a bit more thorough than the few hand-made tests further up.
The code for duplicated
is completely separate (in cython) and so this should have a separate test as well IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know you really want to show that this works, but these tests need hard coded expectations. It is very hard for a reader to assert what you are doing, because you are computing the results here as well. I think its pretty easy to cook an example where you start with a duplicated array and just can assert the results directly. We do lots of these kinds of things already in the drop_duplicates testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the review
s_duplicated.values.setflags(write=writable) | ||
na_mask = s_duplicated.isna().values | ||
|
||
result_inverse, result_unique = htable().factorize(s_duplicated.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not unrelated - it's the central functionality of factorize
, and a bit more thorough than the few hand-made tests further up.
The code for duplicated
is completely separate (in cython) and so this should have a separate test as well IMO
Failures are one hypothesis timeout and one server timeout, so unrelated. |
If you merge in master, that should be solved |
This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am uncomfortable with these tests as is (the last one). I want a bit less computation to produce duplicates and more hard coding of the actual indexers. I don't mind these in addition.
s_duplicated.values.setflags(write=writable) | ||
na_mask = s_duplicated.isna().values | ||
|
||
result_inverse, result_unique = htable().factorize(s_duplicated.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know you really want to show that this works, but these tests need hard coded expectations. It is very hard for a reader to assert what you are doing, because you are computing the results here as well. I think its pretty easy to cook an example where you start with a duplicated array and just can assert the results directly. We do lots of these kinds of things already in the drop_duplicates testing.
we have limited reviewers and 230 odd PR's. Certainly welcome for you to review other PR's to help out. |
So, adding hard-coded tests, and leaving the ones already in the PR as-is? Do I understand that correctly?
Didn't mean to be pushy - was really just asking what's left (after several rounds of reviews). |
yes that's fine. |
Cool. I already did that in the latest commit. |
thanks @h-vetinari |
For adding a
return_inverse
-kwarg tounique
(#21357 / #21645), I originally didn't want to have to add something to the templated cython code, preferring to add a solution usingnumpy.unique
as a first step (with the idea of improving the performance later).@jorisvandenbossche then remarked:
I didn't know what
factorize
was doing, so I had no idea about this connection. From looking at the cython code inhashtable_class_helper.pxi.in
, I found that there's substantial overlap betweenunique
andget_labels
(the core offactorize
) - the only difference is the handling ofNaN/None/etc.
I think that these could/should be unified for less code duplication. Alternatively, the first commit of this PR could be split off into a separate PR - it is just adding a
return_inverse
-kwarg tounique
-- though effectively by duplicating thefactorize
code.Both variants pass the the test suite locally, the only question is whether there's appreciable differences in the ASVs. Since the only extra complexity of unifying the code is reading the value of some
bint
within the loop, I don't think that it'll be relevant. I got some noisy data the first time 'round, will let it run again when I have the possibility.So long story short, this PR prepares the hashtable-backend to support
return_inverse=True
, which plays into #21357 #21645 #22824, and will also allow to easily solve #21720.Side note: this already builds on #22978 to be able to run the ASVs at all.