CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

h-vetinari · 2018-10-03T22:47:17Z

For adding a return_inverse-kwarg to unique (#21357 / #21645), I originally didn't want to have to add something to the templated cython code, preferring to add a solution using numpy.unique as a first step (with the idea of improving the performance later).

@jorisvandenbossche then remarked:

Which led me think: pd.unique with a return_inverse argument is actually basically the same as pd.factorize?

I didn't know what factorize was doing, so I had no idea about this connection. From looking at the cython code in hashtable_class_helper.pxi.in, I found that there's substantial overlap between unique and get_labels (the core of factorize) - the only difference is the handling of NaN/None/etc.

I think that these could/should be unified for less code duplication. Alternatively, the first commit of this PR could be split off into a separate PR - it is just adding a return_inverse-kwarg to unique -- though effectively by duplicating the factorize code.

Both variants pass the the test suite locally, the only question is whether there's appreciable differences in the ASVs. Since the only extra complexity of unifying the code is reading the value of some bint within the loop, I don't think that it'll be relevant. I got some noisy data the first time 'round, will let it run again when I have the possibility.

So long story short, this PR prepares the hashtable-backend to support return_inverse=True, which plays into #21357 #21645 #22824, and will also allow to easily solve #21720.

Side note: this already builds on #22978 to be able to run the ASVs at all.

pep8speaks · 2018-10-03T22:47:21Z

Hello @h-vetinari! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/algorithms.py !
There are no PEP8 issues in the file pandas/tests/test_algos.py !

Comment last updated on October 11, 2018 at 20:09 Hours UTC

jreback

this absolutely needs performance tests run
you might be able to always compute the labels
and remove the return_inverse keyword (and just always return to the calling function) as these are all private functions

but perf will dictate this

h-vetinari · 2018-10-04T05:41:58Z

Here's the full ASV run on a cleanly restarted machine with nothing else running.

Bit surprised at the spread (both positive and negative):

      before           after         ratio
     [640162fa]       [9918d52b]
+        488±30μs         967±70μs     1.98  inference.NumericInferOps.time_subtract(<class 'numpy.uint16'>)
+         622±0μs         1.14±0ms     1.83  inference.NumericInferOps.time_multiply(<class 'numpy.int16'>)
+        579±30μs         977±80μs     1.69  inference.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
+      25.7±0.5μs         40.6±3μs     1.58  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
+      1.98±0.1ms       3.08±0.2ms     1.55  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_decr')
+         110±8μs          171±0μs     1.55  indexing.DataFrameNumericIndexing.time_iloc
+      2.08±0.1ms       3.12±0.4ms     1.51  groupby.RankWithTies.time_rank_ties('int64', 'max')
+      9.94±0.7μs         14.6±2μs     1.47  indexing.DataFrameStringIndexing.time_getitem_scalar
+      1.95±0.1ms       2.86±0.2ms     1.47  groupby.RankWithTies.time_rank_ties('int64', 'min')
+        2.67±0ms       3.91±0.8ms     1.46  indexing.DataFrameNumericIndexing.time_loc_dups
+         613±0μs          893±0μs     1.46  indexing.DataFrameNumericIndexing.time_bool_indexer
+     4.20±0.05μs       6.10±0.1μs     1.45  indexing.CategoricalIndexIndexing.time_getitem_scalar('monotonic_incr')
+          2.48ms           3.57ms     1.44  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
+        1.23±0ms       1.75±0.1ms     1.42  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')
+        26.0±2μs         36.6±4μs     1.41  indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
+        1.83±0μs       2.57±0.3μs     1.41  indexing.GetItemSingleColumn.time_frame_getitem_single_column_int
+        977±70μs      1.36±0.03ms     1.39  indexing.CategoricalIndexIndexing.time_getitem_list('non_monotonic')
+        220±10μs         306±20μs     1.39  indexing.DataFrameNumericIndexing.time_iloc_dups
+          7.50ms           10.4ms     1.39  index_object.Indexing.time_boolean_array('Float')
+        97.1±0μs          135±8μs     1.39  indexing.DataFrameNumericIndexing.time_loc
+        4.24±0μs       5.86±0.2μs     1.38  indexing.CategoricalIndexIndexing.time_getitem_scalar('non_monotonic')
+         268±0μs         368±20μs     1.37  indexing.DataFrameStringIndexing.time_boolean_rows_object
+        14.3±1μs       19.5±0.9μs     1.36  indexing.DataFrameStringIndexing.time_get_value
+        24.4±0μs         33.3±2μs     1.36  indexing.DataFrameStringIndexing.time_ix
+        1.07±0ms      1.46±0.03ms     1.36  indexing.CategoricalIndexIndexing.time_getitem_list('monotonic_decr')
+     4.20±0.05μs       5.69±0.1μs     1.36  indexing.CategoricalIndexIndexing.time_getitem_scalar('monotonic_decr')
+          1.83μs           2.44μs     1.33  index_object.Indexing.time_get('Float')
+        28.2±2μs         36.6±2μs     1.30  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
+        13.4±0μs         17.2±1μs     1.29  indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
+        12.2±0ms         15.6±1ms     1.29  algorithms.Duplicated.time_duplicated_int(False)
+         368±0μs         467±60μs     1.27  groupby.GroupByMethods.time_dtype_as_group('int', 'mean', 'direct')
+         342±0μs         431±30μs     1.26  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
+     1.07±0.08ms         1.35±0ms     1.26  indexing.CategoricalIndexIndexing.time_getitem_list('monotonic_incr')
+     1.09±0.06ms      1.37±0.02ms     1.25  indexing.CategoricalIndexIndexing.time_getitem_bool_array('non_monotonic')
+         342±0μs          426±8μs     1.25  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
+      18.3±0.4μs       22.6±0.4μs     1.23  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')
+      8.01±0.4μs         9.77±1μs     1.22  indexing.DataFrameStringIndexing.time_loc
+        2.14±0ms       2.60±0.1ms     1.21  indexing.CategoricalIndexIndexing.time_get_loc_scalar('non_monotonic')
+        14.1±0μs       17.1±0.3μs     1.21  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')
+           832ns            994ns     1.19  index_object.Indexing.time_get('String')
+        2.68±0ms         3.12±0ms     1.17  index_object.SetOperations.time_operation('datetime', 'intersection')
+        3.12±0ms         3.65±0ms     1.17  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
+          7.06μs           8.07μs     1.14  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+        27.3±0ms         31.2±0ms     1.14  binary_ops.Timeseries.time_timestamp_ops_diff('US/Eastern')
+          13.7ms           15.6ms     1.14  categoricals.Isin.time_isin_categorical('int64')
+           109ms            125ms     1.14  index_object.IndexAppend.time_append_obj_list
+         109±0ms          125±0ms     1.14  index_object.SetOperations.time_operation('strings', 'union')
+           518μs            587μs     1.13  categoricals.CategoricalSlicing.time_getitem_list('non_monotonic')
+        62.5±0ms         70.3±0ms     1.12  eval.Eval.time_and('numexpr', 'all')
+          4.58μs           5.13μs     1.12  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
+          85.4μs           95.6μs     1.12  frame_methods.XS.time_frame_xs(0)
+           141ms            156ms     1.11  index_object.Indexing.time_get_loc_non_unique('String')
+        85.2±0μs         94.7±6μs     1.11  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+      3.29±0.2ms         3.65±0ms     1.11  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
+         383±4ms          422±0ms     1.10  io.stata.Stata.time_read_stata('tq')
-          52.1ms           46.9ms     0.90  eval.Query.time_query_with_boolean_selection
-         375±8ms          336±4ms     0.90  io.excel.Excel.time_write_excel('xlsxwriter')
-          22.0μs           19.5μs     0.89  ctors.SeriesDtypesConstructors.time_index_from_array_floats
-        440±40μs          391±0μs     0.89  groupby.GroupByMethods.time_dtype_as_field('int', 'sum', 'direct')
-      2.44±0.2ms         2.14±0ms     0.88  replace.FillNa.time_replace(True)
-          15.6μs           13.6μs     0.87  index_object.Indexing.time_slice_step('Float')
-          4.84μs           4.20μs     0.87  frame_methods.XS.time_frame_xs(1)
-          5.13μs           4.39μs     0.86  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-         183±0μs         155±10μs     0.85  groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'transformation')
-           1.33s            1.12s     0.85  groupby.GroupByMethods.time_dtype_as_group('float', 'mad', 'direct')
-         406±8ms          344±0ms     0.85  indexing.NonNumericSeriesIndexing.time_getitem_scalar('string', 'unique_monotonic_inc')
-     1.16±0.07ms          977±0μs     0.84  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
-         188±8ms          156±0ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct')
-         188±4ms          156±0ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'transformation')
-        188±70ms          156±8ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'direct')
-          5.86μs           4.86μs     0.83  index_object.Indexing.time_slice_step('String')
-          10.1μs           8.32μs     0.83  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-        419±30μs         344±40μs     0.82  groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'direct')
-         574±0μs          469±0μs     0.82  frame_methods.Iteration.time_iteritems_cached
-        426±20μs          342±0μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'transformation')
-        426±30μs          342±0μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'prod', 'transformation')
-        133±10μs          107±6μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'direct')
-         334±0μs         267±20μs     0.80  groupby.GroupByMethods.time_dtype_as_field('float', 'bfill', 'direct')
-           1.41s            1.12s     0.80  groupby.Datelike.time_sum('period_range')
-        488±30μs          391±0μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'direct')
-          26.0ms           20.8ms     0.80  eval.Query.time_query_datetime_column
-          30.3μs           24.2μs     0.80  index_object.Indexing.time_get_loc('Float')
-        91.6±0μs         72.9±3μs     0.80  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('datetime', 'nonunique_monotonic_inc')
-        537±50μs          426±0μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'tail', 'direct')
-         536±0μs         426±20μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'nunique', 'transformation')
-        513±30μs         402±20μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'tail', 'transformation')
-         141±0μs          110±0μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'transformation')
-        868±30μs         671±40μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'transformation')
-        868±30μs          671±0μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
-        509±20μs          391±0μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'transformation')
-         407±5μs         311±20μs     0.76  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
-         107±8μs         80.1±0μs     0.75  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
-        458±10μs          344±0μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'direct')
-          4.29ms           3.22ms     0.75  index_object.Indexing.time_get_loc_non_unique('Float')
-        897±50μs         671±40μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'direct')
-      3.57±0.2ms         2.67±0ms     0.75  groupby.Datelike.time_sum('date_range_tz')
-      2.88±0.2ms       2.14±0.2ms     0.75  groupby.Datelike.time_sum('date_range')
-      8.88±0.9ms         6.58±0ms     0.74  io.hdf.HDFStoreDataFrame.time_query_store_table
-        556±30μs          407±0μs     0.73  groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'transformation')
-        446±20μs          326±0μs     0.73  groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'transformation')
-        340±50μs         244±10μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'transformation')
-     1.20±0.09ms         862±60μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'direct')
-        119±10μs         85.4±5μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
-        570±30μs         407±20μs     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'head', 'direct')
-        570±40μs         407±40μs     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'direct')
-        951±70μs          679±0μs     0.71  inference.NumericInferOps.time_add(<class 'numpy.int16'>)
-        586±50μs          407±0μs     0.69  groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'direct')
-     1.22±0.09ms         837±60μs     0.69  groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'direct')
-        536±70μs          367±0μs     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'transformation')
-           719ms          492±4ms     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'transformation')
-         471±0μs          322±0μs     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'transformation')
-         114±7μs         77.7±0μs     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
-        575±50μs          388±5μs     0.67  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'direct')
-        586±70μs          391±0μs     0.67  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'transformation')
-      1.29±0.1ms         854±60μs     0.66  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'transformation')
-           1.94s            1.28s     0.66  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')
-      1.22±0.1ms         804±60μs     0.66  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'direct')
-        20.8±0ms         13.7±0ms     0.66  groupby.Float32.time_sum
-     1.25±0.08ms         804±40μs     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'transformation')
-         220±0μs          141±8μs     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct')
-      1.71±0.1ms         1.09±0ms     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
-         382±0μs         244±10μs     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'direct')
-           2.03s            1.28s     0.63  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'direct')
-        467±30μs         293±20μs     0.63  groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'direct')
-         250±8ms          156±8ms     0.62  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'transformation')
-        613±30μs          382±0μs     0.62  groupby.GroupByMethods.time_dtype_as_field('int', 'first', 'transformation')
-     1.34±0.08ms         822±60μs     0.61  groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'transformation')
-           844ms          500±0ms     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'direct')
-        710±40μs         420±20μs     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'head', 'transformation')
-        239±10μs          141±0μs     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'transformation')
-        1.82±0ms         1.07±0ms     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
-        1.44±0ms         839±60μs     0.58  groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'transformation')
-     1.52±0.08ms          854±0μs     0.56  groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'direct')
-         153±0μs         80.1±0μs     0.53  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
-         168±0μs         77.7±0μs     0.46  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
-         171±9μs         73.2±6μs     0.43  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
-         171±9μs         73.2±6μs     0.43  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
-     1.17±0.08ms          412±0μs     0.35  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Since #22978 wasn't merged yet 10h ago, I tested against master+#22978 see
640162f
9918d52

h-vetinari · 2018-10-04T05:58:57Z

I'm guessing that the perf decrease for some indexing methods will be a dealbreaker, and I don't believe that always returning the inverse will be faster than reading the bint. Might have to have separate codes for unique without inverse vs. factorize/unique with inverse...

jorisvandenbossche · 2018-10-04T06:35:22Z

Might have to have separate codes for unique without inverse vs. factorize/unique with inverse...

Yes (which is basically what it was before? but with some additions to get_labels to satisfy the behaviour of unique)

But first, in addition to asv, I would recommend to run a few targetted benchmarks (eg unique and factorize on a single array of some dtype) manually using %timeit before / after. I think that will give you a better idea of the performance cost.

codecov · 2018-10-04T20:26:14Z

Codecov Report

Merging #22986 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22986      +/-   ##
==========================================
- Coverage   92.19%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50954    50952       -2     
==========================================
- Hits        46975    46973       -2     
  Misses       3979     3979

Flag	Coverage Δ
#multiple	`90.61% <100%> (-0.01%)`	⬇️
#single	`42.27% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`95.11% <100%> (-0.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1546c35...6d0e86b. Read the comment docs.

h-vetinari · 2018-10-04T20:57:42Z

After trying to force compilation of the different bint code paths (unsuccessfully...) I now added the original (no-inverse-)unique code back in.

But first, in addition to asv, I would recommend to run a few targetted benchmarks (eg unique and factorize on a single array of some dtype)

Following the advice of @jorisvandenbossche, I ran the following benchmarks for the commits in question (first is master + #22978; then the fully unified function; then the different functions for different bint, and reduplicating the no-inverse code, and finally, adding @cython.wraparound(False)). The hashes are the same as in this PR

Relative to master:

                   640162fab9  9918d52b96  52ae84e7f0  dbe4e0ed81  8481e19619
StringIndex               1.0    0.987317    0.977104    0.983180    0.970452
CategoricalIndex          1.0    1.049310    1.050426    0.983520    0.988915
IntIndex                  1.0    1.127943    1.143860    1.013043    0.986293
UIntIndex                 1.0    1.149187    1.144719    1.003053    0.989413
RangeIndex                1.0    1.126137    1.137957    0.996779    0.990026
FloatIndex                1.0    1.127896    1.097702    1.026405    1.026518
TimedeltaIndex            1.0    1.075186    1.078420    1.006917    0.988882
StringSeries              1.0    0.959835    0.983661    0.968425    0.954817
CategoricalSeries         1.0    1.074586    1.084046    1.023071    1.032467
IntSeries                 1.0    1.045707    1.043808    0.951452    0.944722
UIntSeries                1.0    1.100339    1.117216    1.001963    0.983911
RangeSeries               1.0    1.099110    1.107786    0.991575    0.980506
FloatSeries               1.0    1.085278    1.123008    0.994891    1.018042
TimedeltaSeries           1.0    1.067808    1.099913    0.996012    0.992843

Absolute values:

                        640162fab9       9918d52b96       52ae84e7f0       dbe4e0ed81       8481e19619
StringIndex        16.08 ± 0.19 ms  15.88 ± 0.34 ms  15.71 ± 0.20 ms  15.81 ± 0.21 ms  15.61 ± 0.14 ms
CategoricalIndex    1.28 ± 0.05 ms   1.35 ± 0.07 ms   1.35 ± 0.07 ms   1.26 ± 0.06 ms   1.27 ± 0.06 ms
IntIndex            2.75 ± 0.03 ms   3.10 ± 0.04 ms   3.14 ± 0.05 ms   2.78 ± 0.08 ms   2.71 ± 0.05 ms
UIntIndex           2.71 ± 0.03 ms   3.12 ± 0.03 ms   3.11 ± 0.08 ms   2.72 ± 0.03 ms   2.69 ± 0.02 ms
RangeIndex          2.75 ± 0.04 ms   3.10 ± 0.03 ms   3.13 ± 0.05 ms   2.74 ± 0.06 ms   2.72 ± 0.08 ms
FloatIndex          4.62 ± 0.08 ms   5.21 ± 0.33 ms   5.07 ± 0.08 ms   4.74 ± 0.43 ms   4.74 ± 0.32 ms
TimedeltaIndex      4.29 ± 0.04 ms   4.61 ± 0.09 ms   4.63 ± 0.06 ms   4.32 ± 0.08 ms   4.24 ± 0.08 ms
StringSeries       34.96 ± 0.50 ms  33.56 ± 0.36 ms  34.39 ± 1.19 ms  33.86 ± 0.31 ms  33.38 ± 0.51 ms
CategoricalSeries   1.22 ± 0.03 ms   1.31 ± 0.08 ms   1.32 ± 0.05 ms   1.25 ± 0.02 ms   1.26 ± 0.07 ms
IntSeries           3.41 ± 0.17 ms   3.57 ± 0.04 ms   3.56 ± 0.05 ms   3.24 ± 0.05 ms   3.22 ± 0.06 ms
UIntSeries          3.24 ± 0.03 ms   3.56 ± 0.05 ms   3.62 ± 0.07 ms   3.24 ± 0.04 ms   3.19 ± 0.04 ms
RangeSeries         3.25 ± 0.03 ms   3.57 ± 0.07 ms   3.60 ± 0.08 ms   3.22 ± 0.03 ms   3.19 ± 0.03 ms
FloatSeries         4.74 ± 0.06 ms   5.14 ± 0.11 ms   5.32 ± 0.47 ms   4.71 ± 0.05 ms   4.82 ± 0.37 ms
TimedeltaSeries     4.17 ± 0.04 ms   4.45 ± 0.10 ms   4.59 ± 0.11 ms   4.15 ± 0.04 ms   4.14 ± 0.03 ms

Code:

import pandas as pd
import pandas.util.testing as tm
import timeit
hash = pd.__git_version__[:10]

def pps(mean, std):
    system = [(1, 's'), (1e-3, 'ms'), (1e-6, 'μs'), (1e-9, 'ns'), (1e-12, 'ps')]
    for factor, suffix in system:
        if mean >= factor:
            break
    amount = round(mean / factor, 2)
    return f'{mean / factor:.2f} ± {std / factor:.2f} {suffix}'

k = 10 ** 5
rep = 10
number = 100
tic = timeit.default_timer()
tags = ['String', 'Categorical', 'Int', 'UInt',
         'Range', 'Float', 'Timedelta']
df = pd.DataFrame(index = [x+'Index' for x in tags], columns = ['mean', 'std'])
for tag in tags:
    idx = getattr(tm, f'make{tag}Index')(k=k)
    t = timeit.repeat('idx.unique()', setup='from __main__ import idx', repeat = rep+1, number = number)[1:]
    df.loc[tag+'Index', 'mean'] = pd.Series(t).mean() / number
    df.loc[tag+'Index', 'std'] = pd.Series(t).std() / number
    
    s = pd.Series(idx).sample(frac=1.1, replace=True)
    t = timeit.repeat('s.unique()', setup='from __main__ import s', repeat = rep+1, number = number)[1:]
    df.loc[tag+'Series', 'mean'] = pd.Series(t).mean() / number
    df.loc[tag+'Series', 'std'] = pd.Series(t).std() / number

res = df.apply(lambda r: pps(*tuple(r)), axis=1)

h-vetinari · 2018-10-05T05:47:43Z

Here's the ASV run with of 8481e19 against master + #22978. I ran it with -f 1.05 because ~10% for indexing ops would still be sensitive.

The spread still indicates to me that the results are noisy, but in any case, it's clearly skewed towards being faster now. I'll probably rerun against just -b ^index to see how much of the indexing stuff remains after a rerun.

BTW, is it possible to do an ASV run for a PR on a server? Or could someone else rerun it locally? Not sure how representative my machine is.

      before           after         ratio
     [640162fa]       [8481e196]
!      13.9±0.9ms           failed      n/a  algorithms.Duplicated.time_duplicated_float('first')
!      13.9±0.2ms           failed      n/a  algorithms.Duplicated.time_duplicated_float('last')
!        10.9±0ms           failed      n/a  binary_ops.Ops.time_frame_add(False, 'default')
+        536±40μs          744±0μs     1.39  inference.NumericInferOps.time_add(<class 'numpy.int16'>)
+          52.1ms           70.3ms     1.35  eval.Query.time_query_with_boolean_selection
+        13.7±0ms         18.2±1ms     1.33  stat_ops.Rank.time_rank('DataFrame', False)
+        13.7±0ms         18.2±0ms     1.33  stat_ops.Rank.time_rank('DataFrame', True)
+          26.6μs           34.1μs     1.28  index_object.Indexing.time_get_loc_sorted('Float')
+        85.9±0ms          109±4ms     1.27  io.csv.ReadCSVCategorical.time_convert_post
+         477±4ms          602±4ms     1.26  groupby.GroupByMethods.time_dtype_as_field('float', 'mad', 'transformation')
+          2.60ms           3.28ms     1.26  index_object.Indexing.time_get_loc_non_unique('Float')
+        78.1±1ms         97.7±4ms     1.25  binary_ops.Ops.time_frame_comparison(False, 1)
+           562ms            703ms     1.25  gil.ParallelReadCSV.time_read_csv('float')
+          2.03μs           2.44μs     1.20  index_object.Indexing.time_get('Float')
+          10.4ms           12.5ms     1.20  categoricals.CategoricalSlicing.time_getitem_bool_array('non_monotonic')
+      5.76±0.4ms         6.84±0ms     1.19  frame_methods.Fillna.time_frame_fillna(True, 'pad')
+        93.8±6ms            109ms     1.17  io.sas.SAS.time_read_msgpack('sas7bdat')
+        1.08±0ms       1.26±0.2ms     1.16  indexing.CategoricalIndexIndexing.time_getitem_list('non_monotonic')
+           2.34s            2.70s     1.15  gil.ParallelGroups.time_get_groups(4)
+          13.4μs           15.3μs     1.14  ctors.SeriesDtypesConstructors.time_dtindex_from_series
+           109ms            125ms     1.14  index_object.IndexAppend.time_append_obj_list
+        36.5±0ms         41.7±0ms     1.14  join_merge.MergeAsof.time_on_int
+         281±0ms         320±10ms     1.14  io.stata.Stata.time_write_stata('th')
+     1.71±0.06ms         1.93±0ms     1.13  reindex.DropDuplicates.time_frame_drop_dups_int(True)
+        41.7±0ms         46.9±0ms     1.12  frame_methods.Iteration.time_itertuples
+          4.58μs           5.13μs     1.12  index_object.Indexing.time_slice_step('String')
+        28.1±0ms         31.2±0ms     1.11  index_object.SetOperations.time_operation('date_string', 'symmetric_difference')
+        18.8±0ms         20.8±0ms     1.11  io.hdf.HDFStoreDataFrame.time_read_store
+           4.52s            5.00s     1.11  gil.ParallelGroups.time_get_groups(8)
+           103μs            114μs     1.11  frame_methods.XS.time_frame_xs(0)
+        68.7±0μs         72.9±0μs     1.06  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+        73.2±0μs         77.7±0μs     1.06  indexing.NumericSeriesIndexing.time_ix_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+          4.40μs           4.66μs     1.06  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
+        397±20μs          419±0μs     1.05  stat_ops.SeriesOps.time_op('mean', 'float', False)
+           1.19s            1.25s     1.05  join_merge.I8Merge.time_i8merge('left')
-        20.7±0μs       19.7±0.4μs     0.95  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')
-         312±0ms          297±0ms     0.95  io.json.ToJSON.time_delta_int_tstamp_lines('index')
-         312±0ms          297±0ms     0.95  io.json.ToJSON.time_delta_int_tstamp_lines('split')
-      2.82±0.2μs         2.67±0μs     0.95  categoricals.Contains.time_categorical_index_contains
-          8.27ms           7.81ms     0.94  index_object.Indexing.time_boolean_array('Float')
-        4.24±0μs         4.01±0μs     0.94  io.hdf.HDFStoreDataFrame.time_store_repr
-         365±0μs         344±20μs     0.94  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
-          7.77μs           7.30μs     0.94  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
-        83.3±3ms         78.1±0ms     0.94  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'nearest')
-        1.23±0ms      1.15±0.08ms     0.93  categoricals.Constructor.time_datetimes_with_nat
-        9.77±1ms       9.11±0.2ms     0.93  strings.Cat.time_cat(0, None, '-', 0.001)
-        4.52±0ms       4.21±0.3ms     0.93  algorithms.Hashing.time_series_int
-     1.34±0.08μs       1.24±0.1μs     0.93  period.PeriodProperties.time_property('min', 'dayofweek')
-        372±30ns          344±0ns     0.92  timestamp.TimestampProperties.time_is_quarter_start(None, None)
-         102±0ms         93.8±0ms     0.92  frame_methods.Count.time_count_level_multi(1)
-        4.56±0ms       4.21±0.3ms     0.92  algorithms.Hashing.time_series_timedeltas
-         188±0ms          172±0ms     0.92  io.hdf.HDFStoreDataFrame.time_write_store_table_dc
-         671±0μs          613±0μs     0.91  indexing.DataFrameNumericIndexing.time_bool_indexer
-           1.23s            1.12s     0.91  groupby.GroupByMethods.time_dtype_as_group('float', 'mad', 'transformation')
-      2.29±0.2ms         2.08±0ms     0.91  stat_ops.SeriesOps.time_op('var', 'float', True)
-        7.81±0ms       7.10±0.2ms     0.91  index_object.Ops.time_add('float')
-        61.0±0μs         55.5±1μs     0.91  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-         123±0μs          111±1μs     0.90  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      4.32±0.3ms       3.91±0.1ms     0.90  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 1, 'lower')
-         156±0ms          141±6ms     0.90  strings.Split.time_split(False)
-      9.38±0.6ms         8.41±0ms     0.90  timeseries.DatetimeIndex.time_normalize('repeated')
-        383±30ns         343±10ns     0.90  timestamp.TimestampProperties.time_quarter(None, None)
-        4.74±0μs         4.24±0μs     0.89  indexing.CategoricalIndexIndexing.time_getitem_scalar('monotonic_decr')
-          13.7ms           12.2ms     0.89  index_object.Indexing.time_boolean_array('String')
-          13.7ms           12.2ms     0.89  index_object.Indexing.time_boolean_series('String')
-         141±0ms          125±0ms     0.89  io.excel.Excel.time_read_excel('openpyxl')
-      93.8±0.7ms         83.3±0ms     0.89  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'linear')
-      93.8±0.7ms         83.3±3ms     0.89  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'midpoint')
-        46.9±4ms         41.7±0ms     0.89  strings.Methods.time_title
-         541±0μs          478±0μs     0.88  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        88.5±5ms         78.1±0ms     0.88  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'higher')
-        88.5±5ms         78.1±3ms     0.88  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'lower')
-         121±0μs          107±0μs     0.88  panel_methods.PanelMethods.time_shift('items')
-        391±20ns         343±20ns     0.88  timestamp.TimestampProperties.time_quarter(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-          4.88ms           4.29ms     0.88  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-          20.8ms           18.2ms     0.88  eval.Query.time_query_datetime_column
-      10.4±0.9ms         9.11±0ms     0.88  strings.Cat.time_cat(0, ',', '-', 0.0)
-           125ms            109ms     0.88  strings.Repeat.time_repeat('int')
-        15.6±1ms       13.7±0.7ms     0.88  timeseries.AsOf.time_asof('DataFrame')
-      3.91±0.2ms         3.42±0ms     0.88  timeseries.ToDatetimeCache.time_dup_string_dates(True)
-        458±10ns         401±30ns     0.87  timestamp.TimestampProperties.time_dayofyear(None, None)
-        78.7±0μs         68.7±0μs     0.87  indexing.NumericSeriesIndexing.time_iloc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        48.6±3μs         42.4±2μs     0.87  timestamp.TimestampProperties.time_is_year_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-      4.51±0.1ms      3.91±0.09ms     0.87  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'midpoint')
-         406±0ms          352±4ms     0.87  timeseries.ToDatetimeISO8601.time_iso8601_tz_spaceformat
-         117±6μs          101±2μs     0.86  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        473±50μs          407±0μs     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'transformation')
-        54.7±1ms         46.9±0ms     0.86  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'linear')
-        54.7±0ms         46.9±3ms     0.86  strings.Methods.time_center
-        54.7±0ms         46.9±3ms     0.86  strings.Methods.time_pad
-        36.5±0ms         31.2±0ms     0.86  strings.Methods.time_strip
-        13.7±3ms         11.7±0ms     0.86  timedelta.ToTimedelta.time_convert_string_seconds
-      4.56±0.2ms         3.91±0ms     0.86  timeseries.ToDatetimeCache.time_dup_string_tzoffset_dates(True)
-      3.12±0.3ms       2.67±0.2ms     0.86  timeseries.DatetimeIndex.time_add_timedelta('tz_naive')
-        49.1±5μs       42.0±0.5μs     0.86  timestamp.TimestampProperties.time_is_quarter_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-        313±30ns          267±0ns     0.85  timedelta.DatetimeAccessor.time_dt_accessor
-        50.0±6μs         42.4±0μs     0.85  timestamp.TimestampProperties.time_is_year_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-      7.40±0.2ms       6.25±0.1ms     0.84  timeseries.ResampleSeries.time_resample('period', '5min', 'mean')
-           1.17s            984ms     0.84  groupby.Apply.time_copy_function_multi_col
-      4.65±0.1ms       3.91±0.1ms     0.84  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 1, 'midpoint')
-         466±0ns         391±20ns     0.84  timestamp.TimestampProperties.time_week(None, None)
-      7.25±0.6ms       6.08±0.4ms     0.84  timeseries.DatetimeIndex.time_timeseries_is_month_start('tz_naive')
-        21.9±0ms         18.2±0ms     0.83  inference.DateInferOps.time_timedelta_plus_datetime
-        10.9±0ms       9.11±0.7ms     0.83  strings.Cat.time_cat(0, None, '-', 0.0)
-      10.9±0.8ms         9.11±0ms     0.83  strings.Cat.time_cat(0, None, None, 0.001)
-      10.9±0.6ms       9.11±0.5ms     0.83  timeseries.DatetimeAccessor.time_dt_accessor_normalize
-        93.8±0ms         78.1±1ms     0.83  strings.Cat.time_cat(3, ',', '-', 0.15)
-         375±8ms          312±0ms     0.83  strings.Split.time_split(True)
-         188±4ms          156±8ms     0.83  timeseries.DatetimeIndex.time_timeseries_is_month_start('tz_aware')
-           703ms          586±4ms     0.83  timeseries.ToDatetimeFormat.time_no_exact
-        17.0±1μs         14.1±0μs     0.83  timestamp.TimestampProperties.time_is_year_end(None, 'B')
-           1.56s            1.30s     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')
-           734ms            609ms     0.83  timeseries.ToDatetimeFormat.time_exact
-        52.7±2ms         43.7±2ms     0.83  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'higher')
-        23.6±1μs       19.5±0.9μs     0.83  timeseries.AsOf.time_asof_single('Series')
-         180±4ms          148±4ms     0.83  frame_methods.Iteration.time_iteritems_indexing
-        486±30ns          401±0ns     0.82  timestamp.TimestampProperties.time_week(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-        66.4±4ms         54.7±4ms     0.82  strings.Cat.time_cat(3, None, '-', 0.001)
-      7.81±0.8ms       6.43±0.1ms     0.82  timeseries.ToDatetimeYYYYMMDD.time_format_YYYYMMDD
-        568±30μs         467±30μs     0.82  multiindex_object.Values.time_datetime_level_values_sliced
-        343±30ns          282±0ns     0.82  timestamp.TimestampProperties.time_microsecond(None, None)
-        172±20μs          142±0μs     0.82  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'transformation')
-        378±30ns         311±20ns     0.82  timestamp.TimestampProperties.time_tz(None, None)
-      3.51±0.2ms         2.88±0ms     0.82  timeseries.DatetimeIndex.time_add_timedelta('repeated')
-      10.4±0.9ms       8.52±0.5ms     0.82  strings.Cat.time_cat(0, None, None, 0.0)
-      8.37±0.6ms       6.84±0.4ms     0.82  algorithms.Hashing.time_series_categorical
-        491±60ns          401±0ns     0.82  timestamp.TimestampProperties.time_week(None, 'B')
-         211±8ms          172±0ms     0.81  timeseries.ToDatetimeCache.time_dup_string_tzoffset_dates(False)
-           1.09s            891ms     0.81  timedelta.TimedeltaOps.time_add_td_ts
-           250ms            203ms     0.81  gil.ParallelKth.time_kth_smallest
-     1.35±0.07ms         1.09±0ms     0.81  timeseries.InferFreq.time_infer_freq(None)
-      4.84±0.2ms       3.91±0.1ms     0.81  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'higher')
-        10.4±0ms       8.41±0.5ms     0.81  strings.Cat.time_cat(0, ',', '-', 0.001)
-      5.02±0.3ms       4.04±0.1ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'nearest')
-           797ms          641±0ms     0.80  inference.MaybeConvertNumeric.time_convert
-      4.62±0.4ms       3.70±0.2ms     0.80  timeseries.AsOf.time_asof_nan_single('DataFrame')
-      3.76±0.1ms       3.01±0.1ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'midpoint')
-      2.74±0.2ms         2.20±0ms     0.80  stat_ops.SeriesOps.time_op('var', 'int', True)
-        458±60ns         366±20ns     0.80  timestamp.TimestampProperties.time_days_in_month(None, None)
-        78.1±4ms         62.5±2ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'midpoint')
-        58.6±4ms         46.9±0ms     0.80  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'midpoint')
-        78.1±0ms         62.5±4ms     0.80  strings.Cat.time_cat(3, None, '-', 0.15)
-        305±20ns         244±10ns     0.80  timedelta.TimedeltaProperties.time_timedelta_seconds
-        31.2±2ms         25.0±0ms     0.80  timeseries.DatetimeIndex.time_to_date('repeated')
-        54.7±0ms         43.7±2ms     0.80  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'lower')
-        27.3±2ms         21.9±0ms     0.80  strings.Contains.time_contains(False)
-      4.63±0.1ms      3.70±0.05ms     0.80  timeseries.ResampleSeries.time_resample('period', '1D', 'mean')
-      3.98±0.1ms      3.17±0.09ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'nearest')
-           4.09s            3.27s     0.80  timeseries.ToDatetimeNONISO8601.time_different_offset
-        1.33±0ms         1.06±0ms     0.80  timeseries.InferFreq.time_infer_freq('D')
-        430±20ns         342±20ns     0.79  timestamp.TimestampProperties.time_is_quarter_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-        44.3±5ms         35.2±1ms     0.79  strings.Methods.time_endswith
-        375±10ms          297±2ms     0.79  strings.Methods.time_extract
-           24.8s            19.6s     0.79  strings.Dummies.time_get_dummies
-      3.76±0.2ms      2.97±0.02ms     0.79  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'lower')
-        477±10ms          375±4ms     0.79  groupby.Groups.time_series_groups('int64_large')
-        219±10ms          172±0ms     0.79  timeseries.DatetimeIndex.time_to_date('tz_aware')
-      5.07±0.2ms       3.98±0.1ms     0.79  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'linear')
-        16.4±1μs       12.8±0.9μs     0.79  timestamp.TimestampOps.time_replace_None('US/Eastern')
-           2.17s            1.70s     0.78  timeseries.ToDatetimeNONISO8601.time_same_offset
-        342±20ns          267±0ns     0.78  timestamp.TimestampProperties.time_microsecond(None, 'B')
-        31.1±0μs         24.3±0μs     0.78  timestamp.TimestampAcrossDst.time_replace_across_dst
-        2.50±0μs         1.95±0μs     0.78  timedelta.TimedeltaConstructor.time_from_missing
-          4.39μs           3.44μs     0.78  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_decr')
-      7.81±0.5μs         6.10±0μs     0.78  timedelta.TimedeltaConstructor.time_from_unit
-        938±60ns         732±60ns     0.78  timestamp.TimestampOps.time_to_pydatetime(None)
-         343±0ns          267±0ns     0.78  timestamp.TimestampProperties.time_microsecond(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-          17.6ms           13.7ms     0.78  categoricals.Isin.time_isin_categorical('int64')
-        70.3±5ms         54.7±0ms     0.78  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'higher')
-        6.87±0μs       5.34±0.4μs     0.78  timedelta.TimedeltaConstructor.time_from_int
-        80.7±3ms         62.5±0ms     0.77  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'linear')
-        484±10ms          375±4ms     0.77  groupby.Apply.time_copy_overhead_single_col
-         442±0ns          342±0ns     0.77  timestamp.TimestampProperties.time_is_quarter_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-        674±40μs          521±0μs     0.77  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-        443±30ns          342±0ns     0.77  timestamp.TimestampProperties.time_is_year_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-      5.36±0.2ms         4.09±0ms     0.76  rolling.Quantile.time_quantile('Series', 10, 'int', 1, 'linear')
-        13.0±1ms       9.94±0.7ms     0.76  strings.Cat.time_cat(0, None, None, 0.15)
-        1.91±0ms         1.46±0ms     0.76  timeseries.InferFreq.time_infer_freq('B')
-      3.98±0.1ms       3.01±0.1ms     0.76  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'higher')
-      4.52±0.1ms         3.42±0ms     0.76  timeseries.AsOf.time_asof_single('DataFrame')
-        487±30μs         366±20μs     0.75  timeseries.DatetimeIndex.time_add_timedelta('dst')
-        31.2±1ms         23.4±0ms     0.75  algorithms.Hashing.time_frame
-        93.8±4ms         70.3±4ms     0.75  strings.Cat.time_cat(3, ',', '-', 0.0)
-        62.5±8ms         46.9±5ms     0.75  strings.Cat.time_cat(3, ',', None, 0.15)
-        41.7±3ms         31.2±2ms     0.75  strings.Methods.time_len
-        46.9±3ms         35.2±2ms     0.75  strings.Methods.time_startswith
-      5.58±0.5ms       4.15±0.1ms     0.74  rolling.Quantile.time_quantile('Series', 10, 'int', 1, 'higher')
-      7.35±0.5ms       5.47±0.2ms     0.74  timeseries.ToDatetimeISO8601.time_iso8601_format
-      4.11±0.2ms       3.05±0.1ms     0.74  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0, 'midpoint')
-        70.3±0ms         52.1±0ms     0.74  strings.Cat.time_cat(3, None, None, 0.0)
-      14.1±0.8ms         10.4±0ms     0.74  strings.Cat.time_cat(0, None, '-', 0.15)
-      2.47±0.1ms       1.82±0.1ms     0.74  stat_ops.SeriesOps.time_op('var', 'float', False)
-      5.21±0.3ms       3.83±0.1ms     0.74  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'lower')
-        75.5±3ms         54.7±0ms     0.72  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'lower')
-      4.12±0.2ms      2.98±0.09ms     0.72  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'linear')
-        204±30ns          147±0ns     0.72  timeseries.DatetimeAccessor.time_dt_accessor
-      3.47±0.4ms         2.49±0ms     0.72  timeseries.ToDatetimeCache.time_dup_string_dates_and_format(False)
-        21.9±2ms         15.6±0ms     0.71  algorithms.Hashing.time_series_string
-      6.49±0.4μs         4.58±0μs     0.71  timeseries.DatetimeIndex.time_get('tz_naive')
-        488±20ns          343±0ns     0.70  timestamp.TimestampProperties.time_is_year_end(None, None)
-          52.1ms           36.5ms     0.70  eval.Query.time_query_datetime_index
-      2.69±0.4μs       1.83±0.1μs     0.68  timestamp.TimestampConstruction.time_parse_iso8601_no_tz
-      7.29±0.3ms         4.97±0ms     0.68  timeseries.ToDatetimeISO8601.time_iso8601_format_no_sep
-        46.9±8ms         31.2±0ms     0.67  strings.Methods.time_lstrip
-      7.49±0.6ms       4.84±0.5ms     0.65  rolling.Quantile.time_quantile('Series', 10, 'float', 0, 'nearest')
-      2.67±0.2ms       1.71±0.2ms     0.64  timedelta.DatetimeAccessor.time_timedelta_nanoseconds
-      2.68±0.2ms       1.66±0.1ms     0.62  timedelta.DatetimeAccessor.time_timedelta_days
-       997±200μs          613±0μs     0.61  inference.NumericInferOps.time_multiply(<class 'numpy.int16'>)
-      2.60±0.1ms         1.59±0ms     0.61  timedelta.DatetimeAccessor.time_timedelta_seconds
-           1.05s          609±0ms     0.58  timeseries.Iteration.time_iter(<function date_range at 0x000001C81A7DB7B8>)
-        195±70μs          111±0μs     0.57  series_methods.SeriesConstructor.time_constructor(None)
-         977±0μs          518±0μs     0.53  inference.NumericInferOps.time_subtract(<class 'numpy.int16'>)
-        1.09±0ms          570±0μs     0.52  inference.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
-         117±8ms         33.9±0ms     0.29  join_merge.Align.time_series_align_left_monotonic

h-vetinari · 2018-10-06T11:07:30Z

And here's the one with @cython.wraparound(False):
On some calls, this can completely avoid python interaction, it seems - there's a ~180'000 speed-up for multiindex_object.Integer.time_is_monotonic, for example. The outliers on the upper end seem more harmless to me - and less index-related.

Purely from what's happening in the code:

hashtable.unique goes through one more call layer to hashtable._unique_no_inverse (which is unchanged from before), with the layer just forwarding the arguments.
hashtable.get_labels goes through hashtable._unique_with_inverse and has a tiny overhead for evaluating the bint ignore_na in the loop (which should however be small compared to (val != val or (use_na_value and val == na_value2) in the exact same if-condition).

To me, these changes shouldn't amount to much difference at all - I'm guessing the spread should mostly be the result of randomness (i.e. what else is happening on the computer in that exact moment), although the mean is (luckily) skewed towards being faster.

What do you all think? Also repeating my previous point:

BTW, is it possible to do an ASV run for a PR on a server? Or could someone else rerun it locally? Not sure how representative my machine is.

      before           after         ratio
     [640162fa]       [27ceb4d6]
!        9.11±0ms           failed      n/a  algorithms.Duplicated.time_duplicated_int('first')
!      9.11±0.3ms           failed      n/a  algorithms.Duplicated.time_duplicated_int('last')
!        258±10ms           failed      n/a  io.json.ToJSON.time_floats_with_dt_index_lines('index')
+         105±9μs        434±100μs     4.13  series_methods.SeriesConstructor.time_constructor(None)
+        133±10μs         244±20μs     1.83  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct')
+          36.5ms           62.5ms     1.71  eval.Query.time_query_datetime_index
+         142±8μs         229±30μs     1.61  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'transformation')
+        17.1±0μs         27.5±3μs     1.61  offset.OffestDatetimeArithmetic.time_apply(<BusinessMonthEnd>)
+         484±0μs        767±100μs     1.59  inference.NumericInferOps.time_add(<class 'numpy.uint16'>)
+          46.9ms           70.3ms     1.50  eval.Query.time_query_with_boolean_selection
+      1.46±0.1ms       2.20±0.1ms     1.50  timedelta.DatetimeAccessor.time_timedelta_days
+        23.4±0ms       33.9±0.7ms     1.44  io.sql.WriteSQLDtypes.time_to_sql_dataframe_column('sqlite', 'float')
+     1.49±0.09ms         2.14±0ms     1.44  timedelta.DatetimeAccessor.time_timedelta_microseconds
+         243±0μs          344±0μs     1.42  groupby.GroupByMethods.time_dtype_as_field('float', 'cumcount', 'transformation')
+          4.26ms           5.95ms     1.40  categoricals.CategoricalSlicing.time_getitem_bool_array('monotonic_decr')
+        26.1±2μs         35.5±2μs     1.36  offset.OffestDatetimeArithmetic.time_add_10(<SemiMonthEnd: day_of_month=15>)
+         250±0ms          328±4ms     1.31  io.json.ToJSON.time_floats_with_dt_index_lines('columns')
+           486μs            592μs     1.22  categoricals.CategoricalSlicing.time_getitem_list('monotonic_decr')
+         500±8ms         594±20ms     1.19  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'direct')
+      6.58±0.4ms         7.81±0ms     1.19  index_object.Ops.time_add('float')
+          4.39μs           5.13μs     1.17  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
+          3.12ms           3.65ms     1.17  index_object.Indexing.time_get_loc_non_unique('Float')
+          13.7ms           15.6ms     1.14  categoricals.Isin.time_isin_categorical('int64')
+          12.2ms           13.9ms     1.14  index_object.Indexing.time_boolean_series('String')
+        15.6±0ms         17.9±0ms     1.14  index_object.SetOperations.time_operation('date_string', 'union')
+      9.23±0.4ms           10.4ms     1.13  index_object.Indexing.time_boolean_array('Float')
+        31.2±0ms         35.2±0ms     1.12  eval.Eval.time_chained_cmp('numexpr', 1)
+           125ms            141ms     1.12  index_object.Indexing.time_get_loc_sorted('String')
+          10.2ms         11.4±0ms     1.11  index_object.Indexing.time_boolean_series('Float')
+         828±8ms          906±0ms     1.09  gil.ParallelGroups.time_get_groups(2)
+         102±0μs          111±0μs     1.09  indexing.NumericSeriesIndexing.time_ix_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+          6.46μs           7.06μs     1.09  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+        404±20ns         442±30ns     1.09  timestamp.TimestampProperties.time_freqstr(None, 'B')
+        91.6±1μs         99.4±0μs     1.09  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+          4.30μs           4.66μs     1.08  categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
+          28.3μs           30.5μs     1.08  ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series
+        47.5±0μs         50.9±1μs     1.07  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+          11.7ms           12.5ms     1.07  index_object.Indexing.time_boolean_array('String')
+           547ms            578ms     1.06  gil.ParallelReadCSV.time_read_csv('float')
+        10.7±0ms         11.3±0ms     1.06  rolling.Methods.time_rolling('Series', 10, 'int', 'kurt')
+          6.72μs           7.06μs     1.05  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-         160±8μs          153±0μs     0.95  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-          4.82μs           4.58μs     0.95  frame_methods.XS.time_frame_xs(1)
-     1.19±0.02ms         1.13±0ms     0.95  frame_methods.Isnull.time_isnull
-         430±4ms          406±0ms     0.95  io.json.ReadJSONLines.time_read_json_lines('int')
-        8.27±0ms       7.81±0.4ms     0.94  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        3.17±0ms      2.97±0.02ms     0.94  rolling.Quantile.time_quantile('DataFrame', 10, 'int', 1, 'lower')
-      4.63±0.1ms       4.32±0.2ms     0.93  reindex.DropDuplicates.time_frame_drop_dups_na(True)
-      22.9±0.5μs         21.4±0μs     0.93  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-           1.12s            1.05s     0.93  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'kurt')
-         219±0ms          203±0ms     0.93  frame_methods.Iteration.time_iterrows
-           219ms            203ms     0.93  gil.ParallelKth.time_kth_smallest
-          7.86μs           7.28μs     0.93  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')
-         528±0μs          488±0μs     0.92  frame_methods.Iteration.time_iteritems_cached
-         432±0μs          398±0μs     0.92  frame_ctor.FromRecords.time_frame_from_records_generator(1000)
-        4.51±0ms         4.15±0ms     0.92  rolling.Methods.time_rolling('DataFrame', 10, 'int', 'skew')
-        4.97±0ms         4.56±0ms     0.92  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-         172±0ms          156±0ms     0.91  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-          3.81μs           3.43μs     0.90  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_decr')
-           156ms            141ms     0.90  index_object.Indexing.time_get_loc_non_unique('String')
-        13.7±0ms         12.2±0ms     0.89  replace.FillNa.time_replace(False)
-         211±4ms          188±0ms     0.89  frame_methods.Dropna.time_dropna_axis_mixed_dtypes('any', 0)
-        46.9±0ms         41.7±0ms     0.89  io.hdf.HDFStoreDataFrame.time_write_store_table
-          35.5μs           31.5μs     0.89  index_object.Indexing.time_get_loc_sorted('Float')
-         346±0μs          306±5μs     0.88  categoricals.Constructor.time_from_codes_all_int8
-          4.98μs           4.39μs     0.88  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-           117μs            103μs     0.88  frame_methods.XS.time_frame_xs(0)
-         108±0μs         95.4±0μs     0.88  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-          14.6μs           12.9μs     0.88  ctors.SeriesDtypesConstructors.time_dtindex_from_series
-      32.6±0.7μs         28.6±0μs     0.88  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-          2.91ms           2.55ms     0.88  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-         122±0μs          105±0μs     0.86  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      7.25±0.8μs       6.22±0.3μs     0.86  timedelta.TimedeltaConstructor.time_from_unit
-          37.9μs           32.3μs     0.85  index_object.Indexing.time_get_loc('Float')
-         903±0μs         766±10μs     0.85  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      7.72±0.4μs       6.48±0.4μs     0.84  timedelta.TimedeltaConstructor.time_from_string
-         135±5μs          113±4μs     0.84  index_object.SetOperations.time_operation('datetime', 'union')
-          5.54μs           4.58μs     0.83  index_object.Indexing.time_slice_step('String')
-        17.7±0μs         14.6±0μs     0.82  timedelta.TimedeltaConstructor.time_from_components
-          15.3μs           12.5μs     0.82  index_object.Indexing.time_slice('Float')
-         179±4μs          147±3μs     0.82  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      6.25±0.4ms       5.08±0.4ms     0.81  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
-        39.1±3ms         31.2±0ms     0.80  strings.Methods.time_strip
-        27.3±3ms         21.9±0ms     0.80  timedelta.ToTimedelta.time_convert_string_days
-        62.5±4ms         49.5±3ms     0.79  strings.Cat.time_cat(3, None, None, 0.15)
-        93.8±0ms         74.2±4ms     0.79  series_methods.IsInFloat64.time_isin_nan_values
-        182±10μs         144±10μs     0.79  timedelta.ToTimedelta.time_convert_int
-      13.9±0.9ms         10.9±0ms     0.79  algorithms.Duplicated.time_duplicated_float('last')
-           1.16s            906ms     0.78  timedelta.TimedeltaOps.time_add_td_ts
-        3.51±0ms         2.73±0ms     0.78  timeseries.DatetimeIndex.time_add_timedelta('tz_naive')
-      5.73±0.4μs       4.41±0.2μs     0.77  timedelta.TimedeltaConstructor.time_from_np_timedelta
-        15.8±0μs       12.1±0.9μs     0.77  timedelta.TimedeltaConstructor.time_from_iso_format
-        7.21±0μs       5.50±0.4μs     0.76  timedelta.TimedeltaConstructor.time_from_int
-      9.38±0.6μs         7.06±0μs     0.75  timedelta.TimedeltaConstructor.time_from_datetime_timedelta
-       93.8±20ms         70.3±0ms     0.75  strings.Cat.time_cat(3, None, '-', 0.15)
-        41.7±3ms         31.2±0ms     0.75  strings.Methods.time_len
-        62.5±1ms         46.9±3ms     0.75  strings.Methods.time_pad
-      12.2±0.7ms       9.11±0.5ms     0.75  timeseries.DatetimeAccessor.time_dt_accessor_normalize
-        691±70μs          518±0μs     0.75  inference.NumericInferOps.time_subtract(<class 'numpy.uint16'>)
-      3.91±0.3ms       2.90±0.2ms     0.74  timeseries.DatetimeIndex.time_add_timedelta('tz_aware')
-        70.3±2ms         52.1±1ms     0.74  strings.Cat.time_cat(3, ',', None, 0.0)
-        70.3±0ms       52.1±0.7ms     0.74  strings.Cat.time_cat(3, ',', None, 0.001)
-         422±0ms          312±0ms     0.74  strings.Split.time_split(True)
-         234±8ms          172±8ms     0.73  strings.Slice.time_vector_slice
-         109±4ms         78.1±0ms     0.71  strings.Cat.time_cat(3, ',', '-', 0.15)
-      2.77±0.1μs         1.94±0μs     0.70  timedelta.TimedeltaConstructor.time_from_missing
-        78.1±1ms         54.7±0ms     0.70  strings.Cat.time_cat(3, None, '-', 0.0)
-        78.1±3ms         54.7±0ms     0.70  strings.Methods.time_get
-        39.1±5ms         27.3±0ms     0.70  strings.Methods.time_slice
-           156ms            109ms     0.70  strings.Repeat.time_repeat('int')
-         102±6ms         70.3±4ms     0.69  timedelta.ToTimedeltaErrors.time_convert('ignore')
-        14.0±2μs         9.71±0μs     0.69  timeseries.AsOf.time_asof_single_early('Series')
-        74.2±7ms         50.8±4ms     0.68  strings.Cat.time_cat(3, None, '-', 0.001)
-      5.76±0.7ms         3.91±0ms     0.68  timeseries.AsOf.time_asof_nan_single('Series')
-        54.7±8ms         36.5±3ms     0.67  strings.Methods.time_replace
-        188±20ms          125±2ms     0.67  strings.Methods.time_findall
-         188±8ms          125±8ms     0.67  strings.Split.time_split(False)
-        41.7±3ms         27.3±0ms     0.66  strings.Methods.time_upper
-       886±100μs          570±0μs     0.64  inference.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
-         109±8ms         70.3±0ms     0.64  strings.Methods.time_count
-        221±30ns          141±0ns     0.64  timeseries.DatetimeAccessor.time_dt_accessor
-        46.9±5ms         29.3±2ms     0.62  strings.Methods.time_rstrip
-        21.9±2ms         13.7±0ms     0.62  timeseries.AsOf.time_asof('DataFrame')
-        21.9±0ms         13.7±0ms     0.62  timeseries.AsOf.time_asof_nan('DataFrame')
-        403±60μs         248±20μs     0.62  strings.Encode.time_encode_decode
-        17.9±2ms       10.9±0.8ms     0.61  timedelta.ToTimedelta.time_convert_string_seconds
-           31.2s            19.1s     0.61  strings.Dummies.time_get_dummies
-        31.2±3ms         18.8±1ms     0.60  stat_ops.FrameOps.time_op('mad', 'int', 0, True)
-        93.8±8ms         54.7±0ms     0.58  strings.Contains.time_contains(True)
-        9.94±0ms         5.76±0ms     0.58  timeseries.AsOf.time_asof('Series')
-      9.94±0.5ms       5.76±0.3ms     0.58  timeseries.AsOf.time_asof_nan('Series')
-        516±20ms          297±0ms     0.58  strings.Methods.time_extract
-        82.0±6ms         46.9±1ms     0.57  strings.Cat.time_cat(3, ',', None, 0.15)
-        41.7±3ms         23.4±1ms     0.56  strings.Contains.time_contains(False)
-        6.08±0ms       3.42±0.2ms     0.56  timeseries.AsOf.time_asof_nan_single('DataFrame')
-         453±4ms          250±8ms     0.55  multiindex_object.Values.time_datetime_level_values_copy
-      6.51±0.3ms       3.50±0.2ms     0.54  timeseries.AsOf.time_asof_single('DataFrame')
-        117±20ms         62.5±4ms     0.53  strings.Methods.time_match
-         125±8ms         66.4±4ms     0.53  strings.Cat.time_cat(3, ',', '-', 0.001)
-        70.3±8ms         36.5±3ms     0.52  strings.Methods.time_endswith
-         255±0μs          114±7μs     0.45  timeseries.AsOf.time_asof_single_early('DataFrame')
-        51.1±4μs         20.8±1μs     0.41  timeseries.AsOf.time_asof_single('Series')
-        93.8±0ms          418±0μs     0.00  frame_ctor.FromRecords.time_frame_from_records_generator(None)
-        93.8±6ms          525±0ns     0.00  multiindex_object.Integer.time_is_monotonic

jreback

this change is just moving code around w/o combing anything (which IIUC was the original intention), so why should this PR be accepted?

h-vetinari · 2018-10-06T14:38:51Z

@jreback:
The unifying aspect has been weakened for perf reasons (and because it didn't work to force cython to compile the different bint paths in 52ae84e ).

However, this still gives hashtable.unique a working return_inverse kwarg, and does so with minimal code additions by sharing the implementation of hashtable.get_labels. As such, it will allow fixing #21720 and proceed on all the return_inverse stuff.

jreback · 2018-10-06T15:01:16Z

However, this still gives hashtable.unique a working return_inverse kwarg, and does so with minimal code additions by sharing the implementation of hashtable.get_labels. As such, it will allow fixing #21720 and proceed on all the return_inverse stuff.

and how does this differ from factorize? which requires no code changes

h-vetinari · 2018-10-06T15:23:14Z

and how does this differ from factorize? which requires no code changes

The difference between .factorize and .unique is the reason for #21720 - .factorize puts all flavours of NaN in one bucket, while unique keeps them separate. You can see the difference between them by looking for the ignore_na-kwarg that I introduced for _unique_with_inverse.

And for implementing a return_inverse kwarg on python level, it'd be nice (and much clearer) to have a cython-level .unique that supports that as well (rather than complicated wrappers around get_labels).

jreback · 2018-10-06T15:25:55Z

The difference between .factorize and .unique is the reason for #21720 - .factorize puts all flavours of NaN in one bucket, while unique keeps them separate. You can see the difference between them by looking for the ignore_na-kwarg that I introduced for _unique_with_inverse.

then ok with adding the keyword to the existing functions. I just see a giant moving around code with new names for little gain here.

let's try to minimize the changes.

h-vetinari · 2018-10-06T15:30:26Z

then ok with adding the keyword to the existing functions. I just see a giant moving around code with new names for little gain here.

I don't want to implement this within get_labels. It's clunky to have to hand over your own instance of a HashVector (corresponding to the right dtype), not get the uniques returned, etc. This might be needed somewhere else, but is wholly useless for the purpose of .unique. I can "unmove" the existing unique implementations, just tried to group the functions together logically.

jreback · 2018-10-06T15:42:55Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -851,20 +885,40 @@ cdef class PyObjectHashTable(HashTable):
            val = values[i]
            hash(val)

-            if ((val != val or val is None) or
-                    (use_na_value and val == na_value)):
+            if ignore_na and ((val != val or val is None)


this keyword is completely untested

I would argue that it is (at least for ignore_na=True) because all the tests for factorize need to run through this.

Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse later on. It will be excessively tested there (because unique(return_inverse=True) calls _unique_with_inverse(ignore_na=False)).

yes of course, you added a keyword.

I am talking about the ignore_na keyword

I know that I added this keyword to switch between .unique/.factorize behaviour.

Can you point me to where the cython hashtable code is tested? I could add some tests there. That's why I asked "Do we explicitly test cython code?", because I haven't come across it yet.

Otherwise, I would add copious tests for the behaviour as soon as the keyword makes its way into the pandas API, but this is just the preparation for that.

@h-vetinari : IMO tests/generic/test_generic.py or tests/generic/test_frame.py could be good places to put any kinds of tests for this refactoring for now.

If it is possible to hit those keyword arguments from Python land, then do add some tests.

let's drop the ignore_na for now. and add it when I see that you actually need it.

@jreback
This aspect is the whole point of the PR - to not duplicate the code between .factorize (=hashtable._unique_with_inverse(ignore_na=True)) and the not-yet-implemented-because-this-PR-is-starting-the-process .unique(return_inverse=True) (=hashtable._unique_with_inverse(ignore_na=False))
hashtable._unique_with_inverse(ignore_na=False)

h-vetinari

Thanks for review; undid the moving. Hopefully clearer now.

h-vetinari · 2018-10-06T15:45:38Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -851,20 +885,40 @@ cdef class PyObjectHashTable(HashTable):
            val = values[i]
            hash(val)

-            if ((val != val or val is None) or
-                    (use_na_value and val == na_value)):
+            if ignore_na and ((val != val or val is None)


I would argue that it is (at least for ignore_na=True) because all the tests for factorize need to run through this.

Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse later on. It will be excessively tested there (because unique(return_inverse=True) calls _unique_with_inverse(ignore_na=False)).

jorisvandenbossche · 2018-10-12T06:48:10Z

I think we should stop arguing about that point, as I completely understood your point, but clearly didn't make my point clear :-)
(as I also didn't mean to update usages of Factorizer.factorize)

For testing complicated behaviour, a non-trivial test is justified IMO

But is it that complicated?
For a set of values like ['a', 'c', None, np.nan, 'a', 'b', np.NaT, 'c'] (for object dtype), I can write the expected result (and one could rather easily write such a set of values for other dtypes that gives the same expected labels, so you can still parametrize over the different dtypes).
I completely agree that good test coverage is important, but readable / easily maintainable tests are important as well, and I don't really see the added value here.

h-vetinari · 2018-10-12T11:57:01Z

@jorisvandenbossche

For a set of values like ['a', 'c', None, np.nan, 'a', 'b', np.NaT, 'c'] (for object dtype), I can write the expected result

Absolutely (I added this test for you). It's just that then you're only testing the object dtype, plus there might be bugs that only occur for larger examples etc.

So I've made another try at simplifying but keeping the other test - I switched to testing the inverse implicitly through reconstruction - that's essentially just as thorough, but the test becomes much shorter/easier

h-vetinari · 2018-10-12T22:41:12Z

@jorisvandenbossche @jreback How does this look to you now?

jreback · 2018-10-12T22:43:14Z

i will look soon

jreback · 2018-10-14T17:21:23Z

pandas/_libs/hashtable_class_helper.pxi.in

+
+        Returns
+        -------
+        uniques : ndarray[object]


should be dtype here

jreback · 2018-10-14T17:25:39Z

pandas/tests/test_algos.py

+        s_duplicated.values.setflags(write=writable)
+        na_mask = s_duplicated.isna().values
+
+        result_inverse, result_unique = htable().factorize(s_duplicated.values)


i would rather you not test unrelated things directly here; these are pure unit tests

it just makes the test confusing. add with the other duplicated tests if you must
though i think these are already well covered, but if not you can add

This is not unrelated - it's the central functionality of factorize, and a bit more thorough than the few hand-made tests further up.

The code for duplicated is completely separate (in cython) and so this should have a separate test as well IMO

i know you really want to show that this works, but these tests need hard coded expectations. It is very hard for a reader to assert what you are doing, because you are computing the results here as well. I think its pretty easy to cook an example where you start with a duplicated array and just can assert the results directly. We do lots of these kinds of things already in the drop_duplicates testing.

h-vetinari

Thx for the review

h-vetinari · 2018-10-14T19:21:31Z

pandas/tests/test_algos.py

+        s_duplicated.values.setflags(write=writable)
+        na_mask = s_duplicated.isna().values
+
+        result_inverse, result_unique = htable().factorize(s_duplicated.values)


This is not unrelated - it's the central functionality of factorize, and a bit more thorough than the few hand-made tests further up.

The code for duplicated is completely separate (in cython) and so this should have a separate test as well IMO

h-vetinari · 2018-10-15T05:46:12Z

Failures are one hypothesis timeout and one server timeout, so unrelated.

jorisvandenbossche · 2018-10-15T06:42:53Z

Failures are one hypothesis timeout and one server timeout, so unrelated.

If you merge in master, that should be solved

h-vetinari · 2018-10-15T19:59:15Z

This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ?

jreback

i am uncomfortable with these tests as is (the last one). I want a bit less computation to produce duplicates and more hard coding of the actual indexers. I don't mind these in addition.

jreback · 2018-10-17T12:53:16Z

pandas/tests/test_algos.py

+        s_duplicated.values.setflags(write=writable)
+        na_mask = s_duplicated.isna().values
+
+        result_inverse, result_unique = htable().factorize(s_duplicated.values)


i know you really want to show that this works, but these tests need hard coded expectations. It is very hard for a reader to assert what you are doing, because you are computing the results here as well. I think its pretty easy to cook an example where you start with a duplicated array and just can assert the results directly. We do lots of these kinds of things already in the drop_duplicates testing.

jreback · 2018-10-17T12:55:14Z

This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ?

we have limited reviewers and 230 odd PR's. Certainly welcome for you to review other PR's to help out.

h-vetinari · 2018-10-17T21:16:47Z

i am uncomfortable with these tests as is (the last one). I want a bit less computation to produce duplicates and more hard coding of the actual indexers. I don't mind these in addition.

So, adding hard-coded tests, and leaving the ones already in the PR as-is? Do I understand that correctly?

This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ?

we have limited reviewers and 230 odd PR's. Certainly welcome for you to review other PR's to help out.

Didn't mean to be pushy - was really just asking what's left (after several rounds of reviews).

jreback · 2018-10-18T15:36:17Z

So, adding hard-coded tests, and leaving the ones already in the PR as-is? Do I understand that correctly?

yes that's fine.

h-vetinari · 2018-10-18T15:48:26Z

yes that's fine.

Cool. I already did that in the latest commit.

jreback · 2018-10-18T15:50:52Z

thanks @h-vetinari

pandas-dev#22986)

Fix ASV import error

640162f

h-vetinari added 3 commits October 4, 2018 00:56

Add return_inverse to hashtable.unique

31d0dc5

Pure copy/paste: Group unique/factorize functions next to each other

c5e5147

Unify hashtable.factorize and .unique

9918d52

h-vetinari force-pushed the re_factor_ize branch from 8464750 to 9918d52 Compare October 3, 2018 22:57

jreback requested changes Oct 4, 2018

View reviewed changes

h-vetinari added 2 commits October 4, 2018 13:13

Force compilation of different code paths

52ae84e

Add separate functions for return_inverse=False

dbe4e0e

h-vetinari added 2 commits October 4, 2018 23:12

Finish split in _unique_with_inverse and _unique_no_inverse

8481e19

Add cython.wraparound(False)

27ceb4d

jreback requested changes Oct 6, 2018

View reviewed changes

h-vetinari added 3 commits October 6, 2018 17:34

Merge remote-tracking branch 'upstream/master' into re_factor_ize

f5cd5e9

Unmove unique-implementation (review jreback)

b1705a9

Undo line artefacts

a6ed5dd

jreback requested changes Oct 6, 2018

View reviewed changes

h-vetinari commented Oct 6, 2018

View reviewed changes

gfyoung added the Refactor Internal refactoring of code label Oct 6, 2018

Remove idx_duplicated from test (now unnecessary)

d825be0

jreback requested changes Oct 14, 2018

View reviewed changes

Review (jreback)

1a342d0

h-vetinari commented Oct 14, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into re_factor_ize

28e0441

h-vetinari mentioned this pull request Oct 15, 2018

Directive for force-compiling code branches for different bint-parameters? cython/cython#2660

Open

Merge remote-tracking branch 'upstream/master' into re_factor_ize

d65e4fd

jreback requested changes Oct 17, 2018

View reviewed changes

h-vetinari added 4 commits October 17, 2018 23:24

Merge remote-tracking branch 'upstream/master' into re_factor_ize

bca615c

Review (jreback)

3438727

Merge remote-tracking branch 'upstream/master' into re_factor_ize

facc111

Retrigger Circle

6d0e86b

jreback approved these changes Oct 18, 2018

View reviewed changes

jreback merged commit 99e6401 into pandas-dev:master Oct 18, 2018

h-vetinari deleted the re_factor_ize branch October 22, 2018 14:16

h-vetinari mentioned this pull request Oct 28, 2018

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

Merged

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings (

f490328

pandas-dev#22986)

h-vetinari mentioned this pull request Dec 5, 2018

API/ENH/DEPR: Series.unique returns Series #24108

Closed

11 tasks

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

Conversation

h-vetinari commented Oct 3, 2018 • edited Loading

pep8speaks commented Oct 3, 2018 • edited Loading

Comment last updated on October 11, 2018 at 20:09 Hours UTC

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Oct 4, 2018

h-vetinari commented Oct 4, 2018

jorisvandenbossche commented Oct 4, 2018

codecov bot commented Oct 4, 2018 • edited Loading

Codecov Report

h-vetinari commented Oct 4, 2018 • edited Loading

h-vetinari commented Oct 5, 2018 • edited Loading

h-vetinari commented Oct 6, 2018

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Oct 6, 2018

jreback commented Oct 6, 2018

h-vetinari commented Oct 6, 2018 • edited Loading

jreback commented Oct 6, 2018

h-vetinari commented Oct 6, 2018

Choose a reason for hiding this comment

h-vetinari Oct 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

h-vetinari Oct 6, 2018 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 12, 2018 • edited Loading

h-vetinari commented Oct 12, 2018

h-vetinari commented Oct 12, 2018

jreback commented Oct 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Oct 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

h-vetinari Oct 14, 2018 • edited Loading

Choose a reason for hiding this comment

h-vetinari commented Oct 15, 2018

jorisvandenbossche commented Oct 15, 2018

h-vetinari commented Oct 15, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 17, 2018

h-vetinari commented Oct 17, 2018 • edited Loading

jreback commented Oct 18, 2018

h-vetinari commented Oct 18, 2018

jreback commented Oct 18, 2018

h-vetinari commented Oct 3, 2018 •

edited

Loading

pep8speaks commented Oct 3, 2018 •

edited

Loading

codecov bot commented Oct 4, 2018 •

edited

Loading

h-vetinari commented Oct 4, 2018 •

edited

Loading

h-vetinari commented Oct 5, 2018 •

edited

Loading

h-vetinari commented Oct 6, 2018 •

edited

Loading

h-vetinari Oct 6, 2018 •

edited

Loading

h-vetinari Oct 6, 2018 •

edited

Loading

jorisvandenbossche commented Oct 12, 2018 •

edited

Loading

h-vetinari Oct 14, 2018 •

edited

Loading

h-vetinari Oct 14, 2018 •

edited

Loading

h-vetinari commented Oct 17, 2018 •

edited

Loading