Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement libmissing; untangles _libs dependencies #18357

Merged
merged 16 commits into from
Nov 22, 2017

Conversation

jbrockmendel
Copy link
Member

In the status quo, algos, hashtable, parsers, and period depend on lib. On top of that groupby, join, and index depend on algos and intervaltree depends on hashtable. (In fairness, some of these dependency relationships are python imports, not cimports.)

lib depends on tslib which depends on essentially all of tslibs. So we've got a whole bunch of dependencies (many of which aren't declared in setup.py).

But the lib functions needed by algos, hashtable, and period are basically all a small handful of null-checking checnull, isnaobj, isnaobj2d which depend only on util and NaT.

So this PR takes those functions and puts them into _libs.missing, then updates the relevant imports.

@codecov
Copy link

codecov bot commented Nov 18, 2017

Codecov Report

Merging #18357 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18357      +/-   ##
==========================================
- Coverage   91.38%   91.36%   -0.02%     
==========================================
  Files         164      164              
  Lines       49790    49791       +1     
==========================================
- Hits        45501    45493       -8     
- Misses       4289     4298       +9
Flag Coverage Δ
#multiple 89.17% <100%> (ø) ⬆️
#single 39.5% <50%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/format.py 96.01% <100%> (ø) ⬆️
pandas/core/dtypes/missing.py 90.62% <100%> (ø) ⬆️
pandas/io/formats/excel.py 96.72% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfad581...5fea502. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 18, 2017

Codecov Report

Merging #18357 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18357      +/-   ##
==========================================
- Coverage   91.35%   91.33%   -0.02%     
==========================================
  Files         163      163              
  Lines       49714    49716       +2     
==========================================
- Hits        45415    45408       -7     
- Misses       4299     4308       +9
Flag Coverage Δ
#multiple 89.13% <100%> (ø) ⬆️
#single 39.64% <58.82%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/missing.py 90.74% <100%> (+0.11%) ⬆️
pandas/io/formats/excel.py 96.71% <100%> (ø) ⬆️
pandas/io/formats/format.py 96.01% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d421a09...e94e01f. Read the comment docs.

return get_timedelta64_value(val) == NPY_NAT
elif util.is_array(val):
return False
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for not pulling in util._checknull here as well, as seems logical? (or just future PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean defining it here instead of in util? Or importing it into the namespace? I'd be +1 on the former, indifferent to the latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for sure should define it here. but then this puts missing as a dep of things like hashing.pyx. ok with it being a dep of any of the tslibs though.

(pandas) bash-3.2$ find pandas -name '*.pyx' | xargs grep _checknull
pandas/_libs/hashing.pyx:from util cimport _checknull
pandas/_libs/hashing.pyx:        elif _checknull(val):
pandas/_libs/interval.pyx:        if util._checknull(interval):
pandas/_libs/lib.pyx:from util cimport is_array, _checknull, _checknan
pandas/_libs/lib.pyx:        return _checknull(val)
pandas/_libs/lib.pyx:        return _checknull(val)
pandas/_libs/lib.pyx:        result[i] = val is NaT or util._checknull_old(val)
pandas/_libs/lib.pyx:                _checknull(x) and _checknull(y)):
pandas/_libs/lib.pyx:    if _checknull(val):
pandas/_libs/lib.pyx:        if _checknull(x):
pandas/_libs/lib.pyx:            if _checknull(x):
pandas/_libs/lib.pyx:            elif _checknull(y):
pandas/_libs/src/inference.pyx:        if util._checknull(val):
pandas/_libs/src/inference.pyx:        elif util._checknull(v):
pandas/_libs/src/inference.pyx:    if util._checknull(v):
pandas/_libs/src/inference.pyx:    if util._checknull(v):
pandas/_libs/src/inference.pyx:    if util._checknull(v):
pandas/_libs/src/inference.pyx:    if util._checknull(v):
pandas/_libs/src/inference.pyx:        return util._checknull(value)
pandas/_libs/src/inference.pyx:            bint is_generic_null = util._checknull(value)
pandas/_libs/tslib.pyx:from tslibs.nattype cimport _checknull_with_nat, NPY_NAT
pandas/_libs/tslib.pyx:        if _checknull_with_nat(val):
pandas/_libs/tslib.pyx:            if _checknull_with_nat(val):
pandas/_libs/tslib.pyx:        if _checknull_with_nat(val):
pandas/_libs/tslib.pyx:            if _checknull_with_nat(val):
pandas/_libs/tslib.pyx:            if _checknull_with_nat(val):
pandas/_libs/tslib.pyx:            if _checknull_with_nat(val):
pandas/_libs/tslibs/nattype.pyx:cdef inline bint _checknull_with_nat(object val):
pandas/_libs/tslibs/strptime.pyx:from nattype cimport _checknull_with_nat, NPY_NAT
pandas/_libs/tslibs/strptime.pyx:            if _checknull_with_nat(val):
pandas/_libs/tslibs/timedeltas.pyx:from nattype cimport _checknull_with_nat, NPY_NAT
pandas/_libs/tslibs/timedeltas.pyx:    if _checknull_with_nat(ts):
pandas/_libs/tslibs/timedeltas.pyx:    if _checknull_with_nat(other):
pandas/_libs/tslibs/timedeltas.pyx:        elif _checknull_with_nat(value):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're moving util._checknull anyway, I'd advocate renaming it to e.g. check_none_or_nan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for sure should define it here. but then this puts missing as a dep of things like hashing.pyx. ok with it being a dep of any of the tslibs though.

I'll take a look and see which util funcs can be moved without messing with dependencies.

FWIW this PR already adds missing to the 'pxdfiles` key of hashtable, which cimports missing.checknull. Previously it was an un-declared dependency on lib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like util._checknull_old can be moved to missing (is used there once, nowhere else). Let's saving util._checknull for later, since it is used in a bunch of places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with leaving these for later as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just pushed commit with docstrings.

@cython.wraparound(False)
@cython.boundscheck(False)
def isnaobj(ndarray arr):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally add some doc-strings

cdef int64_t NPY_NAT = util.get_nat()


cdef inline bint is_null_datetimelike(v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob should rename this for consistency (checknull_datetimelike), can be TODO

@@ -533,22 +533,6 @@ cpdef object infer_datetimelike_array(object arr):
return 'mixed'


cdef inline bint is_null_datetimelike(v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason these are not in nattype? (and maybe namespaces to missing, but actual definition in nattype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine putting this in nattype. The one below I considered moving but decided against because it's only used once here in inference.pyx.

@@ -830,24 +829,6 @@ class Timestamp(_Timestamp):
# ----------------------------------------------------------------------


cdef inline bint _check_all_nulls(object val):
""" utility to check if a value is any type of null """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its used once here in _libs.missing.

@@ -208,14 +208,14 @@ class TestNAObj(object):

def _check_behavior(self, arr, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move these tests to test_missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Existing test_missing in tests.dtypes or a new test_libmissing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think tests.dtypes is ok, maybe add test_libmissing at later point (orthogonal to this)

@jreback jreback added Clean Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 19, 2017
@jbrockmendel jbrockmendel mentioned this pull request Nov 19, 2017
59 tasks
@@ -83,7 +83,7 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average', ascending=True,
nan_value = {{neg_nan_value}}

{{if dtype == 'object'}}
mask = lib.isnaobj(values)
mask = missing.isnaobj(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be cimports instead?

Copy link
Member Author

@jbrockmendel jbrockmendel Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM these are not cdef

@@ -1860,7 +1860,7 @@ def _format_strings(self):
(lambda x: pprint_thing(x, escape_chars=('\t', '\r', '\n'))))

def _format(x):
if self.na_rep is not None and lib.checknull(x):
if self.na_rep is not None and libmissing.checknull(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in followup (or here), change this to

is_scalar(x) and isna(x); this is reaching too much into the internals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and all below here

maybe worth defining a is_scalar_na function

@@ -50,7 +50,7 @@ def isna(obj):

def _isna_new(obj):
if is_scalar(obj):
return lib.checknull(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see this here is ok, as this is really the python api for missing values.

@@ -381,12 +381,12 @@ def __init__(self, df, na_rep='', float_format=None, cols=None,
self.inf_rep = inf_rep

def _format_value(self, val):
if lib.checknull(val):
if libmissing.checknull(val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_scalar and isna

val = self.na_rep
elif is_float(val):
if lib.isposinf_scalar(val):
if libmissing.isposinf_scalar(val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isposinf_scalar should be exposed in core/dtypes/missing.py as a python function

_1d_methods = ['isnaobj', 'isnaobj_old']
_2d_methods = ['isnaobj2d', 'isnaobj2d_old']

def _check_behavior(self, arr, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can parametrize/use fixtures would be good

@jreback
Copy link
Contributor

jreback commented Nov 20, 2017

rebase

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

pls rebase

cimport util

from tslibs.np_datetime cimport get_timedelta64_value, get_datetime64_value
from tslibs.nattype import NaT, iNaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of importing iNaT, just use NPY_NAT to avoid perf issues.

cdef int64_t NPY_NAT = util.get_nat()


cdef inline bint is_null_datetimelike(v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type as object

@jreback jreback added this to the 0.22.0 milestone Nov 22, 2017
@jbrockmendel
Copy link
Member Author

instead of importing iNaT, just use NPY_NAT to avoid perf issues.

type as object

Just pushed with both these changes. Also:

  • changed variable v --> val (there's a recent issue regarding linting too-short variable names that I very much agree with)
  • changed for i from 0 <= i < n: --> for i in range(n): in a couple places

setup.py Outdated
@@ -516,6 +517,10 @@ def pxd(name):
'_libs.lib': {
'pyxfile': '_libs/lib',
'depends': lib_depends + tseries_depends},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add missing to lib as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do. At the moment we've basically only implemented the pxfiles "rule" for tslibs. Should we go through and fill them out for the others?

In case I haven't mentioned it recently: I'm really looking forward to using cythonize and not worrying about this.

@jreback jreback merged commit bd145c8 into pandas-dev:master Nov 22, 2017
@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

thanks!

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

not sure this started showing up on this PR, but pls add to the list

/Users/jreback/miniconda3/envs/pandas/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: "Using deprecated NumPy API, disable it by "          "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
#warning "Using deprecated NumPy API, disable it by " \
 ^
pandas/_libs/src/datetime/np_datetime.c:571:12: warning: expression result unused [-Wunused-value]
  meta.num - 1;
  ~~~~~~~~ ^ ~
pandas/_libs/src/datetime/np_datetime.c:1004:15: warning: unused variable 'perday' [-Wunused-variable]
    npy_int64 perday;
              ^
3 warnings generated.

@jbrockmendel jbrockmendel deleted the libmissing branch November 23, 2017 01:13
@jbrockmendel
Copy link
Member Author

not sure this started showing up on this PR, but pls add to the list

Looks like these are in the new timedelta_struct functions, likely leftovers from copy/paste. Easy to fix in-place, but I'd advocate not having them in src/ to begin with. Put them directly in tslibs.np_datetime (or tslibs.np_timedelta).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants