Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: SparseDataFrame/SparseSeries value assignment #17785

Closed
wants to merge 6 commits into from

Conversation

kernc
Copy link
Contributor

@kernc kernc commented Oct 4, 2017

Works by as much as possible using SparseBlock.setitem() which calls into SparseArray.set_values(), which returns a new (replacement) SparseArray object.

@pep8speaks
Copy link

pep8speaks commented Oct 4, 2017

Hello @kernc! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 14, 2018 at 01:24 Hours UTC


# allow arbitrary setting
if is_setter:
return list(key)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from _AtIndexer above. Probably to make a test work.


if isinstance(new, np.ndarray) and len(new) == len(mask):
new = new[mask]

mask = _safe_reshape(mask, new_values.shape)
mask = _safe_reshape(np.asarray(mask), new_values.shape)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer necessary since sparse short-circuit above.

@@ -2753,6 +2761,18 @@ def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
return self.make_block_same_class(values=values,
placement=self.mgr_locs)

def _can_hold_element(self, element):
return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype)
Copy link
Contributor Author

@kernc kernc Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable default. @jreback your recent commit 4efe656#diff-e705e723b2d6e7c0e2a0443f80916abfR609 indicates this must for some reason be strict(er)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this looks ok


def _try_coerce_result(self, result):
if (
# isinstance(result, np.ndarray) and
Copy link
Contributor Author

@kernc kernc Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failing test where a SparseDataFrame is sliced horizontally requires either this line ...

@@ -3750,7 +3770,8 @@ def fast_xs(self, loc):
# Such assignment may incorrectly coerce NaT to None
# result[blk.mgr_locs] = blk._slice((slice(None), loc))
for i, rl in enumerate(blk.mgr_locs):
result[rl] = blk._try_coerce_result(blk.iget((i, loc)))
# result[rl] = blk._try_coerce_result(blk.iget((i, loc)))
result[rl] = blk.iget((i, loc))
Copy link
Contributor Author

@kernc kernc Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or not calling _try_coerce_result() here.

@gfyoung
Copy link
Member

gfyoung commented Oct 4, 2017

@kernc : A couple of points:

  • Travis and company are complaining pretty loudly about some failing tests. Could you address those?
  • Could you point to the issue number that you're fixing? It looks like you pointed to your own PR.

@gfyoung gfyoung added the Sparse Sparse Data Type label Oct 4, 2017
@kernc kernc force-pushed the sparse-assign branch 2 times, most recently from b106895 to 5605e87 Compare October 5, 2017 18:05
return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype)

def _try_coerce_result(self, result):
if (isinstance(result, np.ndarray) and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failing test where a SparseDataFrame is sliced horizontally requires this check. In that test, the items of a dtype=object array are empty lists. Were they ndarray objects, this method would coerce them into SparseArrays (i.e. no longer pure ndarrays).

@kernc
Copy link
Contributor Author

kernc commented Oct 10, 2017

Handling the warnings now. What do I do when there are two warnings to catch with assert_produces_warning (FutureWarning and PerformanceWarning)? Should I even emit PerformanceWarning as I do?

@jreback
Copy link
Contributor

jreback commented Oct 11, 2017

the FutureWarning is prob enough

@kernc
Copy link
Contributor Author

kernc commented Oct 11, 2017

The FutureWarning is about the recent .set_value() deprecation, PerformanceWarning is about setting values on a sparse structure (makes a copy of data, sometimes densifies data).

The problem with assert_produces_warning is that if any other warnings are encountered, the whole thing fails.

Can I amend assert_produces_warning to also accept a tuple of expected warning types? Alternatively, I can add a flag error_unexpected=True which, when set to False, won't error on additional unexpected warnings?

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

can you rebase / update

@codecov
Copy link

codecov bot commented Nov 11, 2017

Codecov Report

Merging #17785 into master will decrease coverage by 0.21%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17785      +/-   ##
==========================================
- Coverage   91.42%   91.21%   -0.22%     
==========================================
  Files         163      163              
  Lines       50064    50036      -28     
==========================================
- Hits        45773    45640     -133     
- Misses       4291     4396     +105
Flag Coverage Δ
#multiple 89.02% <100%> (-0.2%) ⬇️
#single 40.22% <21.27%> (-0.2%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 94.98% <ø> (-0.03%) ⬇️
pandas/core/frame.py 97.76% <100%> (-0.14%) ⬇️
pandas/core/sparse/series.py 95.84% <100%> (+0.56%) ⬆️
pandas/core/sparse/array.py 91.96% <100%> (-0.33%) ⬇️
pandas/core/sparse/frame.py 94.92% <100%> (+0.14%) ⬆️
pandas/core/internals.py 94.26% <100%> (-0.28%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/generic.py 92.09% <0%> (-3.64%) ⬇️
pandas/core/indexes/range.py 92.83% <0%> (-2.83%) ⬇️
pandas/util/_decorators.py 78% <0%> (-2.71%) ⬇️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3493aba...137304e. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 11, 2017

Codecov Report

Merging #17785 into master will decrease coverage by 0.16%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17785      +/-   ##
==========================================
- Coverage   92.08%   91.91%   -0.17%     
==========================================
  Files         169      164       -5     
  Lines       50706    50018     -688     
==========================================
- Hits        46691    45975     -716     
- Misses       4015     4043      +28
Flag Coverage Δ
#multiple 90.3% <93.33%> (-0.19%) ⬇️
#single 42.14% <21.66%> (-0.19%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 94.11% <ø> (+0.39%) ⬆️
pandas/core/sparse/series.py 95.77% <100%> (+0.55%) ⬆️
pandas/core/sparse/array.py 91.81% <100%> (+0.43%) ⬆️
pandas/core/sparse/frame.py 94.91% <100%> (+0.12%) ⬆️
pandas/util/testing.py 85.44% <100%> (-0.62%) ⬇️
pandas/core/frame.py 97.19% <100%> (-0.05%) ⬇️
pandas/core/internals.py 95.2% <77.77%> (ø)
pandas/util/_doctools.py 0% <0%> (-12.88%) ⬇️
pandas/core/reshape/util.py 90.32% <0%> (-9.68%) ⬇️
pandas/util/_decorators.py 82.25% <0%> (-9.09%) ⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffae158...4779c36. Read the comment docs.

@kernc
Copy link
Contributor Author

kernc commented Nov 11, 2017

Of course; thanks for coming around to it!

There's still the issue of two distinct Warnings being emitted but assert_produces_warning only tests for one, raising on the other. Leaning towards extending assert_produces_warning for such a case.

whatsnew is also missing as this is all highly preliminary. Thanks.

# For SparseBlock, self.values is always 1D. If cond was a frame,
# it's 2D values would incorrectly broadcast later on.
if values.ndim == 1 and any(ax == 1 for ax in cond.shape):
cond = cond.ravel()
Copy link
Contributor Author

@kernc kernc Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes #17198, but likely in an incorrect way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, can you define .where() in SparseBlock and do a super call here instead?

@kernc
Copy link
Contributor Author

kernc commented Nov 14, 2017

The tests in fact pass (only the linter complains about unused imports I don't actually intend to keep). @jreback I think this PR while working is full of incorrect solutions so I'd really appreciate if you could have a look-over.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

closing as stale. if you want to continue working, pls ping.

@kernc happy to have sparse fixes. pls ping if you want to fixup.

@jreback jreback closed this Jan 21, 2018
@kernc kernc changed the title SparseDataFrame/SparseSeries value assignment ENH: SparseDataFrame/SparseSeries value assignment Jan 21, 2018
@kernc
Copy link
Contributor Author

kernc commented Jan 21, 2018

This was a planned enhancement. I guess I should wait for #19239 (#19174). If SparseArray survives, the bits are there.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

@kernc actually no, #19174 is a long ways off. I would certainly like to do this before if at all possible. that way the tests (and code) are there. I am happy to have this. let's try to get over the finish line.

@jreback jreback reopened this Jan 21, 2018
@kernc
Copy link
Contributor Author

kernc commented Jan 21, 2018

Ok. Then please give it a first pass as it is! 😉

@kernc kernc force-pushed the sparse-assign branch 2 times, most recently from 5012bcc to 5b7d0f9 Compare February 23, 2018 06:24
@kernc kernc force-pushed the sparse-assign branch 2 times, most recently from 69d0e81 to 7d3a577 Compare July 8, 2018 06:04
@kernc
Copy link
Contributor Author

kernc commented Jul 8, 2018

So the travis tests pass. Now if anyone could have a first look over.

I'm particularly interested in _try_coerce_result, _can_hold_element, extending of assert_produces_warning ... i.e. anything an insider finds obviously wrong or lacking.

# If label already in sparse index, just switch the value on a copy
idx = self.sp_index.lookup(indexer)
if idx != -1:
obj = self.copy()
Copy link
Contributor Author

@kernc kernc Jul 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should indeed make a copy or might we reuse (without the warning)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure if you can do inplace w/o the copy then it is ok. yeah I wouldn't warn unless you are actually copying

@kernc kernc force-pushed the sparse-assign branch 2 times, most recently from 4cb1ec7 to e5950ed Compare July 9, 2018 01:38
@kernc
Copy link
Contributor Author

kernc commented Jul 12, 2018

@jreback please have a look. It's all green. 🙏 The first four commits are where the magic is, the rest was mostly figuring out warnings in CI tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only partially looked

# For SparseBlock, self.values is always 1D. If cond was a frame,
# it's 2D values would incorrectly broadcast later on.
if values.ndim == 1 and any(ax == 1 for ax in cond.shape):
cond = cond.ravel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, can you define .where() in SparseBlock and do a super call here instead?

@@ -1809,6 +1817,11 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0,
new_values = self.values if inplace else self.copy().values
new_values, _, new, _ = self._try_coerce_args(new_values, new)

if is_sparse(new_values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll get used to it. :) Done. I hope that's what you meant. Much simpler now altogether.

@@ -2753,6 +2761,18 @@ def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
return self.make_block_same_class(values=values,
placement=self.mgr_locs)

def _can_hold_element(self, element):
return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this looks ok

@kernc kernc force-pushed the sparse-assign branch 3 times, most recently from 93a0fc0 to c86e0ec Compare July 12, 2018 02:28
tm.assert_index_equal(res2.columns,
pd.Index(list(self.frame.columns) + ['qux']))
pd.Index(list(self.frame.columns)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To justify, this test changed because (deprecated) SparseDataFrame.set_value() was removed in favor of superclass frame's (deprecated) set_value() which edits and returns the same object.

@kernc kernc force-pushed the sparse-assign branch 3 times, most recently from 1a5c3df to 13a033e Compare July 12, 2018 15:25
@kernc
Copy link
Contributor Author

kernc commented Jul 13, 2018

@jreback Please have another look. The docstrings for the two new override methods have been copied nearly verbatim from the superclass, without clean-up. Also the tm.assert_produces_warning() addition was done simpler in another way.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. just a few questions.

# If label already in sparse index, just switch the value on a copy
idx = self.sp_index.lookup(indexer)
if idx != -1:
obj = self.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure if you can do inplace w/o the copy then it is ok. yeah I wouldn't warn unless you are actually copying


indices = np.insert(indices, pos, indexer)
sp_values = np.insert(self.sp_values, pos, value)
# Length can be increased when adding a new value into index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a line before comment

sp_values = np.insert(self.sp_values, pos, value)
# Length can be increased when adding a new value into index
length = max(self.sp_index.length, indexer + 1)
sp_index = _make_index(length, indices, self.kind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no copy here AFICT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is above with

sp_values = np.insert(self.sp_values, pos, value)

@@ -544,6 +592,10 @@ def astype(self, dtype=None, copy=True):
return self._simple_new(sp_values, self.sp_index,
fill_value=fill_value)

def tolist(self):
"""Return *dense* self as list"""
return self.values.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -277,8 +276,13 @@ def __array_wrap__(self, result, context=None):
else:
fill_value = self.fill_value

# Assume: If result size matches, old sparse index is valid (ok???)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~sparseseries is an example. Put in the comment.

kind=self.kind)
self._data = SingleBlockManager(values, new_index)
self._index = new_index
self._data = self._data.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to copy?

# ok, as the index gets converted to object
frame = self.frame.copy()
with tm.assert_produces_warning(FutureWarning,
check_stacklevel=False):
check_stacklevel=False,
ignore_extra=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding ignore_extra?

with tm.assert_produces_warning(FutureWarning,
check_stacklevel=False):
check_stacklevel=False,
ignore_extra=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2465,6 +2466,8 @@ class for all warnings. To check that no warning is returned,
If True, displays the line that called the function containing
the warning to show were the function is called. Otherwise, the
line that implements the function is displayed.
ignore_extra : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we don't want to add this

kernc added 6 commits August 14, 2018 03:21
Besides hstacking cols (data copy), this densified SparseDataFrame.
Also fix .where for sparse blocks.
Discrepancy comes from:

	 dense_frame._data.blocks[0].values  # this is 2D even for 1D block
	sparse_frame._data.blocks[0].values  # this is always 1D

I'm sure this had worked before and was unneeded in Oct 2017.
@kernc
Copy link
Contributor Author

kernc commented Aug 14, 2018

This PR is likely obsoleted by #21978 and its PR #22325. Understood and approved.

@TomAugspurger
Copy link
Contributor

I’m hoping to steal pieces from this once #22325 is in.

@mroeschke
Copy link
Member

@TomAugspurger can this PR build upon the SparseArray work or has that superseded this PR?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 13, 2018 via email

@WillAyd
Copy link
Member

WillAyd commented Nov 26, 2018

Closing based off previous comments - seems like this PR is dead (though may inspire others).

@kernc if you want to reopen please ping

@WillAyd WillAyd closed this Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants