-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: SparseDataFrame/SparseSeries value assignment #17785
Conversation
Hello @kernc! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 14, 2018 at 01:24 Hours UTC |
pandas/core/indexing.py
Outdated
|
||
# allow arbitrary setting | ||
if is_setter: | ||
return list(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from _AtIndexer above. Probably to make a test work.
pandas/core/internals.py
Outdated
|
||
if isinstance(new, np.ndarray) and len(new) == len(mask): | ||
new = new[mask] | ||
|
||
mask = _safe_reshape(mask, new_values.shape) | ||
mask = _safe_reshape(np.asarray(mask), new_values.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer necessary since sparse short-circuit above.
pandas/core/internals.py
Outdated
@@ -2753,6 +2761,18 @@ def _astype(self, dtype, copy=False, raise_on_error=True, values=None, | |||
return self.make_block_same_class(values=values, | |||
placement=self.mgr_locs) | |||
|
|||
def _can_hold_element(self, element): | |||
return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable default. @jreback your recent commit 4efe656#diff-e705e723b2d6e7c0e2a0443f80916abfR609 indicates this must for some reason be strict(er)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this looks ok
pandas/core/internals.py
Outdated
|
||
def _try_coerce_result(self, result): | ||
if ( | ||
# isinstance(result, np.ndarray) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A failing test where a SparseDataFrame is sliced horizontally requires either this line ...
pandas/core/internals.py
Outdated
@@ -3750,7 +3770,8 @@ def fast_xs(self, loc): | |||
# Such assignment may incorrectly coerce NaT to None | |||
# result[blk.mgr_locs] = blk._slice((slice(None), loc)) | |||
for i, rl in enumerate(blk.mgr_locs): | |||
result[rl] = blk._try_coerce_result(blk.iget((i, loc))) | |||
# result[rl] = blk._try_coerce_result(blk.iget((i, loc))) | |||
result[rl] = blk.iget((i, loc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or not calling _try_coerce_result()
here.
@kernc : A couple of points:
|
b106895
to
5605e87
Compare
pandas/core/internals.py
Outdated
return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype) | ||
|
||
def _try_coerce_result(self, result): | ||
if (isinstance(result, np.ndarray) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A failing test where a SparseDataFrame is sliced horizontally requires this check. In that test, the items of a dtype=object array are empty lists. Were they ndarray objects, this method would coerce them into SparseArrays (i.e. no longer pure ndarrays).
Handling the warnings now. What do I do when there are two warnings to catch with |
the FutureWarning is prob enough |
The FutureWarning is about the recent The problem with Can I amend |
can you rebase / update |
Codecov Report
@@ Coverage Diff @@
## master #17785 +/- ##
==========================================
- Coverage 91.42% 91.21% -0.22%
==========================================
Files 163 163
Lines 50064 50036 -28
==========================================
- Hits 45773 45640 -133
- Misses 4291 4396 +105
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17785 +/- ##
==========================================
- Coverage 92.08% 91.91% -0.17%
==========================================
Files 169 164 -5
Lines 50706 50018 -688
==========================================
- Hits 46691 45975 -716
- Misses 4015 4043 +28
Continue to review full report at Codecov.
|
Of course; thanks for coming around to it! There's still the issue of two distinct Warnings being emitted but whatsnew is also missing as this is all highly preliminary. Thanks. |
pandas/core/internals.py
Outdated
# For SparseBlock, self.values is always 1D. If cond was a frame, | ||
# it's 2D values would incorrectly broadcast later on. | ||
if values.ndim == 1 and any(ax == 1 for ax in cond.shape): | ||
cond = cond.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes #17198, but likely in an incorrect way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, can you define .where()
in SparseBlock
and do a super call here instead?
The tests in fact pass (only the linter complains about unused imports I don't actually intend to keep). @jreback I think this PR while working is full of incorrect solutions so I'd really appreciate if you could have a look-over. |
closing as stale. if you want to continue working, pls ping. @kernc happy to have sparse fixes. pls ping if you want to fixup. |
Ok. Then please give it a first pass as it is! 😉 |
5012bcc
to
5b7d0f9
Compare
69d0e81
to
7d3a577
Compare
So the travis tests pass. Now if anyone could have a first look over. I'm particularly interested in |
pandas/core/sparse/array.py
Outdated
# If label already in sparse index, just switch the value on a copy | ||
idx = self.sp_index.lookup(indexer) | ||
if idx != -1: | ||
obj = self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should indeed make a copy or might we reuse (without the warning)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure if you can do inplace w/o the copy then it is ok. yeah I wouldn't warn unless you are actually copying
4cb1ec7
to
e5950ed
Compare
@jreback please have a look. It's all green. 🙏 The first four commits are where the magic is, the rest was mostly figuring out warnings in CI tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only partially looked
pandas/core/internals.py
Outdated
# For SparseBlock, self.values is always 1D. If cond was a frame, | ||
# it's 2D values would incorrectly broadcast later on. | ||
if values.ndim == 1 and any(ax == 1 for ax in cond.shape): | ||
cond = cond.ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, can you define .where()
in SparseBlock
and do a super call here instead?
pandas/core/internals.py
Outdated
@@ -1809,6 +1817,11 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0, | |||
new_values = self.values if inplace else self.copy().values | |||
new_values, _, new, _ = self._try_coerce_args(new_values, new) | |||
|
|||
if is_sparse(new_values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll get used to it. :) Done. I hope that's what you meant. Much simpler now altogether.
pandas/core/internals.py
Outdated
@@ -2753,6 +2761,18 @@ def _astype(self, dtype, copy=False, raise_on_error=True, values=None, | |||
return self.make_block_same_class(values=values, | |||
placement=self.mgr_locs) | |||
|
|||
def _can_hold_element(self, element): | |||
return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this looks ok
93a0fc0
to
c86e0ec
Compare
tm.assert_index_equal(res2.columns, | ||
pd.Index(list(self.frame.columns) + ['qux'])) | ||
pd.Index(list(self.frame.columns))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To justify, this test changed because (deprecated) SparseDataFrame.set_value()
was removed in favor of superclass frame's (deprecated) set_value()
which edits and returns the same object.
1a5c3df
to
13a033e
Compare
@jreback Please have another look. The docstrings for the two new override methods have been copied nearly verbatim from the superclass, without clean-up. Also the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. just a few questions.
pandas/core/sparse/array.py
Outdated
# If label already in sparse index, just switch the value on a copy | ||
idx = self.sp_index.lookup(indexer) | ||
if idx != -1: | ||
obj = self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure if you can do inplace w/o the copy then it is ok. yeah I wouldn't warn unless you are actually copying
pandas/core/sparse/array.py
Outdated
|
||
indices = np.insert(indices, pos, indexer) | ||
sp_values = np.insert(self.sp_values, pos, value) | ||
# Length can be increased when adding a new value into index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a line before comment
pandas/core/sparse/array.py
Outdated
sp_values = np.insert(self.sp_values, pos, value) | ||
# Length can be increased when adding a new value into index | ||
length = max(self.sp_index.length, indexer + 1) | ||
sp_index = _make_index(length, indices, self.kind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no copy here AFICT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is above with
sp_values = np.insert(self.sp_values, pos, value)
@@ -544,6 +592,10 @@ def astype(self, dtype=None, copy=True): | |||
return self._simple_new(sp_values, self.sp_index, | |||
fill_value=fill_value) | |||
|
|||
def tolist(self): | |||
"""Return *dense* self as list""" | |||
return self.values.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
pandas/core/sparse/series.py
Outdated
@@ -277,8 +276,13 @@ def __array_wrap__(self, result, context=None): | |||
else: | |||
fill_value = self.fill_value | |||
|
|||
# Assume: If result size matches, old sparse index is valid (ok???) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give an example here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~sparseseries
is an example. Put in the comment.
pandas/core/sparse/series.py
Outdated
kind=self.kind) | ||
self._data = SingleBlockManager(values, new_index) | ||
self._index = new_index | ||
self._data = self._data.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to copy?
# ok, as the index gets converted to object | ||
frame = self.frame.copy() | ||
with tm.assert_produces_warning(FutureWarning, | ||
check_stacklevel=False): | ||
check_stacklevel=False, | ||
ignore_extra=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding ignore_extra?
with tm.assert_produces_warning(FutureWarning, | ||
check_stacklevel=False): | ||
check_stacklevel=False, | ||
ignore_extra=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/util/testing.py
Outdated
@@ -2465,6 +2466,8 @@ class for all warnings. To check that no warning is returned, | |||
If True, displays the line that called the function containing | |||
the warning to show were the function is called. Otherwise, the | |||
line that implements the function is displayed. | |||
ignore_extra : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no we don't want to add this
Besides hstacking cols (data copy), this densified SparseDataFrame.
Also fix .where for sparse blocks. Discrepancy comes from: dense_frame._data.blocks[0].values # this is 2D even for 1D block sparse_frame._data.blocks[0].values # this is always 1D I'm sure this had worked before and was unneeded in Oct 2017.
I’m hoping to steal pieces from this once #22325 is in. |
@TomAugspurger can this PR build upon the SparseArray work or has that superseded this PR? |
SparseArray doesn't support setting values yet. I haven't had a chance to
into this. I think the core of the approach in
https://github.com/pandas-dev/pandas/pull/17785/files#diff-71caf9627e9687e837e4b1f86ecc6271R373
should
still be valid on top of SparseArray.
…On Mon, Nov 12, 2018 at 11:56 PM Matthew Roeschke ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> can this PR build upon
the SparseArray work or has that superseded this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17785 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIg9fUZocG2EOwoSJyj3PmuI0Btuqks5uul8WgaJpZM4Pt5Os>
.
|
Closing based off previous comments - seems like this PR is dead (though may inspire others). @kernc if you want to reopen please ping |
git diff upstream/master -u -- "*.py" | flake8 --diff
Works by as much as possible using
SparseBlock.setitem()
which calls intoSparseArray.set_values()
, which returns a new (replacement)SparseArray
object.