ENH: SparseDataFrame/SparseSeries value assignment #17785

kernc · 2017-10-04T16:31:31Z

closes BUG: Sparse structures don't fully support value assignment #21818
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Works by as much as possible using SparseBlock.setitem() which calls into SparseArray.set_values(), which returns a new (replacement) SparseArray object.

pep8speaks · 2017-10-04T16:31:41Z

Hello @kernc! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 14, 2018 at 01:24 Hours UTC

kernc · 2017-10-04T16:33:10Z

pandas/core/indexing.py

+
+        # allow arbitrary setting
+        if is_setter:
+            return list(key)


Copied from _AtIndexer above. Probably to make a test work.

kernc · 2017-10-04T16:34:22Z

pandas/core/internals.py


        if isinstance(new, np.ndarray) and len(new) == len(mask):
            new = new[mask]

-        mask = _safe_reshape(mask, new_values.shape)
+        mask = _safe_reshape(np.asarray(mask), new_values.shape)


No longer necessary since sparse short-circuit above.

kernc · 2017-10-04T16:37:02Z

pandas/core/internals.py

@@ -2753,6 +2761,18 @@ def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
        return self.make_block_same_class(values=values,
                                          placement=self.mgr_locs)

+    def _can_hold_element(self, element):
+        return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype)


This seems like a reasonable default. @jreback your recent commit 4efe656#diff-e705e723b2d6e7c0e2a0443f80916abfR609 indicates this must for some reason be strict(er)?

yes this looks ok

kernc · 2017-10-04T16:37:24Z

pandas/core/internals.py

+
+    def _try_coerce_result(self, result):
+        if (
+                    # isinstance(result, np.ndarray) and


A failing test where a SparseDataFrame is sliced horizontally requires either this line ...

kernc · 2017-10-04T16:38:29Z

pandas/core/internals.py

@@ -3750,7 +3770,8 @@ def fast_xs(self, loc):
            # Such assignment may incorrectly coerce NaT to None
            # result[blk.mgr_locs] = blk._slice((slice(None), loc))
            for i, rl in enumerate(blk.mgr_locs):
-                result[rl] = blk._try_coerce_result(blk.iget((i, loc)))
+                # result[rl] = blk._try_coerce_result(blk.iget((i, loc)))
+                result[rl] = blk.iget((i, loc))


... or not calling _try_coerce_result() here.

gfyoung · 2017-10-04T19:17:48Z

@kernc : A couple of points:

Travis and company are complaining pretty loudly about some failing tests. Could you address those?
Could you point to the issue number that you're fixing? It looks like you pointed to your own PR.

kernc · 2017-10-10T18:40:18Z

pandas/core/internals.py

+        return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype)
+
+    def _try_coerce_result(self, result):
+        if (isinstance(result, np.ndarray) and


A failing test where a SparseDataFrame is sliced horizontally requires this check. In that test, the items of a dtype=object array are empty lists. Were they ndarray objects, this method would coerce them into SparseArrays (i.e. no longer pure ndarrays).

kernc · 2017-10-10T23:31:15Z

Handling the warnings now. What do I do when there are two warnings to catch with assert_produces_warning (FutureWarning and PerformanceWarning)? Should I even emit PerformanceWarning as I do?

jreback · 2017-10-11T13:09:58Z

the FutureWarning is prob enough

kernc · 2017-10-11T14:04:07Z

The FutureWarning is about the recent .set_value() deprecation, PerformanceWarning is about setting values on a sparse structure (makes a copy of data, sometimes densifies data).

The problem with assert_produces_warning is that if any other warnings are encountered, the whole thing fails.

Can I amend assert_produces_warning to also accept a tuple of expected warning types? Alternatively, I can add a flag error_unexpected=True which, when set to False, won't error on additional unexpected warnings?

jreback · 2017-11-10T21:13:01Z

can you rebase / update

codecov · 2017-11-11T16:40:08Z

Codecov Report

Merging #17785 into master will decrease coverage by 0.21%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17785      +/-   ##
==========================================
- Coverage   91.42%   91.21%   -0.22%     
==========================================
  Files         163      163              
  Lines       50064    50036      -28     
==========================================
- Hits        45773    45640     -133     
- Misses       4291     4396     +105

Flag	Coverage Δ
#multiple	`89.02% <100%> (-0.2%)`	⬇️
#single	`40.22% <21.27%> (-0.2%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`94.98% <ø> (-0.03%)`	⬇️
pandas/core/frame.py	`97.76% <100%> (-0.14%)`	⬇️
pandas/core/sparse/series.py	`95.84% <100%> (+0.56%)`	⬆️
pandas/core/sparse/array.py	`91.96% <100%> (-0.33%)`	⬇️
pandas/core/sparse/frame.py	`94.92% <100%> (+0.14%)`	⬆️
pandas/core/internals.py	`94.26% <100%> (-0.28%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/generic.py	`92.09% <0%> (-3.64%)`	⬇️
pandas/core/indexes/range.py	`92.83% <0%> (-2.83%)`	⬇️
pandas/util/_decorators.py	`78% <0%> (-2.71%)`	⬇️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3493aba...137304e. Read the comment docs.

codecov · 2017-11-11T16:40:11Z

Codecov Report

Merging #17785 into master will decrease coverage by 0.16%.
The diff coverage is 93.33%.

@@            Coverage Diff             @@
##           master   #17785      +/-   ##
==========================================
- Coverage   92.08%   91.91%   -0.17%     
==========================================
  Files         169      164       -5     
  Lines       50706    50018     -688     
==========================================
- Hits        46691    45975     -716     
- Misses       4015     4043      +28

Flag	Coverage Δ
#multiple	`90.3% <93.33%> (-0.19%)`	⬇️
#single	`42.14% <21.66%> (-0.19%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/series.py	`94.11% <ø> (+0.39%)`	⬆️
pandas/core/sparse/series.py	`95.77% <100%> (+0.55%)`	⬆️
pandas/core/sparse/array.py	`91.81% <100%> (+0.43%)`	⬆️
pandas/core/sparse/frame.py	`94.91% <100%> (+0.12%)`	⬆️
pandas/util/testing.py	`85.44% <100%> (-0.62%)`	⬇️
pandas/core/frame.py	`97.19% <100%> (-0.05%)`	⬇️
pandas/core/internals.py	`95.2% <77.77%> (ø)`
pandas/util/_doctools.py	`0% <0%> (-12.88%)`	⬇️
pandas/core/reshape/util.py	`90.32% <0%> (-9.68%)`	⬇️
pandas/util/_decorators.py	`82.25% <0%> (-9.09%)`	⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffae158...4779c36. Read the comment docs.

kernc · 2017-11-11T16:45:25Z

Of course; thanks for coming around to it!

There's still the issue of two distinct Warnings being emitted but assert_produces_warning only tests for one, raising on the other. Leaning towards extending assert_produces_warning for such a case.

whatsnew is also missing as this is all highly preliminary. Thanks.

kernc · 2017-11-14T21:38:50Z

pandas/core/internals.py

+        # For SparseBlock, self.values is always 1D. If cond was a frame,
+        # it's 2D values would incorrectly broadcast later on.
+        if values.ndim == 1 and any(ax == 1 for ax in cond.shape):
+            cond = cond.ravel()


This fixes #17198, but likely in an incorrect way.

hmm, can you define .where() in SparseBlock and do a super call here instead?

kernc · 2017-11-14T21:41:31Z

The tests in fact pass (only the linter complains about unused imports I don't actually intend to keep). @jreback I think this PR while working is full of incorrect solutions so I'd really appreciate if you could have a look-over.

jreback · 2018-01-21T18:21:18Z

closing as stale. if you want to continue working, pls ping.

@kernc happy to have sparse fixes. pls ping if you want to fixup.

kernc · 2018-01-21T18:26:51Z

This was a planned enhancement. I guess I should wait for #19239 (#19174). If SparseArray survives, the bits are there.

jreback · 2018-01-21T18:47:18Z

@kernc actually no, #19174 is a long ways off. I would certainly like to do this before if at all possible. that way the tests (and code) are there. I am happy to have this. let's try to get over the finish line.

kernc · 2018-01-21T18:49:32Z

Ok. Then please give it a first pass as it is! 😉

kernc · 2018-07-08T18:18:06Z

So the travis tests pass. Now if anyone could have a first look over.

I'm particularly interested in _try_coerce_result, _can_hold_element, extending of assert_produces_warning ... i.e. anything an insider finds obviously wrong or lacking.

kernc · 2018-07-08T20:08:56Z

pandas/core/sparse/array.py

+        # If label already in sparse index, just switch the value on a copy
+        idx = self.sp_index.lookup(indexer)
+        if idx != -1:
+            obj = self.copy()


Should indeed make a copy or might we reuse (without the warning)?

sure if you can do inplace w/o the copy then it is ok. yeah I wouldn't warn unless you are actually copying

kernc · 2018-07-12T00:29:28Z

@jreback please have a look. It's all green. 🙏 The first four commits are where the magic is, the rest was mostly figuring out warnings in CI tests.

jreback

only partially looked

jreback · 2018-07-12T00:34:37Z

pandas/core/internals.py

+        # For SparseBlock, self.values is always 1D. If cond was a frame,
+        # it's 2D values would incorrectly broadcast later on.
+        if values.ndim == 1 and any(ax == 1 for ax in cond.shape):
+            cond = cond.ravel()


hmm, can you define .where() in SparseBlock and do a super call here instead?

jreback · 2018-07-12T00:34:49Z

pandas/core/internals.py

@@ -1809,6 +1817,11 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0,
        new_values = self.values if inplace else self.copy().values
        new_values, _, new, _ = self._try_coerce_args(new_values, new)

+        if is_sparse(new_values):


similar to above

I'll get used to it. :) Done. I hope that's what you meant. Much simpler now altogether.

jreback · 2018-07-12T00:34:57Z

pandas/core/internals.py

@@ -2753,6 +2761,18 @@ def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
        return self.make_block_same_class(values=values,
                                          placement=self.mgr_locs)

+    def _can_hold_element(self, element):
+        return np.can_cast(np.asarray(element).dtype, self.sp_values.dtype)


yes this looks ok

kernc · 2018-07-12T02:27:36Z

pandas/tests/sparse/frame/test_frame.py

        tm.assert_index_equal(res2.columns,
-                              pd.Index(list(self.frame.columns) + ['qux']))
+                              pd.Index(list(self.frame.columns)))


To justify, this test changed because (deprecated) SparseDataFrame.set_value() was removed in favor of superclass frame's (deprecated) set_value() which edits and returns the same object.

kernc · 2018-07-13T08:11:16Z

@jreback Please have another look. The docstrings for the two new override methods have been copied nearly verbatim from the superclass, without clean-up. Also the tm.assert_produces_warning() addition was done simpler in another way.

jreback

looks pretty good. just a few questions.

jreback · 2018-07-14T15:42:22Z

pandas/core/sparse/array.py

+        # If label already in sparse index, just switch the value on a copy
+        idx = self.sp_index.lookup(indexer)
+        if idx != -1:
+            obj = self.copy()


sure if you can do inplace w/o the copy then it is ok. yeah I wouldn't warn unless you are actually copying

jreback · 2018-07-14T15:42:42Z

pandas/core/sparse/array.py

+
+        indices = np.insert(indices, pos, indexer)
+        sp_values = np.insert(self.sp_values, pos, value)
+        # Length can be increased when adding a new value into index


add a line before comment

jreback · 2018-07-14T15:43:37Z

pandas/core/sparse/array.py

+        sp_values = np.insert(self.sp_values, pos, value)
+        # Length can be increased when adding a new value into index
+        length = max(self.sp_index.length, indexer + 1)
+        sp_index = _make_index(length, indices, self.kind)


no copy here AFICT ?

It is above with

sp_values = np.insert(self.sp_values, pos, value)

jreback · 2018-07-14T15:43:47Z

pandas/core/sparse/array.py

@@ -544,6 +592,10 @@ def astype(self, dtype=None, copy=True):
        return self._simple_new(sp_values, self.sp_index,
                                fill_value=fill_value)

+    def tolist(self):
+        """Return *dense* self as list"""
+        return self.values.tolist()


test for this?

jreback · 2018-07-14T15:44:20Z

pandas/core/sparse/series.py

@@ -277,8 +276,13 @@ def __array_wrap__(self, result, context=None):
        else:
            fill_value = self.fill_value

+        # Assume: If result size matches, old sparse index is valid (ok???)


can you give an example here?

~sparseseries is an example. Put in the comment.

jreback · 2018-07-14T15:44:29Z

pandas/core/sparse/series.py

-                             kind=self.kind)
-        self._data = SingleBlockManager(values, new_index)
-        self._index = new_index
+        self._data = self._data.copy()


do we need to copy?

jreback · 2018-07-14T15:45:52Z

pandas/tests/sparse/frame/test_frame.py

        # ok, as the index gets converted to object
        frame = self.frame.copy()
        with tm.assert_produces_warning(FutureWarning,
-                                        check_stacklevel=False):
+                                        check_stacklevel=False,
+                                        ignore_extra=True):


why are you adding ignore_extra?

jreback · 2018-07-14T15:46:51Z

pandas/tests/sparse/frame/test_frame.py

        with tm.assert_produces_warning(FutureWarning,
-                                        check_stacklevel=False):
+                                        check_stacklevel=False,
+                                        ignore_extra=True):


jreback · 2018-07-14T15:47:32Z

pandas/util/testing.py

@@ -2465,6 +2466,8 @@ class for all warnings. To check that no warning is returned,
        If True, displays the line that called the function containing
        the warning to show were the function is called. Otherwise, the
        line that implements the function is displayed.
+    ignore_extra : bool, default False


no we don't want to add this

Besides hstacking cols (data copy), this densified SparseDataFrame.

Also fix .where for sparse blocks. Discrepancy comes from: dense_frame._data.blocks[0].values # this is 2D even for 1D block sparse_frame._data.blocks[0].values # this is always 1D I'm sure this had worked before and was unneeded in Oct 2017.

kernc · 2018-08-14T01:27:11Z

This PR is likely obsoleted by #21978 and its PR #22325. Understood and approved.

TomAugspurger · 2018-08-14T02:19:19Z

I’m hoping to steal pieces from this once #22325 is in.

mroeschke · 2018-11-13T05:56:33Z

@TomAugspurger can this PR build upon the SparseArray work or has that superseded this PR?

TomAugspurger · 2018-11-13T12:23:22Z

SparseArray doesn't support setting values yet. I haven't had a chance to into this. I think the core of the approach in https://github.com/pandas-dev/pandas/pull/17785/files#diff-71caf9627e9687e837e4b1f86ecc6271R373 should still be valid on top of SparseArray.

…

On Mon, Nov 12, 2018 at 11:56 PM Matthew Roeschke ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> can this PR build upon the SparseArray work or has that superseded this PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17785 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIg9fUZocG2EOwoSJyj3PmuI0Btuqks5uul8WgaJpZM4Pt5Os> .

WillAyd · 2018-11-26T03:55:35Z

Closing based off previous comments - seems like this PR is dead (though may inspire others).

@kernc if you want to reopen please ping

kernc commented Oct 4, 2017

View reviewed changes

kernc mentioned this pull request Oct 4, 2017

SparseSeries accepts scipy.sparse.spmatrix in constructor #16617

Closed

4 tasks

gfyoung added the Sparse Sparse Data Type label Oct 4, 2017

kernc force-pushed the sparse-assign branch 2 times, most recently from b106895 to 5605e87 Compare October 5, 2017 18:05

kernc commented Oct 10, 2017

View reviewed changes

kernc force-pushed the sparse-assign branch from aa49161 to da09073 Compare October 10, 2017 20:08

kernc force-pushed the sparse-assign branch from da09073 to 137304e Compare November 11, 2017 16:39

kernc force-pushed the sparse-assign branch from 49f83ca to 23a2e92 Compare November 14, 2017 01:26

kernc commented Nov 14, 2017

View reviewed changes

jreback closed this Jan 21, 2018

kernc changed the title ~~SparseDataFrame/SparseSeries value assignment~~ ENH: SparseDataFrame/SparseSeries value assignment Jan 21, 2018

jreback reopened this Jan 21, 2018

kernc force-pushed the sparse-assign branch 2 times, most recently from 5012bcc to 5b7d0f9 Compare February 23, 2018 06:24

kernc force-pushed the sparse-assign branch 2 times, most recently from 69d0e81 to 7d3a577 Compare July 8, 2018 06:04

kernc commented Jul 8, 2018

View reviewed changes

kernc force-pushed the sparse-assign branch 2 times, most recently from 4cb1ec7 to e5950ed Compare July 9, 2018 01:38

jreback requested changes Jul 12, 2018

View reviewed changes

kernc force-pushed the sparse-assign branch 3 times, most recently from 93a0fc0 to c86e0ec Compare July 12, 2018 02:28

kernc commented Jul 12, 2018

View reviewed changes

kernc force-pushed the sparse-assign branch 3 times, most recently from 1a5c3df to 13a033e Compare July 12, 2018 15:25

jreback requested changes Jul 14, 2018

View reviewed changes

kernc added 6 commits August 14, 2018 03:21

PERF: frame: avoid unnecessary .values calls

b4889c8

Besides hstacking cols (data copy), this densified SparseDataFrame.

SparseDataFrame: inherit default_kind and default_fill_value

22b0346

fixup! ENH: Allow SparseDataFrame/SparseSeries values assignment

35e8c61

fixup! ENH: Allow SparseDataFrame/SparseSeries values assignment

83c1f2b

TST: SparseArray.tolist()

4779c36

kernc force-pushed the sparse-assign branch from 39e2caf to 4779c36 Compare August 14, 2018 01:24

WillAyd closed this Nov 26, 2018

ENH: SparseDataFrame/SparseSeries value assignment #17785

ENH: SparseDataFrame/SparseSeries value assignment #17785

Conversation

kernc commented Oct 4, 2017 • edited Loading

pep8speaks commented Oct 4, 2017 • edited Loading

Comment last updated on August 14, 2018 at 01:24 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc Oct 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc Oct 4, 2017 • edited Loading

Choose a reason for hiding this comment

kernc Oct 4, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung commented Oct 4, 2017

Choose a reason for hiding this comment

kernc commented Oct 10, 2017

jreback commented Oct 11, 2017

kernc commented Oct 11, 2017 • edited Loading

jreback commented Nov 10, 2017

codecov bot commented Nov 11, 2017

Codecov Report

codecov bot commented Nov 11, 2017 • edited Loading

Codecov Report

kernc commented Nov 11, 2017

kernc Nov 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc commented Nov 14, 2017

jreback commented Jan 21, 2018

kernc commented Jan 21, 2018 • edited Loading

jreback commented Jan 21, 2018

kernc commented Jan 21, 2018 • edited Loading

kernc commented Jul 8, 2018 • edited Loading

kernc Jul 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc commented Jul 12, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc commented Jul 13, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018

mroeschke commented Nov 13, 2018

TomAugspurger commented Nov 13, 2018 via email

WillAyd commented Nov 26, 2018

kernc commented Oct 4, 2017 •

edited

Loading

pep8speaks commented Oct 4, 2017 •

edited

Loading

kernc Oct 4, 2017 •

edited

Loading

kernc Oct 4, 2017 •

edited

Loading

kernc Oct 4, 2017 •

edited

Loading

kernc commented Oct 11, 2017 •

edited

Loading

codecov bot commented Nov 11, 2017 •

edited

Loading

kernc Nov 14, 2017 •

edited

Loading

kernc commented Jan 21, 2018 •

edited

Loading

kernc commented Jan 21, 2018 •

edited

Loading

kernc commented Jul 8, 2018 •

edited

Loading

kernc Jul 8, 2018 •

edited

Loading

kernc commented Jul 12, 2018 •

edited

Loading