Skip to content

Commit

Permalink
make Table.to_dataframe create real sparse frames (#809)
Browse files Browse the repository at this point in the history
* specify fill value

* fix flake8

* add to changelog

* fix test for Python 2
  • Loading branch information
cdiener authored and ElDeveloper committed Mar 7, 2019
1 parent 2932695 commit ac2f835
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 11 deletions.
14 changes: 8 additions & 6 deletions ChangeLog.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,23 @@ New Features:

Bug fixes:

* `Table.to_dataframe(dense=False)` does now correctly produce sparse data frames (and not accidentally dense ones as before)

biom 2.1.7
----------

New features and bug fixes, released on 28 September 2018.

Important:
Important:

* Python 3.4 support has been dropped. We now only support Python 2.7, 3.5, 3.6 and 3.7.
* We will be dropping Python 2.7 support on the next release.
* Pandas >= 0.20.0 is now the minimum required version.
* Pandas >= 0.20.0 is now the minimum required version.
* pytest is now used instead of nose.

New Features:

* Massive performance boost to `Table.collapse` with the default collapse function. The difference was 10s of milliseconds vs. minutes stemming from prior use of `operator.add`. See [issue #761](https://github.com/biocore/biom-format/issues/761).
* Massive performance boost to `Table.collapse` with the default collapse function. The difference was 10s of milliseconds vs. minutes stemming from prior use of `operator.add`. See [issue #761](https://github.com/biocore/biom-format/issues/761).
* `Table.align_to` for aligning one table to another. This is useful in multi-omic analyses where multiple preparations have been performed on the sample physical samples. This is essentially a helper method around `Table.sort_order`. See [issue #747](https://github.com/biocore/biom-format/issues/747).
* Added additional sanity checks when calling `Table.to_hdf5`, see [PR #769](https://github.com/biocore/biom-format/pull/769).
* `Table.subsample()` can optionally perform subsampling with replacement. See [issue #774](https://github.com/biocore/biom-format/issues/774).
Expand All @@ -47,7 +49,7 @@ New Features:
* `Table.from_hdf5` now supports a rapid subset in the event that metadata is
not needed. In benchmarking against the Earth Microbiome Project BIOM table,
the reduction in runtime was multiple orders of magnitude while additionally
preserving substantial memory.
preserving substantial memory.
* `Table.rankdata` has been added to convert values to ranked abundances on
either axis. See [issue #645](https://github.com/biocore/biom-format/issues/639).
* Format of numbers in ``biom summarize-table`` output is now more readable and localized. See [issue #679](https://github.com/biocore/biom-format/issues/679).
Expand Down Expand Up @@ -105,8 +107,8 @@ Bug fixes:
* `biom --version` now prints the software version (previously the individual
commands did this, but not the base command).
* `Table.vlen_list_of_str_formatter` was considering a `str` to be valid for
formatting resulting in an obscure error when a `str`, as opposed to a
`list` of `str`, was used for taxonomy. See
formatting resulting in an obscure error when a `str`, as opposed to a
`list` of `str`, was used for taxonomy. See
[issue #709](https://github.com/biocore/biom-format/issues/709).

biom 2.1.4
Expand Down
9 changes: 5 additions & 4 deletions biom/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@
from copy import deepcopy
from datetime import datetime
from json import dumps
from functools import reduce
from functools import reduce, partial
from operator import itemgetter
from future.builtins import zip
from future.utils import viewitems
Expand Down Expand Up @@ -4045,9 +4045,10 @@ def to_dataframe(self, dense=False):
mat = self.matrix_data.toarray()
constructor = pd.DataFrame
else:
mat = [pd.SparseSeries(r.toarray().squeeze())
for r in self.matrix_data.tocsr()]
constructor = pd.SparseDataFrame
mat = self.matrix_data
constructor = partial(pd.SparseDataFrame,
default_fill_value=0,
copy=True)

return constructor(mat, index=index, columns=columns)

Expand Down
9 changes: 8 additions & 1 deletion biom/tests/test_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -1475,10 +1475,17 @@ def test_add_group_metadata_w_existing_metadata(self):
def test_to_dataframe(self):
exp = pd.SparseDataFrame(np.array([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0]]),
index=['O1', 'O2'],
columns=['S1', 'S2', 'S3'])
columns=['S1', 'S2', 'S3'],
default_fill_value=0.0)
obs = example_table.to_dataframe()
pdt.assert_frame_equal(obs, exp)

def test_to_dataframe_is_sparse(self):
df = example_table.to_dataframe()
density = (float(example_table.matrix_data.getnnz()) /
np.prod(example_table.shape))
assert np.allclose(df.density, density)

def test_to_dataframe_dense(self):
exp = pd.DataFrame(np.array([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0]]),
index=['O1', 'O2'],
Expand Down

0 comments on commit ac2f835

Please sign in to comment.