Skip to content

Commit

Permalink
ENH: Use sparse DataFrames for the BIOM table DF!
Browse files Browse the repository at this point in the history
This should make rankratioviz use a lot less memory, at least in
the python side of things. I'm uh not confident that this will
make huge BIOM tables usable in the viz interface (if python will
choke on a dense representation, then JS will almost certainly also),
but at least this will make the generation script faster.
  • Loading branch information
fedarko committed Apr 30, 2019
1 parent 39767ab commit beaf464
Showing 1 changed file with 17 additions and 6 deletions.
23 changes: 17 additions & 6 deletions rankratioviz/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,23 @@ def process_input(feature_ranks, sample_metadata, biom_table,
ensure_df_headers_unique(sample_metadata, "sample metadata")
logging.debug("Ensured uniqueness of sample metadata.")

# Old versions of BIOM accidentally produce an effectively-dense DataFrame,
# due to not specifying 0 as the default fill_value. TODO: explicitly
# convert the DataFrame to a sparse one, or use a different representation
# entirely.
table = biom_table.to_dataframe().to_dense()
logging.debug("Converted BIOM table to DataFrame.")
logging.debug("Creating a SparseDataFrame from the BIOM table.")

# Old versions of BIOM accidentally produce an effectively-dense DataFrame
# when using biom_table.to_dataframe() (see
# https://github.com/biocore/biom-format/issues/808). To get around this,
# we extract the scipy.sparse.csr_matrix data from the BIOM table and
# directly convert that to a pandas SparseDataFrame.
sparse_matrix_data = biom_table.matrix_data
table = pd.SparseDataFrame(sparse_matrix_data, default_fill_value=0.0)

# The csr_matrix doesn't include column/index IDs, so we manually add them
# in to the SparseDataFrame.
table.index = biom_table.ids(axis="observation")
table.columns = biom_table.ids(axis="sample")

logging.debug("Converted BIOM table to SparseDataFrame.")

# Match features to BIOM table, and then match samples to BIOM table.
# This should bring us to a point where every feature/sample is
# supported in the BIOM table. (Note that the input BIOM table might
Expand Down

0 comments on commit beaf464

Please sign in to comment.