Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

barplot: make taxonomy optional #153

Merged
merged 5 commits into from
May 17, 2023
Merged

Conversation

nbokulich
Copy link
Member

fixes #116 and #108

This PR makes FeatureData[Taxonomy] an optional input to barplot. In this case, feature labels are parsed from the feature IDs (either single-level, e.g., ASV IDs, or semicolon-delimited for multi-level)

In practice, this means that various other use cases are possible, e.g.,:

  1. input a collapsed table. Taxonomy is parsed from feature IDs.
  2. input an ASV/OTU table to look at ASV frequencies in a barplot.

I tested both of these use cases (and added appropriate unit tests). To fully validate, I ran the moving pictures tutorial and made barplots on the collapsed table. The output visualizations are identical.

@ebolyen
Copy link
Member

ebolyen commented Apr 21, 2023

cc @gregcaporaso and @colinvwood who've discussed this recently in the context of Kraken

@nbokulich
Copy link
Member Author

hey @ebolyen @gregcaporaso @colinvwood

I have added a parameter parse_ids, which is False by default. @ebolyen I think this is what you requested when we discussed yesterday?

Is this acceptable with this change? This change does not ever require hierarchical labels — it just attempts to parse those labels if the user sets parse_ids=True.

@lizgehret lizgehret requested a review from ebolyen May 8, 2023 17:15
Copy link
Member

@ebolyen ebolyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nbokulich,

We were discussing this and @lizgehret came up with an idea that we all like on our end.

What if instead of parse_ids, there was something like level_delimiter=None (which would be equivalent to parse_ids=False).

The intention would be that a user could provide this delimiter if they knew the IDs had hierarchical ranks. In most cases they would pass ;, but the reason we like this is it no longer imbues the ID with an implicit schema (the taxonomy format). Instead it's up to the user to decide to interpret their IDs as ranks and they need to provide the schema to do so.


I've sketched the idea below. Let us know what you think!

def barplot(output_dir: str, table: biom.Table, taxonomy: pd.Series,
metadata: Metadata = None) -> None:
def barplot(output_dir: str, table: biom.Table, taxonomy: pd.Series = None,
metadata: Metadata = None, parse_ids: bool = False) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata: Metadata = None, parse_ids: bool = False) -> None:
metadata: Metadata = None, level_delimiter: str = None) -> None:

Comment on lines 40 to 43
_ids = table.ids('observation')
taxonomy = pd.Series(_ids, index=_ids)
if not parse_ids:
collapse = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ids = table.ids('observation')
taxonomy = pd.Series(_ids, index=_ids)
if not parse_ids:
collapse = False
if level_delimiter is None:
collapse = False
else:
_ids = table.ids('observation')
ranks = [';'.join(r.split(level_delimiter)) for r in _ids]
taxonomy = pd.Series(ranks, index=_ids)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @ebolyen sounds good, thanks for the suggestion @lizgehret !

The delimiter could be passed to the _collapse_table function... then instead of just splitting and rejoining the ids, this also could be used to parse taxonomies with other delimiter (after all, I don't think that the semicolon delimiter is a requirement of the type, right?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just seeing this comment now - sorry. I commented on this below.

@nbokulich
Copy link
Member Author

hey @ebolyen please see what you think of my latest commit. This adds the level_delimiter parameter, but also allows users to use this param to parse taxonomies that have non-semicolon level delimiters. This seems like a not-so-rare edge case, as some taxonomies we have seen are, e.g., pipe-delimited or comma-delimited. See my note about how the default (None) is handled.

@gregcaporaso
Copy link
Member

@nbokulich, I don't think the delimiter option should be applied to FeatureData[Taxonomy] artifacts. This is effectively expanding the definition of the underlying format, which has always been assumed to be semi-colon delimited taxonomic levels (though I don't know if that is explicitly defined anywhere), to allow arbitrary delimiters. If we want to support this, creating a new format would be a better option, in which case the delimiter could even be defined as part of the format so the user doesn't need to know what it is. In any case though, this feels outside the scope of this PR and like something we should discuss if/how to support.

num_metadata_cols = metadata.column_count
metadata = metadata.to_dataframe()
jsonp_files, csv_files = [], []
collapsed_tables = _extract_to_level(taxonomy, table)
if collapse:
print(level_delimiter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print statement.

'and a level_delimiter is passed, it will attempt '
'to parse hierarchical taxonomic information from '
'the feature ID labels. If no level_delimiter is '
'provided and a taxonomy is passed, simicolon (;) '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simicolon -> semicolon

@nbokulich
Copy link
Member Author

@gregcaporaso @ebolyen thanks for the feedback. I have reverted that last commit and replaced the level_delimiter param more or less as @ebolyen suggested above. I opted for the simpler str.replace(level_delimiter, ';') instead of ';'.join(str.split(level_delimiter)) as suggested by @ebolyen . Sound okay?

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me, thanks for enduring our multiple iterations of comments on this one @nbokulich!

Your deviation from @ebolyen's suggestion seems good to me, but I'll let him comment in case there was an edge case he was thinking about when making that suggestion.

I will note that this won't behave correctly in the pathological case of taxonomy with ; as part of the taxonomy labels (not as delimiters), but that's something that wouldn't have been handled well before anyway so I don't think we should worry about that.

I have tested with several data sets locally:

  1. No taxonomy provided, | as delimiters in feature ids, level_delimiter='|'
  2. No taxonomy provided, | as delimiters in feature ids
  3. Taxonomy provided, ; as level delimiters in taxonomy (our typical usage)
  4. Taxonomy provided, ; as level delimiters in taxonomy, level_delimiter='|' (should give the same output as 3)

This all worked as expected.

gregcaporaso

This comment was marked as duplicate.

@gregcaporaso
Copy link
Member

Just chatted with @ebolyen and he didn't have any particular motivation for split/join versus replace, so we're good to go. Thanks again @nbokulich!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

Make taxonomy optional in barplot
4 participants