Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPIS data: cannot reindex on an axis with duplicate labels #83

Closed
ktmeaton opened this issue Jun 28, 2022 · 3 comments
Closed

LAPIS data: cannot reindex on an axis with duplicate labels #83

ktmeaton opened this issue Jun 28, 2022 · 3 comments
Assignees
Labels
bug Something isn't working dependency If bug is in dependency and needs to be fixed there

Comments

@ktmeaton
Copy link

Context

When using LAPIS data (data_source: "lapis"), the rule filter exits with the error: ValueError: cannot reindex on an axis with duplicate labels1. I think augur is unhappy that a year column already exists in the LAPIS data.

Additional Context

I'm using a conda environment rather than the docker image. But the conda environment works flawlessly for Nextstrain data, just not LAPIS. I'm guessing it's because I'm using a newer version of pandas (v1.4.2) since augur is also raising FutureWarning: reindexing with a non-unique Index is deprecated.

Possible Solution

One way to solve this, would be to drop the year column before the filter rule. Adding the following segment to scripts/wrangle_metadata.py fixes the issue for me:

# Remove the year column, because it will break augur filter
if "year" in metadata.columns:
  new_dates = []
  # Iterate through the 'date' and 'year' columns
  for s_date, s_year in zip(metadata["date"], metadata["year"]):

    # If date is null, we use the year
    if pd.isna(s_date) and not pd.isna(s_year):
      new_dates.append("{}-XX-XX".format(int(s_year)))

    # if date is not null, use it
    elif not pd.isna(s_date):
      new_dates.append(s_date)

    # Otherwise, use none
    else:
      new_dates.append(None)

  metadata["date"] = new_dates
  metadata.drop(columns=["year"], inplace=True)

Steps to Reproduce

Here is the shell command in isolation (after LAPIS download):

augur filter \
  --sequences data/sequences.fasta \
  --metadata results/metadata.tsv \
  --exclude config/exclude_accessions_hmpxv1.txt \
  --output-sequences results/hmpxv1_lapis/filtered.fasta \
  --output-metadata results/hmpxv1_lapis/metadata.tsv \
  --group-by country year \
  --sequences-per-group 1000 \
  --min-date 2017 \
  --min-length 10000 \
  --output-log results/hmpxv1_lapis/filtered.log

Environment

name: nextstrain-mpx
channels:
  - bioconda
  - conda-forge
  - anaconda
  - defaults
dependencies:
  - anaconda::python=3.9.10
  - anaconda::pip=22.0.3
  - conda-forge::pandas=1.4.2
  # Workflow
  - bioconda::snakemake=7.3.6
  # Phylogeny
  - bioconda::iqtree=2.2.0.3
  # Misc
  - bioconda::epiweeks=2.1.4
  - conda-forge::gzip>=1.6
  - pip:
    - nextstrain-augur==16.0.1

# Notes:
# - nextclade and nextalign: v2 must be manually installed and renamed to nextclade2 and nextalign2
#     wget -O $CONDA_PREFIX/bin/nextclade2 https://github.com/nextstrain/nextclade/releases/download/2.0.0-beta.5/nextclade-x86_64-unknown-linux-gnu
#     wget -O $CONDA_PREFIX/bin/nextalign2 https://github.com/nextstrain/nextclade/releases/download/2.0.0-beta.5/nextalign-x86_64-unknown-linux-gnu

Full Traceback

/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/filter.py:953: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
  df_skip = metadata[metadata['year'].isnull()]
Traceback (most recent call last):
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/__init__.py", line 81, in run
    return args.__command__.run(args)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/filter.py", line 1424, in run
    group_by_strain, skipped_strains = get_groups_for_subsampling(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/filter.py", line 953, in get_groups_for_subsampling
    df_skip = metadata[metadata['year'].isnull()]
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 3492, in __getitem__
    return self.where(key)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 10955, in where
    return super().where(cond, other, inplace, axis, level, errors, try_cast)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 9308, in where
    return self._where(cond, other, inplace, axis, level, errors=errors)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 9075, in _where
    cond = cond.reindex(self._info_axis, axis=self._info_axis_number, copy=False)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/util/_decorators.py", line 324, in wrapper
    return func(*args, **kwargs)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 4804, in reindex
    return super().reindex(**kwargs)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 4966, in reindex
    return self._reindex_axes(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 4617, in _reindex_axes
    frame = frame._reindex_columns(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 4662, in _reindex_columns
    return self._reindex_with_indexers(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 5032, in _reindex_with_indexers
    new_data = new_data.reindex_indexer(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 679, in reindex_indexer
    self.axes[axis]._validate_can_reindex(indexer)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4107, in _validate_can_reindex
    raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels
@ktmeaton ktmeaton added the bug Something isn't working label Jun 28, 2022
@victorlin victorlin self-assigned this Jun 28, 2022
@victorlin
Copy link
Member

victorlin commented Jun 28, 2022

Thanks for flagging! This is a known bug in Augur nextstrain/augur#871. It should be fixed by nextstrain/augur#967.

Since you are using a conda environment: if you want to use this fix before it's released, you can install it locally:

git clone https://github.com/nextstrain/augur; cd augur/
git pull origin pull/967/head
pip install -e '.[dev]'

@ktmeaton
Copy link
Author

Ah sorry for the duplicate issue post! Thanks for the link and temporary fix 😀

@corneliusroemer corneliusroemer added the dependency If bug is in dependency and needs to be fixed there label Jun 29, 2022
@victorlin
Copy link
Member

This should be fixed with the release of Augur 16.0.2.

Repository owner moved this from New to Done in Nextstrain planning (archived) Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependency If bug is in dependency and needs to be fixed there
Projects
No open projects
Development

No branches or pull requests

3 participants