Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: --query broken with backtick quoting #1415

Closed
victorlin opened this issue Feb 15, 2024 · 5 comments · Fixed by #1417
Closed

filter: --query broken with backtick quoting #1415

victorlin opened this issue Feb 15, 2024 · 5 comments · Fixed by #1417
Assignees
Labels
bug Something isn't working

Comments

@victorlin
Copy link
Member

victorlin commented Feb 15, 2024

flagged on Slack

Description

Previously, a metadata column could be queried using backtick quoting¹ supported by pandas.DataFrame.query. It no longer works as of d0f36a1.

¹ e.g. --query '(`region name` == "A")'

Affected versions

Augur 24.2.0, 24.2.1

Workaround

Use Augur <=24.1.0, or <=24.0.0 to avoid #1411. For users of Nextstrain's Docker runtime, NEXTSTRAIN_DOCKER_IMAGE=nextstrain/base:build-20240122T233533Z is the latest image that has Augur 24.0.0.

@victorlin victorlin added the bug Something isn't working label Feb 15, 2024
@victorlin victorlin self-assigned this Feb 15, 2024
@tsibley
Copy link
Member

tsibley commented Feb 15, 2024

If you want ideas for approaches here, let me know.

@victorlin
Copy link
Member Author

victorlin commented Feb 15, 2024

@tsibley thanks. Function in question is augur.filter.include_exclude_rules.extract_variables. I'm looking to see if we can leverage Pandas internals to do this. So far I have this, but it produces false-positives such as str and endswith given the value "coverage.str.endswith('.95')". It needs more context like how ast.walk can tell that str and endswith are property names, but tokenize_string lacks that info.

import keyword
from pandas.core.computation.parsing import tokenize_string, BACKTICK_QUOTED_STRING

# Backtick-quoted string or 1 (NAME)
return set(tokval for toknum, tokval in tokenize_string(pandas_query)
           if toknum == BACKTICK_QUOTED_STRING or (toknum == 1 and not keyword.iskeyword(tokval)))

Do you have other ideas?

@victorlin
Copy link
Member Author

Actually, false positives are manageable since the column names are known. I'll make a PR with what I drafted, but still open to suggestions!

@tsibley
Copy link
Member

tsibley commented Feb 15, 2024

I'd likely preprocess the query as a string to replace backtick-quoted names with generated names, then proceed with parsing as-is now, then translate generated names in resulting set back to the originals.

The preprocessing of backtick-quoted names can be done with a regex (it's a rather constrained problem to tackle) or with something more formal like a parsing expression grammar via pyparsing.

@victorlin
Copy link
Member Author

@tsibley I've implemented your suggestion as #1417.

There are now 3 PRs that address this bug:

The first two are mutually exclusive. I think the last one is good to have regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants