Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fasta id should split after ANY whitespace #16

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nh13
Copy link
Contributor

@nh13 nh13 commented Nov 29, 2023

No description provided.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@markschl
Copy link
Owner

Thanks for the pull request. I apologize for not yet looking at this. The reason is that, even though I agree that this is a good idea, I would prefer to treat byte strings (&[u8]) and &str types in a consistent way. Right now, the implementation mostly recognizes spaces and tabs, but id_desc() recognizes additional ASCII and Unicode characters by its use of char::is_whitespace. This may also have performance implications.

It may be worth doing some investigation on how other parsers handle the problem. Biopython uses title.split(None, 1)[0], which is similar to u8::is_ascii_whitespace resp. char::is_whitespace according to my (superficial) research.

In addition, str.split in Biopython recognizes runs of consecutive whitespace, so there will not be any descriptions starting with whitespace. However, I'm inclined not to follow that behaviour, since it is computationally more intensive and users wanting that behaviour can still trim the description by themselves using str::trim [unfortunately, slice::trim_ascii is not stable yet...].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants