Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the name tokeniser uncomp_len calculation (PR #803) #803

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0').

Fixes #802

Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of bece1f7: CRAMcodecs (diff).

CRAMcodecs.tex Outdated Show resolved Hide resolved
@cmnbroad
Copy link
Contributor

cmnbroad commented Jan 7, 2025

Looks good - thanks!

@jkbonfield jkbonfield added sam cram and removed sam labels Jan 7, 2025
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802
Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of 4982e03: CRAMcodecs (diff).

CRAMcodecs.tex Outdated
Comment on lines 2455 to 2456
the number of read names. This is followed the array elements
themselves. Note the uncompressed size is calculated as the sum of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialised data stream starts with two unsigned little endian 32-bit integers... This is followed the array elements themselves.

This is unrelated to the length calculation, but note there is also a 1 byte flag between the 2 integers and the data stream:

Bytes Type Name
4 uint32 uncomp_length
4 uint32 num_reads
1 uint8 use_arith

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It also had a poor and nebulous "array elements" term which I expanded on.

This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802

Also clarify the name tokeniser serialisation description.
Acknowledge the 1-byte "use_arith" field and replace the nebulous
"array elements" with a more descriptive text about token streams.
Copy link

Changed PDFs as of 7de0ae0: CRAMcodecs (diff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

Name tokenizer codec clarification
3 participants