Skip to content

Commit

Permalink
Clarify the name tokeniser uncomp_len calculation (PR samtools#803)
Browse files Browse the repository at this point in the history
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802

Also clarify the name tokeniser serialisation description.
Acknowledge the 1-byte "use_arith" field and replace the nebulous
"array elements" with a more descriptive text about token streams.
  • Loading branch information
jkbonfield committed Jan 28, 2025
1 parent a6a4504 commit 7de0ae0
Showing 1 changed file with 12 additions and 5 deletions.
17 changes: 12 additions & 5 deletions CRAMcodecs.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2450,11 +2450,18 @@ \section{Name tokenisation codec}
a format within a format, as the multiple byte streams $B_{pos,type}$
are serialised into a single byte stream.

The serialised data stream starts with two unsigned little endiand 32-bit
integers holding the total size of uncompressed name buffer and the
number of read names. This is followed the array elements
themselves.

The serialised data stream starts with two unsigned little endian
32-bit integers holding the total size of uncompressed name buffer and
the number of read names, and a flag byte indicating whether data is
compressed with arithmetic coding or rANS Nx16.
Note the uncompressed size is calculated as the sum of
all name lengths including a termination byte per name (e.g. the nul
char). This is irrespective of whether the implementation produces
data in this form or whether it returns separate name and name-length
arrays.

This is then followed by serialised data and meta-data for each token
stream.
Token types, $ttype$ holds one of the token ID values listed above
in the list above, plus special values to indicate certain additional
flags. Bit 6 (64) set indicates that this entire token data stream is a
Expand Down

0 comments on commit 7de0ae0

Please sign in to comment.