twid.tex

\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in,marginparwidth=2in]{geometry}
\usepackage[utf8x]{inputenc}
\usepackage{listings}
\usepackage{nameref,hyperref}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
% clean citations
\usepackage{cite}


% line numbers
\usepackage[right]{lineno}

% improves typesetting in LaTeX % NOT SURE IF VALID
\usepackage{microtype}
% \DisableLigatures[f]{encoding = *, family = * }

% text layout - change as needed
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in 
\textheight 8.75in

% use adjustwidth environment to exceed text width (see examples in text)
\usepackage{changepage}

% adjust caption style
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,singlelinecheck=off]{caption}

% remove brackets from references
\makeatletter
% \renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother

% headrule, footrule and page numbers
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}

\usepackage{color}

% define custom colors (this one is for figure captions)
\definecolor{Gray}{gray}{.25}

% use if you want to put caption to the side of the figure - see example in text
\usepackage{sidecap}

% use for have text wrap around figures
\usepackage{wrapfig}
\usepackage[pscoord]{eso-pic}
\usepackage[fulladjust]{marginnote}
\reversemarginpar


\include{gccontent}
\include{hash}
\include{revcomp}
\include{transversions}


% document begins here
\begin{document}
\vspace*{0.35in}

% title goes here:
\begin{flushleft}
{\Large
\textbf\newline{Bit-twiddling on Nucleotides}
}
\newline
% authors go here:
\\
Fabian Klötzl\textsuperscript{1,*}
\\
\bigskip
\bf{1} MPI for Evolutionary Biology
\\
\bigskip
* kloetzl@evolbio.mpg.de
\\
\today
\end{flushleft}

\section*{Abstract}
Bits, nucleotides and speed.

\linenumbers


\section{Introduction}

Analyzing sequence data is one of the main focuses of bioinformatics. In this note we shall analyze four common tasks: computing the GC-content, hashing k-mers, computing the reverse complement and counting transversions. We present improved algorithms that show superior performance compared to naïve approaches.

The sources of this paper including \LaTeX\ and program code are available at \url{https://github.com/kloetzl/biotwiddle}.

\section{Materials and Methods}

A lot of bioinformatics tools work with nucleotide data. Usually, each nucleotide is represented by one character in IUBUB nomenclature. For the four standard DNA bases these are \texttt{A}, \texttt{C}, \texttt{G}, and \texttt{T}. These letters are commonly encoded in ASCII and take up one byte of memory. Assuming a C memory model, a sequence of nucleotides is stored as a string, continuous memory, with each byte representing one character delimited by a \texttt{NUL}-byte.

\subsection{Definitions}

An alphabet is a non-empty set of characters, commonly written as $\Sigma$. A finite sequence of characters $w = w_1 w_2 \ldots w_n$ with $w_i \in \Sigma$ is called a string with length $n = |w|$. In this note the alphabet is assumed to consist of the four nucleotides from DNA: $\Sigma = \{\texttt{A},\texttt{C},\texttt{G},\texttt{T}\}$.

Each character is also the representative for a seven bit sequence, namely their ASCII encoding. We allow the execution of bitwise logical operators on these sequences. For example, let $c$ and $d$ be two bit sequences. Then $c \mathbin{\&} d$ is the result of a bitwise-logical-and.


\subsection{GC-content}
\label{sec:gccontent}

The GC-content of DNA is the proportion of guanin and cytosin among all nucleotides. Let $w$ be a DNA sequence of length $n$. Furthermore, $S\colon \Sigma \to \{0,1\}$ maps a nucleotide to $1$ iff it is a $\texttt{C}$ or $\texttt{G}$. So the GC-content of the sequence $w$ is $\sum_{i=1}^n S(w_i) / n$.

This definition easily translates into code. For every character in the given string, check if it is \texttt{C} or \texttt{G}; if so, increase a counter. Once all characters are processed, compute the final ratio.

\lstdefinestyle{algo}{
    language=c,
    mathescape=true,
    columns=flexible,
    identifierstyle=\emph,
    inputencoding=utf8x,
    tabsize=4,
    % escapeinside={(*}{*)},
    % numbers=left,
    xleftmargin=2em,
    % numberstyle=\scriptsize,
    morekeywords=size_t
}

\lstset{style=algo}

\begin{lstlisting}
double gc(const char *seq) {  
    size_t gc = 0;
    const char *ptr = seq;

    for (; *ptr; ptr++) {
        if (*ptr == 'G' || *ptr == 'C') {
            gc++;
        }
    }

    return (double)gc / (ptr - seq);
}
\end{lstlisting}

This looks like three comparisons are made against each character, one for \texttt{NUL} and two against \texttt{C} and \texttt{G}. However, compilers can optimize them into just two comparisons. Luckily, in ASCII, \texttt{C} and \texttt{G} differ by only one bit. This enables optimizing compilers (or us) to rewrite the comparisons to use a bit mask. We ignore the bit which differs between \texttt{C} and \texttt{G} and check if the remaining bits equal the common bit pattern.

\begin{lstlisting}
double gc(const char *seq) {  
    size_t gc = 0;
    const char *ptr = seq;

    for (; *ptr; ptr++) {
    	char masked = *ptr & 'G' & 'C';
        if ( masked == ('G' & 'C')) {
            gc++;
        }
    }

    return (double)gc / (ptr - seq);
}
\end{lstlisting}

As most sequence data is encoded as ASCII, computing the GC-content the new way may result in great performance benefits for a whole lot of bioinformatics applications.


\subsection{Hashing}
\label{sec:hash}

A common procedure on k-mers is to hash them. This allows for compact representation in memory, or can be used as an index into hash-based data structures. As the DNA-alphabet consists of only four characters, two bits suffice to represent one nucleotide. With this we can define the following mapping for $k \ge |w|$.

\begin{align*}
	H(w, k) &= h(w_1) \cdot 4^{k} + h(w_2) \cdot 4^{k-1} + \cdots + h(w_k) \\
	&= \sum_{i=1}^k h(w_i)\cdot 4^{k-i+1} \\
\end{align*}
\begin{align*}
	h(x) &=
		\begin{cases}
			0 & \text{if } x = \texttt{A} \\
			1 & \text{if } x = \texttt{C} \\
			2 & \text{if } x = \texttt{G} \\
			3 & \text{if } x = \texttt{T}
		\end{cases}
\end{align*}

Again, this definition easily translates into code: Iterate over the first $k$ characters of a string, compute the map, and combine them into one number.

\begin{lstlisting}
size_t hash(const char *seq, size_t k) {  
    size_t return_value = 0;

    while (k--) {
    	char c = *seq++
    	size_t val = 0;
    	switch (c) {
    		case 'A': val = 0; break;
    		case 'C': val = 1; break;
    		case 'G': val = 2; break;
    		case 'T': val = 3; break;
    	}
    	return_value <<= 2;
    	return_value |= val;
    }

    return return_value;
}
\end{lstlisting}

The \lstinline!switch!-statement is convenient for humans to read, but not the most compact way to achieve the desired mapping. Instead of essentially doing four comparisons, we can resort to bit-twiddling, giving us the same mapping with fewer machine instructions.

\marginpar{
\vspace{.7cm}
\textbf{Table 1.} ASCII Table (excerpt)
% \vspace{5cm}
\begin{tabular}{ccc}
    \toprule
    Char & ASCII & bit code \\
    \hline
    A & 0x41 & 100\,0001 \\
    C & 0x43 & 100\,0011 \\
    G & 0x47 & 100\,0111 \\
    T & 0x54 & 101\,0100 \\
    \bottomrule
\end{tabular}
}

Remember, the nucleotides are stored as ASCII characters in memory. The actual values are shown in the table above. In the table, it can be seen that the lower bits 2 and 3 of each character uniquely identify it \cite{up2bit}. Further more, they almost achieve the desired mapping given as $h(x)$ above. However, if bit 3 is set (i.\,e. \texttt{G} or \texttt{T}) the bit 2 is wrong. Thus it needs to be flipped conditionally using a bitwise-exclusive-or as shown in the following code.

\begin{lstlisting}
size_t hash(const char *seq, size_t k) {  
    size_t return_value = 0;

    while (k--) {
    	char c = *seq++;
    	c &= 6;
    	c ^= c >> 1;
    	size_t val = c >> 1;
    	return_value <<= 2;
    	return_value |= val;
    }

    return return_value;
}
\end{lstlisting}

Finally, we have achieved the mapping we set out for. In C, this code compiles down to just five bit-twiddling instructions, making it the fastest order-preserving method available. Furthermore, unlike the switch-statement, this code can be vectorized.


\subsection{Reverse Complement}
\label{sec:revcomp}

DNA consists of a forward and a reverse complement strain. For this note we use the following, technical, definition. Let $w\in \Sigma^*$ be a DNA sequence with $\bar w$ being its reverse complement, given

% if $|\bar w| = |w|$ and its characters have the following property:

\begin{align*}
	\forall i\in[1,|w|]\colon \quad {\bar w}_{|w|-i-1} &=
		\begin{cases}
			\texttt{T} & \text{if } w_i = \texttt{A} \\
			\texttt{G} & \text{if } w_i = \texttt{C} \\
			\texttt{C} & \text{if } w_i = \texttt{G} \\
			\texttt{A} & \text{if } w_i = \texttt{T}
		\end{cases}
\end{align*}

A simple implementation for the reverse complement follows a similar scheme as the hashing procedure above. Iterate over all characters in the forward sequence, map everyone to its complement, and write the result to the reverse string.

This time we focus our efforts on the complementing. Specifically, complementing a nucleotide can be compactly written as $\texttt{A} \leftrightarrow \texttt{T}$ and $\texttt{C} \leftrightarrow \texttt{G}$. Let $c$ be the nucleotide to be complemented. Then the following is true $\bar c = \bar c \otimes 0 = \bar c \otimes (c \otimes c) = (\bar c \otimes c) \otimes c$, but also $c = (\bar c \otimes c) \otimes c$. So $c$ can be complemented back and forth with the same operation. Furthermore, $(\bar c \otimes c)$ simplifies to a constant value, reducing the complement to one machine instruction.

So two magic constants suffice to complement all four nucleotides: $21 = (\texttt{A} \otimes \texttt{T})$ and $4 = (\texttt{C} \otimes \texttt{G})$. To pick the right constant for complementing, a trick similar that is in the hashing section above, can be used; $\texttt{C}$ and $\texttt{G}$ have their bit 2 set, whereas $\texttt{A}$ and $\texttt{T}$ do not.


\begin{lstlisting}
char *revcomp(const char *forward, char *reverse, size_t len)
{
	for (size_t k = 0; k < len; k++) {
		char c = forward[k];
		char magic = c & 2 ? 4 : 21;

		reverse[len - k - 1] = c ^ magic;
	}

	reverse[len] = '\0';
	return reverse;
}
\end{lstlisting}

This code is surprisingly short and could be compacted even further. It contains only one branch and very simple instructions, making it fast and vectorizing-friendly.


\subsection{Mutations and Transversions}
\label{sec:transversions}

As a final task we focus our attention on comparing genomes, specifically, counting and classifying mutations. Given two genomes $s,q \in \Sigma^*$, with equal length, we are interested in the number of transversions separating the two sequences.

\begin{align*}
    \mathit{trans}(a,b) &= \sum_{i=1}^{|a|} \delta(a_i,b_i) \\
    \delta(a,b) &= \begin{cases}
        1 & \text{if } a \in \{\texttt{C},\texttt{T}\} \text{ and } b \not\in \{\texttt{C},\texttt{T}\} \\
        1 & \text{if } b \in \{\texttt{C},\texttt{T}\} \text{ and } a \not\in \{\texttt{C},\texttt{T}\} \\
        0 & \text{otherwise}
    \end{cases}
\end{align*}

To account for a transversion, exactly one of the two bases has to be a pyrimidine, and the other purine. A character $c$ is a pyrimidine in ASCII if $c + 1$ has a $1$ at bit 3. For purines that bit has the value $0$. So incrementing the characters by one allows for easy characterization of bases. With a bitwise-exclusive-or we can check if only one of the two bases is a pyrimidine.

\begin{lstlisting}
double transversions(const char *subject, size_t length, const char *query)
{
    size_t transversions = 0;

    for (size_t k = 0; k < length; k++) {
        if (((subject[k] + 1) ^ (query[k] + 1)) & 4) {
            transversions++;
        }
    }

    return (double)transversions / length;
}
\end{lstlisting}


\section{Results}

To evaluate the performance of the given methods, we simulated sequences with 100{,}000 nucleotides. On each sequence a method is run often enough to gain statistical confidence using the benchmark library by Google \cite{bench}. This process is repeated a number of runs, from which the \emph{minimum} is chosen as the best run-time \cite{min}.

Also, we inspect the runtime characteristics of different methods using the \textsc{perf} tools \cite{perf}. These allow measuring the instructions per cycle, branches and other features of a program.


\subsection{GC-Content}
{
In Section~\ref{sec:gccontent} we describe, how the GC-content of a sequence can be computed with

\marginpar{
\vspace{.7cm} % adjust vertical position relative to text with \vspace{} - note that you can enter negative numbers to move margin caption up
\color{Gray} % this gives caption a grey color to set it apart from text body
\textbf{Figure \ref{fig:gccontent}. Computing the GC-Content.} Runtimes of different methods to determine the GC-content from from sequences with 100{,}000 nucleotides. \textbf{Todo}: fix the y-axis.
}

\begin{wrapfigure}[12]{l}{53mm}
    \begin{tikzpicture}
        \begin{semilogyaxis}[
            width =5cm,
            height=5cm,
            enlarge x limits=.3,
            % title={GC-Content},
            ylabel={Runtime (ns)},
            xtick=data,
            % xlabel={Method},
            symbolic x coords={simple,twiddle,table},
            ybar
        ]

            \addplot plot coordinates {
                (simple, \gZgccontentZlength) (twiddle, \gZgccontentZlengthZtwiddle) (table, \gZgccontentZtable)
            };
            \addlegendentry{GCC}

            \addplot plot coordinates {
                (simple, \clangZgccontentZlength) (twiddle, \clangZgccontentZlengthZtwiddle) (table, \clangZgccontentZtable)
            };
            \addlegendentry{Clang}

        \end{semilogyaxis}
    \end{tikzpicture}
    \captionsetup{labelformat=empty} % makes sure dummy caption is blank
    \caption{}
    \label{fig:gccontent}
\end{wrapfigure}

\noindent fewer comparisons than~necessary~at~first~glance. We created two methods, one representing the simple way of counting and the other the explicit twiddling way. As modern compilers optimize the generated code, we do not expect any significant difference between the two methods. As a third method we use a table look up.

Figure~\ref{fig:gccontent} shows the benchmarking results for the three methods. 
Using the GNU Compiler all methods are almost equally fast. However, Clang fails to optimize the simple method. It still makes one extra comparison leading to many more branches and thus a high number of branch misses. This leads to a roughly 25 times slower function.

}

\vspace*{2cm}
\newpage{}
\subsection{Hashing}

We compare the two hashing procedures defined in Section~\ref{sec:hash}, with one using a look-up table. One uses a big switch statement to map the nucleotides to an index. The other

\begin{wrapfigure}[16]{r}{65mm}
    \begin{tikzpicture}
        \begin{axis}[
            width =5cm,
            height=5cm,
            enlarge x limits=.3,
        	% title={Hashing},
        	ylabel={Runtime (ns)},
        	xtick=data,
        	% xlabel={Method},
        	symbolic x coords={switch,twiddle,table},
        	ybar
        ]

            \addplot plot coordinates {
            	(switch, \gZhashZsimple) (twiddle, \gZhashZtwiddle) (table, \gZhashZtable)
            };
            \addlegendentry{GCC}

            \addplot plot coordinates {
            	(switch, \clangZhashZsimple) (twiddle, \clangZhashZtwiddle) (table, \clangZhashZtable)
            };
            \addlegendentry{Clang}

        \end{axis}
    \end{tikzpicture}
    \caption{Runtime of different k-mer hashing procedures.}
\end{wrapfigure}
\noindent exploits~the~ASCII~representation~of~the characters~to~achieve~the~same~result, much~faster.~We~measured~the~performance by~having~both~methods~compute~the hash of~all~$k$-mers~in~a~long~sequence with $k = 16$.

As we would have hoped, the twiddling method is faster than the simple switch, by a factor of 3. This is due to different reasons. The twiddle method utilizes more instruction-level parallelism. On an Intel Core i5-5200U the simple method executes about 1.19 instructions per cycle. However, using twiddling, that number can be ramped up to 3.75. On the contrary, the number of branches and especially the number of branch misses goes down. Whereas for the simple method 25\% of all branches are missed, it is only 0.02\% for twiddle.

Using a look-up table is even faster than twiddling by roughly 20\%. This is mainly due to a greatly reduced number of instructions.


\subsection{Reverse Complement}
{
Here we compare four different ways of computing the reverse complement. As a baseline, again we use our simple switch-statement. twiddle is the method using xors as presented in Section~\ref{sec:revcomp}. A third method used by programs such as \textsc{Blast} and

% A third method uses addition and subtraction instead of

\marginpar{
\vspace{.7cm} % adjust vertical position relative to text with \vspace{} - note that you can enter negative numbers to move margin caption up
\color{Gray} % this gives caption a grey color to set it apart from text body
\textbf{Figure \ref{fig:revcomp}. Reverse Complement.} Runtime of different methods for computing the reverse complement of a sequence with 100{,}000 nucleotides.
}
\begin{wrapfigure}{l}{75mm}
    \begin{tikzpicture}
        \begin{semilogyaxis}[
            width =7cm,
            height=7cm,
            enlarge x limits=.2,
            % title={Reverse Complement},
            ylabel={Runtime (ns)},
            xtick=data,
            % xlabel={Method},
            symbolic x coords={switch,twiddle,table, two step},
            ybar
        ]

            \addplot plot coordinates {
                (switch, \gZrevcompZsimple) (twiddle, \gZrevcompZtwiddle) (table, \gZrevcompZtable) (two step, \gZrevcompZtwostep)
            };
            \addlegendentry{GCC}

            \addplot plot coordinates {
                (switch, \clangZrevcompZsimple) (twiddle, \clangZrevcompZtwiddle) (table, \clangZrevcompZtable) (two step, \clangZrevcompZtwostep)
            };
            \addlegendentry{Clang}

        \end{semilogyaxis}
    \end{tikzpicture}
    \captionsetup{labelformat=empty} % makes sure dummy caption is blank
    \caption{}
    \label{fig:revcomp}
\end{wrapfigure}
\noindent  \textsc{MUMmer}~uses a table lookup. Lastly, one method (two step) splits the process into two parts. First, all nucleotides are complemented, then the string is reversed.

It can be seen in Figure~\ref{fig:revcomp} that the switch method is by far the slowest. All other methods are at least one order of magnitude faster. Among them the table lookup is the slowest, followed by a two-step process. Twiddling leads to the best performance. All results are the same across compilers.

}

\newpage
\subsection{Transversions}

{
To compare the performance of different ways of counting transversions, we simply

\marginpar{
    \vspace{.7cm}
    \color{Gray}
    \textbf{Figure \ref{fig:transversions}. Counting Transversions.} Runtime of different methods for counting mutations and transversions. Shown is the minimum of \gZhashZruns~runs.
}

\begin{wrapfigure}{l}{75mm}
    \begin{tikzpicture}
        \begin{semilogyaxis}[
            width=7cm,
            height=6cm,
            title={Transversions},
            ylabel={Runtime (ns)},
            xtick=data,
            enlarge x limits=.2,
            % xlabel={Method},
            symbolic x coords={mutations,trans,twiddle,table},
            ybar
        ]

            \addplot plot coordinates {
                (mutations, \gZtransversionsZmutationsZlength) (twiddle, \gZtransversionsZtwiddle) (trans, \gZtransversionsZtransversions) (table, \gZtransversionsZtable)
            };
            \addlegendentry{GCC}

            \addplot plot coordinates {
                (mutations, \clangZtransversionsZmutationsZlength) (twiddle, \clangZtransversionsZtwiddle) (trans, \clangZtransversionsZtransversions) (table, \clangZtransversionsZtable)
            };
            \addlegendentry{Clang}

        \end{semilogyaxis}
    \end{tikzpicture}
    \captionsetup{labelformat=empty}
    \caption{}
    \label{fig:transversions}
\end{wrapfigure}
\noindent choose~counting~mutations as a baseline.

The~simple~way~of~counting transversions~cannot be vectorized and thus is more than ten times slower than counting only mutations. However, the new twiddling method presented in Section~\ref{sec:transversions} can be vectorized, an thus is almost as fast as the first method. Likewise, using a look-up table cannot be vectorized as thus is slower than twiddling, but ten times faster than a simple comparison.


}

% \section*{Acknowledgments}
% Schlegel for the \LaTeX template %\cite{Schlegel2016}.

\vspace*{3cm}

\nolinenumbers

\bibliography{twid}

\bibliographystyle{abbrv}

\end{document}