-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtwid.tex
538 lines (411 loc) · 21 KB
/
twid.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in,marginparwidth=2in]{geometry}
\usepackage[utf8x]{inputenc}
\usepackage{listings}
\usepackage{nameref,hyperref}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
% clean citations
\usepackage{cite}
% line numbers
\usepackage[right]{lineno}
% improves typesetting in LaTeX % NOT SURE IF VALID
\usepackage{microtype}
% \DisableLigatures[f]{encoding = *, family = * }
% text layout - change as needed
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in
\textheight 8.75in
% use adjustwidth environment to exceed text width (see examples in text)
\usepackage{changepage}
% adjust caption style
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,singlelinecheck=off]{caption}
% remove brackets from references
\makeatletter
% \renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% headrule, footrule and page numbers
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}
\usepackage{color}
% define custom colors (this one is for figure captions)
\definecolor{Gray}{gray}{.25}
% use if you want to put caption to the side of the figure - see example in text
\usepackage{sidecap}
% use for have text wrap around figures
\usepackage{wrapfig}
\usepackage[pscoord]{eso-pic}
\usepackage[fulladjust]{marginnote}
\reversemarginpar
\include{gccontent}
\include{hash}
\include{revcomp}
\include{transversions}
% document begins here
\begin{document}
\vspace*{0.35in}
% title goes here:
\begin{flushleft}
{\Large
\textbf\newline{Bit-twiddling on Nucleotides}
}
\newline
% authors go here:
\\
Fabian Klötzl\textsuperscript{1,*}
\\
\bigskip
\bf{1} MPI for Evolutionary Biology
\\
\bigskip
* kloetzl@evolbio.mpg.de
\\
\today
\end{flushleft}
\section*{Abstract}
Bits, nucleotides and speed.
\linenumbers
\section{Introduction}
Analyzing sequence data is one of the main focuses of bioinformatics. In this note we shall analyze four common tasks: computing the GC-content, hashing k-mers, computing the reverse complement and counting transversions. We present improved algorithms that show superior performance compared to naïve approaches.
The sources of this paper including \LaTeX\ and program code are available at \url{https://github.com/kloetzl/biotwiddle}.
\section{Materials and Methods}
A lot of bioinformatics tools work with nucleotide data. Usually, each nucleotide is represented by one character in IUBUB nomenclature. For the four standard DNA bases these are \texttt{A}, \texttt{C}, \texttt{G}, and \texttt{T}. These letters are commonly encoded in ASCII and take up one byte of memory. Assuming a C memory model, a sequence of nucleotides is stored as a string, continuous memory, with each byte representing one character delimited by a \texttt{NUL}-byte.
\subsection{Definitions}
An alphabet is a non-empty set of characters, commonly written as $\Sigma$. A finite sequence of characters $w = w_1 w_2 \ldots w_n$ with $w_i \in \Sigma$ is called a string with length $n = |w|$. In this note the alphabet is assumed to consist of the four nucleotides from DNA: $\Sigma = \{\texttt{A},\texttt{C},\texttt{G},\texttt{T}\}$.
Each character is also the representative for a seven bit sequence, namely their ASCII encoding. We allow the execution of bitwise logical operators on these sequences. For example, let $c$ and $d$ be two bit sequences. Then $c \mathbin{\&} d$ is the result of a bitwise-logical-and.
\subsection{GC-content}
\label{sec:gccontent}
The GC-content of DNA is the proportion of guanin and cytosin among all nucleotides. Let $w$ be a DNA sequence of length $n$. Furthermore, $S\colon \Sigma \to \{0,1\}$ maps a nucleotide to $1$ iff it is a $\texttt{C}$ or $\texttt{G}$. So the GC-content of the sequence $w$ is $\sum_{i=1}^n S(w_i) / n$.
This definition easily translates into code. For every character in the given string, check if it is \texttt{C} or \texttt{G}; if so, increase a counter. Once all characters are processed, compute the final ratio.
\lstdefinestyle{algo}{
language=c,
mathescape=true,
columns=flexible,
identifierstyle=\emph,
inputencoding=utf8x,
tabsize=4,
% escapeinside={(*}{*)},
% numbers=left,
xleftmargin=2em,
% numberstyle=\scriptsize,
morekeywords=size_t
}
\lstset{style=algo}
\begin{lstlisting}
double gc(const char *seq) {
size_t gc = 0;
const char *ptr = seq;
for (; *ptr; ptr++) {
if (*ptr == 'G' || *ptr == 'C') {
gc++;
}
}
return (double)gc / (ptr - seq);
}
\end{lstlisting}
This looks like three comparisons are made against each character, one for \texttt{NUL} and two against \texttt{C} and \texttt{G}. However, compilers can optimize them into just two comparisons. Luckily, in ASCII, \texttt{C} and \texttt{G} differ by only one bit. This enables optimizing compilers (or us) to rewrite the comparisons to use a bit mask. We ignore the bit which differs between \texttt{C} and \texttt{G} and check if the remaining bits equal the common bit pattern.
\begin{lstlisting}
double gc(const char *seq) {
size_t gc = 0;
const char *ptr = seq;
for (; *ptr; ptr++) {
char masked = *ptr & 'G' & 'C';
if ( masked == ('G' & 'C')) {
gc++;
}
}
return (double)gc / (ptr - seq);
}
\end{lstlisting}
As most sequence data is encoded as ASCII, computing the GC-content the new way may result in great performance benefits for a whole lot of bioinformatics applications.
\subsection{Hashing}
\label{sec:hash}
A common procedure on k-mers is to hash them. This allows for compact representation in memory, or can be used as an index into hash-based data structures. As the DNA-alphabet consists of only four characters, two bits suffice to represent one nucleotide. With this we can define the following mapping for $k \ge |w|$.
\begin{align*}
H(w, k) &= h(w_1) \cdot 4^{k} + h(w_2) \cdot 4^{k-1} + \cdots + h(w_k) \\
&= \sum_{i=1}^k h(w_i)\cdot 4^{k-i+1} \\
\end{align*}
\begin{align*}
h(x) &=
\begin{cases}
0 & \text{if } x = \texttt{A} \\
1 & \text{if } x = \texttt{C} \\
2 & \text{if } x = \texttt{G} \\
3 & \text{if } x = \texttt{T}
\end{cases}
\end{align*}
Again, this definition easily translates into code: Iterate over the first $k$ characters of a string, compute the map, and combine them into one number.
\begin{lstlisting}
size_t hash(const char *seq, size_t k) {
size_t return_value = 0;
while (k--) {
char c = *seq++
size_t val = 0;
switch (c) {
case 'A': val = 0; break;
case 'C': val = 1; break;
case 'G': val = 2; break;
case 'T': val = 3; break;
}
return_value <<= 2;
return_value |= val;
}
return return_value;
}
\end{lstlisting}
The \lstinline!switch!-statement is convenient for humans to read, but not the most compact way to achieve the desired mapping. Instead of essentially doing four comparisons, we can resort to bit-twiddling, giving us the same mapping with fewer machine instructions.
\marginpar{
\vspace{.7cm}
\textbf{Table 1.} ASCII Table (excerpt)
% \vspace{5cm}
\begin{tabular}{ccc}
\toprule
Char & ASCII & bit code \\
\hline
A & 0x41 & 100\,0001 \\
C & 0x43 & 100\,0011 \\
G & 0x47 & 100\,0111 \\
T & 0x54 & 101\,0100 \\
\bottomrule
\end{tabular}
}
Remember, the nucleotides are stored as ASCII characters in memory. The actual values are shown in the table above. In the table, it can be seen that the lower bits 2 and 3 of each character uniquely identify it \cite{up2bit}. Further more, they almost achieve the desired mapping given as $h(x)$ above. However, if bit 3 is set (i.\,e. \texttt{G} or \texttt{T}) the bit 2 is wrong. Thus it needs to be flipped conditionally using a bitwise-exclusive-or as shown in the following code.
\begin{lstlisting}
size_t hash(const char *seq, size_t k) {
size_t return_value = 0;
while (k--) {
char c = *seq++;
c &= 6;
c ^= c >> 1;
size_t val = c >> 1;
return_value <<= 2;
return_value |= val;
}
return return_value;
}
\end{lstlisting}
Finally, we have achieved the mapping we set out for. In C, this code compiles down to just five bit-twiddling instructions, making it the fastest order-preserving method available. Furthermore, unlike the switch-statement, this code can be vectorized.
\subsection{Reverse Complement}
\label{sec:revcomp}
DNA consists of a forward and a reverse complement strain. For this note we use the following, technical, definition. Let $w\in \Sigma^*$ be a DNA sequence with $\bar w$ being its reverse complement, given
% if $|\bar w| = |w|$ and its characters have the following property:
\begin{align*}
\forall i\in[1,|w|]\colon \quad {\bar w}_{|w|-i-1} &=
\begin{cases}
\texttt{T} & \text{if } w_i = \texttt{A} \\
\texttt{G} & \text{if } w_i = \texttt{C} \\
\texttt{C} & \text{if } w_i = \texttt{G} \\
\texttt{A} & \text{if } w_i = \texttt{T}
\end{cases}
\end{align*}
A simple implementation for the reverse complement follows a similar scheme as the hashing procedure above. Iterate over all characters in the forward sequence, map everyone to its complement, and write the result to the reverse string.
This time we focus our efforts on the complementing. Specifically, complementing a nucleotide can be compactly written as $\texttt{A} \leftrightarrow \texttt{T}$ and $\texttt{C} \leftrightarrow \texttt{G}$. Let $c$ be the nucleotide to be complemented. Then the following is true $\bar c = \bar c \otimes 0 = \bar c \otimes (c \otimes c) = (\bar c \otimes c) \otimes c$, but also $c = (\bar c \otimes c) \otimes c$. So $c$ can be complemented back and forth with the same operation. Furthermore, $(\bar c \otimes c)$ simplifies to a constant value, reducing the complement to one machine instruction.
So two magic constants suffice to complement all four nucleotides: $21 = (\texttt{A} \otimes \texttt{T})$ and $4 = (\texttt{C} \otimes \texttt{G})$. To pick the right constant for complementing, a trick similar that is in the hashing section above, can be used; $\texttt{C}$ and $\texttt{G}$ have their bit 2 set, whereas $\texttt{A}$ and $\texttt{T}$ do not.
\begin{lstlisting}
char *revcomp(const char *forward, char *reverse, size_t len)
{
for (size_t k = 0; k < len; k++) {
char c = forward[k];
char magic = c & 2 ? 4 : 21;
reverse[len - k - 1] = c ^ magic;
}
reverse[len] = '\0';
return reverse;
}
\end{lstlisting}
This code is surprisingly short and could be compacted even further. It contains only one branch and very simple instructions, making it fast and vectorizing-friendly.
\subsection{Mutations and Transversions}
\label{sec:transversions}
As a final task we focus our attention on comparing genomes, specifically, counting and classifying mutations. Given two genomes $s,q \in \Sigma^*$, with equal length, we are interested in the number of transversions separating the two sequences.
\begin{align*}
\mathit{trans}(a,b) &= \sum_{i=1}^{|a|} \delta(a_i,b_i) \\
\delta(a,b) &= \begin{cases}
1 & \text{if } a \in \{\texttt{C},\texttt{T}\} \text{ and } b \not\in \{\texttt{C},\texttt{T}\} \\
1 & \text{if } b \in \{\texttt{C},\texttt{T}\} \text{ and } a \not\in \{\texttt{C},\texttt{T}\} \\
0 & \text{otherwise}
\end{cases}
\end{align*}
To account for a transversion, exactly one of the two bases has to be a pyrimidine, and the other purine. A character $c$ is a pyrimidine in ASCII if $c + 1$ has a $1$ at bit 3. For purines that bit has the value $0$. So incrementing the characters by one allows for easy characterization of bases. With a bitwise-exclusive-or we can check if only one of the two bases is a pyrimidine.
\begin{lstlisting}
double transversions(const char *subject, size_t length, const char *query)
{
size_t transversions = 0;
for (size_t k = 0; k < length; k++) {
if (((subject[k] + 1) ^ (query[k] + 1)) & 4) {
transversions++;
}
}
return (double)transversions / length;
}
\end{lstlisting}
\section{Results}
To evaluate the performance of the given methods, we simulated sequences with 100{,}000 nucleotides. On each sequence a method is run often enough to gain statistical confidence using the benchmark library by Google \cite{bench}. This process is repeated a number of runs, from which the \emph{minimum} is chosen as the best run-time \cite{min}.
Also, we inspect the runtime characteristics of different methods using the \textsc{perf} tools \cite{perf}. These allow measuring the instructions per cycle, branches and other features of a program.
\subsection{GC-Content}
{
In Section~\ref{sec:gccontent} we describe, how the GC-content of a sequence can be computed with
\marginpar{
\vspace{.7cm} % adjust vertical position relative to text with \vspace{} - note that you can enter negative numbers to move margin caption up
\color{Gray} % this gives caption a grey color to set it apart from text body
\textbf{Figure \ref{fig:gccontent}. Computing the GC-Content.} Runtimes of different methods to determine the GC-content from from sequences with 100{,}000 nucleotides. \textbf{Todo}: fix the y-axis.
}
\begin{wrapfigure}[12]{l}{53mm}
\begin{tikzpicture}
\begin{semilogyaxis}[
width =5cm,
height=5cm,
enlarge x limits=.3,
% title={GC-Content},
ylabel={Runtime (ns)},
xtick=data,
% xlabel={Method},
symbolic x coords={simple,twiddle,table},
ybar
]
\addplot plot coordinates {
(simple, \gZgccontentZlength) (twiddle, \gZgccontentZlengthZtwiddle) (table, \gZgccontentZtable)
};
\addlegendentry{GCC}
\addplot plot coordinates {
(simple, \clangZgccontentZlength) (twiddle, \clangZgccontentZlengthZtwiddle) (table, \clangZgccontentZtable)
};
\addlegendentry{Clang}
\end{semilogyaxis}
\end{tikzpicture}
\captionsetup{labelformat=empty} % makes sure dummy caption is blank
\caption{}
\label{fig:gccontent}
\end{wrapfigure}
\noindent fewer comparisons than~necessary~at~first~glance. We created two methods, one representing the simple way of counting and the other the explicit twiddling way. As modern compilers optimize the generated code, we do not expect any significant difference between the two methods. As a third method we use a table look up.
Figure~\ref{fig:gccontent} shows the benchmarking results for the three methods.
Using the GNU Compiler all methods are almost equally fast. However, Clang fails to optimize the simple method. It still makes one extra comparison leading to many more branches and thus a high number of branch misses. This leads to a roughly 25 times slower function.
}
\vspace*{2cm}
\newpage{}
\subsection{Hashing}
We compare the two hashing procedures defined in Section~\ref{sec:hash}, with one using a look-up table. One uses a big switch statement to map the nucleotides to an index. The other
\begin{wrapfigure}[16]{r}{65mm}
\begin{tikzpicture}
\begin{axis}[
width =5cm,
height=5cm,
enlarge x limits=.3,
% title={Hashing},
ylabel={Runtime (ns)},
xtick=data,
% xlabel={Method},
symbolic x coords={switch,twiddle,table},
ybar
]
\addplot plot coordinates {
(switch, \gZhashZsimple) (twiddle, \gZhashZtwiddle) (table, \gZhashZtable)
};
\addlegendentry{GCC}
\addplot plot coordinates {
(switch, \clangZhashZsimple) (twiddle, \clangZhashZtwiddle) (table, \clangZhashZtable)
};
\addlegendentry{Clang}
\end{axis}
\end{tikzpicture}
\caption{Runtime of different k-mer hashing procedures.}
\end{wrapfigure}
\noindent exploits~the~ASCII~representation~of~the characters~to~achieve~the~same~result, much~faster.~We~measured~the~performance by~having~both~methods~compute~the hash of~all~$k$-mers~in~a~long~sequence with $k = 16$.
As we would have hoped, the twiddling method is faster than the simple switch, by a factor of 3. This is due to different reasons. The twiddle method utilizes more instruction-level parallelism. On an Intel Core i5-5200U the simple method executes about 1.19 instructions per cycle. However, using twiddling, that number can be ramped up to 3.75. On the contrary, the number of branches and especially the number of branch misses goes down. Whereas for the simple method 25\% of all branches are missed, it is only 0.02\% for twiddle.
Using a look-up table is even faster than twiddling by roughly 20\%. This is mainly due to a greatly reduced number of instructions.
\subsection{Reverse Complement}
{
Here we compare four different ways of computing the reverse complement. As a baseline, again we use our simple switch-statement. twiddle is the method using xors as presented in Section~\ref{sec:revcomp}. A third method used by programs such as \textsc{Blast} and
% A third method uses addition and subtraction instead of
\marginpar{
\vspace{.7cm} % adjust vertical position relative to text with \vspace{} - note that you can enter negative numbers to move margin caption up
\color{Gray} % this gives caption a grey color to set it apart from text body
\textbf{Figure \ref{fig:revcomp}. Reverse Complement.} Runtime of different methods for computing the reverse complement of a sequence with 100{,}000 nucleotides.
}
\begin{wrapfigure}{l}{75mm}
\begin{tikzpicture}
\begin{semilogyaxis}[
width =7cm,
height=7cm,
enlarge x limits=.2,
% title={Reverse Complement},
ylabel={Runtime (ns)},
xtick=data,
% xlabel={Method},
symbolic x coords={switch,twiddle,table, two step},
ybar
]
\addplot plot coordinates {
(switch, \gZrevcompZsimple) (twiddle, \gZrevcompZtwiddle) (table, \gZrevcompZtable) (two step, \gZrevcompZtwostep)
};
\addlegendentry{GCC}
\addplot plot coordinates {
(switch, \clangZrevcompZsimple) (twiddle, \clangZrevcompZtwiddle) (table, \clangZrevcompZtable) (two step, \clangZrevcompZtwostep)
};
\addlegendentry{Clang}
\end{semilogyaxis}
\end{tikzpicture}
\captionsetup{labelformat=empty} % makes sure dummy caption is blank
\caption{}
\label{fig:revcomp}
\end{wrapfigure}
\noindent \textsc{MUMmer}~uses a table lookup. Lastly, one method (two step) splits the process into two parts. First, all nucleotides are complemented, then the string is reversed.
It can be seen in Figure~\ref{fig:revcomp} that the switch method is by far the slowest. All other methods are at least one order of magnitude faster. Among them the table lookup is the slowest, followed by a two-step process. Twiddling leads to the best performance. All results are the same across compilers.
}
\newpage
\subsection{Transversions}
{
To compare the performance of different ways of counting transversions, we simply
\marginpar{
\vspace{.7cm}
\color{Gray}
\textbf{Figure \ref{fig:transversions}. Counting Transversions.} Runtime of different methods for counting mutations and transversions. Shown is the minimum of \gZhashZruns~runs.
}
\begin{wrapfigure}{l}{75mm}
\begin{tikzpicture}
\begin{semilogyaxis}[
width=7cm,
height=6cm,
title={Transversions},
ylabel={Runtime (ns)},
xtick=data,
enlarge x limits=.2,
% xlabel={Method},
symbolic x coords={mutations,trans,twiddle,table},
ybar
]
\addplot plot coordinates {
(mutations, \gZtransversionsZmutationsZlength) (twiddle, \gZtransversionsZtwiddle) (trans, \gZtransversionsZtransversions) (table, \gZtransversionsZtable)
};
\addlegendentry{GCC}
\addplot plot coordinates {
(mutations, \clangZtransversionsZmutationsZlength) (twiddle, \clangZtransversionsZtwiddle) (trans, \clangZtransversionsZtransversions) (table, \clangZtransversionsZtable)
};
\addlegendentry{Clang}
\end{semilogyaxis}
\end{tikzpicture}
\captionsetup{labelformat=empty}
\caption{}
\label{fig:transversions}
\end{wrapfigure}
\noindent choose~counting~mutations as a baseline.
The~simple~way~of~counting transversions~cannot be vectorized and thus is more than ten times slower than counting only mutations. However, the new twiddling method presented in Section~\ref{sec:transversions} can be vectorized, an thus is almost as fast as the first method. Likewise, using a look-up table cannot be vectorized as thus is slower than twiddling, but ten times faster than a simple comparison.
}
% \section*{Acknowledgments}
% Schlegel for the \LaTeX template %\cite{Schlegel2016}.
\vspace*{3cm}
\nolinenumbers
\bibliography{twid}
\bibliographystyle{abbrv}
\end{document}