conclusions.tex

\chapter{Conclusions and Discussion}
\label{sec:conclusions}

The goal of this dissertation was to explore the potential of emerging DNA
sequencing technologies to discover, characterize and validate structural
variation. These technologies had brought large improvements to related fields
such as chromosome phasing and \emph{de novo} assembly, but are particularly
suited for the purpose of \sv characterization, too.
Throughout the projects described herein, I presented three
concrete approaches of how these techniques advance the detection of difficult
\sv types. I was able to scrutinize \acp{sv} to an extent that had not been
possible beforehand based on standard \mps experiments. Importantly, the
\acp{sv} unraveled by my work led to novel insights into the complexity and
functional role
of \acp{sv}. I further developed new computational approaches to detect and
analyze \acp{sv} based on these techniques, proved their utility, and made
them available to the community.

While each chapter contains a conclusion on its own, I here summarize again the
major findings of my work and put them into the context of recent developments
within the community.


\section{Complex inversions in the human genome}

Inversions are a \sv class of outstanding relevance for human disease \citep{Feuk2010},
yet they are especially difficult to detect and they eluded ascertainment also
in the 1000 Genomes Project. As I show in \cref{sec:complex_invs}, I was able to
validate hundreds of inversion loci by using targeted long-read sequencing data
from both \pacbio and \ont MinION platforms. This revealed that more than 80\% of the
inversion loci predicted from \mps indeed carried an inversion signature.
Strikingly, this verification had previously not been possible via \pcr
experiments.  This solved my first research goal and a principal
challenge of the overall study \citep{Sudmant2015}.

Moreover, I then found that the majority of predicted loci contained not simple
inversions, but complex variants containing inverted sequence. I categorized them into five major
classes, which included inverted duplications as the most frequent event. These
insights had only been possible due to the ability of long-read techniques to
span complete loci around predicted inversions. My analyses critically relied on
the visualization tool \maze, which I developed simultaneously and which I made
available to the public (\url{https://github.com/dellytools/maze}).

The unforeseen amount of complex variation resulting from my work and the work
of others was one of the key lessons learned from the 1000 Genomes Project's \sv
study. The function and origin of these complex sv classes remained uncharted,
though. I thus carefully analyzed the breakpoints of complex \acp{sv} with the
goal to infer the mechanisms they originated from. The evidence I found was not
distinctive of any precise mechanism that might have formed these \acp{sv}, but
it suggested that several of the seemingly very different classes might
originate from the same mutagenic process, with slight evidence for replication-based
mechanisms such as \mmbir.

Intrigued by the unforeseen amount of complex variation revealed in the 1000 Genomes
Project, others continued to study this \sv class in human genomes \citep{Chaisson2014,Collins2017}.
Using the emerging 10X Genomics technology and mate pair sequencing, \citet{Collins2017} even extended
the five classes that I reported to a total of 16 different complex \sv classes
(which they call cxSV), more than 80\% of which contained inverted sequence.
This further emphasizes the that this phenomenon was previously underappreciated,
as I predicted. They also note that these complex events might have been created
by a replicative mechanism such as \mmbir.

My work and the subsequent finding of \citet{Collins2017} underline the
prevalence of complex inverted rearrangements---leading to the notion of the ``morbid''
human genome. Whereas my work revealed complex \acp{sv} in healthy individuals,
\Citet{Collins2017} found them in patients with autism spectrum disorder. The
functional role of these \sv classes is not yet understood, but our results
suggest that inverted and complex variation can and and should be detected,
especially in the context of genetic studies around human disease.


\subsubsection{Long-read sequencing on the rise}

In the mean time (especially since 2014, when I started this project) an
increasing number of studies were published by others that utilize long-read
sequencing technologies for \sv detection and related tasks.
Notably the \pacbio technology, which became commercially
available in 2011, has gained many users. Initially, \pacbio had been used to
perform targeted validation experiments like I showed here, and computational tools
have been proposed since to facilitate this approach \citep{Wang2015,Rudewicz2016}.
The method by \citet{Wang2015} is even designed specifically for \sv
characterization and uses a breakpoint visualization approach very similar to
the one I developed for \maze.

However, the applications of \pacbio have long gone beyond this level. Due to
increases in throughput, \wgs has become possible in a more cost-effective
fashion. For example, prior to 2014, \citet{Chaisson2014} still needed 243 SMRT
flow cells to achieve a coverage of 40~x in a human genome, whereas the most
recent developments (i.e. the \pacbio Sequel system) promise a 10~x coverage
from only 4 SMRT cells\footnote{\label{footnote:pacbioblog}%
\url{https://www.pacb.com/blog/new-software-polymerase-sequel-system-boost-throughput-affordability/}}.

The capabilities of the \pacbio technology have especially caused a stir in the
plant genomics community, which had been affected by the limitations of
short-read \mps to a special degree \citep{Bickhart2014}. Notably, the hope is
to perform \textit{de novo} assembly of highly repetitive, or even polyploid
genomes \citep{Li2017}. An accurate assembly would make the discovery of \acp{sv}
trivial---it could simply be done by sequence comparison.
However, the problem of \textit{de novo} assembly from
\pacbio data alone is not yet considered to be solved, despite a number of
available software tools \citep{Chin2013,Chin2016,Koren2017,Koren2018} and the
attention of renowned scientists\footnote{E.g. the efforts of Gene Myers, see
\url{https://dazzlerblog.wordpress.com/}}.

Nevertheless, \pacbio \wgs data was specifically used to study \acp{sv} in the
human genome. Notably by \citet{Chaisson2014}, who utilized a local assembly approach to detect
insertions and deletions in a haploid CHM1 cell line. Remarkably, they found
tens of thousands of \acp{sv} that had not been detected beforehand. They further
observed an insertional bias (more insertions than deletions) of short tandem
repeats, ALU elements and complex variation. In addition, they could close (or
reduce in size) dozens of gaps that were missing in the reference genome
(GRCh37). Obviously, this had not been possible based on standard \mps
approaches. These results highlight shortcomings of the human reference assembly,
which does not represent the human genome in its entireness. Even more though,
they highlight the capabilities of long-read sequencing for \sv detection.

Since then, \sv detection based on \pacbio data has been further improved and
new software tools have been developed using read mapping or assembly approaches
\citep{Pendleton2015,Huddleston2017}. In a recent study, we utilized \pacbio and
other techniques for an unprecedentedly deep characterization of \acp{sv} in
the human genome \citep{Chaisson2017}, which to a large part relied on \pacbio
technology.

\ont technology has seen many improvements, too, and is slowly gaining
popularity. Recently, the \textit{de novo} assembly of the genomes of yeast,
\textit{Caenorhabditis elegans}, and \textit{Drosophila melanogaster} were
demonstrated using \ont sequencing \citep{Istace2017,Tyson2017,Solares2018}.
What is especially interesting to note, though, is the pace of these technological
improvements. In the latest study by \ont, \citet{Jain2018} used a novel protocol
capable of generating sequencing reads with a N50 value of more than 100~kb
(and a maximum of 880~kb). This is a length so far unachieved by PacBio, which
typically yields a maximum read length below 100~kb\footnoteref{footnote:pacbioblog}.

Together, these technological improvements in long-read sequencing will facilitate
studies on \acp{sv} that have been overlooked in the past---they might even, at some point in the future,
make whole-homologue \emph{de novo} assembly possible, which would directly reveal the full spectrum
of \acp{sv} within an individual's genome.


\section{Effects of SVs on gene expression and chromatin organization}

In \cref{sec:balancer}, I set out to study the functional consequences of
\acp{sv} in respect to gene expression and chromatin conformation. My first goal
within this collaborative project was to characterize the variants present in
highly rearranged balancer chromosomes. I achieved this by utilizing deep \wgs
and \hic data. Among many other aspects, I discovered the exact breakpoints of
large rearrangements of the balancer chromosomes. In the meantime, others had
mapped these breakpoints, too, and reassuringly, our results perfectly matched
their findings \citep{Miller2016,Miller2018}.

However, through the technological
advantage of \hic data, I could additionally detect precisely (in 2
cases) or approximately (in 1 case) the breakpoints that had been missed by
these studies. In addition, I utilized haplotype-resolved \hic maps to validate
large rearrangements including an inversion, and a duplication of 258~kb. The
large duplication most likely inserted in reverse orientation next to the
original copy, which I concluded from the differential contact frequencies
around the affected locus. Together, these findings clearly show the benefits
of \hic for the characterization of large \acp{sv}.

Afterwards, I implemented a
test for \acl{ase} that utilizes multiple biological replicates and that
corrects for effects of maternally deposited RNA. I found that changes in
expression occur almost everywhere across the genome and that they appear not to
be caused by enhancer hijacking, as had been observed in previous studies (\cref{sec:balancer_background}).
Instead, \acp{sv} alter expression via alternative mechanisms such as dosage
effects or chimeric expression of transcripts through mobile elements (summarized in \cref{sec:balancer_concl}). Our
findings appear contrary to what has been seen in other scenarios; however, I
argued that this might be a result of natural selection in both the other
studies and in ours. In conclusion, balancer chromosomes show a remarkable
robustness towards the huge rearrangements and other variation that they carry,
and the potential effects of enhancer hijacking mechanisms appear to be buffered.
I speculated that this buffering might be caused by other forms of variation,
such as \acp{snv}, or possible via changes of the epigenome.

I think that these results will complement
previous studies and lead to a more holistic view on the role of chromatin
architecture. The manuscript was in preparation at the time of writing this
thesis.


\subsubsection{SV characterization via \hic}
I demonstrated in \cref{sec:balancer} how \hic data can be utilized for \sv
characterization. Naturally---and considering the popularity of \hic and the
amount of publicly available data---this observation was made by others, too.

The prospects of \hic for purposes other than studying chromatin conformation have
been noted early in the field of \textit{de novo} assembly:
\Citet{Kaplan2013}, for instance, predicted that \hic could facilitate assembly
and assigned unplaced contigs to the human genome; \Citet{Burton2013} created
scaffolds of human, mouse, and \textit{Drosophila} genomes based on \hic and
\mps data; \Citet{Selvaraj2013} successfully extended the idea to haplotyping;
And recently, the mosquito \textit{Aedes aegypti}, vector of the Zika virus, was
assembled using \hic data \citep{Dudchenko2017}.
%Interstingly, the biological
%folding of chromatin is not relevant---maybe even impairing---for the pure
%purpose of assembly or \sv detection. \citet{Putnam2016}
%hence developed a protocol that reconstitutes chromatin \textit{in vitro} prior to \hic
%library preparation.

The core idea of \hic-based \sv detection is the identification of characteristic
alterations in contact frequencies. The presumably first \acp{sv} detected using
\hic were translocations in cancer cell lines, which were detected during the
search for \textit{trans} interactions between chromosomes \citep{Rickman2012}.
This idea was then augmented towards the detection of arbitrary rearrangements.
For example, large rearrangements in scrambled synthetic yeast genomes were
recently studied based on \hic \citep{Mercy2017}. Moreover, \citet{Putnam2016}
explored the potential of \hic to identify inversions. And more translocations
could be identified in cancer cell lines \citep{Barutcu2015,Ay2015,Harewood2017}.
Eventually, \hic based SV detection was further extended to \acp{cnv}
\citep{Harewood2017,Li2018}.

Hence in summary, the idea of applying Hi-C for \sv characterization has been
commonly known beforehand. Moreover, recent efforts within the community
have advanced the state of the art far way beyond the application I presented
here.  Nevertheless was \hic-based \sv detection a highly important step within
our study and it allowed us to gain novel insights on the relationship of
chromatin conformation and gene regulation.


\section{Structural variation detection in single cells}

Finally, in \cref{sec:mosaicatcher} I present a novel method for \sv detection on
the single-cell level, which is currently under active development. This method
termed \mc allows, for the very first time, the detection of
multiple different \sv classes based on single-cell Strand-seq data. In a first
step, my collaborators and I devised a detailed scheme about how each \sv type
can be detected, genotyped and phased within a set of single-cell data. This
conceptual work is largely based on previous experience with Strand-seq data,
but I presented examples of five \sv classes in a new, yet
unpublished data set of \acl{rpe} cells. In the next step, I conceived and
implemented a framework to simulate Strand-seq data. This framework models
Strand-seq data in terms of a negative binomial distribution, for which I
provided evidence that it reflects well the properties of real data. The
framework can then be used to simulate single-cell Strand-seq libraries of
arbitrary sequencing depth and incorporating four
different \sv classes at any desired size and subclonal fraction. Simulations
within this framework enable us to explore the theoretical limitations of
\mc. At last, I designed and implemented an algorithm for data binning
and segmentation, which covers two of three steps of our conceptual \sv calling
procedure. The segmentation algorithm uses the multivariate strand-specific read
depth to find the boundaries of potential \acp{sv} based on a quadratic error
term and I showed that it performs well in
simulations. The last goal---implementing and applying this method---has not yet
been reached at the time of writing. \mc will, once completed, greatly
facilitate studies of somatic mosaicism, e.g. in the context of ageing or cancer,
which had recently been severely limited---if not in respect to \acp{sv}.


\section{Concluding remarks}
Copy-number neutral and complex forms of structural variation have often been
neglected in genetics studies compared to \acp{cnv}. Consequently, less is known
about their prevalence and their role in health and disease. This is owed to
technical limitations in their detection based on commonly used techniques
such as \mps.

Here, I presented three concrete examples of how emerging technologies improve
the detection and characterization of \acp{sv}. Utilizing these technologies
allowed me to detect an unforeseen amount of complex inversions in the human
genome, and to shed new light onto the functional impact of \acp{sv}.
I further developed a computational tool for the characterization of complex
rearrangements from long-read data, including the fine-mapping of their
breakpoints, and a novel approach for \sv detection within single cells.

Together with efforts by others in the community, these new approaches will
enhance our abilities to discern such \acp{sv} both in the germ line as well as
in the context of somatic mosaicism. This will eventually contribute to a deeper
understanding of genetic variants and their potential functional roles.