-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchap2.tex
150 lines (125 loc) · 9.75 KB
/
chap2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
\chapter{Related Work and Scope}
In this section, we first summarize relevant research presented in three categories:
topic modeling, topic model visualizations, and mental health topic modeling. We
then provide the scope and limitations of this thesis project.
\section{Topic Modeling}
Probabilistic topic models \cite{blei-topicmodel} are algorithms that aim to extract the main themes
from a large collection of documents. These algorithms use statistics to analyze the
words in each document's text and organize them into topics. Topic modeling can be
used to aid summarization and information retrieval for various types of data without
the need for humans to manually annotate a large amount of text.
The simplest topic model is \textit{latent Dirichlet allocation} (LDA) \cite{blei-topicmodel}. LDA uses a statistical process to discover the topics in a corpus of documents. A \textit{topic} is formally
defined as a distribution over a fixed vocabulary. For example, a \textit{genetics} topic
should have the words \textit{genetics} and \textit{genes} with high probability. LDA consists of
reverse-engineering an imaginary generative process. This process begins by taking a
random distribution over topics. Each word for each document is then generated by
randomly choosing a topic from the distribution over topics and randomly choosing
a word from that topic's distribution over words. We refer to the topics, the
per-document topic distributions, and the per-document per-word topic assignments as
the topic structure. This generative process must be reverse-engineered because the
words in the documents are observed, while the hidden topic structure that most
likely generated the words must be inferred.
We will not go further into the specifics of topic modeling in terms of probability
and statistics because this thesis concentrates on visualization. The purpose of this
overview is to familiarize the reader with the concept of topic modeling, focusing
on how it is used to extract a set of topics from a document corpus and annotate
documents with themes based on the document words.
\section{Visualizing Topic Models}
Topic model visualizations vary in design due to different goals and audiences. Many
projects focus on visualizing relationships between documents instead of summarizing
each document. Some were created for non-technical users to improve understanding,
while others were made for technical users to evaluate a certain model. A few systems
also aim to show topic changes over time.
\subsection{Document Relationships}
Numerous research projects revolve around visualizing documents to show similarities
based on their latent topics. \textit{Probabilistic Latent Semantic Visualization} (PLSV) \cite{plsv}
is a topic model approach to visualizing documents and topics as coordinate points in
a visualization space. The distances between documents and topics are based on the
topic distribution of a document. \textit{Topic maps} \cite{topic-maps} and \textit{Exemplar-based Visualization}
(EV) \cite{ev} provide similar graphs of a large collection of documents, with document
points color-coded by their dominant topic. The Stanford Dissertation Browser \cite{interpretation-trust} is
also a notable visualization developed to evaluate word and topic similarities between
the Ph.D. theses of different departments over time. The general purpose of these
visualizations is to show documents with similar topics in clustered areas for a global
overview of the corpus.
\subsection{Thesis-Relevant Projects}
Now we turn to a few systems that are more relevant to our research in terms of
their goals, end-users, or visual design. We are focused on summarizing individual
documents using topics, revealing topic trends of a document over time, and indexing
topics within document text using simple visualizations for non-technical users. Our
developed visualizations were inspired by different aspects of these projects.
The Wikipedia navigator \cite{wikipedia} was specifically designed to summarize the corpus
and show relationships between textual content and topics for non-technical users.
Three straightforward visualizations were produced: an overview page that lists the
set of topics associated with all documents, a topic page that displays associated
words as well as related document and topic links, and a document page showing the
content in addition to related document links and a pie chart of related topics. These
visuals allow the user to be completely unaware of the underlying LDA topic models.
The interactive visual text analysis tool TIARA \cite{tiara} summarizes a corpus over
time using a stream graph with topic layers and distributed keywords. ThemeRiver
\cite{theme-river} provides the same type of graph without keywords. The height of the topic layer
areas illustrate the strength of each topic at a certain point in time. Although I
personally find stream graphs difficult to comprehend, these visualizations show that
area or line charts can be useful for expressing topic trends over time.
Finally, Termite \cite{termite} is a visual analysis tool for evaluating the quality of topic
models. The main visualization of this tool is a term-topic matrix that can be described
as a scatter plot of words for each topic, with the size of each point proportional
to the word frequency for that topic. Clicking on a topic in this matrix shows its
representative documents and a one-dimensional plot of where topical terms can be
found within each document. These simple designs seem effective for visually indexing
topics in each document.
\section{Mental Health Topic Modeling}
Very little research has been done related to the application of topic models to the
mental health domain. The Software Agents Group at the MIT Media Lab first began
branching into this area with their previous story-matching research and now our
Crisis Text Line project. The topic models for both projects, developed by Karthik
Dinakar, use similar approaches. We will first describe the previous project and then
outline the topic model differences used for our CTL system.
\subsection{Story Matching Project}
The previous research revolved around an ethics website where teenagers share stories
about their mental health issues \cite{mtv-atl}. Researchers aimed to mitigate the effects of
cyberbullying by presenting teens with stories similar to their own. The approach
uses LDA to discover themes within the stories \cite{dinakar-mtv}. First, LDA extracts topics, in the
form of word clusters, and a distribution over the topics for each document. Each
word cluster is then analyzed by a human and interpreted as a theme if possible.
This process iterates with an increasing number of desired topics until a satisfactory
collection of themes have been extracted. Each document has a distribution over the
themes. Using the output of this process, \textit{Reflective Interfaces} \cite{reflective-interfaces} displays stories
with common themes in order to help the teenagers relate to each other.
\subsection{Thesis Topic Model}
The topic model algorithm used for the visualizations in this thesis is very similar to
the story-matching approach with a few main differences. The documents are
conversations between a client and a counselor, so only the words in the client messages
are analyzed. After the algorithm is applied to a large set of sample conversations,
the extracted topics and word distributions are used to analyze each client message
in a conversation. Having the themes at the message level allows us to: 1) provide
indexing information regarding where topics occur within a conversation and 2)
dynamically apply the topic model to new messages. The topic model summary is
produced by normalizing the topic distributions for each client message. This topic
model is referred to as Labeled mixed-initiative Latent Dirichlet Allocation (L-LDA).
\section{Scope and Limitations}
We will now give an assessment of the scope and limitations of this thesis project.
First, the goal of this research is to provide a prototype for a Crisis Text Line website
that makes use of topic model visualizations. It is designed on a development server
and is not deployment-ready. The CTL developers may use the system design and
implementation as guidelines or inspiration for future work. We do not have the time
or resources to fully test and deploy this system to real users due to thesis deadlines
and lack of additional developers.
We are also focused on crisis hotlines that use texting because we are mainly
limited to conducting contextual inquiries and tests with the Crisis Text Line
organization and the Boston Samaritans, which is a local hotline that uses texting. Some
of the problems we are trying to solve, such as context switching and cognitive recall,
are also unique to texting due to the longer and more frequent gaps in conversation.
In this thesis, the evaluation of the visualizations is emphasized rather than topic
model accuracy. These are two different aspects of the group project, so we are
concentrating our efforts on visualization effectiveness. We realize that the topic
model may be improved with counselor feedback, such as having counselors interpret
themes from the word distributions, merging topics they find to be too similar, or
allowing them to indicate confusing topic assignments.
Finally, we have additional ideas that may improve the quality of counseling but
choose not to implement them at the moment due to constraints on time and external
resources. For example, topic-specific resources may be provided to counselors as they
are having an ongoing conversation. Resources could be specialized hotline numbers
or training documents on how to deal with a specific situation, as determined by
counseling experts. Exploring this avenue would require controlled testing on whether
this addition is distracting or helpful and gathering resources for predetermined topics.