Skip to content

Commit

Permalink
Added results, workload.
Browse files Browse the repository at this point in the history
  • Loading branch information
Peter Us committed Jun 12, 2016
1 parent b05c7d3 commit 735b1ea
Show file tree
Hide file tree
Showing 5 changed files with 29 additions and 30 deletions.
Binary file removed report/img/histogram-main.png
Binary file not shown.
Binary file added report/img/histogram_main.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added report/img/histogram_toy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 23 additions & 28 deletions report/results.tex
Original file line number Diff line number Diff line change
@@ -1,34 +1,6 @@
\section{Results}
\label{sec:results}

\todo{todo results}

\begin{figure}[H]
\centering
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{{img/sample}}
\caption{Subfigure 2(A)}
\label{fig:2a}
\end{subfigure}~
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{{img/sample}}
\caption{Subfigure 2(B)}
\label{fig:2b}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{{img/sample}}
\caption{Subfigure 2(C)}
\label{fig:2c}
\end{subfigure}~
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{{img/sample}}
\caption{Subfigure 2(D)}
\label{fig:2d}
\end{subfigure}
\caption{Four Subfigures}
\label{fig:fig_2}
\end{figure}

To visualize persistences of homology groups we used the bar code diagrams and
persistence diagrams. The bar code diagram for a certain homology group $H$ is
a two dimensional plot that shows us the life spans of all the homology
Expand Down Expand Up @@ -145,3 +117,26 @@ \section{Results}
\caption{Persistence diagrams for reviews\_test}
\label{fig:s_4}
\end{figure}


To get concrete results on how much persistence diagrams differ between texts from different domains, we calculated bottleneck distances between all pairs of diagrams. The bottleneck distance between two diagrams is the cost of the optimal matching between points of the two diagrams. From the calculated diagrams a pairwise distance matrix was constructed, on top of which hierarchical clustering was performed. To test weather or not persistence diagrams can separate documents from different domains, we first split documents from each of the three domains into two groups. This way we obtained six groups of documents where each two of them came from the same domain. The main idea is that if persistence diagrams separate the documents from different domains well, each two groups of documents from the same domain, would be grouped ``sooner'' in the hierarchical clustering than group of documents from different domains. Results can be seen in figure~\ref{fig:h_1}. We can first notice that abstracts and sports texts get connected sooner than abstracts and sports with itself. This means that bottleneck distance between one group of sports texts and abstracts texts is smallest distance between all 6 groups. This leads to that persistence diagrams between two groups of sports articles differ more than diagrams of different domains (at least by means of bottleneck distance).


\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{{img/histogram_main.png}}
\caption{Hierarchical clustering results on texts dataset.}
\label{fig:h_1}
\end{figure}

The results are not promising. One would expect that diagrams from groups of documents from same domain would differ significantly less, than diagrams of different domains, and that inner-domain bottleneck distances would therefore be much smaller. Instead of using the bottleneck distance as a distance metric between diagrams we also tested the Wasserstein distance, with no improvement in the results.


We also tested the hierarchical clustering method on a ``toy'' dataset where one sample of points were coming from a circle, and the other one form a straight line. The clustering had no trouble distinguishing between the domains and the expected results can be seen in~\ref{fig:h_2}. The inner-group bottleneck distances in both groups are much smaller than the distance between groups from different domains, which confirms the intuition of our test, and that the results would be expected of persistence diagrams differed enough.

\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{{img/histogram_toy.png}}
\caption{Hierarchical clustering results on a toy ``circles and lines'' dataset.}
\label{fig:h_2}
\end{figure}
8 changes: 6 additions & 2 deletions report/workload.tex
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
\section{Workload}
\section{Workload}
\label{sec:workload}

The work on this project was even more collaborative then was the case in the
Expand All @@ -14,7 +14,11 @@ \section{Workload}
outputs from all three complex building methods and constructed persistence
diagrams from them, and wrote the glue that ran all the steps required for
our analysis of the data;
\item P
\item P collected the abstracts dataset, implemented the texts preprocessors which
removed stopped words and applied stemming and lemmatization, implemented various
methods for extracting text features, wrote code for hierarchical clustering on top
of the bottleneck distances of persistence diagrams, performed clustering tests on
the main dataset and implemented clustering on the toy dataset
\item R collected the sports dataset, implemented the tf-idf method for
extracting additional features, designed the drawing of barcode plots and
persistence diagrams, performed benchmark clustering methods on the
Expand Down

0 comments on commit 735b1ea

Please sign in to comment.