Skip to content

Commit

Permalink
Added sumamry.
Browse files Browse the repository at this point in the history
  • Loading branch information
Peter Us committed Jun 12, 2016
1 parent 735b1ea commit 28c1c09
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 6 deletions.
3 changes: 1 addition & 2 deletions report/data.tex
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,11 @@ \subsection{Description of data}

Our data consisted of text documents in English language from 3 domains: news
articles about sports (\href{http://mlg.ucd.ie/datasets/bbc.html}{source}),
abstracts from scientific papers, and movie reviews. Per each domain we had 20
abstracts from scientific papers (\href{http://eprints.fri.uni-lj.si/cgi/latest_tool?mode=articles}{source}), and movie reviews. Per each domain we had 20
documents, which we later split on train and test sets with 10 documents per
set per domain. Each document contained at least 100 words in total, with
sports articles averaging exactly 150 words per document, scientific abstracts
$172.55$ words per document, and movie reviews $381.6$ words per document.

\subsection{Preparation of data}
\label{sub:preparation_of_data}

Expand Down
4 changes: 2 additions & 2 deletions report/results.tex
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ \section{Results}
\end{figure}


To get concrete results on how much persistence diagrams differ between texts from different domains, we calculated bottleneck distances between all pairs of diagrams. The bottleneck distance between two diagrams is the cost of the optimal matching between points of the two diagrams. From the calculated diagrams a pairwise distance matrix was constructed, on top of which hierarchical clustering was performed. To test weather or not persistence diagrams can separate documents from different domains, we first split documents from each of the three domains into two groups. This way we obtained six groups of documents where each two of them came from the same domain. The main idea is that if persistence diagrams separate the documents from different domains well, each two groups of documents from the same domain, would be grouped ``sooner'' in the hierarchical clustering than group of documents from different domains. Results can be seen in figure~\ref{fig:h_1}. We can first notice that abstracts and sports texts get connected sooner than abstracts and sports with itself. This means that bottleneck distance between one group of sports texts and abstracts texts is smallest distance between all 6 groups. This leads to that persistence diagrams between two groups of sports articles differ more than diagrams of different domains (at least by means of bottleneck distance).
To get concrete results on how much persistence diagrams differ between texts from different domains, we calculated bottleneck distances between all pairs of diagrams. The bottleneck distance between two diagrams is the cost of the optimal matching between points of the two diagrams. From the calculated diagrams a pairwise distance matrix was constructed, on top of which hierarchical clustering was performed. As already mentioned we performed this on 6 groups of documents where each group had a pair from the same domain. The main idea is that if persistence diagrams separate the documents from different domains well, each two groups of documents from the same domain, would be grouped ``sooner'' in the hierarchical clustering than group of documents from different domains. Results can be seen in figure~\ref{fig:h_1}. We can first notice that abstracts and sports texts get connected sooner than abstracts and sports texts with its pair. This means that bottleneck distance between one group of sports texts and abstracts texts is the smallest distance between all 6 groups and that persistence diagrams between two groups of sports articles differ more than diagrams of different domains (at least by means of bottleneck distance). Although review pair gets connected, the distance between the two pairs is greater than all the distances between other four groups.


\begin{figure}[H]
Expand All @@ -129,7 +129,7 @@ \section{Results}
\label{fig:h_1}
\end{figure}

The results are not promising. One would expect that diagrams from groups of documents from same domain would differ significantly less, than diagrams of different domains, and that inner-domain bottleneck distances would therefore be much smaller. Instead of using the bottleneck distance as a distance metric between diagrams we also tested the Wasserstein distance, with no improvement in the results.
The results are not promising. One would expect that diagrams from groups of documents from the same domain would differ significantly less, than diagrams of different domains, and that inner-domain bottleneck distances would therefore be much smaller. Instead of using the bottleneck distance as a distance metric between diagrams we also tested the Wasserstein distance with no improvement in the results.


We also tested the hierarchical clustering method on a ``toy'' dataset where one sample of points were coming from a circle, and the other one form a straight line. The clustering had no trouble distinguishing between the domains and the expected results can be seen in~\ref{fig:h_2}. The inner-group bottleneck distances in both groups are much smaller than the distance between groups from different domains, which confirms the intuition of our test, and that the results would be expected of persistence diagrams differed enough.
Expand Down
8 changes: 6 additions & 2 deletions report/summary.tex
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
\section{Summary}
\section{Summary}
\label{sec:summary}

\lipsum[3-5]
Persistence diagrams built on top of documents from different domains did not differ enough to correctly classify documents based on bottleneck distances only. However it should be noticed that bottleneck distances between diagrams is only one possible way of using persistence diagrams for a predictive task. From diagrams, a various number of other numerical features can be extracted (e.g. number of homology generators of a specific dimension, average living length of a specific dimension etc.) so further exploration in this direction could provide promising results. Also the number of samples used in this project was relatively small (60), so testing the methods on a bigger corpora could provide different results as well.

We also show that this method works well on a toy example where homology of points in each group differ significantly. Therefore we believe that there are possible other applications (outside of text classification domains) that could benefit from the analysis of persistence diagrams.

We notice the main benefit of using persistence diagrams as a predictive task that it relies on inner-class data structure, instead on finding a linearly separable representation of the data, as is the case in many other clustering or classification models. Therefore we see it as an interesting tool for data analysis where other simple linear models would fail. The other important benefit of using this method is that it can provide us additional numerical features that can be used to further improve an existing model.

0 comments on commit 28c1c09

Please sign in to comment.