Developed with the software and tools below.
Repository for the project of Information Retrieval @ University of Trieste held by professor Laura Nenzi, DSSC Course, A.Y. 2023/2024.
Clone the repository:
git clone git@github.com:alessimichele/LSA.git
cd LSA
pip install -r requirements.txt
This project aims to implement different models for Latent Semantic Analysis (LSA) using deep neural networks and SVD techniques and compare their performances. The goal of the assessment is to evaluate the quality of the learned latent space. The following papers are taken as starting points to implement the code:
- Neural Variational Inference for Text Processing
- Unsupervised Neural Generative Semantic Hashing
- Semantic Hashing
Two deep architectures are implemented: see here and here for details.
The datasets used can be distinguished into two different categories.
- datasets inside data folder are small datasets, which comprise:
- corpus of documents/articles (
.ALL
file) - a set of provided queries (
.QRY
file) - for each query, the corresponding relevance judgments (
.REL
file)
- corpus of documents/articles (
- 20newsgroup dataset, a large dataset consisting of 20000 articles divided into 20 different topics. The dataset is already divided into training and test sets but does not contain queries and relevance judgments, hence the assessment is performed with different techniques.
-
data folder contains the following dataset:
- time dataset (423 documents, 83 queries)
- cran dataset (1398 documents, 225 queries)
- med dataset (1033 documents, 29 queries)
-
src folder contains the source code:
- autoencoder.py* implementation from scratch of autoencoder architecture
- variational_autoencoder.py* implementation from scratch of variational autoencoder architecture
- IR class* implementation of a python class from scratch useful for the analysis
- import_dataset.py* code for read and import data stored inside data folder.
- pipelines.py* implementation of two pipelines for the analysis of the two different categories of dataset.
- utils.py* code for the implementation of some useful functions.
- _embeddings.py** code for the implementation from scratch of the inverted index, the tfidf, and word-count matrices.
-
out folder contains
.txt
outputs from the analysis done on data. -
data_analysis notebook for the analysis of the data's datasets.
-
20news_analysis notebook for the analysis of the 20newsgroup dataset.
*
: for further details, see the specific documentation provided for each function.
**
: the code is working but not directly employed for efficiency reasons.
Procedure for data analysis
The following pipeline is followed to perform the analysis:
-
Import the data.
-
Build the embedding matrix of the corpus, with both word count and tfidf embeddings.
-
For each model: SVD, AutoEncoder, VariationalAutoeEncoder:
- build the latent space using the embedding matrix of the corpus (if the model is a NN, train the model)
- project each query into the latent space and compute its similarity with all the documents in the latent space
- get the top k similar documents to each query
- compute the precision and recall using the relevance
A simplified step-by-step version of the procedure is shown in the following figures:
Procedure for 20newsgroup analysis
The following pipeline is followed to perform the analysis:
- Import the data
- Use a CountVectorizer to build the bag of words matrix of the training data
- Build the trainlaoder using the embedded training data
- Train an AutoEncoder model using the trainloader
- Build the latent space using the training data
- For each document in the test set:
- compute its latent representation
- compute the cosine similarity between the latent representation and the training documents
- retrieve the top k documents with the highest similarity and the k nearest neighbors
- compare the labels of the retrieved documents with the label of the test document
- compute the precision for both methods
- Average the precision over all the test documents
A simplified step-by-step version of the procedure is shown in the following figures:
In this section are reported the results obtained from the analysis of the different datasets.
Note: the analysis was performed using datasets easily available. 20newsgroup dataset is more suitable than TIME, CRAN and MED dataset to assess the latent space, since it is divided into topics. On the other hand, the other three datasets are not topic-based, hence the model is less likely to learn a meaningful latent space. Moreover, the provided queries are in the form of free queries, meaning that it is more difficult to use the model to process them. However, even in the former case, the model is able to retrieve relevant documents, as shown here.
20newsgroup dataset
The following table reports the results obtained from the analysis of the 20newsgroup dataset. The model used is the AutoEncoder, trained on a word-count embedding matrix of the training data. It was tested with an increasing number of classes, and with different latent dimensions (50 and 200). Then, cosine similarity and nearest neighbors were used to retrieve the documents and compared following steps 6-7 of this procedure.
# classess | latent dimension | Accuracy cosine similarity | Accuracy nearest neighbours |
---|---|---|---|
2 | 50 | 85% | 83% |
2 | 200 | 94% | 93% |
4 | 50 | 54% | 26% |
4 | 200 | 67% | 59% |
6 | 50 | 47% | 17% |
6 | 200 | 54% | 40% |
8 | 50 | 50% | 45% |
8 | 200 | 52% | 40% |
Datasets in data
The following table reports the results obtained from the analysis of the datasets in data, following this. The models used were SVD, the AutoEncoder, and the VariationalAutoEncoder.
Both word count and tfidf embeddings were tested for each model, and the results are reported in the table below.
The percentage % represents the percentage of queries within the .QRY file that has at least one relevant document in the top 15 retrieved documents. The precision P and recall R are computed on the top 15 retrieved documents.
SVD | AE | VAE | |
---|---|---|---|
TIME - WC | 83%, P: 15%, R: 65% | 71%, P: 13%, R: 49%, |
20%, P: 1.2%, R: 7%, |
TIME - TFIDF | 85%, P: 18%, R: 72% | 58%, P: 8%, R: 32%, |
1%, P: 0.6%, R:2.3%, |
CRAN - WC | 66%, P: 9%, R: 19% | 52%, P: 5%, R: 12%, |
7%, P: 0.05%, R: 1%, |
CRAN - TFIDF | 73%, P: 12%, R: 25% | 38%, P: 4%, R: 7%, |
11%, P: 0.07%, R: 1.3%, |
MED - WC | 93%, P: 50%, R: 35% | 93%, P: 42%, R: 29%, |
17%, P: 1%, R: 1%, |
MED - TFIDF | 96%, P: 68%, R: 48% | 96%, P: 42%, R: 30%, |
38%, P: 3%, R: 2%, |
Many architectures were recently developed to learn a latent representation of a document. In this project, I used two simple architectures: the AutoEncoder and the VariationalAutoEncoder.
The idea behind the AutoEncoder is to learn a deterministic latent representation of the document, by encoding it into a lower dimensional space, and then decoding it back to the original space. The encoding is done using two hidden linear layers, with reLU activation function. Note that the latent space is the output of the encoder. The decoding is done using one linear layer, then a Softmax layer: this is useful to obtain a probability distribution over the vocabulary space, based on the latent representation of the document.
The training is done by minimizing the loss function, which is the cross entropy between the original document and the decoded one, as depicted in the following figure.
The VariationalAutoEncoder is a more complex architecture.
In this context, we don't learn a deterministc latent space, but a probabilistic one.
During the training, we learn an inference model parametrized by the encoder neural network which outputs the parameters of the unknown probability distribution
The generative model (or decoder network) is then used to reconstruct the document from its latent representation. The training is done by minimizing the loss function, which is the sum of the cross entropy between the original document and the decoded one, and the KL divergence between the inference model and the prior distribution.
For further reading, see this paper.