Information Retrieval

About

This is a collection of solutions to university assignments from the course on Information Retrieval (IR). Tasks range from simple word dictionary to querying using binary operators, implementing index compression and TF-IDF.

Solutions were tested on Shakespeare's works, free books from Project Gutenberg and some additional datasets. They are not polished or guaranteed to be correct.

Description

PW1

Creates a simple dictionary with each word occurrence count.

To improve performance for large datasets it's multithreaded and uses memory mapping for reading files. All further solutions use this as a foundation.

PW2

Uses inverted index and incidence matrix to implement queries with boolean operators such as &, |, !, \ (and, or, not, and subtraction respectively) and brackets (, ). The duration that each query took is measured, which stays the same for further solutions.

Example:

heaven & hell
father & (brother | sister)
heaven & !hell

PW3

Uses an inverted positional index. In addition to the previous implementation, it allows searching for words within a specific radius and literal phrases. There's also a simpler implementation of phrase search using word pair index.

Example:

# words within radius 3 (inclusive)
hello {3} world
# adjacent words (both directions)
hello {1} world

# words in a specific order
hello > world
# operator can be chained
what > is > love

# phrase literal that desugars to the same thing
"what is love"

PW5

Brings improvements to indexing speed and merges indices from separate files in parallel. Allows to index large amounts of data (more than available RAM). It also measures indexing time, index size, and amount of data.

PW6

Builds on the previous work and implements index compression. File contains word dictionary packed in a fashion similar to a radix tree; null byte separator; then term positions in variable byte encoding.

PW7

Implements IR in structured documents by splitting the file into segments like filename, title, authors, body, etc. And by assigning different weights to each part. Plain text and .fb2 files are supported.

PW8

Implements search in vector space using cosine similarity and TF-IDF. Produces similar documents as query result using clusterization.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pw1		pw1
pw2		pw2
pw3		pw3
pw5		pw5
pw6		pw6
pw7		pw7
pw8		pw8
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval

About

Description

PW1

PW2

PW3

PW5

PW6

PW7

PW8

About

Releases

Packages

Languages

rchuk/uni_information_retrieval

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval

About

Description

PW1

PW2

PW3

PW5

PW6

PW7

PW8

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages