Skip to content

BookQuest is a hybrid search engine for books that combines semantic search with traditional full-text search to deliver accurate and relevant results. It supports harvesting book metadata from various sources using international standards like MARC21 and ONIX 3.

License

Notifications You must be signed in to change notification settings

torleifg/bookquest

Repository files navigation

Hybrid Search Engine for Books

This repository contains code and resources to run a hybrid (semantic and full-text) search engine for books. It utilizes text embeddings and supports harvesting book metadata from various sources, using international standards like MARC21 and ONIX 3.

The application leverages Multilingual-E5-small for generating text embeddings and PostgreSQL with pgvector as database and vector store. This provides multilingual semantic search capabilities.

Technologies

  • Multilingual-E5-small: This pre-trained model is used for generating text embeddings.
  • pgvector: A PostgreSQL extension for storing and querying vectors, used as the vector store in the search engine.

Getting Started

Follow Run as Docker Compose or Run as Spring Boot to configure and run the application.

Run as Docker Compose

Configure one or more gateways for harvesting metadata by editing compose-app.yml.

Available options:

  • oai-pmh
  • bibbi
  • bokbasen

Example:

SCHEDULER_ENABLED: true
SCHEDULER_INITIAL_DELAY: 5
SCHEDULER_FIXED_DELAY: 3600
HARVESTING_OAI_PMH_GATEWAYS_0_ENABLED: true
HARVESTING_OAI_PMH_GATEWAYS_0_SERVICE_URI: https://oai.aja.bs.no/mlnb
HARVESTING_OAI_PMH_GATEWAYS_0_TTL: 5
HARVESTING_OAI_PMH_GATEWAYS_0_MAPPER: default
HARVESTING_OAI_PMH_GATEWAYS_0_VERB: ListRecords
HARVESTING_OAI_PMH_GATEWAYS_0_METADATA_PREFIX: marc21

Run the following command in the project directory:

docker compose -f compose-app.yml up 

Note: The first run may take some time as it will download the necessary embedding models. Once the models are in place, the application will be ready for use.

Run as Spring Boot

Configure one or more gateways for harvesting metadata by editing application/src/main/resources/application.yaml.

Available options:

  • oai-pmh
  • bibbi
  • bokbasen

Example:

scheduler:
  enabled: true
  initial-delay: 5
  fixed-delay: 3600

harvesting:
  oai-pmh:
    gateways:
      - enabled: true
        service-uri: https://oai.aja.bs.no/mlnb
        ttl: 5
        mapper: default
        verb: ListRecords
        metadata-prefix: marc21
        set:

Run the following command in the project directory:

docker compose up

Open and run the application in your IDE or run the following command in the project directory:

./gradlew bootRun

Note: The first run may take some time as it will download the necessary embedding models. Once the models are in place, the application will be ready for use.

Use the search engine

The hybrid search is based on Reciprocal Rank Fusion (RRF), an algorithm used for combining multiple ranked lists of search results to improve the overall ranking quality, in this case to combine full-text and vector search results.

GUI

Visit http://localhost:8080 in the browser and watch the results as the metadata harvesting progresses. Enter a query for hybrid search or leave it blank for semantic similarity search (the first search hit will be a random choice and the rest will be semantically similar books).

API (build your own frontend)

Visit http://localhost:8080/swagger-ui.html in the browser to read and/or download the OpenAPI specification.

Gateway

The gateway abstracts away the details of the external services and transforms metadata from the external services into a common model. The application supports three gateways: OAI-PMH (MARC21), Bokbasen (ONIX) and Bibbi. Custom mappers can be implemented as needed.

OAI-PMH

The OAI-PMH gateway harvests metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It supports retrieving bibliographic data in MARC21 format.

Additional documentation for OAI-PMH from Biblioteksentralen (https://www.bibsent.no/):

Additional documentation for OAI-PMH from Nasjonalbiblioteket (https://www.nb.no/):

Bokbasen

The Bokbasen gateway uses the ONIX format for metadata, commonly employed in the publishing industry. This is particularly useful for harvesting data from large-scale book vendors.

Additional documentation for ONIX from Bokbasen (https://www.bokbasen.no/):

Bibbi

The Bibbi gateway is used for integrating with the Bibbi metadata service. The gateway uses a format based on Schema.org.

Additional documentation for Bibbi from Biblioteksentralen (https://www.bibsent.no/):

Text classification

Instructions for extracting a dataset for fine-tuning a BERT-based model for multi-label classification of book reviews: https://github.com/torleifg/multi-label-bert-classifier

psql -h localhost -p 5433 -U username -d postgres

Extract example dataset using genre and form as labels.

create temp table temp_export as
select
  concat(metadata->>'title', '. ', metadata->>'description') as text,
  jsonb_agg(distinct genre_terms->>'term' order by genre_terms->>'term') as labels
from
  book,
  lateral jsonb_array_elements(metadata->'genreAndForm') as genre_terms
where
  metadata->>'description' is not null
  and metadata->>'description' <> ''
  and length(metadata->>'description') > 200
  and metadata->'genreAndForm' is not null
  and jsonb_array_length(metadata->'genreAndForm') > 0
  and genre_terms->>'language' = 'nob'
group by text;
\copy temp_export to '~/dataset.csv' with csv header delimiter ';';

About

BookQuest is a hybrid search engine for books that combines semantic search with traditional full-text search to deliver accurate and relevant results. It supports harvesting book metadata from various sources using international standards like MARC21 and ONIX 3.

Topics

Resources

License

Stars

Watchers

Forks

Packages