This repository contains code and resources to run a hybrid (semantic and full-text) search engine for books. It utilizes text embeddings and supports harvesting book metadata from various sources, using international standards like MARC21 and ONIX 3.
The application leverages Multilingual-E5-small for generating text embeddings and PostgreSQL with pgvector as database and vector store. This provides multilingual semantic search capabilities.
- Multilingual-E5-small: This pre-trained model is used for generating text embeddings.
- pgvector: A PostgreSQL extension for storing and querying vectors, used as the vector store in the search engine.
Follow Run as Docker Compose or Run as Spring Boot to configure and run the application.
Configure one or more gateways for harvesting metadata by editing
compose-app.yml
.
Available options:
- oai-pmh
- bibbi
- bokbasen
Example:
SCHEDULER_ENABLED: true
SCHEDULER_INITIAL_DELAY: 5
SCHEDULER_FIXED_DELAY: 3600
HARVESTING_OAI_PMH_GATEWAYS_0_ENABLED: true
HARVESTING_OAI_PMH_GATEWAYS_0_SERVICE_URI: https://oai.aja.bs.no/mlnb
HARVESTING_OAI_PMH_GATEWAYS_0_TTL: 5
HARVESTING_OAI_PMH_GATEWAYS_0_MAPPER: default
HARVESTING_OAI_PMH_GATEWAYS_0_VERB: ListRecords
HARVESTING_OAI_PMH_GATEWAYS_0_METADATA_PREFIX: marc21
Run the following command in the project directory:
docker compose -f compose-app.yml up
Note: The first run may take some time as it will download the necessary embedding models. Once the models are in place, the application will be ready for use.
Configure one or more gateways for harvesting metadata by editing
application/src/main/resources/application.yaml
.
Available options:
- oai-pmh
- bibbi
- bokbasen
Example:
scheduler:
enabled: true
initial-delay: 5
fixed-delay: 3600
harvesting:
oai-pmh:
gateways:
- enabled: true
service-uri: https://oai.aja.bs.no/mlnb
ttl: 5
mapper: default
verb: ListRecords
metadata-prefix: marc21
set:
Run the following command in the project directory:
docker compose up
Open and run the application in your IDE or run the following command in the project directory:
./gradlew bootRun
Note: The first run may take some time as it will download the necessary embedding models. Once the models are in place, the application will be ready for use.
The hybrid search is based on Reciprocal Rank Fusion (RRF), an algorithm used for combining multiple ranked lists of search results to improve the overall ranking quality, in this case to combine full-text and vector search results.
Visit http://localhost:8080
in the browser and watch the results as the metadata harvesting progresses. Enter a
query for hybrid search or leave it blank for semantic similarity search (the first search hit will be a random choice
and the rest will be semantically similar books).
Visit http://localhost:8080/swagger-ui.html
in the browser to read and/or download the OpenAPI specification.
The gateway abstracts away the details of the external services and transforms metadata from the external services into a common model. The application supports three gateways: OAI-PMH (MARC21), Bokbasen (ONIX) and Bibbi. Custom mappers can be implemented as needed.
The OAI-PMH gateway harvests metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). It supports retrieving bibliographic data in MARC21 format.
Additional documentation for OAI-PMH from Biblioteksentralen (https://www.bibsent.no/):
- Ája OAI-PMH API (requires no authentication)
Additional documentation for OAI-PMH from Nasjonalbiblioteket (https://www.nb.no/):
- Nasjonalbibliografien og spesialbibliografiene OAI-PMH API (requires no authentication)
The Bokbasen gateway uses the ONIX format for metadata, commonly employed in the publishing industry. This is particularly useful for harvesting data from large-scale book vendors.
Additional documentation for ONIX from Bokbasen (https://www.bokbasen.no/):
- Bokbasen ONIX API (requires authentication)
The Bibbi gateway is used for integrating with the Bibbi metadata service. The gateway uses a format based on Schema.org.
Additional documentation for Bibbi from Biblioteksentralen (https://www.bibsent.no/):
- Bibbi Metadata REST API (requires no authentication)
Instructions for extracting a dataset for fine-tuning a BERT-based model for multi-label classification of book reviews: https://github.com/torleifg/multi-label-bert-classifier
psql -h localhost -p 5433 -U username -d postgres
Extract example dataset using genre and form as labels.
create temp table temp_export as
select
concat(metadata->>'title', '. ', metadata->>'description') as text,
jsonb_agg(distinct genre_terms->>'term' order by genre_terms->>'term') as labels
from
book,
lateral jsonb_array_elements(metadata->'genreAndForm') as genre_terms
where
metadata->>'description' is not null
and metadata->>'description' <> ''
and length(metadata->>'description') > 200
and metadata->'genreAndForm' is not null
and jsonb_array_length(metadata->'genreAndForm') > 0
and genre_terms->>'language' = 'nob'
group by text;
\copy temp_export to '~/dataset.csv' with csv header delimiter ';';