Semantic Search For LHCb papers

This project is providing a semantic search for LHCb papers. It provides a pipeline to search by similar papers or use natural language queries to search for papers. It uses embeddings to find similar papers and sentence transformers to encode the papers and queries.

There is a web interface to search for papers in easy and straightforward way. I plan to update the papers database every week to keep the search results up to date. There are many papers published by LHCb collaboration and about LHCb every week on arxiv.

This is currently more of a proof of concept and I plan to improve the search results and the interface in the future. I'm going to present the results of this projects in a more technical manner later.

a demo of the project is available on my website hosted on a humble server here

There are two main parts of this project:

pipeline

The code inside the model folder is responsible for creating the embeddings for the papers. It scrapes the arxiv papers dataset on kaggle and uses sentence transformers to encode the papers. The embeddings are saved in a file and used in the search part. I also use Pinecode to create the embeddings for the papers.

website

The code inside the website folder is responsible for the web interface. It uses the embeddings created by the pipeline to search for similar papers or use natural language queries to search for papers. The website is built using Flask python library.

How to run the code

Website Setup

The website component can be run either locally using Python or via Docker. Both methods are explained below.

Prerequisites

Python 3.10 or higher
Pinecone API key and index name

Local Development Setup

Clone the repository:

git clone https://github.com/MohamedElashri/lhcbfinder.git
cd lhcbfinder/website

Create and activate a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate # Linux/Mac, for Windows use venv\Scripts\activate

Install the required packages:

pip install -r requirements.txt

Create a .env file in the website directory with your Pinecone credentials:

PINECONE_API_KEY=your_api_key_here
PINECONE_INDEX_NAME=your_index_name_here
FLASK_ENV=development

Run the development server:

python run_local.py

The website should now be accessible at http://localhost:8000

Docker Deployment

For production deployment, you can use Docker and docker-compose:

Clone the repository:

git clone https://github.com/MohamedElashri/lhcbfinder
cd lhcb-search/website

Create a .env file with your credentials:

PINECONE_API_KEY=your_api_key_here
PINECONE_INDEX_NAME=your_index_name_here
FLASK_ENV=production
FLASK_APP=app.py

Build and start the containers:

docker compose up -d

The website will be available at http://localhost:8000

Configuration Options

Environment Variables

PINECONE_API_KEY: Your Pinecone API key
PINECONE_INDEX_NAME: Name of your Pinecone index
FLASK_ENV: Set to 'development' for local development or 'production' for deployment
WORKERS: Number of Gunicorn workers (default: 2)
THREADS: Number of threads per worker (default: 4)
TIMEOUT: Worker timeout in seconds (default: 60)

Rate Limiting

The application includes rate limiting to prevent abuse:

1 request per 30 seconds per IP
5 requests per 3 minutes per IP

Troubleshooting

If you see Redis connection errors in Docker:
- Ensure Redis container is running: docker-compose ps
- Check Redis logs: docker-compose logs redis
If rate limiting is too restrictive for development:
- Use development environment to switch to memory-based storage
- Adjust limits in app.py if needed
For connection issues with Pinecone:
- Verify your API key and index name
- Check your network connection
- Ensure your Pinecone plan is active

Future work

A lot of work on quality and improving the embeddings and the search results. I also plan to add more features to the website and make it more user friendly. if you like to contribute to this project, feel free to open an issue or a pull request.

Acknowledgements

This work is a fork of searchthearxiv project that provides a semantic search for Machine Leaning arxiv papers. I used the code and the idea to create a similar project for LHCb papers.

License

This project is licensed under the GPLv3 License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
img		img
model		model
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search For LHCb papers

pipeline

website

How to run the code

Website Setup

Prerequisites

Local Development Setup

Docker Deployment

Configuration Options

Environment Variables

Rate Limiting

Troubleshooting

Future work

Acknowledgements

License

About

Languages

License

MohamedElashri/lhcbfinder

Folders and files

Latest commit

History

Repository files navigation

Semantic Search For LHCb papers

pipeline

website

How to run the code

Website Setup

Prerequisites

Local Development Setup

Docker Deployment

Configuration Options

Environment Variables

Rate Limiting

Troubleshooting

Future work

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages