- Fine-Tuning Embedding Starter Kit
- Key features
- Before you begin
- Using the model
- Contributing
- License
- Acknowledgments
This comprehensive starter kit guides users through fine-tuning embeddings from unstructured data, leveraging Large Language Models (LLMs) and open-source embedding models to enhance NLP task performance. It supports a flexible workflow catering to different stages of the fine-tuning process, from data preparation to evaluation.
- Automated Query Generation: Automatically generate synthetic query-answer pairs from unstructured text.
- Embedding Model Fine-Tuning: Fine-tune open-source embedding models with a synthetic dataset via the Sentence Transformers library.
- Performance Evaluation: Benchmark the fine-tuned embeddings with metrics to quantify improvements.
Some things to note:
- This kit is fine-tuning models from Hugging Face using the name of the model and the model_id referred to later.
- We hope to have embedding model fine-tuning capability in SambaStudio very soon which will allow you to fine-tune embeddings models on SambaNova RDUs.
- In this kit the SambaNova LLM is used in the dataset creation process, namely in creating questions and answers from the chunks in order to finetune the model.
- This kit allows the use of SambaStudio:
- Clone the ai-starter-kit repo.
git clone https://github.com/sambanova/ai-starter-kit.git
- Install the dependencies:
- With poetry:
poetry install --no-root
- With pip:
pip install -r requirements.txt
Clone the start kit repo.
- With poetry:
- Python 3.11+
- Required libraries: Sentence Transformers, Hugging Face Transformers
The next step is to set up your environment to use one of the models available from SambaNova. If you're a current SambaNova customer, you can deploy your models with SambaStudio. If you are not a SambaNova customer, you can self-service provision API endpoints using SambaStudio.
-
If using SambaStudio Please follow the instructions here for setting up endpoint and your environment variables. Then in the config file set the llm
api
variable to"sambastudio"
, set theCoE
andselect_expert
configs if using a CoE endpoint. -
If using SambaNova Fast-API Please follow the instructions here for setting up your environment variables. Then in the config file set the llm
api
variable to"fastapi"
and set theselect_expert
config depending on the model you want to use.
After you've completed installation, you can use and evaluate it.
The standard workflow consists of data preparation and script execution. Follow these steps:
-
Place your data in a directory that you later specify when you run the script. The script defaults to PDFs but supports other file types.
-
Run the script with necessary parameters. For example:
python scriptfine_tune_embed_model_name.py --input_data_directory ./your_data_directory --output_data_directory ./processed_data
The script supports the following arguments to customize the process:
--file_extension
for specifying file types.--split_ratio
to adjust the train-validation dataset split.--force_retrain
to force retraining even if a finetuned model exists.
Run python fine_tune_embed_model.py --help
for a full list of arguments and their descriptions.
If you've previously generated synthetic data and wish to proceed directly to fine-tuning:
- Run the script with the following arguments:
--train_dataset_path
and `--val_dataset_path are the paths to your pre-generated datasets.- Specify the model and output directory.
For example:
python fine_tune_embed_model.py --train_dataset_path ./processed_data/train_dataset.json --val_dataset_path ./processed_data/val_dataset.json --model_id "your_model_id" --model_output_path ./finetuned_model
To evaluate an existing finetuned model without re-running the entire process, specify the model path and run evaluation only (--evaluate_only
argument). Follow these steps:
- Use the
--model_output_path
argument to point to your finetuned model directory. - Ensure the dataset paths are specified if they are not in the default location.
- Specify
--evaluate_only
.
For example:
python fine_tune_embed_model.py --val_dataset_path ./processed_data/val_dataset.json --model_output_path ./finetuned_model --evaluate_only
The script supports integrating SNS (SambaNova systems) embeddings for enhancing the retrieval and understanding capabilities within LangChain workflows. Consider these points when using SNS embeddings:
- Configuration: Ensure that you have the
export.env
file set up with your SNS credentials, includingEMBED_BASE_URL
,EMBED_PROJECT_ID
,EMBED_ENDPOINT_ID
, andEMBED_API_KEY
. - Integration Example: The script demonstrates how to use SNS embeddings for document and query encoding, followed by computing cosine similarity for a given query against a set of documents. Additionally, it integrates these embeddings into a LangChain retrieval workflow, showcasing an end-to-end example of query handling and document retrieval.
To run the SNS embeddings integration:
python sns_embedding_script.py
The script executes the embedding process for documents and queries, computes similarities, and demonstrates a LangChain retrieval workflow using the predefined embeddings.
We welcome contributions! Feel free to improve the process or add features by submitting issues or pull requests.
This project is licensed under the Apache 2.0 license. See the LICENSE.md
file in the parent folder (AI-STARTER-KIT
) for more details.
- Sentence Transformers and Hugging Face for their resources and pre-trained models.
- Inspired by practices from original embedding fine-tuning repository.
All the packages/tools are listed in the requirements.txt
file in the project directory.