Fully reproducible guide w/ online notebooks, model(s), dataset.
Note
Following my previous work on people's engagement with the Ukraine War, I decided to manually annotate approximately 400 comments (out of 180k) and train a classifier to assess the weight of pro-Russian comments. We initially experimented with a few-shot learning model (SetFit), but then we found ourselves going down the rabbit hole. Update 2024/04/22 : also check Docker+FastAPI serving app here
Baseline model - SetFit: very good performance (and latency...) for a few-shots learning approach. After many trials, final choice was MPNet embeddings + had to extend to 90 labels per class (3 classes : 1. pro Ukraine, 2. pro Russia, 3. off topic/no opinion) + logistic head, to achieve good accuracy. Anything lower (16, 32 samples etc.) wouldn't be enough.
Mistral-7B Fine Tuning: a well-crafted prompt + 2K samples synthetic generation (with OpenHermes) was enough to fine-tune a Mistral-7B with 81% accuracy and a notable --better, performance on our class of interest (1: pro-Russia comments). Unsurprisingly, LLM shows its amazing power to capture (some of) the human subtleties.
Classifier training on augmented dataset: I was eager to know if we could train a more "classic" classifier on a larger portion of our initial dataset. 20k unlabeled comments were labeled using a voting ensemble SetFit + our fine-tuned Mistral-7B and used to train our classifier on top of multi-e5-base
embeddings (vs. BGE
and multi-e5-small
). We tried many training data combinations (train size and/our 2k synthetic sample and/or 5-20k predicted data added), best performance was achieved with a weighted loss + only the 20k ensemble-predicted labels, without the 2k synthetic examples. Still perform better than our baseline on class 1.
Model Optimization: e5-based classifier is converted to ONNX and then optimized + quantized. We retain 98% accuracy of the base e5 model, while shrinking the model size to 266Mb (instead of 1Gb) and doing x1.9 on our inference latency (180ms vs. 90ms on a P100). Performs slightly worse than our fine-tuned LLM but the latency gain is huge! (800ms vs. 90ms).
Model | Size | Accuracy (%) | F1, class 1 (pro Russia) | Latency (ms) |
---|---|---|---|---|
SetFit (logistic head) | n/a | 78 | 58 | 10 |
Fine-Tuned Mistral-7B | 13-4Gb | 81 | 74 | 800 |
Fine-Tuned Llama3-8B | 13-4Gb | 80 | 77 | 800 |
multi-e5-base | 1Gb | 79 | 70 | 180 |
Quantized ONNX multi-e5-base | 266Mb | 78 | 69 | 90 |
- Notebooks should work on Google Colab too, maybe with a few adaptations for fine-tuning with
Unsloth
(libs install). - Dataset (here) , LLM LoRa adapters (here) and final multi-e5-base ONNX model (here) are available on HuggingFace.
Notebook | Description | Resource |
---|---|---|
lmd_setfit_modeling_logistic_head | Baseline model - few-shot clf using SetFit | notebook |
lmd_mistral_synthetic_gen_testprompt | Synthetic data gen - prepare dataset - prompts tests Mistral-7B-OpenHermes | notebook |
lmd_mistral_synthetic_gen_run | Synthetic data gen - run (output : 2k synthetic samples) | notebook |
lmd_mistral_synthetic_fine_tune | Fine-tuning Mistral-7B-base for classi. (output: json label) from synthetic data, using Unsloth (Qlora), Alpaca template | notebook |
lmd_setfit_mistral_evaluation | Benchmark SetFit / fine-tuned Mistral (several experiments) | notebook |
lmd_setfit_mistral_inference | Augment original dataset - voting ensemble SetFit + f-tuned LLM to infer 20k unlabeled comments | notebook |
lmd_multi-e5_train | multi-e5/bge embeddings + nn classifier on augmented dataset (several experiments) | notebook |
e5_onnx_optimization | multi-e5 - ONNX conversion & optimization/quantization | notebook |
lmd_e5_evaluation | Benchmark all models - focus on global accuracy, minority class accuracy and inference latency | notebook |
- Three labels 0. support to Ukraine, 1. (rather) support Russia 2. off topic / no opinion.
- Initial seed (annotated data) was labeled with
Label Studio
: around 400 samples with 'oversampling' on the minority class to capture enough information : - Overall (obvious/direct) support for Russia is rare (+- 10%), and Le Monde subscribers love to digress (2 is vast majority).
- We used our Faiss index / vector search previously built to retrieve enough "pro russian" comments among the 180k we scrapped, along with random exploration.
- We tried many optimizations on the few shot model
SetFit
(not shared here): # labels, grid search, different heads. - Compared to sample size, a very good performance (78% accuracy) & deployablility (5ms latency) but not satisfied with performance on our class of interest (pro-russian comments).
Notebook | Description | Resource |
---|---|---|
lmd_setfit_modeling_logistic_head | Baseline model - few shots clf using SetFit | notebook |
- Goal : augment our initial, manually annotated seed, with synthetic data so we can fine-tune a LLM to perform classification.
- Room for improvement : efforts on prompting (context + examples + lot of tests), but still had to discard 40% of comments eventually. But manual (random) review showed good enough results : credible comments with right ssociated label.
Notebook | Description | Resource |
---|---|---|
lmd_mistral_synthetic_gen_testprompt | Synthetic data gen - prepare dataset - prompts tests Mistral-7B-OpenHermes | notebook |
lmd_mistral_synthetic_gen_run | Synthetic data gen - run (output : 2k synthetic samples) | notebook |
- We are not using LLM embeddings with a classification layer. Instead we fine-tune our model with annotated + synthetic data so it predicts a label {label:0} or {label:1} our {label:2} given a prompt (instruction + comment).
- Our Alpaca-like template showed good performance with Mistral-7b base v0.1 (no improvement with recently released v0.2).
- LLM as a predictor shows very good accuracy (81%) and most importantly performs well on our minority class.
- We could use it to extend our dataset ? We have nearly 180k unlabeled comments that could be used to train a standard classifier!
Notebook | Description | Resource |
---|---|---|
lmd_mistral_synthetic_fine_tune | Fine-tuning Mistral-7B-base for classi. (output: json label) w/ synth. data, using Unsloth (Qlora), Alpaca template | notebook |
lmd_setfit_mistral_evaluation | Benchmark SetFit / fine-tuned Mistral (several experiments) | notebook |
- End goal : retain enough accuracy while lowering inference time (900ms for fine-tuned LLM).
- Train a classifier on top of multi-e5-embeddins (tested m3-bge as well). Several dataset composition tried (synthetic data and/or llm-predicted labels and/or initial seed).
- Our accuracy is good enough, especially on minority class, considering we trained on 20k synthetic/llm-predicted data.
- Final model is converted to ONNX, quantized & optimized -> 80ms avg latency.
Notebook | Description | Resource |
---|---|---|
lmd_setfit_mistral_inference | Augment original dataset - voting ensemble SetFit + f-tuned LLM to infer 20k unlabeled comments | notebook |
lmd_multi-e5_train | multi-e5/bge embeddings + nn classifier on augmented dataset (several experiments) | notebook |
e5_onnx_optimization | multi-e5 - ONNX conversion & optimization/quantization | notebook |
lmd_e5_evaluation | Benchmark all models - focus on global, minority class and inference latency | notebook |
Ressources I found to be particularly useful re. prompting, template, fine-tuning, embeddings performance, quantization etc.
- MLabonne Repo
- Dataset Gen - Kaggle example
- Dataset Gen - blog w/ prompt examples
- Prepare dataset- /r/LocalLLaMA best practice classi
- Prepare dataset - using gpt3.5
- Prepare dataset - Predibase prompts for diverse fine-tuning tasks
- Fine tune OpenHermes-2.5-Mistral-7B - including prompt template gen
- Fine tune - Unsloth colab example
- Fine tune - w/o unsloth or wandb or philschmid
- Fine tune - impact of parameters S. Raschka
- Embeddings - multilingual, latest comparison w/ e5-multi
- Philschmid ONNX optim