LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 19 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi.
This repo includes scripts needed to run our full pipeline, including data preprocessing and sampling, instruction dataset creation, model fine-tuning, inference and evaluation.
- Multilingual Support: Arabic, English, and Hindi.
- Comprehensive NLP Tasks: 19 tasks utilizing 52 datasets.
- Domain Optimization: Tailored for news and social media content analysis.
The model was trained on the LlamaLens dataset.
Access the LlamaLens model on Hugging Face.
- Ensure you have Python and
pip
installed on your system. - Clone the repository (if applicable):
git clone https://github.com/firojalam/LlamaLens.git cd LlamaLens
- Install the required packages:
pip install -r requirements.txt
- You may need to update transformers library
pip install --upgrade transformers
This repository includes a script to prepare the Llama3.1 instruction dataset. You can customize the dataset preparation process using various parameters, including sample size, dataset split, shuffling strategy, and the output directory for the processed dataset. All datasets are available on hugginface
- Description: Defines the number of samples to be used in the dataset.
- Usage:
- Set to
-1
to use the full dataset. - Set to any positive integer to specify the maximum number of samples (using stratified sampling).
- Set to
- Description: Specifies the dataset split to generate.
- Choices:
"train"
: Training set"test"
: Testing set"dev"
: Development set
- Description: Path to the directory containing subdirectories for different languages (e.g.,
ar
,en
,hi
). - Usage: The base directory should contain the following subdirectories:
ar
: Arabic language dataseten
: English language datasethi
: Hindi language dataset
- Example:
/path/to/intermediate_datasets/ar
,/path/to/intermediate_datasets/en
,/path/to/intermediate_datasets/hi
.
- Description: Defines how the dataset is shuffled.
- Choices:
"none"
: No shuffling."by_task"
: Shuffle the dataset within each task."by_language"
: Shuffle the dataset within each language."fully"
: Shuffle the entire dataset.
- Usage: Choose the option that best matches the configuration in the original paper.
- Description: Path to the directory where the prepared dataset will be saved.
- Usage: Specify the directory where you want to save the final dataset after it is processed.
To run the dataset preparation script, use the following command. Adjust the parameters as needed:
python3 bin/data_preparation/llama3_dataset_preparation.py \
--samples -1 \
--split "train" \
--intermediate_datasets_base "data/instruction_datasets" \
--shuffling "none" \
--dataset_directory "finetuning_datasets/testing_dataset"
This is an example of how run the training script on full precision mode:
accelerate launch bin/model_training/parallel_fine_tuning_llama3_Full_precision.py \
--model_name "base_models/Meta-Llama-3.1-8B-Instruct" \
--max_seq_length 512 \
--quant_bits 4 \
--use_nested_quant False \
--batch_size 16 \
--grad_size 2 \
--epochs 1\
--out_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/outputs" \
--save_steps 500 \
--train_set_dir "data/finetuning_datasets/shuffled_by_language_20k" \
--dev_set_dir "data/validation_data_500" \
--start_from_last_checkpoint False \
--lora_adapter_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/lora_adapter" \
--merged_model_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/merged_model" \
This is an example of how run the training script on quantized mode:
accelerate launch bin/model_training/parallel_fine_tuning_llama3_quantized.py \
--model_name "base_models/Meta-Llama-3.1-8B-Instruct" \
--max_seq_length 512 \
--quant_bits 4 \
--use_nested_quant False \
--batch_size 16 \
--grad_size 2 \
--epochs 1\
--quant_bits 4 \
--use_nested_quant False \
--out_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/outputs" \
--save_steps 500 \
--train_set_dir "data/finetuning_datasets/shuffled_by_language_20k" \
--dev_set_dir "data/validation_data_500" \
--start_from_last_checkpoint False \
--lora_adapter_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/lora_adapter" \
--merged_model_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/merged_model" \
To run inference for a specific language, you have to specify the intermediate folder that contains multiple datasets.
python bin/evaluation/inference.py \
--instructions-path support_data/instructions/instructions_gpt-4o_claude-3-5-sonnet_ar.json \
--intermediate-base-path data/intermediate_datasets_ar \
--results-folder-path "results/Test_results" \
--model-path "base_models/Meta-Llama-3.1-8B-Instruct" \
--samples -1 \
--device 0
To score results, run the follwing script:
python bin/evaluation/evaluate.py \
--experiment_dir results/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/ar
--output_dir scores/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/ar
Each JSONL file in the dataset follows a structured format with the following fields:
id
: Unique identifier for each data entry.original_id
: Identifier from the original dataset, if available.input
: The original text that needs to be analyzed.output
: The label assigned to the text after analysis.dataset
: Name of the dataset the entry belongs.task
: The specific task type.lang
: The language of the input text.instruction
: A brief set of instructions describing how the text should be labeled.text
: A formatted structure including instructions and response for the task in a conversation format between the system, user, and assistant, showing the decision process.
Example entry in JSONL file:
{
"id": "d1662e29-11cf-45cb-bf89-fa5cd993bc78",
"original_id": "nan",
"input": "الدفاع الجوي السوري يتصدى لهجوم صاروخي على قاعدة جوية في حمص",
"output": "not_claim",
"dataset": "ans-claim",
"task": "Claim detection",
"lang": "ar",
"instruction": "Analyze the given text and label it as 'claim' if it includes a factual statement that can be verified, or 'not_claim' if it's not a checkable assertion. Return only the label without any explanation, justification or additional text.",
"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a social media expert providing accurate analysis and insights.<|eot_id|><|start_header_id|>user<|end_header_id|>Analyze the given text and label it as 'claim' if it includes a factual statement that can be verified, or 'not_claim' if it's not a checkable assertion. Return only the label without any explanation, justification or additional text.\ninput: الدفاع الجوي السوري يتصدى لهجوم صاروخي على قاعدة جوية في حمص\nlabel: <|eot_id|><|start_header_id|>assistant<|end_header_id|>not_claim<|eot_id|><|end_of_text|>"
}
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Please cite our paper when using this model:
@article{kmainasi2024llamalensspecializedmultilingualllm,
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
year={2024},
journal={arXiv preprint arXiv:2410.15308},
volume={},
number={},
pages={},
url={https://arxiv.org/abs/2410.15308},
eprint={2410.15308},
archivePrefix={arXiv},
primaryClass={cs.CL}
}