Skip to content

GPT-based data generator, port in your private data, LLMs will create high-quality question and answer pairs for finetuning!

License

Notifications You must be signed in to change notification settings

haohww/gpt-doc2data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpt-doc2data

[ English | 中文 ]

Welcome to join our Wechat group chat!

When collecting data for LLM finetuning, obtaining formatted data from existing documents can be costly. Fortunately, gpt-doc2data comes to the rescue! It is a GPT-based data generator that allows you to input your private data, and large language models will create high-quality question and answer pairs. These pairs can then be utilized for fine-tuning or prompt-tuning your own model.

Example

Below are 10 QA pairs generated after feeding the LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS paper into the system:

[
    {
        "question": "What is LoRA?",
        "answer": "LoRA is a method that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters."
    },
    {
        "question": "What are the advantages of LoRA?",
        "answer": "The advantages of LoRA include reduced number of trainable parameters, lower GPU memory requirement, higher training throughput, no additional inference latency, and the ability to switch tasks efficiently."
    },
    {
        "question": "How does LoRA compare to fine-tuning in terms of model quality?",
        "answer": "LoRA performs on-par or better than fine-tuning in model quality on various language models, despite having fewer trainable parameters and higher training throughput."
    },
    {
        "question": "Which weight matrices in the Transformer architecture should be adapted with LoRA?",
        "answer": "LoRA should be applied to the weight matrices in the self-attention module, specifically Wq and Wv, for optimal performance."
    },
    {
        "question": "What is the optimal rank for LoRA?",
        "answer": "A low rank, such as 1 or 2, is sufficient for LoRA to achieve competitive performance on downstream tasks."
    },
    {
        "question": "What is the advantage of few-shot learning?",
        "answer": "Few-shot learning is advantageous when we only have a handful of training samples."
    },
    {
        "question": "What is the difference between adapter layers and LoRA?",
        "answer": "Adapter layers are computed in addition to the base model, introducing additional latency, while LoRA is added in a parallel manner."
    },
    {
        "question": "What is the GLUE Benchmark?",
        "answer": "The GLUE Benchmark is a collection of natural language understanding tasks used to evaluate NLU models."
    },
    {
        "question": "What is the purpose of the E2E NLG Challenge dataset?",
        "answer": "The E2E NLG Challenge dataset is used for training end-to-end, data-driven natural language generation systems."
    },
    {
        "question": "What is the amplification factor for task-specific directions in LoRA?",
        "answer": "The amplification factor for task-specific directions in LoRA is around 20."
    }
]

Getting Started

Install requirements

git clone https://github.com/codewangg/gpt-doc2data.git
cd gpt-doc2data
pip install -r requirements.txt

Prepare your documents

Currently supported file format:

  • PDF
  • Markdown
  • TXT

All the files should be put under gpt-doc2data/data directory

config.yaml

Rename example_config.yaml to config.yaml and modify it to suit your requirements and provide your own openai API key.

Generate QA pairs

python3 gpt-doc2data/gpt-doc2data.py

TODO

Low-hanging Fruits

  • Add an "id" field in the output JSON.
  • Improve the README.md for better understanding and usage.
  • We need a Chinese README page.
  • Clean up and add comments and type specifiers to the codebase (currently over 50% generated by LLM).

Medium-hanging Fruits

  • Improve the method for estimating the generated QA pair token number, as the current approach may waste tokens for each API call.
  • Add support to configure the output JSON key's name.
  • Add rate-limiter to avoid overloading the openai api.

High-hanging Fruits

  • Integrate the tool with local/private-served open-source models to reduce the cost associated with using the openai API due to high throughput.
  • Extend support for more file types, such as audio and videos, to serve as useful information sources. Broaden the tool's capabilities to generate different formats of outputs for fine-tuning, not just limited to QA pairs.
  • Implement a human judge mechanism to ensure high-quality data generation when needed.

About

GPT-based data generator, port in your private data, LLMs will create high-quality question and answer pairs for finetuning!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages