createllm

A Python package that enables users to create and train their own Language Learning Models (LLMs) from scratch using custom datasets. This package provides a simplified approach to building, training, and deploying custom language models tailored to specific domains or use cases.

🎯 Core Purpose

createllm allows you to:

Train custom language models on your specific text data
Create domain-specific LLMs for specialized applications
Build and experiment with different model architectures
Deploy trained models for text generation tasks

✨ Key Features

🔨 Build LLMs from scratch using your own text data
🚀 Multi-threaded training for faster model development
📊 Real-time training progress tracking
🎛️ Configurable model architecture
💾 Easy model saving and loading
🎯 Custom text generation capabilities
📈 Built-in performance monitoring

📋 Requirements

pip install torch torchvision tqdm dill

🚀 Quick Start Guide

1. Prepare Your Training Data

Place your training text in a file. The model learns from this text to generate similar content.

my_training_data.txt
├── Your custom text
├── Can be articles
├── Documentation
└── Any text content you want the model to learn from

2. Train Your Custom LLM

from createllm import ModelConfig, GPTTrainer, TextFileProcessor

# Initialize model configuration
config = ModelConfig(
    vocab_size=None,  # Will be automatically set based on your data
    n_embd=384,      # Embedding dimension
    block_size=256,  # Context window size
    n_layer=4,       # Number of transformer layers
    n_head=4        # Number of attention heads
)

# Create trainer instance
trainer = GPTTrainer(
    text_file="path/to/my_training_data.txt",
    learning_rate=3e-4,
    batch_size=64,
    max_iters=5000,
    eval_interval=500,
    saved_path="path/to/save/model"
)

# Start training
trainer.trainer()  # This will automatically process text and train the model

3. Use Your Trained Model

from createllm import LLMModel

# Load your trained model
model = LLMModel("path/to/saved/model")

# Generate text
generated_text = model.generate("Your prompt text")
print(generated_text)

📝 Example Use Cases

Domain-Specific Documentation Generator

# Train on technical documentation
trainer = GPTTrainer(
    text_file="technical_docs.txt",
    saved_path="tech_docs_model"
)
trainer.trainer()

Custom Writing Style Model

# Train on specific author's works
trainer = GPTTrainer(
    text_file="author_works.txt",
    saved_path="author_style_model"
)
trainer.trainer()

Specialized Content Generator

# Train on specific content type
trainer = GPTTrainer(
    text_file="specialized_content.txt",
    saved_path="content_model"
)
trainer.trainer()

⚙️ Model Configuration Options

Customize your model architecture based on your needs:

config = ModelConfig(
    n_embd=384,     # Larger for more complex patterns
    block_size=256, # Larger for longer context
    n_layer=8,      # More layers for deeper understanding
    n_head=8,       # More heads for better pattern recognition
    dropout=0.2     # Adjust for overfitting prevention
)

💡 Training Tips

Data Quality
- Clean your training data
- Remove irrelevant content
- Ensure consistent formatting

Resource Management

trainer = GPTTrainer(
    batch_size=32,     # Reduce if running out of memory
    max_iters=5000,    # Increase for better learning
    eval_interval=500  # Monitor training progress
)

Model Size vs Performance
- Smaller models (n_layer=4, n_head=4): Faster training, less complex patterns
- Larger models (n_layer=8+, n_head=8+): Better understanding, more resource intensive

🔍 Monitoring Training

The training process provides real-time feedback:

step 0: train loss 4.1675, val loss 4.1681
step 500: train loss 2.4721, val loss 2.4759
step 1000: train loss 1.9842, val loss 1.9873
step 1500: train loss 1.1422, val loss 1.1422
...

📁 Saved Model Structure

saved_model/
├── model.pt           # Model weights
├── encoder.pickle    # Text encoder
├── decoder.pickle    # Text decoder
└── config.json      # Model configuration

⚠️ Limitations

Training requires significant computational resources
Model quality depends on training data quality
Larger models require more training time and resources

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests.

📫 Support

For issues and questions, please open an issue in the GitHub repository.

📄 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Based on the GPT architecture with modifications for custom training and ease of use.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
createllm		createllm
docs		docs
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

createllm

🎯 Core Purpose

✨ Key Features

📋 Requirements

🚀 Quick Start Guide

1. Prepare Your Training Data

2. Train Your Custom LLM

3. Use Your Trained Model

📝 Example Use Cases

⚙️ Model Configuration Options

💡 Training Tips

🔍 Monitoring Training

📁 Saved Model Structure

⚠️ Limitations

🤝 Contributing

📫 Support

📄 License

🙏 Acknowledgments

About

Releases 2

Packages

Contributors 2

Languages

License

khushaljethava/createllm

Folders and files

Latest commit

History

Repository files navigation

createllm

🎯 Core Purpose

✨ Key Features

📋 Requirements

🚀 Quick Start Guide

1. Prepare Your Training Data

2. Train Your Custom LLM

3. Use Your Trained Model

📝 Example Use Cases

⚙️ Model Configuration Options

💡 Training Tips

🔍 Monitoring Training

📁 Saved Model Structure

⚠️ Limitations

🤝 Contributing

📫 Support

📄 License

🙏 Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages