-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase #645
Comments
Related issues#505: LoRAX: Dynamic loading and optimized inference of LoRA adapter models.### DetailsSimilarity score: 0.91 - [ ] [LoRAX Docs](https://predibase.github.io/lorax/?h=cpu#features)LoRAX DocsMulti-LoRA inference server that scales to 1000s of fine-tuned LLMs📖 What is LoRAX? LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. 🌳 Features
URL: https://predibase.github.io/lorax/?h=cpu#features Suggested labels{ "label-name": "LoRA Framework", "description": "A powerful framework for serving fine-tuned models on a single GPU efficiently.", "repo": "llm-inference-engines", "confidence": 98.7 }#636: S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe### DetailsSimilarity score: 0.9 - [ ] [S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe](https://openpipe.ai/blog/s-lora)S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipeDESCRIPTION: The Problem of Weights The downside, of course, is that now you have to figure out how to serve all those task-specific fine-tuned models efficiently. Spinning up a dedicated GPU for each model is a non-starter because it leads to low GPU utilization, which is an existential issue because of how expensive GPU time is ($2+/hr for an A100). How do we square the circle? Serving all the models everywhere all at once URL: https://openpipe.ai/blog/s-lora Suggested labels{'label-name': 'GPU-Optimization', 'label-description': 'Optimizing GPU resource utilization for running multiple models efficiently on a single GPU.', 'gh-repo': 'openpipe/openpipe-ai', 'confidence': 54.2}#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models### DetailsSimilarity score: 0.87 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)Awesome-Efficient-LLMA curated list for Efficient Large Language Models:
Inference Acceleration
Updates
ContributingIf you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in Suggested labels{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }#628: LLaVA/README.md at main · haotian-liu/LLaVA### DetailsSimilarity score: 0.87 - [ ] [LLaVA/README.md at main · haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/README.md?plain=1)LLaVA/README.md at main · haotian-liu/LLaVA🌋 LLaVA: Large Language and Vision AssistantVisual instruction tuning towards large language and vision models with GPT-4 level capabilities. 📢 LLaVA-NeXT Blog Project Page Demo Data Model Zoo 🤝Community Contributions: llama.cpp Colab 🤗Space Replicate AutoGen BakLLaVA Improved Baselines with Visual Instruction Tuning Paper HF Visual Instruction Tuning (NeurIPS 2023, Oral) Paper HF Release
More
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations. ContentsSuggested labels#174: SparseLLM/ReluLLaMA-7B · Powerinfer - faster CPU inference### DetailsSimilarity score: 0.86 - [ ] [SparseLLM/ReluLLaMA-7B · Hugging Face](https://huggingface.co/SparseLLM/ReluLLaMA-7B)ReluLLaMA-7B Model creator: Meta Background Sparse computation is increasingly recognized as an important direction in enhancing the computational efficiency of large language models (LLMs). Among various approaches, the mixture-of-experts (MoE) method, exemplified by models like Mixtral, has shown particular promise. MoE works by selectively activating different model components (experts), thus optimizing resource usage. Recent studies (Zhang el al., 2021; Liu et al., 2023; Mirzadeh et al., 2023) reveal that LLMs inherently exhibit properties conducive to sparse computation when employing the ReLU activation function. This insight opens up new avenues for model efficiency, akin to MoE's selective activation. By dynamically choosing model parameters for computation, we can substantially boost efficiency. However, the widespread adoption of ReLU-based models in the LLM field remains limited. Referring to the transformation methods from existing works (Zhang el al., 2021; Mirzadeh et al., 2023), we convert existing models to ReLU-activated versions through fine-tuning. We hope these open-source ReLU LLMs could promote the development of sparse LLMs. #311: Introduction | Mistral AI Large Language Models### DetailsSimilarity score: 0.86 - [ ] [Introduction | Mistral AI Large Language Models](https://docs.mistral.ai/)Mistral AI currently provides two types of access to Large Language Models: An API providing pay-as-you-go access to our latest models, API Access Our API is currently in beta to ramp up the load and provide good quality of service. Access the platform to join the waitlist. Once your subscription is active, you can immediately use our chat endpoint: curl --location "https://api.mistral.ai/v1/chat/completions" Or our embeddings endpoint: curl --location "https://api.mistral.ai/v1/embeddings" For a full description of the models offered on the API, head on to the model docs. For more examples on how to use our platform, head on to our platform docs. Raw model weights Raw model weights can be used in several ways: For self-deployment, on cloud or on premise, using either TensorRT-LLM or vLLM, head on to Deployment Join our Discord community to discuss our models and talk to our engineers. Alternatively, reach out to our business team if you have enterprise needs, want more information about our products or if there are missing features you would like us to add. Contributing Mistral AI is committed to open source software development and welcomes external contributions. Please open a PR! Suggested labels{ "key": "llm-api", "value": "Accessing Large Language Models through the Mistral AI API" } |
LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibase
DESCRIPTION:
TL;DR: We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4-15%, depending on the task. LoRA Land’s 25 task-specialized large language models (LLMs) were all fine-tuned with Predibase for less than $8.00 each on average and are all served from a single A100 GPU using LoRAX, our open source framework that allows users to serve hundreds of adapter-based fine-tuned models on a single GPU. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.
Join our webinar on February 29th to learn more!
LLM Benchmarks: 25 fine-tuned Mistral-7b adapters that outperform GPT-4.
The Need for Efficient Fine-Tuning and Serving
With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs) and the emergence of large language models (LLMs) with billions of parameters, it has become increasingly challenging to adapt them to specific downstream tasks, especially in environments with limited computational resources or budgets. Parameter Efficient Fine-Tuning (PEFT) and Quantized Low Rank Adaptation (QLoRA) offer an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning.
Predibase has incorporated these best practices into its fine-tuning platform and, to demonstrate the accessibility and affordability of adapter-based fine-tuning of open-source LLMs, has fine-tuned 25 models for less than $8 each on average in terms of GPU costs.
Fine-tuned LLMs have historically also been very expensive to put into production and serve, requiring dedicated GPU resources for each fine-tuned model. For teams that plan on deploying multiple fine-tuned models to address a range of use cases, these GPU expenses can often be a bottleneck for innovation. LoRAX, the open-source platform for serving fine-tuned LLMs developed by Predibase, enables teams to deploy hundreds of fine-tuned LLMs for the cost of one from a single GPU.
URL: LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4
Suggested labels
{'label-name': 'adapter-based-fine-tuning', 'label-description': 'Efficient approach to fine-tuning large language models using adapters', 'gh-repo': 'https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4', 'confidence': 64.54}
The text was updated successfully, but these errors were encountered: