Skip to content

Sharding Large Language Models for loading them efficiently in lesser RAM

Notifications You must be signed in to change notification settings

SharathHebbar/Model-Sharding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Model Sharding

Hugging Face

Introduction

Large Language Models (LLMs) represent a significant advancement in artificial intelligence and natural language processing.

Models such as OpenAI's GPT (Generative Pre-trained Transformer) series, Google's Gemini, PaLM, T5, and many such open-source models have achieved remarkable capabilities in understanding and generating human-like text. However, as these models grow larger to improve performance, they also pose challenges in terms of scalability, resource requirements, and ethical considerations.

A major challenge is using such models. Leave alone using the LLM in Colab, Kaggle notebook, or locally with less amount of RAM, even loading such huge models needs high RAM which is not a feasible solution.

So one such solution will be model sharding which converts the huge models into smaller chunks which in turn takes less time and consumes less hardware for loading such huge models.

Here we will discuss model sharding using Open Source LLM Mistral 7B freely hosted on HuggingFace Platform.

Lesser RAM

%%time
model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16
)
CPU times: user 36.8 s, sys: 48.5 s, total: 1min 25s
Wall time: 3min 30s

Before Sharding

%%time
model_name = "Sharathhebbar24/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16
)
CPU times: user 23 s, sys: 48.7 s, total: 1min 11s
Wall time: 1min 49s

After Sharding

References

About

Sharding Large Language Models for loading them efficiently in lesser RAM

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published