You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im interested in finding a way to load the model weights onto the gpus while the weight data is still being streamed/downloaded from a storage service like s3. This is to avoid waiting to download the weights first and only then begin to load the weights (basically starting llama.cpp), which both take considerable time on cloud VMs.
I searched around but haven't really found anything that addresses this particular issue.
What would be the best way to accomplish this? Can I write some code that lobs the data to llama.cpp as it comes in from the stream or is there a much simpler way?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Im interested in finding a way to load the model weights onto the gpus while the weight data is still being streamed/downloaded from a storage service like s3. This is to avoid waiting to download the weights first and only then begin to load the weights (basically starting llama.cpp), which both take considerable time on cloud VMs.
I searched around but haven't really found anything that addresses this particular issue.
What would be the best way to accomplish this? Can I write some code that lobs the data to llama.cpp as it comes in from the stream or is there a much simpler way?
Beta Was this translation helpful? Give feedback.
All reactions