Skip to content

Commit

Permalink
Deployed 99b939a to master with MkDocs 1.6.0 and mike 2.1.3
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Aug 18, 2024
1 parent a85f05a commit d3cde39
Show file tree
Hide file tree
Showing 4 changed files with 230 additions and 184 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -1331,9 +1331,25 @@ <h1 id="deploy-the-llama3-model-for-text_generation-task-with-hugging-face-llm-s
<h2 id="serve-the-hugging-face-llm-model-using-vllm-backend">Serve the Hugging Face LLM model using vLLM backend<a class="headerlink" href="#serve-the-hugging-face-llm-model-using-vllm-backend" title="Permanent link"></a></h2>
<p>KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.</p>
<div class="tabbed-set tabbed-alternate" data-tabs="1:1"><input checked="checked" id="__tabbed_1_1" name="__tabbed_1" type="radio"><div class="tabbed-labels"><label for="__tabbed_1_1">Yaml</label></div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>The Llama3 model requires huggingface hub token to download the model. You can set the token using <code>HF_TOKEN</code>
environment variable.</p>
</div>
<p>Create a secret with the Hugging Face token.</p>
<div class="tabbed-set tabbed-alternate" data-tabs="1:2"><input checked="checked" id="__tabbed_1_1" name="__tabbed_1" type="radio"><input id="__tabbed_1_2" name="__tabbed_1" type="radio"><div class="tabbed-labels"><label for="__tabbed_1_1">Yaml</label><label for="__tabbed_1_2">Yaml</label></div>
<div class="tabbed-content">
<div class="tabbed-block">
<div class="highlight"><pre><span></span><code><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">v1</span>
<span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Secret</span>
<span class="nt">metadata</span><span class="p">:</span>
<span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">hf-secret</span>
<span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Opaque</span><span class="w"> </span>
<span class="nt">stringData</span><span class="p">:</span>
<span class="w"> </span><span class="nt">HF_TOKEN</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">&lt;token&gt;</span>
</code></pre></div>
</div>
<div class="tabbed-block">
<div class="highlight"><pre><span></span><code><span class="l l-Scalar l-Scalar-Plain">kubectl apply -f - &lt;&lt;EOF</span>
<span class="l l-Scalar l-Scalar-Plain">apiVersion</span><span class="p p-Indicator">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">serving.kserve.io/v1beta1</span>
<span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">InferenceService</span>
Expand All @@ -1347,6 +1363,13 @@ <h2 id="serve-the-hugging-face-llm-model-using-vllm-backend">Serve the Hugging F
<span class="w"> </span><span class="nt">args</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--model_name=llama3</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--model_id=meta-llama/meta-llama-3-8b-instruct</span>
<span class="w"> </span><span class="nt">env</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">HF_TOKEN</span>
<span class="w"> </span><span class="nt">valueFrom</span><span class="p">:</span>
<span class="w"> </span><span class="nt">secretKeyRef</span><span class="p">:</span>
<span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">hf-secret</span>
<span class="w"> </span><span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">HF_TOKEN</span>
<span class="w"> </span><span class="nt">optional</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
<span class="w"> </span><span class="nt">resources</span><span class="p">:</span>
<span class="w"> </span><span class="nt">limits</span><span class="p">:</span>
<span class="w"> </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s">"6"</span>
Expand All @@ -1360,7 +1383,7 @@ <h2 id="serve-the-hugging-face-llm-model-using-vllm-backend">Serve the Hugging F
</code></pre></div>
</div>
</div>
</input></div>
</input></input></div>
<h3 id="check-inferenceservice-status">Check <code>InferenceService</code> status.<a class="headerlink" href="#check-inferenceservice-status" title="Permanent link"></a></h3>
<div class="highlight"><pre><span></span><code>kubectl<span class="w"> </span>get<span class="w"> </span>inferenceservices<span class="w"> </span>huggingface-llama3
</code></pre></div>
Expand Down Expand Up @@ -1462,9 +1485,25 @@ <h4 id="sample-openai-chat-completions-streaming-request">Sample OpenAI Chat Com
<h2 id="serve-the-hugging-face-llm-model-using-huggingface-backend">Serve the Hugging Face LLM model using HuggingFace Backend<a class="headerlink" href="#serve-the-hugging-face-llm-model-using-huggingface-backend" title="Permanent link"></a></h2>
<p>You can use <code>--backend=huggingface</code> argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also
supports the OpenAI <code>/v1/completions</code> and <code>/v1/chat/completions</code> endpoints for inference.</p>
<div class="tabbed-set tabbed-alternate" data-tabs="2:1"><input checked="checked" id="__tabbed_2_1" name="__tabbed_2" type="radio"><div class="tabbed-labels"><label for="__tabbed_2_1">Yaml</label></div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>The Llama3 model requires huggingface hub token to download the model. You can set the token using <code>HF_TOKEN</code>
environment variable.</p>
</div>
<p>Create a secret with the Hugging Face token.</p>
<div class="tabbed-set tabbed-alternate" data-tabs="2:2"><input checked="checked" id="__tabbed_2_1" name="__tabbed_2" type="radio"><input id="__tabbed_2_2" name="__tabbed_2" type="radio"><div class="tabbed-labels"><label for="__tabbed_2_1">Yaml</label><label for="__tabbed_2_2">Yaml</label></div>
<div class="tabbed-content">
<div class="tabbed-block">
<div class="highlight"><pre><span></span><code><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">v1</span>
<span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Secret</span>
<span class="nt">metadata</span><span class="p">:</span>
<span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">hf-secret</span>
<span class="nt">type</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Opaque</span><span class="w"> </span>
<span class="nt">stringData</span><span class="p">:</span>
<span class="w"> </span><span class="nt">HF_TOKEN</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">&lt;token&gt;</span>
</code></pre></div>
</div>
<div class="tabbed-block">
<div class="highlight"><pre><span></span><code><span class="l l-Scalar l-Scalar-Plain">kubectl apply -f - &lt;&lt;EOF</span>
<span class="l l-Scalar l-Scalar-Plain">apiVersion</span><span class="p p-Indicator">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">serving.kserve.io/v1beta1</span>
<span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">InferenceService</span>
Expand All @@ -1479,6 +1518,13 @@ <h2 id="serve-the-hugging-face-llm-model-using-huggingface-backend">Serve the Hu
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--model_name=llama3</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--model_id=meta-llama/meta-llama-3-8b-instruct</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--backend=huggingface</span>
<span class="w"> </span><span class="nt">env</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">HF_TOKEN</span>
<span class="w"> </span><span class="nt">valueFrom</span><span class="p">:</span>
<span class="w"> </span><span class="nt">secretKeyRef</span><span class="p">:</span>
<span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">hf-secret</span>
<span class="w"> </span><span class="nt">key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">HF_TOKEN</span>
<span class="w"> </span><span class="nt">optional</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
<span class="w"> </span><span class="nt">resources</span><span class="p">:</span>
<span class="w"> </span><span class="nt">limits</span><span class="p">:</span>
<span class="w"> </span><span class="nt">cpu</span><span class="p">:</span><span class="w"> </span><span class="s">"6"</span>
Expand All @@ -1492,7 +1538,7 @@ <h2 id="serve-the-hugging-face-llm-model-using-huggingface-backend">Serve the Hu
</code></pre></div>
</div>
</div>
</input></div>
</input></input></div>
<h3 id="check-inferenceservice-status_1">Check <code>InferenceService</code> status.<a class="headerlink" href="#check-inferenceservice-status_1" title="Permanent link"></a></h3>
<div class="highlight"><pre><span></span><code>kubectl<span class="w"> </span>get<span class="w"> </span>inferenceservices<span class="w"> </span>huggingface-llama3
</code></pre></div>
Expand Down
2 changes: 1 addition & 1 deletion master/search/search_index.json

Large diffs are not rendered by default.

Loading

0 comments on commit d3cde39

Please sign in to comment.