Skip to content

Commit

Permalink
Implement the Phi 3 vision model (#351)
Browse files Browse the repository at this point in the history
* Intial work on phi3v

* Add the image embedding layer

* Lints

* Implement the loader

* Add infrastructure for phi3 image processor

* Merge

* Merge

* Merge

* Merge

* Partially implement padding

* Implement the hd transform step

* Work on the image processor

* Clippy

* Complete the phi3v inputs processor

* Rename

* Merge

* Merge

* Rename to phi3v and fix deser

* Fix varbuilder

* Fix varbuilder

* Default for do convert rgb

* Some defaults

* Allow no processor config

* Setup debug flag

* Add phi3v

* Implement messages flattening

* Update

* Rewrite the pad, hd transform

* Clippy

* Detect num channels

* Fix reshape

* Fix global image channel dim

* Fix assert

* Fix dtype

* Fix gt

* Fix image id neg

* Fix dim0 of pixel values

* Fix dtype

* Check if model supports gemm

* Fix some shape errors

* Fix some shape errors

* Fix rank of slice_assign

* Fix image toks

* Properly downcase

* Fix response

* Fix response

* Allow no images in prompt

* Output correct hidden state

* Fix nonzero and add test

* Fix n image toks

* Add mistralrs_vision

* Typo

* Fix and add tests

* Fix indexing

* Fix test condition

* Fix unsqueeze

* Fix dtype for norm

* Update clip

* Clippy

* Run clip in f32

* Run in bf16

* Run in bf16 again

* Fix dtype

* Set toks to have correct context lens

* Set toks to have correct context lens

* Support multiple GGUF files (#379)

* Move to gguf module

* Add content abstraction for multiple gguf files

* Fix test

* Allow specifying and loading multiple gguf files

* Update docs and examples

* Print some info

* Merge

* Organize normal loading metadata (#381)

* Organize normal loading metadata

* Fix

* Bump version 0.1.13 -> 0.1.14 (#382)

* Patch incorrect unwrap and bump version (#383)

* Patch incorrect unwrap

* Bump version to 0.1.15

* More verbose logging during loading (#385)

* More verbose logging when loading

* More logging

* Refactor enabling debug logging (#387)

* Refactor enabling debug logging

* Fix reversed order

* Merge

* Merge

* Merge

* Use precise gelu

* Use correct kernel

* Debugging commit

* Add fused bias linear

* Finish merge

* Use fused layer in clip

* Save progress

* Remove debugs

* Update example

* Resize exact

* Update interpolate

* Fix batch dim

* Update test and transform

* It works

* Add some examples

* Allow more than one image

* Add support in python api

* Add to toml selector

* Update python api

* Overhaul readme and docs

* Update

* Export vision arch

* Export vision arch

* Export vision arch

* Fix max img dim

* Fix unwrap
  • Loading branch information
EricLBuehler authored Jun 7, 2024
1 parent 44e8a22 commit 5a7ebb7
Show file tree
Hide file tree
Showing 69 changed files with 4,106 additions and 820 deletions.
5 changes: 4 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ members = [
"mistralrs-pyo3",
"mistralrs",
"mistralrs-bench",
"mistralrs-vision",
]
resolver = "2"

Expand Down Expand Up @@ -32,10 +33,12 @@ tracing = "0.1.40"
tracing-subscriber = { version = "0.3.18", features = ["env-filter"] }
futures = "0.3"
clap = { version = "4.5.1", features = ["derive"] }
pyo3 = { version = "0.21.0", features = ["full", "extension-module"] }
pyo3 = { version = "0.21.0", features = ["full", "extension-module", "either"] }
tokio = { version = "1.36.0", features = ["full", "rt-multi-thread"] }
once_cell = "1.19.0"
image = "0.25.1"
reqwest = { version = "0.12.4", features = ["blocking"] }
base64 = "0.22.1"

[profile.profiling]
inherits = "release"
Expand Down
122 changes: 83 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,47 @@ Blazingly fast LLM inference.

Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

## Upcoming features
- More models: please submit requests [here](https://github.com/EricLBuehler/mistral.rs/issues/156).
- X-LoRA: Scalings `topk` and softmax `topk` ([#48](https://github.com/EricLBuehler/mistral.rs/issues/48)).
- Parallel linear layers (sharding) ([#50](https://github.com/EricLBuehler/mistral.rs/issues/50)).
- Vision models: Idefics 2 ([#309](https://github.com/EricLBuehler/mistral.rs/pull/309)).
Please submit requests for new models [here](https://github.com/EricLBuehler/mistral.rs/issues/156).

**Running the new Llama 3 model**
## Get started fast 🚀

`cargo run --release --features ... -- -i plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama`
1) [Install](#installation-and-build)

**Running the new Phi 3 model with 128K context window**
2) [Get models](#getting-models)

`cargo run --release --features ... -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3`
3) Deploy with our easy to use APIs
- [Python](examples/python)
- [Rust](mistralrs/examples)
- [OpenAI compatible HTTP server](examples/http.md)

## Quick examples
- 🦙 Run the Llama 3 model

*After following installation instructions*

```
./mistralrs_server -i plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama
```
- φ³ Run the Phi 3 model with 128K context window
*After following installation instructions*
```
./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
```
- φ³📷 Run the Phi 3 vision model: [documentation and guide here](docs/PHI3V.md)
<img src="https://static.vecteezy.com/system/resources/previews/012/168/187/large_2x/beautiful-sunset-on-the-beach-with-palm-tree-for-travel-and-vacation-free-photo.JPG" alt="Sunset on a beach" width = "400" height = "267">
*After following installation instructions*
```
./mistralrs_server --port 1234 vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
```
- Other models: [see supported models](#supported-models) and [how to run them](#run-with-the-cli)
## Description
**Fast**:
Expand Down Expand Up @@ -69,19 +97,24 @@ https://github.com/EricLBuehler/mistral.rs/assets/65165915/3396abcd-8d44-4bf7-95
Please see [this section](#supported-models) for details on quantization and LoRA support.
## APIs and Integrations
**Rust Library API**
Rust multithreaded API for easy integration into any application.
<details>
<summary><b>Rust Crate</b></summary>
Rust multithreaded/async API for easy integration into any application.
- [Docs](https://ericlbuehler.github.io/mistral.rs/mistralrs/)
- [Examples](mistralrs/examples/)
- To install: Add `mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }`
**Python API**
</details>
<details>
<summary><b>Python API</b></summary>
Python API for mistral.rs.
- [Installation](mistralrs-pyo3/README.md)
- [Installation including PyPI](mistralrs-pyo3/README.md)
- [Docs](mistralrs-pyo3/API.md)
- [Example](examples/python/python_api.py)
- [Cookbook](examples/python/cookbook.ipynb)
Expand Down Expand Up @@ -113,18 +146,26 @@ print(res.choices[0].message.content)
print(res.usage)
```

**HTTP Server**
</details>

<details>
<summary><b>HTTP Server</b></summary>

OpenAI API compatible API server

- [API Docs](examples/http.md).
- [Running](README.md#run)
- [Example](examples/server/chat.py)

**Llama Index integration**
</details>

<details>
<summary><b>Llama Index integration</b></summary>

- Docs: https://docs.llamaindex.ai/en/stable/examples/llm/mistral_rs/

</details>

---

## Supported accelerators
Expand All @@ -149,13 +190,11 @@ Enabling features is done by passing `--features ...` to the build system. When
|A10 GPU, CUDA|78|78|[mistral-7b](TheBloke/Mistral-7B-Instruct-v0.1-GGUF)|4_K_M|
|Intel Xeon 8358 CPU, AVX|6|19|[mistral-7b](TheBloke/Mistral-7B-Instruct-v0.1-GGUF)|4_K_M|
|Raspberry Pi 5 (8GB), Neon|2|3|[mistral-7b](TheBloke/Mistral-7B-Instruct-v0.1-GGUF)|2_K|
|A100 GPU, CUDA|110|119|[mistral-7b](TheBloke/Mistral-7B-Instruct-v0.1-GGUF)|4_K_M|
|A100 GPU, CUDA|119|119|[mistral-7b](TheBloke/Mistral-7B-Instruct-v0.1-GGUF)|4_K_M|

Please submit more benchmarks via raising an issue!

## Usage
### Installation and Build
To install mistral.rs, one should ensure they have Rust installed by following [this](https://rustup.rs/) link. Additionally, the Hugging Face token should be provided in `~/.cache/huggingface/token` by running `huggingface-cli login` to enable automatic download of gated models.
## Installation and Build

1) Install required packages
- `openssl` (ex., `sudo apt install libssl-dev`)
Expand Down Expand Up @@ -224,6 +263,7 @@ To install mistral.rs, one should ensure they have Rust installed by following [
There are 2 ways to run a model with mistral.rs:
- From Hugging Face Hub (easiest)
- From local files
- Running a GGUF model fully locally

### Getting models from Hugging Face Hub

Expand Down Expand Up @@ -284,16 +324,14 @@ please consider using the method demonstrated in examples below, where the token
**Supported GGUF tokenizer types**
- `llama`

## Run

To start a server serving Mistral GGUF on `localhost:1234`,
```bash
./mistralrs_server --port 1234 --log output.log gguf -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -t mistralai/Mistral-7B-Instruct-v0.1 -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
```
## Run with the CLI

Mistral.rs uses subcommands to control the model type. They are generally of format `<XLORA/LORA>-<QUANTIZATION>`. Please run `./mistralrs_server --help` to see the subcommands.

Additionally, for models without quantization, the model architecture should be provided as the `--arch` or `-a` argument in contrast to GGUF models which encode the architecture in the file. It should be one of the following:
Additionally, for models without quantization, the model architecture should be provided as the `--arch` or `-a` argument in contrast to GGUF models which encode the architecture in the file.

### Architecture for plain models

- `mistral`
- `gemma`
- `mixtral`
Expand All @@ -302,6 +340,10 @@ Additionally, for models without quantization, the model architecture should be
- `phi3`
- `qwen2`

### Architecture for vision models

- `phi3v`

**Interactive mode:**

You can launch interactive mode, a simple chat application running in the terminal, by passing `-i`:
Expand All @@ -310,7 +352,7 @@ You can launch interactive mode, a simple chat application running in the termin
./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
```

### Quick examples:
## More quick examples:

- X-LoRA with no quantization

Expand Down Expand Up @@ -362,13 +404,14 @@ Example:
./mistralrs_server --port 1234 toml -f toml-selectors/gguf.toml
```

**Command line docs**

Command line docs [here](docs/CMD_LINE_DOCS.md)

---

## Supported models

Mistal.rs supports several model categories:
- text
- vision (see [the docs](docs/VISION_MODELS.md))

**Quantization support**
|Model|GGUF|GGML|
|--|--|--|
Expand All @@ -379,13 +422,15 @@ Command line docs [here](docs/CMD_LINE_DOCS.md)
|Phi 2|| |
|Phi 3|| |
|Qwen 2| | |
|Phi 3 Vision| | |

**Device mapping support**
|Model|Supported|
|--|--|
|Normal||
|Plain||
|GGUF||
|GGML| |
|Vision Plain| |

**X-LoRA and LoRA support**
|Model|X-LoRA|X-LoRA+GGUF|X-LoRA+GGML|
Expand All @@ -397,17 +442,19 @@ Command line docs [here](docs/CMD_LINE_DOCS.md)
|Phi 2|| | |
|Phi 3||| |
|Qwen 2| | | |
|Phi 3 Vision| | | |

**Using derivative models**
### Using derivative model

To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass `--help` after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:

- **Normal**: Model id
- **Plain**: Model id
- **Quantized**: Quantized model id, quantized filename, and tokenizer id
- **X-LoRA**: Model id, X-LoRA ordering
- **X-LoRA quantized**: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
- **LoRA**: Model id, LoRA ordering
- **LoRA quantized**: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
- **Vision Plain**: Model id

See [this](#adapter-ordering-file) section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.

Expand All @@ -421,16 +468,13 @@ For example, when using a Zephyr model:

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the `x-lora-*` architecture, and LoRA support by selecting the `lora-*` architecture. Please find docs for adapter models [here](docs/ADAPTER_MODELS.md)

---

### Chat Templates and Tokenizer
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation [here](docs/CHAT_TOK.md).

## Contributing
If you have any problems or want to contribute something, please raise an issue or pull request!


If you want to add a new model, please see [our guide](docs/ADDING_MODELS.md).
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request.
If you want to add a new model, please contact us via an issue and we can coordinate how to do this.

## FAQ
- Debugging with the environment variable `MISTRALRS_DEBUG=1` causes the following things
Expand Down
Loading

0 comments on commit 5a7ebb7

Please sign in to comment.