feat(text): add tokenizer for cohere & new gpt-4o (#276)

Because - users will need to count the token count for each chunk - token chunk strategy and token count should decouple - users will need to fetch tokenisor from vendors This commit - add tokenization for cohere & gpt-4o family - p.s. there are more use cases in huggingface. it will cause error when the setting is not correct for huggingface python lib. - e.g. the token count is over the limitation for a specific model - refactor text task chunk text for future extensibility and maintainability
instill-ai · Aug 9, 2024 · 5d8cec3 · 5d8cec3
1 parent 15fc0d2
commit 5d8cec3
Show file tree

Hide file tree

Showing 16 changed files with 969 additions and 216 deletions.
diff --git a/go.mod b/go.mod
@@ -43,7 +43,7 @@ require (
 	github.com/lib/pq v1.10.9
 	github.com/nakagami/firebirdsql v0.9.10
 	github.com/pkg/errors v0.9.1
-	github.com/pkoukk/tiktoken-go v0.1.6
+	github.com/pkoukk/tiktoken-go v0.1.7
 	github.com/redis/go-redis/v9 v9.5.1
 	github.com/santhosh-tekuri/jsonschema/v5 v5.3.0
 	github.com/sijms/go-ora/v2 v2.8.19

diff --git a/go.sum b/go.sum
@@ -396,8 +396,8 @@ github.com/pkg/errors v0.8.0/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINE
 github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
 github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
 github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
-github.com/pkoukk/tiktoken-go v0.1.6 h1:JF0TlJzhTbrI30wCvFuiw6FzP2+/bR+FIxUdgEAcUsw=
-github.com/pkoukk/tiktoken-go v0.1.6/go.mod h1:9NiV+i9mJKGj1rYOT+njbv+ZwA/zJxYdewGl6qVatpg=
+github.com/pkoukk/tiktoken-go v0.1.7 h1:qOBHXX4PHtvIvmOtyg1EeKlwFRiMKAcoMp4Q+bLQDmw=
+github.com/pkoukk/tiktoken-go v0.1.7/go.mod h1:9NiV+i9mJKGj1rYOT+njbv+ZwA/zJxYdewGl6qVatpg=
 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
 github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=

diff --git a/operator/text/v0/.compogen/extra-chunk-text.mdx b/operator/text/v0/.compogen/extra-chunk-text.mdx
@@ -7,39 +7,102 @@ There are three strategies available for chunking text in Text Component:
 #### Token
 Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.
 
-| **Parameter**        | **Type**         | **Description**                                                                                                                                                                                              |
-|----------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `chunk-size`         | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
-| `chunk-overlap`      | integer          | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
-| `model-name`         | string           | The name of the model used for tokenization                                                                                                                                                                  |
-| `allowed-special`    | array of strings | A list of special tokens that are allowed within chunks                                                                                                                                                      |
-| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks                                                                                                                                                |
+| **Parameter**        | **Type**         | **Description**                                                           |
+| -------------------- | ---------------- | ------------------------------------------------------------------------- |
+| `chunk-size`         | integer          | Specifies the maximum size of each chunk in terms of the number of tokens |
+| `chunk-overlap`      | integer          | Determines the number of tokens that overlap between consecutive chunks   |
+| `model-name`         | string           | The name of the model used for tokenization                               |
+| `allowed-special`    | array of strings | A list of special tokens that are allowed within chunks                   |
+| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks             |
 
 #### Recursive
 This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
 
-| **Parameter**      | **Type**         | **Description**                                                                                                                                                                                              |
-|--------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `chunk-size`       | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
-| `chunk-overlap`    | integer          | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
-| `model-name`       | string           | The name of the model used for tokenization                                                                                                                                                                  |
-| `separators`       | array of strings | A list of strings representing the separators used to split the text                                                                                                                                         |
-| `keep-separator`   | boolean          | A flag indicating whether to keep the separator characters at the beginning or end of chunks                                                                                                                 |
+| **Parameter**    | **Type**         | **Description**                                                                              |
+| ---------------- | ---------------- | -------------------------------------------------------------------------------------------- |
+| `chunk-size`     | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                    |
+| `chunk-overlap`  | integer          | Determines the number of tokens that overlap between consecutive chunks                      |
+| `separators`     | array of strings | A list of strings representing the separators used to split the text                         |
+| `keep-separator` | boolean          | A flag indicating whether to keep the separator characters at the beginning or end of chunks |
 
 
 #### Markdown
 This text splitter is specially designed for Markdown format.
 
-| **Parameter**      | **Type** | **Description**                                                                                                                                                                                              |
-|--------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `chunk-size`       | integer  | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
-| `chunk-overlap`    | integer  | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
-| `model-name`       | string   | The name of the model used for tokenization                                                                                                                                                                  |
-| `code-blocks`      | boolean  | A flag indicating whether code blocks should be treated as a single unit                                                                                                                                     |
+| **Parameter**   | **Type** | **Description**                                                           |
+| --------------- | -------- | ------------------------------------------------------------------------- |
+| `chunk-size`    | integer  | Specifies the maximum size of each chunk in terms of the number of tokens |
+| `chunk-overlap` | integer  | Determines the number of tokens that overlap between consecutive chunks   |
+| `code-blocks`   | boolean  | A flag indicating whether code blocks should be treated as a single unit  |
+
+### Tokenization
+There are 2 ways to choose the tokenizer:
+- 1. Use Model name to choose the tokenizer
+- 2. Use Encoding name to choose the tokenizer
+
+#### Model Name
+
+| **Model**                     |
+| ----------------------------- |
+| gpt-4o                        |
+| gpt-4                         |
+| gpt-3.5-turbo                 |
+| command-r-plus                |
+| command-r                     |
+| command                       |
+| command-nightly               |
+| command-light                 |
+| command-light-nightly         |
+| embed-english-v3.0            |
+| embed-multilingual-v3.0       |
+| embed-english-light-v3.0      |
+| embed-multilingual-light-v3.0 |
+| text-davinci-003              |
+| text-davinci-002              |
+| text-davinci-001              |
+| text-curie-001                |
+| text-babbage-001              |
+| text-ada-001                  |
+| davinci                       |
+| curie                         |
+| babbage                       |
+| ada                           |
+| code-davinci-002              |
+| code-davinci-001              |
+| code-cushman-002              |
+| code-cushman-001              |
+| davinci-codex                 |
+| cushman-codex                 |
+| text-davinci-edit-001         |
+| code-davinci-edit-001         |
+| text-embedding-ada-002        |
+| text-similarity-davinci-001   |
+| text-similarity-curie-001     |
+| text-similarity-babbage-001   |
+| text-similarity-ada-001       |
+| text-search-davinci-doc-001   |
+| text-search-curie-doc-001     |
+| text-search-babbage-doc-001   |
+| text-search-ada-doc-001       |
+| code-search-babbage-code-001  |
+| code-search-ada-code-001      |
+| gpt2                          |
+
+
+
+#### Encoding Name
+| **Encoding** |
+| ------------ |
+| o200k_base   |
+| cl100k_base  |
+| p50k_base    |
+| r50k_base    |
+| p50k_edit    |
+
 
 ### Text Chunks in Output
 | **Parameter**    | **Type** | **Description**                                              |
-|------------------|----------|--------------------------------------------------------------|
+| ---------------- | -------- | ------------------------------------------------------------ |
 | `test`           | string   | The text chunk                                               |
 | `start-position` | integer  | The starting position of the text chunk in the original text |
 | `end-position`   | integer  | The ending position of the text chunk in the original text   |