Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

Commit

Permalink
feat(text): add tokenizer for cohere & new gpt-4o (#276)
Browse files Browse the repository at this point in the history
Because

- users will need to count the token count for each chunk
- token chunk strategy and token count should decouple
- users will need to fetch tokenisor from vendors

This commit

- add tokenization for cohere & gpt-4o family
- p.s. there are more use cases in huggingface. it will cause error when
the setting is not correct for huggingface python lib.
    - e.g. the token count is over the limitation for a specific model
- refactor text task chunk text for future extensibility and
maintainability
  • Loading branch information
chuang8511 authored Aug 9, 2024
1 parent 15fc0d2 commit 5d8cec3
Show file tree
Hide file tree
Showing 16 changed files with 969 additions and 216 deletions.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ require (
github.com/lib/pq v1.10.9
github.com/nakagami/firebirdsql v0.9.10
github.com/pkg/errors v0.9.1
github.com/pkoukk/tiktoken-go v0.1.6
github.com/pkoukk/tiktoken-go v0.1.7
github.com/redis/go-redis/v9 v9.5.1
github.com/santhosh-tekuri/jsonschema/v5 v5.3.0
github.com/sijms/go-ora/v2 v2.8.19
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -396,8 +396,8 @@ github.com/pkg/errors v0.8.0/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINE
github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pkoukk/tiktoken-go v0.1.6 h1:JF0TlJzhTbrI30wCvFuiw6FzP2+/bR+FIxUdgEAcUsw=
github.com/pkoukk/tiktoken-go v0.1.6/go.mod h1:9NiV+i9mJKGj1rYOT+njbv+ZwA/zJxYdewGl6qVatpg=
github.com/pkoukk/tiktoken-go v0.1.7 h1:qOBHXX4PHtvIvmOtyg1EeKlwFRiMKAcoMp4Q+bLQDmw=
github.com/pkoukk/tiktoken-go v0.1.7/go.mod h1:9NiV+i9mJKGj1rYOT+njbv+ZwA/zJxYdewGl6qVatpg=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
Expand Down
105 changes: 84 additions & 21 deletions operator/text/v0/.compogen/extra-chunk-text.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,39 +7,102 @@ There are three strategies available for chunking text in Text Component:
#### Token
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

| **Parameter** | **Type** | **Description** |
|----------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `allowed-special` | array of strings | A list of special tokens that are allowed within chunks |
| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks |
| **Parameter** | **Type** | **Description** |
| -------------------- | ---------------- | ------------------------------------------------------------------------- |
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `allowed-special` | array of strings | A list of special tokens that are allowed within chunks |
| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks |

#### Recursive
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

| **Parameter** | **Type** | **Description** |
|--------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `separators` | array of strings | A list of strings representing the separators used to split the text |
| `keep-separator` | boolean | A flag indicating whether to keep the separator characters at the beginning or end of chunks |
| **Parameter** | **Type** | **Description** |
| ---------------- | ---------------- | -------------------------------------------------------------------------------------------- |
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `separators` | array of strings | A list of strings representing the separators used to split the text |
| `keep-separator` | boolean | A flag indicating whether to keep the separator characters at the beginning or end of chunks |


#### Markdown
This text splitter is specially designed for Markdown format.

| **Parameter** | **Type** | **Description** |
|--------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `code-blocks` | boolean | A flag indicating whether code blocks should be treated as a single unit |
| **Parameter** | **Type** | **Description** |
| --------------- | -------- | ------------------------------------------------------------------------- |
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `code-blocks` | boolean | A flag indicating whether code blocks should be treated as a single unit |

### Tokenization
There are 2 ways to choose the tokenizer:
- 1. Use Model name to choose the tokenizer
- 2. Use Encoding name to choose the tokenizer

#### Model Name

| **Model** |
| ----------------------------- |
| gpt-4o |
| gpt-4 |
| gpt-3.5-turbo |
| command-r-plus |
| command-r |
| command |
| command-nightly |
| command-light |
| command-light-nightly |
| embed-english-v3.0 |
| embed-multilingual-v3.0 |
| embed-english-light-v3.0 |
| embed-multilingual-light-v3.0 |
| text-davinci-003 |
| text-davinci-002 |
| text-davinci-001 |
| text-curie-001 |
| text-babbage-001 |
| text-ada-001 |
| davinci |
| curie |
| babbage |
| ada |
| code-davinci-002 |
| code-davinci-001 |
| code-cushman-002 |
| code-cushman-001 |
| davinci-codex |
| cushman-codex |
| text-davinci-edit-001 |
| code-davinci-edit-001 |
| text-embedding-ada-002 |
| text-similarity-davinci-001 |
| text-similarity-curie-001 |
| text-similarity-babbage-001 |
| text-similarity-ada-001 |
| text-search-davinci-doc-001 |
| text-search-curie-doc-001 |
| text-search-babbage-doc-001 |
| text-search-ada-doc-001 |
| code-search-babbage-code-001 |
| code-search-ada-code-001 |
| gpt2 |



#### Encoding Name
| **Encoding** |
| ------------ |
| o200k_base |
| cl100k_base |
| p50k_base |
| r50k_base |
| p50k_edit |


### Text Chunks in Output
| **Parameter** | **Type** | **Description** |
|------------------|----------|--------------------------------------------------------------|
| ---------------- | -------- | ------------------------------------------------------------ |
| `test` | string | The text chunk |
| `start-position` | integer | The starting position of the text chunk in the original text |
| `end-position` | integer | The ending position of the text chunk in the original text |
Loading

0 comments on commit 5d8cec3

Please sign in to comment.