Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jparkerweb committed Nov 6, 2024
1 parent fcbabfc commit 1fd5ce6
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 24 deletions.
25 changes: 13 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 🍱 semantic-chunking

Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).
NPM Package for Semantically creating chunks from large texts. Useful for workflows involving large language models (LLMs).

## Features

Expand All @@ -12,6 +12,16 @@ Semantically create chunks from large texts. Useful for workflows involving larg
- Chunk prefix support for RAG workflows
- Web UI for experimenting with settings

## Semantic Chunking Workflow
_how it works_

1. **Sentence Splitting**: The input text is split into an array of sentences.
2. **Embedding Generation**: A vector is created for each sentence using the specified ONNX model.
3. **Similarity Calculation**: Cosine similarity scores are calculated for each sentence pair.
4. **Chunk Formation**: Sentences are grouped into chunks based on the similarity threshold and max token size.
5. **Chunk Rebalancing**: Optionally, similar adjacent chunks are combined into larger ones up to the max token size.
6. **Output**: The final chunks are returned as an array of objects, each containing the properties described above.

## Installation

```bash
Expand Down Expand Up @@ -83,15 +93,6 @@ The output is an array of chunks, each containing the following properties:
- `embedding`: Array - The embedding vector (if `returnEmbedding` is `true`).
- `token_length`: Integer - The token length (if `returnTokenLength` is `true`).

## Semantic Chunking Workflow

1. **Sentence Splitting**: The input text is split into an array of sentences.
2. **Embedding Generation**: A vector is created for each sentence using the specified ONNX model.
3. **Similarity Calculation**: Cosine similarity scores are calculated for each sentence pair.
4. **Chunk Formation**: Sentences are grouped into chunks based on the similarity threshold and max token size.
5. **Chunk Rebalancing**: Optionally, similar adjacent chunks are combined into larger ones up to the max token size.
6. **Output**: The final chunks are returned as an array of objects, each containing the properties described above.

## Examples

Example 1: Basic usage with custom similarity threshold:
Expand Down Expand Up @@ -219,6 +220,7 @@ The behavior of the `chunkit` function can be finely tuned using several optiona
| Xenova/all-MiniLM-L6-v2 | true | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) | 23 MB |
| Xenova/all-MiniLM-L6-v2 | false | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) | 90.4 MB |
| Xenova/paraphrase-multilingual-MiniLM-L12-v2 | true | [https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2) | 118 MB |
| thenlper/gte-base | false | [https://huggingface.co/thenlper/gte-base](https://huggingface.co/thenlper/gte-base) | 436 MB |
| Xenova/all-distilroberta-v1 | true | [https://huggingface.co/Xenova/all-distilroberta-v1](https://huggingface.co/Xenova/all-distilroberta-v1) | 82.1 MB |
| Xenova/all-distilroberta-v1 | false | [https://huggingface.co/Xenova/all-distilroberta-v1](https://huggingface.co/Xenova/all-distilroberta-v1) | 326 MB |
| BAAI/bge-base-en-v1.5 | false | [https://huggingface.co/BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 436 MB |
Expand All @@ -241,8 +243,7 @@ The Semantic Chunking Web UI allows you to experiment with the chunking paramete
- Example texts for testing
- Dark mode interface



![Semantic Chunking Web UI](./img/semantic-chunking_web-ui.gif)

---

Expand Down
25 changes: 13 additions & 12 deletions webui/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 🍱 Semantic Chunking Web UI

A web-based interface for experimenting with and tuning Semantic Chunking settings. This tool provides a visual way to test and configure the `semantic-chunking` library's settings to get optimal results for your specific use case.
A web-based interface for experimenting with and tuning Semantic Chunking settings. This tool provides a visual way to test and configure the `semantic-chunking` library's settings to get optimal results for your specific use case. Once you've found the best settings, you can generate code to implement them in your project.

## Features

Expand All @@ -13,6 +13,8 @@ A web-based interface for experimenting with and tuning Semantic Chunking settin
- Example texts for testing
- Dark mode interface

![semantic-chunking_web-ui](../img/semantic-chunking_web-ui.gif)

## Getting Started

### Prerequisites
Expand All @@ -22,36 +24,30 @@ A web-based interface for experimenting with and tuning Semantic Chunking settin
### Installation

1. Clone the repository:
---bash
```bash
git clone https://github.com/jparkerweb/semantic-chunking.git
```

2. Navigate to the webui directory:
---bash
```bash
cd semantic-chunking/webui
```


3. Install dependencies:
---bash
```bash
npm install
```

4. Start the server:
---bash
```bash
npm start
```


5. Open your browser and visit:
---bash
```bash
http://localhost:3000
```

---
## Usage

### Basic Controls
Expand Down Expand Up @@ -104,3 +100,8 @@ The web UI is built with:
## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Appreciation

If you enjoy this package please consider sending me a tip to support my work 😀
# [🍵 tip me here](https://ko-fi.com/jparkerweb)

0 comments on commit 1fd5ce6

Please sign in to comment.