Upload

goldpulpy · Oct 9, 2024 · 8cb7cfa · 8cb7cfa
1 parent 71276ea
commit 8cb7cfa
Show file tree

Hide file tree

Showing 6 changed files with 462 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,162 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 .pulpy
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,198 @@
+<h1 align="center">PySentence-Similarity 😊</h1>
+<p align="center">
+    <a href="https://github.com/goldpulpy/pysentence-similarity/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/goldpulpy/pysentence-similarity.svg?color=blue"></a>
+    <img alt="GitHub Actions Workflow Status" src="https://img.shields.io/github/actions/workflow/status/goldpulpy/pysentence-similarity/package.yml">
+</p>
+
+## Information
+
+**pysentence-similarity** is a tool designed to identify and find similarities between sentences and a base sentence, expressed as a percentage 📊. It compares the semantic value of each input sentence to the base sentence, providing a score that reflects how related or similar they are. This tool is useful for various natural language processing tasks such as clustering similar texts 📚, paraphrase detection 🔍 and textual consequence measurement 📈.
+
+The models were converted to ONNX format to optimize and speed up inference. Converting models to ONNX enables cross-platform compatibility and optimized hardware acceleration, making it more efficient for large-scale or real-world applications 🚀.
+
+- **High accuracy:** Utilizes a robust Transformer-based architecture, providing high accuracy in semantic similarity calculations 🔬.
+- **Cross-platform support:** The ONNX format provides seamless integration across platforms, making it easy to deploy across environments 🌐.
+- **Scalability:** Efficient processing can handle large datasets, making it suitable for enterprise-level applications 📈.
+- **Real-time processing:** Optimized for fast output, it can be used in real-world applications without significant latency ⏱️.
+- **Flexible:** Easily adaptable to specific use cases through customization or integration with additional models or features 🛠️.
+- **Low resource consumption:** The model is designed to operate efficiently, reducing memory and CPU/GPU requirements, making it ideal for resource-constrained environments ⚡.
+- **Fast and user-friendly:** The library offers high performance and an intuitive interface, allowing users to quickly and easily integrate it into their projects 🚀.
+
+## Installation 📦
+
+- **Requirements:** Python 3.8 or higher.
+
+```bash
+# install from PyPI
+pip install pysentence-similarity
+
+# install from GitHub
+pip install git+https://github.com/goldpulpy/pysentence-similarity.git
+```
+
+## Support models 🤝
+
+You don't need to download anything; the package itself will download the model and its tokenizer from a special HF [repository](https://huggingface.co/goldpulpy/pysentence-similarity).
+
+Below are the models currently added to the special repository, including their file size and a link to the source.
+
+| Model                                 | Parameters | FP32   | FP16  | INT8  | Source link                                                                                 |
+| ------------------------------------- | ---------- | ------ | ----- | ----- | ------------------------------------------------------------------------------------------- |
+| all-MiniLM-L6-v2                      | 22.7M      | 90MB   | 45MB  | 23MB  | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 🤗                      |
+| paraphrase-MiniLM-L6-v2               | 22.7M      | 90MB   | 45MB  | 23MB  | [HF](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) 🤗               |
+| all-MiniLM-L12-v2                     | 33.4M      | 127MB  | 65MB  | 32MB  | [HF](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) 🤗                     |
+| gte-small                             | 33.4M      | 127MB  | 65MB  | 32MB  | [HF](https://huggingface.co/thenlper/gte-small) 🤗                                          |
+| all-mpnet-base-v2                     | 109M       | 418MB  | 209MB | 105MB | [HF](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) 🤗                     |
+| paraphrase-multilingual-MiniLM-L12-v2 | 118M       | 449MB  | 225MB | 113MB | [HF](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) 🤗 |
+| text2vec-base-multilingual            | 118M       | 449MB  | 225MB | 113MB | [HF](https://huggingface.co/shibing624/text2vec-base-multilingual) 🤗                       |
+| gte-multilingual-base                 | 305M       | 1.17GB | 599MB | 324MB | [HF](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) 🤗                           |
+| gte-large                             | 335M       | 1.25GB | 640MB | 321MB | [HF](https://huggingface.co/thenlper/gte-large) 🤗                                          |
+| LaBSE                                 | 470M       | 1.75GB | 898MB | 450MB | [HF](https://huggingface.co/sentence-transformers/LaBSE) 🤗                                 |
+
+**pysentence-similarity** supports `FP32`, `FP16`, and `INT8` dtypes.
+
+- **FP32:** 32-bit floating-point format that provides high precision and a wide range of values.
+- **FP16:** 16-bit floating-point format, reducing memory consumption and computation time, with minimal loss of precision (typically less than 1%).
+- **INT8:** 8-bit integer quantized format that greatly reduces model size and speeds up output, ideal for resource-constrained environments, with little loss of precision.
+
+## Usage examples 📖
+
+### Sentence similarity score 📊
+
+Let's define the similarity score as the percentage of how similar the sentences are to the original sentence (0.75 = 75%), default compute function is `cosine`
+
+You can use CUDA 12.X by passing the `device='cuda'` parameter to the Model object; the default is `cpu`. If the device is not available, it will automatically be set to `cpu`.
+
+```python
+from pysentence_similarity import Model, compute_score
+
+# Create an instance of the model all-MiniLM-L6-v2; the default dtype is `fp32`
+model = Model("all-MiniLM-L6-v2", dtype="fp16")
+
+sentences = [
+    "This is another test.",
+    "This is yet another test.",
+    "We are testing sentence similarity."
+]
+
+# Convert sentences to embeddings
+# The default is to use mean_pooling as a pooling function
+source_embedding = model.encode("This is a test.")
+embeddings = model.encode(sentences, progress_bar=True)
+
+# Compute similarity scores
+# The rounding parameter allows us to round our float values
+# with a default of 2, which means 2 decimal places.
+compute_score(source_embedding, embeddings)
+# Return: [0.86, 0.77, 0.48]
+```
+
+`compute_score` returns in the same index order in which the embedding was encoded.
+
+Let's see the sentence and its evaluation from a computational function
+
+```python
+# Compute similarity scores
+scores = compute_score(source_embedding, embeddings)
+
+for sentence, score in zip(sentences, scores):
+    print(f"{sentence} ({score})")
+
+# Output prints:
+# This is another test. (0.86)
+# This is yet another test. (0.77)
+# We are testing sentence similarity. (0.48)
+```
+
+You can use the computational functions: `cosine`, `euclidean`, `manhattan`, `jaccard`, `pearson`, `minkowski`, `hamming`, `kl_divergence`, `chebyshev`, `bregman` or your custom function
+
+```python
+from pysentence_similarity.compute import euclidean
+
+compute_score(source_embedding, embeddings, compute_function=euclidean)
+# Return: [2.52, 3.28, 5.62]
+```
+
+You can use `max_pooling`, `mean_pooling`, `min_pooling` or your custom function
+
+```python
+from pysentence_similarity.pooling import max_pooling
+
+source_embedding = model.encode("This is a test.", pooling_function=max_pooling)
+embeddings = model.encode(sentences, pooling_function=max_pooling)
+...
+```
+
+### Splitting ✂️
+
+```python
+from pysentence_similarity import Splitter
+
+# Default split markers: '\n'
+splitter = Splitter()
+
+# If you want to separate by specific characters.
+splitter = Splitter(markers_to_split=["!", "?", "."], preserve_markers=True)
+
+# Test text
+text = "Hello world! How are you? I'm fine."
+
+# Split from text
+splitter.split_from_text(text)
+# Return: ['Hello world!', 'How are you?', "I'm fine."]
+```
+
+At this point, sources for the splitting are available: text, file, URL, CSV, and JSON.
+
+### Storage 💾
+
+The storage allows you to save and link sentences and their embeddings for easy access, so you don't need to encode a large corpus of text every time. The storage also enables similarity searching.
+
+The storage must store the **sentences** themselves and their **embeddings**.
+
+```python
+from pysentence_similarity import Model, Storage
+
+# Create an instance of the model
+model = Model("all-MiniLM-L6-v2", dtype="fp16")
+
+# Create an instance of the storage
+storage = Storage()
+sentences = [
+    "This is another test.",
+    "This is yet another test.",
+    "We are testing sentence similarity."
+]
+
+# Convert sentences to embeddings
+embeddings = model.encode(sentences)
+
+# Add sentences and their embeddings
+storage.add(sentences, embeddings)
+
+# Save the storage
+storage.save("my_storage.h5")
+```
+
+Load from the storage
+
+```python
+from pysentence_similarity import Model, Storage, compute_score
+
+# Create an instance of the model and storage
+model = Model("all-MiniLM-L6-v2", dtype="fp16")
+storage = Storage.load("my_storage.h5")
+
+# Convert sentence to embedding
+source_embedding = model.encode("This is a test.")
+
+# Compute similarity scores with the storage
+compute_score(source_embedding, storage)
+# Return: [0.86, 0.77, 0.48]
+```
+
+## License 📜
+
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details
+
+<h6 align="center">Created by goldpulpy with ❤️</h6>