NOAH (Negative Outlier Analysis Hub) is a Python toolkit designed to facilitate the sampling of negative instances based on both syntactic and semantic similarity measures. This tool is intended for researchers and developers working in machine learning, particularly those involved in natural language processing and recommendation systems, who require robust methods for identifying dissimilar or outlier examples in their datasets.
- Syntactic Similarity Sampling: Utilizes SpaCy to generate syntactic embeddings and sample negatives.
- Semantic Similarity Sampling: Leverages the power of Sentence Transformers for deep semantic analysis.
- Easy Integration: Designed to be easily integrated with existing machine learning pipelines.
- Customizable: Allows for customization of similarity thresholds and sampling strategies.
To run NOAH, you will need:
- Python 3.6+
- SpaCy
- Sentence Transformers
- NumPy
- TensorFlow (optional, for handling large datasets efficiently)
Install NOAH by cloning this repository and installing the required packages:
git clone https://github.com/LeonS-creator/NOAH.git
NOAH uses NLP models to encode sentences and calculates their similarity. You can choose between syntactic and semantic models based on your specific needs:
- Syntactic Model: Uses SpaCy for faster, rule-based vectorizations.
- Semantic Model: Utilizes Sentence Transformers for deeper, context-aware embeddings.
We welcome contributions to NOAH! If you have suggestions for improvements or want to contribute code, please:
- Fork the repository.
- Create a new branch for your changes.
- Submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.