Skip to content

amirivojdan/shekar

Repository files navigation

Shekar

Simplifying Persian NLP for Everyone

GitHub Actions Workflow Status Package version Supported Python versions

Introduction

Shekar (meaning 'sugar' in Persian) is a Python library for Persian natural language processing, named after the influential satirical story "فارسی شکر است" (Persian is Sugar) published in 1921 by Mohammad Ali Jamalzadeh. The story became a cornerstone of Iran's literary renaissance, advocating for accessible yet eloquent expression.

Inspired by the story’s role in making Persian literature more relatable and expressive, Shekar aspires to democratize Persian natural language processing by offering a user-friendly yet powerful toolkit that captures the richness and elegance of the Persian language. Just as Jamalzadeh’s story bridged tradition and modernity, Shekar bridges the gap between technical complexity and linguistic accessibility, empowering developers and researchers to explore and innovate in Persian NLP with ease.

Installation

To install the package, you can use pip. Run the following command:

pip install shekar

Usage

Normalization

from shekar.normalizers import Normalizer
normalizer = Normalizer()

text = "ۿدف ما ػمګ بۀ ێڪډيڱڕ أښټ"
text = normalizer.normalize(text) # Output: هدف ما کمک به یکدیگر است
print(text)
هدف ما کمک به یکدیگر است

Sentence Tokenization

Here is a simple example of how to use the shekar package:

from shekar.tokenizers import SentenceTokenizer

text = "هدف ما کمک به یکدیگر است! ما می‌توانیم با هم کار کنیم."
tokenizer = SentenceTokenizer()
sentences = tokenizer.tokenize(text)

for sentence in sentences:
    print(sentence)
هدف ما کمک به یکدیگر است!
ما می‌توانیم با هم کار کنیم.

Word Embeddings

A complete example of using embeddings:

from shekar.embeddings import Embedding

# Load pre-trained embeddings
embedding = Embedding(model_name="fasttext-d100-w10-cbow-blogs")

# Retrieve word vector
word = "کتاب"
vector = embedding[word]
print(f"Vector for {word}: {vector}")

# Find similar words
similar_words = embedding.most_similar(word, topn=5)
print(f"Words similar to {word}: {similar_words}")