Simplifying Persian NLP for Everyone
Shekar (meaning 'sugar' in Persian) is a Python library for Persian natural language processing, named after the influential satirical story "فارسی شکر است" (Persian is Sugar) published in 1921 by Mohammad Ali Jamalzadeh. The story became a cornerstone of Iran's literary renaissance, advocating for accessible yet eloquent expression.
Inspired by the story’s role in making Persian literature more relatable and expressive, Shekar aspires to democratize Persian natural language processing by offering a user-friendly yet powerful toolkit that captures the richness and elegance of the Persian language. Just as Jamalzadeh’s story bridged tradition and modernity, Shekar bridges the gap between technical complexity and linguistic accessibility, empowering developers and researchers to explore and innovate in Persian NLP with ease.
To install the package, you can use pip
. Run the following command:
pip install shekar
from shekar.normalizers import Normalizer
normalizer = Normalizer()
text = "ۿدف ما ػمګ بۀ ێڪډيڱڕ أښټ"
text = normalizer.normalize(text) # Output: هدف ما کمک به یکدیگر است
print(text)
هدف ما کمک به یکدیگر است
Here is a simple example of how to use the shekar
package:
from shekar.tokenizers import SentenceTokenizer
text = "هدف ما کمک به یکدیگر است! ما میتوانیم با هم کار کنیم."
tokenizer = SentenceTokenizer()
sentences = tokenizer.tokenize(text)
for sentence in sentences:
print(sentence)
هدف ما کمک به یکدیگر است!
ما میتوانیم با هم کار کنیم.
A complete example of using embeddings:
from shekar.embeddings import Embedding
# Load pre-trained embeddings
embedding = Embedding(model_name="fasttext-d100-w10-cbow-blogs")
# Retrieve word vector
word = "کتاب"
vector = embedding[word]
print(f"Vector for {word}: {vector}")
# Find similar words
similar_words = embedding.most_similar(word, topn=5)
print(f"Words similar to {word}: {similar_words}")