GitHub - amirivojdan/shekar: Simplifying Persian NLP for Everyone

Simplifying Persian NLP for Everyone

Introduction

Shekar (meaning 'sugar' in Persian) is a Python library for Persian natural language processing, named after the influential satirical story "فارسی شکر است" (Persian is Sugar) published in 1921 by Mohammad Ali Jamalzadeh. The story became a cornerstone of Iran's literary renaissance, advocating for accessible yet eloquent expression.

Inspired by the story’s role in making Persian literature more relatable and expressive, Shekar aspires to democratize Persian natural language processing by offering a user-friendly yet powerful toolkit that captures the richness and elegance of the Persian language. Just as Jamalzadeh’s story bridged tradition and modernity, Shekar bridges the gap between technical complexity and linguistic accessibility, empowering developers and researchers to explore and innovate in Persian NLP with ease.

Installation

To install the package, you can use pip. Run the following command:

pip install shekar

Usage

Normalization

from shekar.normalizers import Normalizer
normalizer = Normalizer()

text = "ۿدف ما ػمګ بۀ ێڪډيڱڕ أښټ"
text = normalizer.normalize(text) # Output: هدف ما کمک به یکدیگر است
print(text)

هدف ما کمک به یکدیگر است

Sentence Tokenization

Here is a simple example of how to use the shekar package:

from shekar.tokenizers import SentenceTokenizer

text = "هدف ما کمک به یکدیگر است! ما می‌توانیم با هم کار کنیم."
tokenizer = SentenceTokenizer()
sentences = tokenizer.tokenize(text)

for sentence in sentences:
    print(sentence)

هدف ما کمک به یکدیگر است!
ما می‌توانیم با هم کار کنیم.

Word Embeddings

A complete example of using embeddings:

from shekar.embeddings import Embedding

# Load pre-trained embeddings
embedding = Embedding(model_name="fasttext-d100-w10-cbow-blogs")

# Retrieve word vector
word = "کتاب"
vector = embedding[word]
print(f"Vector for {word}: {vector}")

# Find similar words
similar_words = embedding.most_similar(word, topn=5)
print(f"Words similar to {word}: {similar_words}")

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
shekar		shekar
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Usage

Normalization

Sentence Tokenization

Word Embeddings

About

Releases 9

Languages

License

amirivojdan/shekar

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Usage

Normalization

Sentence Tokenization

Word Embeddings

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Languages