This is a rule-based morphological analyzer for Albanian (sqi
). It is based on a formalized description of literary Albanian morphology, which also includes a number of dialectal (Gheg) elements, and uses uniparser-morph for parsing. It performs full morphological analysis of Albanian words (lemmatization, POS tagging, grammatical tagging).
The analyzer is available as a Python package. If you want to analyze Albanian texts in Python, install the module:
pip3 install uniparser-albanian
Import the module and create an instance of AlbanianAnalyzer
class. Set mode='strict'
if you are going to process text in standard orthography, or mode='nodiacritics'
if you expect some words to lack the diacritics (c instead of ç and e instead of ë). After that, you can either parse tokens or lists of tokens with analyze_words()
, or parse a frequency list with analyze_wordlist()
. Here is a simple example:
from uniparser_albanian import AlbanianAnalyzer
a = AlbanianAnalyzer(mode='strict')
analyses = a.analyze_words('Morfologjinë')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)
# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
print(ana.wf, ana.lemma, ana.gramm)
# You can also pass lists (even nested lists) and specify
# output format ('xml', 'json' or 'conll')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['i'], ['Të', 'dua', '.']],
format='xml')
analyses = a.analyze_words([['i'], ['Të', 'dua', '.']],
format='conll')
analyses = a.analyze_words(['Morfologjinë', [['i'], ['Të', 'dua', '.']]],
format='json')
Refer to the uniparser-morph documentation for the full list of options.
Alternatively, you can use a preprocessed word list. The wordlists
directory contains a list of words from a 31-million-word Albanian corpus (wordlist.csv
) with 456,000 unique tokens, list of analyzed tokens (wordlist_analyzed.txt
; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt
). The recall of the analyzer on the corpus texts is about 93% and the corpus is sufficiently large, so if you just use the analyzed word list, the recall on your texts will probably exceed 90%.
The description is carried out in the uniparser-morph
format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (sqi_lexemes_XXX.txt files), a list of productive lemma-changing derivations (derivations.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/dialectal information, its inflectional type (paradigm), and English translation. See more about the format in the uniparser-morph documentation.