- Generate word frequency lists from corpus of old books in wikisource(in progess)
- To understand the dic aff format | chromium developers | Ubuntu manpages | Source documentation.
- Find a way to test find word coverage, preferably in firefox or libre writer.
- Use wikisource to classify words in to parts of speach (helps with suffixies)
- Generate word frequency lists from the books proofread by bn.wikisource.
- Download the epub files by hand from wikisource to here,(machine downloads not permited).
- Convert them to txt by using epub_to_txt.sh
- Generate the most frequent words using word_frequency.py .
- Test word coverge using analyze like this.
- Post made at wikisource requesting help to transcribe dictionaries.
- To view bangla with joint glyphs(jukthakhor) in terminal, use konsole. Use a suitable font (I use MesloLGS NF) and enable Bramhic script charactes as follows. Menu>settings> configure Konsole> Profiles> new Profile> Edit> Appearance > Complex Text Layout Check Bramhic Script Charactes.
Most of the .dic
and .aff
files have been extracted and placed in the resources folder. To open any such plugins for firefox, thunderbird or libre office use any archive manager. The Bangla Akademi word list published by SNLTR is in .doc
format, it has been converted to .csv
for better utility. Other than that their dictionaries use only the .dic
file mainly, so it doesn't take advantage of the .aff
file for compression hence has very low coverage. However I am not well versed in java
to understand what they are doing with that plugin. Anyhow, the most important resource of all is the .dic
and .aff
files from Bangla Type Foundry. They have done a tremendous job of embedding the grammer rules of the Bangla language into the dic-aff
format. The idea would be to create a bn-in dictionary following those methods, taking into account the old words(suddo).