Debian 10
python 3.7.3
sudo apt update
sudo apt install default-jre
java -version
sudo apt install default-jdk
javac -version
Append the following line to the bashrc file
java -Xmx{any number without braces}g # Xmx54g - allocates 54GB of heap space
After loading nlp from spacy, include max_length in the script
nlp.max_length = 10000000 # any number > 1000000
pip3 install numpy
pip3 install pandas
pip3 install nltk
pip3 install gensim
pip3 install pyLDAvis
pip3 install spacy
pip3 install matplotlib
pip3 install tika
pip3 install ipdb
Home dir:
cd LDA_Finance/
mv utils
bash unpack_data.sh # creates data dir and unzips the dataset there
bash mallet_installation.sh # downloads mallet and unzips it
Preprocess Data:
python3 preprocess_data.py --datadir=/path/to/dir/ --lemmatized_data_dir=/path/to/lemmatized_data --book_path=/path/to/book.pdf
Pass the whole path including home, example below:
# python3 preprocess_data.py --data_dir=/home/patelamal_01/LDA_repo/LDA_Finance/data --lemmatized_data_dir=/home/patelamal_01/LDA_repo/LDA_Finance/lemmatized_data --book_path=/home/patelamal_01/LDA_repo/LDA_Finance/Tidd_Innovation.pdf
LDA Model:
python3 LDAMulticoreModel.py --model_name=ModelName --save_model_path=/path/to/save/model/ --lemmatized_data_path=/path/to/processed/lemmatized/data
Pass the whole path including home, example below:
# python3 LDAMulticoreModel.py --model_name=LDA_MC_1 --save_model_path=/home/patelamal_01/LDA_repo/LDA_Finance/model/ --lemmatized_data_path=/home/patelamal_01/LDA_repo/LDA_Finance/lemmatized_data/
Analysis & Divergence:
python3 Analysis_Divergence.py --model_path=/path/to/model/modelname --lemmatized_data_path=/path/to/lemmatized/data/ --prob_file_path=/path/to/save/prob/csv/
Pass the whole path including home, example below:
# python3 Analysis_Divergence.py --model_path=/home/patelamal_01/LDA_repo/LDA_Finance/model/LDA_MC_1/LDA_MC_1 --lemmatized_data_path=/home/patelamal_01/LDA_repo/LDA_Finance/lemmatized_data/ --prob_file_path=/home/patelamal_01/LDA_repo/LDA_Finance/probability/
Gensim Tutorials
Topic Modeling with Gensim
Topic Modeling with mallet