To install the active learning framework for natural language processing (ALNLP) of pathology reports:
-
Log in to Biowulf.
-
Go to the
/data
partition of Biowulf. For example:-
Run the following command:
cd /data/$USER/export
-
Export the current working directory to the
$alnlp_INSTALL
variable. For example:export alnlp_INSTALL=$(pwd)
Do this on Biowulf. (That is, not from a Biowulf compute node, where GitHub access is limited.)
-
-
Clone this repository:
cd $alnlp_INSTALL git clone https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Active_learning_NLP.git
-
Allocate a compute node for the installation process:
sinteractive --mem=2g
-
Install the Miniconda package manager. Create and activate an
alnlp
environment:conda env create -f environment.yml -n alnlp conda activate alnlp
-
In your conda environment, load python dependencies:
python >>> import nltk >>> nltk.download('stopwords') >>> nltk.download('punkt')
You can test the installation via:
cd $alnlp_INSTALL/NCI-DOE-Collab-Pilot3-Active_learning_NLP/experiments
python experiment_001.py
The above example script runs the active learning loop for four logistic regression models, each one using a different acquisition function. This example uses the 20 Newsgroups dataset. In the loop's execute method, you can specify what percentages of data you want to initially use for training, the size of the test set, and how many new samples you want each iteration of the loop to select for labeling. After the execution, the example script creates a report with all the results and plots in the given output folder. The Python script also creates a sub-folder with the same name as the script (experiment_001 in this case) to store the plots in PDF format.