This package contains two scripts that produce a text classification model based on scikit-learn and then use that model to power a lightweight web-api that gives predictions for the classification of text strings.
The first part of the package is generate_model.py
which takes as input a directory
containing text files. Each text file consists of a series of lines, each line should
be a string with a value in the category. The name of the category is taken from the
filename (excluding .txt
). The text files can be gzipped, in which case they should
end with .txt.gz
All the text files from the directory are imported into a script which then trains a
text classifier, and saves the model to model.pkl.gz
(by default). If the filename
for the model (given by the --output-file
flag) ends in .gz
then a gzipped version
of the model is created (smaller for transport).
The second part takes the gzipped model from the classifier stage and loads it into a simple web API where queries can be passed to it. To run the server, the command is:
python server.py model.pkl.gz
where model.pkl.gz
is the model created in the first part. This will run a server
accessible on http://localhost:8080/
where you can produce predictions. The server
has two endpoints:
-
/predict_proba?q=XXXXXX
gives a json object containing the predicted probabilities for each classification category for the stringXXXXXXX
. You can do multiple predictions at once by given multipleq=
parameters, eg/predict_proba?q=XXXXXX&q=YYYYYYY
. -
/predict?q=XXXXXX
gives the most likely category for the stringXXXXXXX
. If just oneq=
is given then the server just returns the string of the most likely one. If more than oneq=
is provided then a JSON object with name:class pairs is returned instead.
dokku apps:create org-type
dokku domains:enable org-type
dokku domains:add org-type orgtype.findthatcharity.uk
dokku config:set --no-restart org-type DOKKU_LETSENCRYPT_EMAIL=your@email.tld
dokku config:set org-type PREDICT_MODEL=model.pkl.gz