Note: This document describes the latest release version of jiant
. Additional unreleased changes may be available on the GitHub master branch.
To find the setup instructions for using jiant
and to run a simple example demo experiment using data from GLUE, follow this getting started tutorial!
We currently support the below tasks, plus several more documented only in the code:
- All GLUE tasks (downloadable here)
- All SuperGLUE tasks (downloadable here)
- DisSent: Details for preparing the data are in
scripts/dissent/README
. - CCG: Details for preparing the data are in
scripts/ccg/README
. - SWAG. The data can be downloaded from SWAG website.
- QA-SRL. The data can be downloaded using the script provided here. The resulting data folder
qasrl-v2
should be renamed toQASRL
.
Data files should be in the directory specified by data_dir
in a subdirectory corresponding to the task, as specified in the task definition (see jiant/tasks
). The GLUE and SuperGLUE download scripts should create acceptable directories automatically.
To add a new task, refer to this tutorial!
All model configuration is handled through the config file system and the --overrides
flag, but there are also a few command-line arguments that control the behavior of main.py
. In particular:
--tensorboard
(or -t
): use this to run a Tensorboard server while the trainer is running, serving on the port specified by --tensorboard_port
(default is 6006
).
The trainer will write event data even if this flag is not used, and you can run Tensorboard separately as:
tensorboard --logdir <exp_dir>/<run_name>/tensorboard
--notify <email_address>
: use this to enable notification emails via SendGrid. You'll need to make an account and set the SENDGRID_API_KEY
environment variable to contain the (text of) the client secret key.
--remote_log
(or -r
): use this to enable remote logging via Google Stackdriver. You can set up credentials and set the GOOGLE_APPLICATION_CREDENTIALS
environment variable; see Stackdriver Logging Client Libraries.
Models in jiant
generally have three components: A shared input component (typically a word embedding layer, or a pretrained ELMo, GPT, BERT, or XLNet model), a shared general-purpose encoder component which sits on top of the input component (optional, typically a BiLSTM trained from scratch within jiant
, specified by sent_enc
), and task-specific components for each tasks.
We use the ELMo implementation provided by AllenNLP.
To use ELMo, set input_module = elmo
.
By default, AllenNLP will download and cache the pretrained ELMo weights. If you want to use a particular file containing ELMo weights, set elmo_weight_file_path = path/to/file
.
To use only the character-level CNN word encoder from ELMo, set elmo_chars_only = 1
.
We use the CoVe implementation provided here.
To use CoVe, clone the repo and set the option path_to_cove = "/path/to/cove/repo"
and set cove = 1
.
Download the pretrained vectors located here, preferably the 300-dimensional Common Crawl vectors. Set the word_emb_file
to point to the .vec file. Lastly, set input_module = fastText
.
To use GloVe pretrained word embeddings, download and extract the relevant files and set word_embs_file
to the GloVe file. Lastly, set input_module = glove
.
To use BERT, XLNet, GPT, GPT-2, Transformer-XL, XLM or RoBERTa set input_module
to one of the relevant model names summarized in defaults.conf and listed in full here, e.g. bert-base-cased
. We generally follow the procedures set out in the original works as closely as possible: For single sentence tasks, we add special boundry tokens to the sentence. For pair sentence tasks, we concatenate the sentences, with special boundry and separator tokens specified in the original work (e.g. For BERT, [CLS]
and [SEP]
). If you choose pool_type = auto
, we will take the representation from the designated location (e.g. For BERT, from the first token, where [CLS]
resides) as the representation of the entire sequence. We also have support for the version of Adam that was used in training BERT (optimizer = bert_adam
). When using these models, it is preferable to set tokeizer = auto
.
copa_bert.conf
shows an example setup using BERT on a single task, and can serve as a reference.
To use the ON-LSTM sentence encoder from Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks, set sent_enc = onlstm
. To re-run experiments from the paper on WSJ Language Modeling, use the configuration file config/onlstm.conf. Specific ON-LSTM modules use code from the Github implementation of the paper.
To use the PRPN sentence encoder from Neural language modeling by jointly learning syntax and lexicon, set sent_enc=prpn
. To re-run experiments from the paper on WSJ Language Modeling, use the configuration file config/prpn.conf. Specific PRPN modules use code from the Github implementation of the paper.
Task-specific components include logistic regression and multi-layer perceptron for classification and regression tasks, and an RNN decoder with attention for sequence transduction tasks. To see the full set of available params, see config/defaults.conf. For a list of options affecting the execution pipeline (which configuration file to use, whether to enable remote logging or Tensorboard, etc.), see the arguments section in main.py.
The standard trainer is designed around sampling-based multi-task training. At each step, a task is sampled one step of training is run on that task.
The trainer evaluates the model on the validation data after a fixed number of gradient steps, set by val_interval
.
The learning rate is scheduled to decay by lr_decay_factor
(default: .5) whenever the validation score doesn't improve after lr_patience
(default: 1
) validation checks.
If you're training only on one task, you don't need to worry about this. You'll still see macro-average and micro-average performance statistics, but these will simply be repetitions of your the results for your single task.
If you are training on multiple tasks, you can vary the sampling weights with weighting_method
, e.g. weighting_method = uniform
or weighting_method = proportional
(proportional to amount of training data). You can also scale the losses of each minibatch via scaling_method
if you want to weight tasks with different amounts of training data equally throughout training.
We use a shared global optimizer and LR scheduler for all tasks. In the global case, we use the macro average of each task's validation metrics to do LR scheduling and early stopping. When doing multi-task training and at least one task's validation metric should decrease (e.g. perplexity), we invert tasks whose metric should decrease by averaging 1 - (val_metric / dec_val_scale)
, so that the macro-average will be well-behaved.
Within a run, training is distinguished between pretraining and target training phases. In the pretraining phase, the pretrain tasks
are trained in a multi-task fashion. In the target train phase, each task is trained one at a time, and there is no shared training of the encoder in the target train phase.
Specify pretraining tasks with pretrain_tasks = $pretrain_tasks
where $pretrain_tasks
is a comma-separated list of task names; similarly use target_tasks
to specify the tasks to target train on.
The most extensive way to use jiant
is to pretrain on a set of tasks, before training on target tasks. In this method, the best model from the pretraining stage is loaded and used to train each of the target tasks. This shared sentence encoder can either be frozen or finetuned (controlled by transfer_paradigm
).transfer_paradigm = finetune
will train the shared encoder alongside the task specific parts of the model, whereas setting transfer_paradigm = frozen
will only train the target-task specific components while training for a target task.
You can control which steps are performed or skipped by setting the flags do_pretrain, do_target_task_training, do_full_eval
.
More specifically:
- If you would like to simply multitask train on tasks, then simply set
do_pretrain=1 and target_task_training = 0.
- If you would like to train an encoder on a set of tasks sequentially without sharing the encoder, set
do_pretrain=0 and target_task_training = 1.
- If you would like to evaluate on some tasks, based on a previously
jiant
trained model, setload_eval_checkpoint
to the path of that model checkpoint, and then setdo_pretrain=0, do_target_task_training=0, do_full_eval=1
.
If using ELMo and sep_embs_for_skip = 1
, we will also learn a task-specific set of ELMo's layer-mixing weights.
copa_bert.conf
shows an example setup using a single task without pretraining, and can serve as a reference.
Because preprocessing is expensive (e.g. building vocab and indexing for very large tasks like WMT or BWB), we often want to run multiple experiments using the same preprocessing. So, we group runs using the same preprocessing in a single experiment directory (set using the exp_dir
flag) in which we store all shared preprocessing objects. Later runs will load the stored preprocessing. We write run-specific information (logs, saved models, etc.) to a run-specific directory (set using flag run_dir
), usually nested in the experiment directory. Experiment directories are written in project_dir
. Overall the directory structure looks like:
project_dir # directory for all experiments using jiant
|-- exp1/ # directory for a set of runs training and evaluating on FooTask and BarTask
| |-- preproc/ # shared indexed data of FooTask and BarTask
| |-- vocab/ # shared vocabulary built from examples from FooTask and BarTask
| |-- FooTask/ # shared FooTask class object
| |-- BarTask/ # shared BarTask class object
| |-- run1/ # run directory with some hyperparameter settings
| |-- run2/ # run directory with some different hyperparameter settings
| |
| [...]
|
|-- exp2/ # directory for a runs with a different set of experiments, potentially using a different branch of the code
| |-- preproc/
| |-- vocab/
| |-- FooTask/
| |-- BazTask/
| |-- run1/
| |
| [...]
|
[...]
You should also set data_dir
and word_embs_file
options to point to the directories containing the data (e.g. the output of the scripts/download_glue_data
script) and word embeddings (optional, not needed when using ELMo, see later sections) respectively.
To force rereading and reloading of the tasks, perhaps because you changed the format or preprocessing of a task, delete the objects in the directories named for the tasks (e.g., QQP/
) or use the option reload_tasks = 1
.
To force rebuilding of the vocabulary, perhaps because you want to include vocabulary for more tasks, delete the objects in vocab/
or use the option reload_vocab = 1
.
To force reindexing of a task's data, delete some or all of the objects in preproc/
or use the option reload_index = 1
and set reindex_tasks
to the names of the tasks to be reindexed, e.g. reindex_tasks=\"sst,mnli\"
. You should do this whenever you rebuild the task objects or vocabularies.
If you use jiant
in academic work, please cite it directly:
@misc{wang2019jiant,
author = {Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Haokun Liu and Najoung Kim and Phu Mon Htut and Thibault F'{e}vry and Berlin Chen and Nikita Nangia and Anhad Mohananey and Katharina Kann and Shikha Bordia and Nicolas Patry and David Benton and Ellie Pavlick and Samuel R. Bowman},
title = {\texttt{jiant} 1.2: A software toolkit for research on general-purpose text understanding models},
howpublished = {\url{http://jiant.info/}},
year = {2019}
}
jiant
has been used in these three papers so far:
- Looking for ELMo's Friends: Sentence-Level Pretraining Beyond Language Modeling
- What do you learn from context? Probing for sentence structure in contextualized word representations ("Edge Probing")
- Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
To exactly reproduce experiments from the ELMo's Friends paper use the jsalt-experiments
branch. That will contain a snapshot of the code as of early August, potentially with updated documentation.
For the edge probing paper, see the probing/ directory.
Releases are identified using git tags and distributed via PyPI for pip installation. After passing CI tests and creating a new git tag for a release, it can be uploaded to PyPI by running:
# create distribution
python setup.py sdist bdist_wheel
# upload to PyPI
python -m twine upload dist/*
More details can be found in setup.py.
This package is released under the MIT License. The material in the allennlp_mods directory is based on AllenNLP, which was originally released under the Apache 2.0 license.
I'm seeing gcc++ errors when using conda to set up my environment.
Installing AllenNLP, which we build on, requires a working C++ compiler. See advice here (MacOS Mojave only) or here.
I'm seeing ModuleNotFoundError: No module named 'src'
or ImportError: bad magic number
when starting a run.
This will occur if you try to reuse preprocessed files from jiant
0.9 after upgrading to a newer version. Delete your experiment directories and try again, or see the question immediately below for more information.
I changed/updated the code, and my experiment broke with errors that don't seem related to the change. What should I do?
Our preprocessing pipeline relies on Python pickles to store some intermediate data, and the format of these pickles is closely tied to the internal structure of some of our code. Because of this, you may see a variety of strange errrors if you try to use preprocessed data from an old experiment that was created with a previous version of the code.
To work around this, try your experiment again without the old preprocessed data. If you don't need your old log or checkpoints, simply delete your experiment directory ($JIANT_PROJECT_PREFIX/your_experiment_name_here
) or move to a new one (by changing or overriding the exp_name
config option). If you'd like to save as much old data as possible, try deleting the tasks
subdirectory of your experiment directory, and if that doesn't work, try deleting preproc
and vocab
as well.
It seems like my preproc/{task}__{split}.data has nothing in it!
This probably means that you probably ran the script before downloading the data for that task. Thus, delete the file from preproc and then run main.py again to build the data splits from scratch.
How can I pass BERT embeddings straight to the classifier without a sentence encoder?
Right now, you need to set skip_embs=1
and sep_embs_for_skip=1
just because of the current way
our logic works. We're currently streamlining the logic around sep_embs_for_skip
...
How can I do STILTS-style training?
For a typical STILTs experiment on top of BERT, GPT, ELMo, or some other supported pre-trained encoder, you can simply start from an effective configuration like config/superglue_bert.conf
set pretrain_tasks
to your intermediate task and target_tasks
to your target task.
Right now, we only support training in two stages, so if you'd like to do the initial pretrianing stage from scratch, things get more complicated. Training in more than two stages is possible, but will require you to divide your training up into multiple runs. For instance, assume you want to run multitask training on task set A, and then train on task set B, and finally fine-tune on task set C. You would perform the following:
- First run: pretrain on task set A
- pretrain_tasks=“task_a1,task_a2”, target_tasks=“”
- Second run: load checkpoints, and train on task set B and then C:
- load_model = 1
- load_target_train_checkpoint_arg=/path/to/saved/run
- pretrain_tasks=“task_b1,task_b2, target_tasks=task_c1,task_c2”
Post an issue here on GitHub if you have any problems, and create a pull request if you make any improvements (substantial or cosmetic) to the code that you're willing to share.