TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries.
The main contributions include:
- Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from.
- Fine-Tuning: Designed to support a
PyTorch
backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application. - Parameter Optimization: Interoperable with the standard
scikit-learn
pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user. - Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization.
- GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU.
TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021.
# Conceptually, TextWiser is composed of an Embedding, potentially with a pretrained model,
# that can be chained into zero or more Transformations
from textwiser import TextWiser, Embedding, Transformation, WordOptions, PoolOptions
# Data
documents = ["Some document", "More documents. Including multi-sentence documents."]
# Model: TFIDF `min_df` parameter gets passed to sklearn automatically
emb = TextWiser(Embedding.TfIdf(min_df=1))
# Model: TFIDF followed with an NMF + SVD
emb = TextWiser(Embedding.TfIdf(min_df=1), [Transformation.NMF(n_components=30), Transformation.SVD(n_components=10)])
# Model: Word2Vec with no pretraining that learns from the input data
emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained=None), Transformation.Pool(pool_option=PoolOptions.min))
# Model: BERT with the pretrained bert-base-uncased embedding
emb = TextWiser(Embedding.Word(word_option=WordOptions.bert), Transformation.Pool(pool_option=PoolOptions.first))
# Features
vecs = emb.fit_transform(documents)
Embeddings | Notes |
---|---|
Bag of Words (BoW) | Supported by scikit-learn Defaults to training from scratch |
Term Frequency Inverse Document Frequency (TfIdf) | Supported by scikit-learn Defaults to training from scratch |
Document Embeddings (Doc2Vec) | Supported by gensim Defaults to training from scratch |
Universal Sentence Encoder (USE) | Supported by tensorflow , see requirements Defaults to large v5 |
Compound Embedding | Supported by a context-free grammar |
Word Embedding: Word2Vec | Supported by these pretrained embeddings Common pretrained options include crawl , glove , extvec , twitter , and en-news When the pretrained option is None , trains a new model from the given data Defaults to en , FastText embeddings trained on news |
Word Embedding: Character | Initialized randomly and not pretrained Useful when trained for a downstream task Enable fine-tuning to get good embeddings |
Word Embedding: BytePair | Supported by these pretrained embeddings Pretrained options can be specified with the string <lang>_<dim>_<vocab_size> Default options can be omitted like en , en_100 , or en__10000 Defaults to en , which is equal to en_100_10000 |
Word Embedding: ELMo | Supported by these pretrained embeddings from TensorflowHub Defaults to original |
Word Embedding: Flair | Supported by these pretrained embeddings Defaults to news-forward-fast |
Word Embedding: BERT | Supported by these pretrained embeddings Defaults to bert-base-uncased |
Word Embedding: OpenAI GPT | Supported by these pretrained embeddings Defaults to openai-gpt |
Word Embedding: OpenAI GPT2 | Supported by these pretrained embeddings Defaults to gpt2-medium |
Word Embedding: TransformerXL | Supported by these pretrained embeddings Defaults to transfo-xl-wt103 |
Word Embedding: XLNet | Supported by these pretrained embeddings Defaults to xlnet-large-cased |
Word Embedding: XLM | Supported by these pretrained embeddings Defaults to xlm-mlm-en-2048 |
Word Embedding: RoBERTa | Supported by these pretrained embeddings Defaults to roberta-base |
Word Embedding: DistilBERT | Supported by these pretrained embeddings Defaults to distilbert-base-uncased |
Word Embedding: CTRL | Supported by these pretrained embeddings Defaults to ctrl |
Word Embedding: ALBERT | Supported by these pretrained embeddings Defaults to albert-base-v2 |
Word Embedding: T5 | Supported by these pretrained embeddings Defaults to t5-base |
Word Embedding: XLM-RoBERTa | Supported by these pretrained embeddings Defaults to xlm-roberta-base |
Word Embedding: BART | Supported by these pretrained embeddings Defaults to facebook/bart-base |
Word Embedding: ELECTRA | Supported by these pretrained embeddings Defaults to google/electra-base-generator |
Word Embedding: DialoGPT | Supported by these pretrained embeddings Defaults to microsoft/DialoGPT-small |
Word Embedding: Longformer | Supported by these pretrained embeddings Defaults to allenai/longformer-base-4096 |
Transformations | Notes |
---|---|
Singular Value Decomposition (SVD) | Differentiable |
Latent Dirichlet Allocation (LDA) | Not differentiable |
Non-negative Matrix Factorization (NMF) | Not differentiable |
Uniform Manifold Approximation and Projection (UMAP) | Not differentiable |
Pooling Word Vectors | Applies to word embeddings only Reduces word-level vectors to document-level Pool options include max , min , mean , first , and last Defaults to max |
Examples can be found under the notebooks folder.
TextWiser requires Python 3.8+ and can be installed from PyPI using pip install textwiser
, using pip install textwiser[full]
to install from PyPI with all optional dependencies, or by building from source by following the instructions
in our documentation.
A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding.
This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization. You can see the details in our documentation and in the usage example.
All Word2Vec and transformer-based embeddings and any embedding followed with an svd
transformation are fine-tunable for downstream tasks.
In other words, if you pass the resulting fine-tunable embedding to a PyTorch training method, the features will automatically
be trained for your application. You can see the details in our documentation
and in the usage example.
In general, text data should be whitespace-tokenized before being fed into TextWiser. Customized tokenization is also supported as described in more detail in our documentation
Please submit bug reports, questions and feature requests as Issues.
If you use TextWiser in a publication, please cite it as:
@article{textwiser2021,
author={Kilitcioglu, Doruk and Kadioglu, Serdar},
title={Representing the Unification of Text Featurization using a Context-Free Grammar},
url={https://github.com/fidelity/textwiser},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={35},
number={17},
year={2021},
month={May},
pages={15439-15445}
}
TextWiser is licensed under the Apache License 2.0.