Skip to content

Files

Latest commit

a5670e8 · Apr 19, 2022

History

History

nlp

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Feb 2, 2021
Feb 2, 2021
Mar 8, 2021
Mar 25, 2021
Apr 26, 2021
Jul 5, 2021
Aug 10, 2021
Sep 10, 2021
Feb 14, 2022
Dec 16, 2021
Mar 7, 2022
Apr 19, 2022
Apr 19, 2022

ML6 NLP Quick Tips

Current content:

  • Multilingual Sentence Embeddings (21/01/2021): Gives an overview of various current multilingual sentence embedding techniques and tools, and how they compare given various sequence lengths.

  • Spacy 3.0 (01/02/2021): Spacy 3.0 has just been released and in this tip, we'll have a look at some of the new features. We'll be training a German NER model and streamline the end-to-end pipeline using the brand new spaCy projects!

  • Compact transformers (26/02/2021): Bigger isn't always better. In this tip we look at some compact BERT-based models that provide a nice balance between computational resources and model accuracy.

  • Keyword Extraction with pke (18/03/2021): The KEYNG (read king) is dead, long live the KEYNG! In this tip we look at pke, an alternative to Gensim for keyword extraction.

  • Explainable transformers using SHAP (22/04/2021): BERT, explain yourself! 📖 Up until recently language model predictions have lacked transparency. In this tip we look at SHAP, a way to explain your latest transformer based models.

  • Transformer-based Data Augmentation (18/06/2021): Ever struggled with having a limited non-English NLP dataset for a project? Fear not, data augmentation to the rescue ⛑️ In this week's tip, we look at backtranslation 🔀 and contextual word embedding insertions as data augmentation techniques for multilingual NLP.

  • Long range transformers (14/07/2021): Beyond and above the 512! 🏅 In this week's tip, we look at novel long range transformer architectures and compare them against the well-known RoBERTa model.

  • Neural Keyword Extraction (10/09/2021): Neural Keyword Extraction 🧠 In this week's tip, we look at neural keyword extraction methods and how they compare to classical methods.

  • HuggingFace Optimum (12/10/2021): HuggingFace Optimum Quantization ✂️ In this week's tip, we take a look at the new HuggingFace Optimum package to check out some model quantization techniques.

  • Text Augmentation using large-scale LMs and prompt engineering (25/11/2021): Typically, the more data we have, the better performance we can achieve 🤙. However, it is sometimes difficult and/or expensive to annotate a large amount of training data 😞. In this tip, we leverage three large-scale LMs (GPT-3, GPT-J and GPT-Neo) to generate very realistic samples from a very small dataset.

  • Gender debaising of datasets using CDA (25/01/2022): A lot of large language models are trained on webtext. However, this means that unintended biases can sneak into your model behaviour 😞. In this tip, we'll look at how to try and alleviate this bias using Counterfactual Data Augmentation ⚖️.

  • GPT2 Quantization using ONNXRuntime (19/04/2022): Large language models are costly to run, in this notebook we leverage ONNXRuntime to quantize and run our Dutch GPT2 model in a more efficient way 💰.