John Snow Labs Spark-NLP 1.4.1: Model Downloader and easier to use External Resource API
Overview
Here we present an exciting release, since we are including for the first time in the library, the base code for a model and pipeline downloader. This will be used by ourselves to provide quality pre-trained models and pipelines that will allow the user to quickly predict or tag a dataset with NLP annotators out of the box, provided what is the pipeline or model trained for.
The next important enhancement is how we deal with External sources for training annotators. This has been unified in 1.4.0 and now further improved by making it easier to provide reading properties, such as how is it preferred to be read (depending on the size of the target, line by line or as a spark dataset will put significant impact on performance), and allowing protocol reading such as hdfs:// or file:// for local following the spark native HadoopConfiguration setting.
Th rest of the release is about improving and fixing issues on the new 1.4.0 Tokenizer and a few critical bugs on CRF NER. Many users contributed reporting these bugs so we are thankful. There were improvements on PySpark API to make it easy to extend and maintain annotators.
New features
- Model and Pipeline Downloader
We are glad to announce our first experimental model downloader, working both in Python and Scala.
This allows to download pre-trained models from our public storage. This does not include any pre-trained models yet but just the logic to be able to do it.
Enhancements
- Improved ExternalResource API (introduced in 1.4.0) to make it easier to provide external corpus and resource information
on annotators such as readAs (which allows setting how would you like SparkNLP to read your source), delimiters and parse settings among
other options that might be passed to Spark Reader directly. Annotators using external sources now all share this functionality.
WordEmbeddings are not yet supported on this format. - All python annotators now properly have getter functions to retrieve param values
Bugfixes
- Fixed some annotators in python not de-serializable on their own outside a Pipeline
- Fixed CRF NER not working when not using word embeddings (thanks @Crisliu for reporting)
- Fixed Tokenizer not properly recognizing some stop words (thanks @easimadi)
- Fixed Tokenizer not properly recognizing composite tokens when changing target pattern param (thanks @easimadi)
- ReadAs parameter now properly read from string in all ExternalResource setters
Developer API
- PySpark API further improvements within AnnotatorApproach, AnnotatorModel and now private internal _AnnotatorModel for fit() result representation
- Automated getter have been written in order not to have to write getter functions in all annotators manually
Other
- RocksDB dependency rolled back to 5.2.1 for better universal compatibility particularly to support databricks platform
- Tests jar is now available in maven central (Thanks @lorenz-nlp for the idea)
Documentation
- Updated website components page to match 1.4.x
- Replaced notebooks site to a placeholder linking to current python notebooks for lower maintenance