Merge pull request #571 from JohnSnowLabs/210-release-candidate-4

2.1.0 Release Candidate #4
JohnSnowLabs · Jul 13, 2019 · 5cbaa02 · 5cbaa02
2 parents e37ec99 + 7f51807
commit 5cbaa02
Show file tree

Hide file tree

Showing 603 changed files with 349,146 additions and 1,035 deletions.
diff --git a/.sbtrc b/.sbtrc
@@ -1,6 +1,6 @@
 alias assemblyAndCopy=;assembly;copyAssembledJar
 alias assemblyOcrAndCopy=;ocr/assembly;copyAssembledOcrJar
 alias assemblyEvalAndCopy=;evaluation/assembly;copyAssembledEvalJar
-alias assemblyAllAndCopy=;assemblyAndCopy;assemblyOcrAndCopy;assemblyEvalAndCopy;copyAssembledEvalJar
+alias assemblyAllAndCopy=;assemblyEvalAndCopy;assemblyOcrAndCopy
 alias assemblyAndCopyForPyPi=;assembly;copyAssembledJarForPyPi
-alias publishSignedOcr=;ocr/assembly;ocr/publishSigned
+alias publishSignedOcr=;ocr/assembly;ocr/publishSigned
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,56 @@
+========
+2.1.0
+========
+---------------
+Overview
+---------------
+Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
+The tokenizer now has easier to customize params and simplified exception management.
+DocumentAssembler `trimAndClearNewLiens` was redesigned into a `cleanupMode` for further control over the cleanup process.
+Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language based Tokenizers.
+Another big introduction is the `eval` module. An optional Spark NLP sub-module that provides evaluation scripts, to
+make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
+Some work also began on metrics during training, starting now with the `NerDLApproach`.
+Finally, we'll have Scaladocs ready for easy library reference.
+Thank you for your feedback in our Slack channels.
+Particular thanks to @csnardi for fixing a bug in one of the release candidates.
+
+---------------
+New Features
+---------------
+* Spark NLP Eval module, includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)
+
+---------------
+Enhancements
+---------------
+* DocumentAssembler new param `cleanupMode` allows user to decide what kind of cleanup to apply to source
+* Tokenizer has been severely enhanced to allow easier and more intuitive customization
+* Norvig and Symmetric spell checkers now report confidence scores in metadata
+* NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through `setTrainValidationProp`
+* Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
+
+---------------
+Bugfixes
+---------------
+* Fixed Dependency Parser not reporting offsets correctly
+* Dependency Parser now only shows head token as part of the result, instead of pairs
+* Fixed NerDLModel not allowing to pick noncontrib versions from linux
+* Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
+* Removed unintentional gc calls causing some performance issues
+
+---------------
+Framework
+---------------
+* ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)
+
+---------------
+Documentation
+---------------
+* Scaladocs for Spark NLP reference
+* Added Google Colab workthrough guide
+* Added Approach and Model class names in reference documentation
+* Fixed various typos and outdated pieces in documentation
+
 ========
 2.0.8
 ========

diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@ Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http:
 
 ## Apache Spark Support
 
-Spark NLP *2.0.8* has been built on top of Apache Spark 2.4.3
+Spark NLP *2.1.0* has been built on top of Apache Spark 2.4.3
 
 Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work.
 
@@ -65,18 +65,18 @@ This library has been uploaded to the [spark-packages repository](https://spark-
 
 Benefit of spark-packages is that makes it available for both Scala-Java and Python
 
-To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.8` to you spark command
+To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.1.0` to you spark command
 
 ```sh
-spark-shell --packages JohnSnowLabs:spark-nlp:2.0.8
+spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0
 ```
 
 ```sh
-pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
+pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
 ```
 
 ```sh
-spark-submit --packages JohnSnowLabs:spark-nlp:2.0.8
+spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0
 ```
 
 This can also be used to create a SparkSession manually by using the `spark.jars.packages` option in both Python and Scala
@@ -144,7 +144,7 @@ Our package is deployed to maven central. In order to add this package as a depe
 <dependency>
     <groupId>com.johnsnowlabs.nlp</groupId>
     <artifactId>spark-nlp_2.11</artifactId>
-    <version>2.0.8</version>
+    <version>2.1.0</version>
 </dependency>
 ```
 
@@ -155,22 +155,22 @@ and
 <dependency>
     <groupId>com.johnsnowlabs.nlp</groupId>
     <artifactId>spark-nlp-ocr_2.11</artifactId>
-    <version>2.0.8</version>
+    <version>2.1.0</version>
 </dependency>
 ```
 
 ### SBT
 
 ```sbtshell
 // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.8"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.1.0"
 ```
 
 and
 
 ```sbtshell
 // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-ocr
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.8"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.1.0"
 ```
 
 Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
@@ -185,7 +185,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
 
 Pip:
 ```bash
-pip install spark-nlp==2.0.8
+pip install spark-nlp==2.1.0
 ```
 Conda:
 ```bash
@@ -202,7 +202,7 @@ spark = SparkSession.builder \
     .master("local[4]")\
     .config("spark.driver.memory","4G")\
     .config("spark.driver.maxResultSize", "2G") \
-    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.8")\
+    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.1.0")\
     .config("spark.kryoserializer.buffer.max", "500m")\
     .getOrCreate()
 ```
@@ -216,7 +216,7 @@ Use either one of the following options
 * Add the following Maven Coordinates to the interpreter's library list
 
 ```bash
-com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.8
+com.johnsnowlabs.nlp:spark-nlp_2.11:2.1.0
 ```
 
 * Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
@@ -226,7 +226,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.8
 Apart from previous step, install python module through pip
 
 ```bash
-pip install spark-nlp==2.0.8
+pip install spark-nlp==2.1.0
 ```
 
 Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -251,7 +251,7 @@ export PYSPARK_PYTHON=python3
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
+pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
 ```
 
 Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`

diff --git a/build.sbt b/build.sbt
@@ -16,7 +16,7 @@ if(is_gpu.equals("false")){
 
 organization:= "com.johnsnowlabs.nlp"
 
-version := "2.0.8"
+version := "2.1.0"
 
 scalaVersion in ThisBuild := scalaVer
 
@@ -86,6 +86,7 @@ developers in ThisBuild:= List(
   Developer(id="showy", name="Eduardo Muñoz", email="eduardo@johnsnowlabs.com", url=url("https://github.com/showy"))
 )
 
+target in Compile in doc := baseDirectory.value / "docs/api"
 
 lazy val ocrDependencies = Seq(
   "net.sourceforge.tess4j" % "tess4j" % "4.2.1"
@@ -108,19 +109,19 @@ lazy val testDependencies = Seq(
 lazy val utilDependencies = Seq(
   "com.typesafe" % "config" % "1.3.0",
   "org.rocksdb" % "rocksdbjni" % "5.17.2",
-  "org.apache.hadoop" % "hadoop-aws" %  "2.7.3"
+  "org.apache.hadoop" % "hadoop-aws" %  "3.2.0"
     exclude("com.fasterxml.jackson.core", "jackson-annotations")
     exclude("com.fasterxml.jackson.core", "jackson-databind")
+    exclude("com.fasterxml.jackson.core", "jackson-core")
     exclude("commons-configuration","commons-configuration")
+    exclude("com.amazonaws","aws-java-sdk-bundle")
     exclude("org.apache.hadoop" ,"hadoop-common"),
-  "com.amazonaws" % "aws-java-sdk" % "1.11.568"
-    exclude("commons-codec", "commons-codec")
-    exclude("com.fasterxml.jackson.core", "jackson-core")
+  "com.amazonaws" % "aws-java-sdk-core" % "1.11.375"
     exclude("com.fasterxml.jackson.core", "jackson-annotations")
     exclude("com.fasterxml.jackson.core", "jackson-databind")
-    exclude("com.fasterxml.jackson.dataformat", "jackson-dataformat-smile")
-    exclude("com.fasterxml.jackson.datatype", "jackson-datatype-joda"),
-
+    exclude("com.fasterxml.jackson.core", "jackson-core")
+    exclude("commons-configuration","commons-configuration"),
+  "com.amazonaws" % "aws-java-sdk-s3" % "1.11.375",
   "org.rocksdb" % "rocksdbjni" % "5.17.2",
   "com.github.universal-automata" % "liblevenshtein" % "3.0.0"
     exclude("com.google.guava", "guava")
@@ -158,7 +159,6 @@ lazy val root = (project in file("."))
 
 
 val ocrMergeRules: String => MergeStrategy  = {
-
   case "versionchanges.txt" => MergeStrategy.discard
   case "StaticLoggerBinder" => MergeStrategy.discard
   case PathList("META-INF", fileName)
@@ -171,10 +171,23 @@ val ocrMergeRules: String => MergeStrategy  = {
   case _ => MergeStrategy.deduplicate
 }
 
+val evalMergeRules: String => MergeStrategy  = {
+  case "versionchanges.txt" => MergeStrategy.discard
+  case "StaticLoggerBinder" => MergeStrategy.discard
+  case PathList("META-INF", fileName)
+    if List("NOTICE", "MANIFEST.MF", "DEPENDENCIES", "INDEX.LIST").contains(fileName) || fileName.endsWith(".txt")
+  => MergeStrategy.discard
+  case PathList("META-INF", "services", _ @ _*)  => MergeStrategy.first
+  case PathList("META-INF", xs @ _*)  => MergeStrategy.first
+  case PathList("org", "apache", "spark", _ @ _*)  => MergeStrategy.discard
+  case PathList("apache", "commons", "logging", "impl",  xs @ _*)  => MergeStrategy.discard
+  case _ => MergeStrategy.deduplicate
+}
+
 assemblyMergeStrategy in assembly := {
   case PathList("apache.commons.lang3", _ @ _*)  => MergeStrategy.discard
-  case PathList("org.apache.hadoop", _ @ _*)  => MergeStrategy.last
-  case PathList("com.amazonaws", _ @ _*)  => MergeStrategy.last
+  case PathList("org.apache.hadoop", xs @ _*)  => MergeStrategy.first
+  case PathList("com.amazonaws", xs @ _*)  => MergeStrategy.last
   case PathList("com.fasterxml.jackson") => MergeStrategy.first
   case PathList("META-INF", "io.netty.versions.properties")  => MergeStrategy.first
   case PathList("org", "tensorflow", _ @ _*)  => MergeStrategy.first
@@ -187,7 +200,15 @@ assemblyMergeStrategy in assembly := {
 lazy val evaluation = (project in file("eval"))
   .settings(
     name := "spark-nlp-eval",
-    version := "2.0.8",
+    version := "2.1.0",
+
+    assemblyMergeStrategy in assembly := evalMergeRules,
+
+    libraryDependencies ++= testDependencies ++ Seq(
+      "org.mlflow" % "mlflow-client" % "1.0.0"
+    ),
+
+    test in assembly := {},
 
     publishTo := Some(
       if (isSnapshot.value)
@@ -220,7 +241,7 @@ lazy val evaluation = (project in file("eval"))
 lazy val ocr = (project in file("ocr"))
   .settings(
     name := "spark-nlp-ocr",
-    version := "2.0.8",
+    version := "2.1.0",
 
     test in assembly := {},
 
@@ -294,9 +315,10 @@ copyAssembledOcrJar := {
   println(s"[info] $jarFilePath copied to $newJarFilePath ")
 }
 
+// Includes spark-nlp, so use sparknlp.jar
 copyAssembledEvalJar := {
   val jarFilePath = (assemblyOutputPath in assembly in "evaluation").value
-  val newJarFilePath = baseDirectory( _ / "python" / "lib" /  "sparknlp-eval.jar").value
+  val newJarFilePath = baseDirectory( _ / "python" / "lib" /  "sparknlp.jar").value
   IO.copyFile(jarFilePath, newJarFilePath)
   println(s"[info] $jarFilePath copied to $newJarFilePath ")
 }

diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html
@@ -49,22 +49,22 @@ <h1>{{ _section.title }}</h1>
           <div class="cell cell--12 cell--lg-12" style="text-align: left; background-color: #2d2d2d; padding: 10px">
             {% highlight bash %}
 # Install Spark NLP from PyPI
-$ pip install spark-nlp==2.0.8
+$ pip install spark-nlp==2.1.0
 
 # Install Spark NLP from Anacodna/Conda
 $ conda install -c johnsnowlabs spark-nlp
 
 # Load Spark NLP with Spark Shell
-$ spark-shell --packages JohnSnowLabs:spark-nlp:2.0.8
+$ spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0
 
 # Load Spark NLP with PySpark
-$ pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
+$ pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
 
 # Load Spark NLP with Spark Submit
-$ spark-submit --packages JohnSnowLabs:spark-nlp:2.0.8
+$ spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0
 
 # Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
-$ spark-shell --jar spark-nlp-assembly-2.0.8
+$ spark-shell --jar spark-nlp-assembly-2.1.0
             {% endhighlight %}
           </div>
         </div>