Skip to content

Commit

Permalink
Merge pull request #571 from JohnSnowLabs/210-release-candidate-4
Browse files Browse the repository at this point in the history
2.1.0 Release Candidate #4
  • Loading branch information
saif-ellafi authored Jul 13, 2019
2 parents e37ec99 + 7f51807 commit 5cbaa02
Show file tree
Hide file tree
Showing 603 changed files with 349,146 additions and 1,035 deletions.
4 changes: 2 additions & 2 deletions .sbtrc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
alias assemblyAndCopy=;assembly;copyAssembledJar
alias assemblyOcrAndCopy=;ocr/assembly;copyAssembledOcrJar
alias assemblyEvalAndCopy=;evaluation/assembly;copyAssembledEvalJar
alias assemblyAllAndCopy=;assemblyAndCopy;assemblyOcrAndCopy;assemblyEvalAndCopy;copyAssembledEvalJar
alias assemblyAllAndCopy=;assemblyEvalAndCopy;assemblyOcrAndCopy
alias assemblyAndCopyForPyPi=;assembly;copyAssembledJarForPyPi
alias publishSignedOcr=;ocr/assembly;ocr/publishSigned
alias publishSignedOcr=;ocr/assembly;ocr/publishSigned
53 changes: 53 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,56 @@
========
2.1.0
========
---------------
Overview
---------------
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler `trimAndClearNewLiens` was redesigned into a `cleanupMode` for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language based Tokenizers.
Another big introduction is the `eval` module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the `NerDLApproach`.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.

---------------
New Features
---------------
* Spark NLP Eval module, includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

---------------
Enhancements
---------------
* DocumentAssembler new param `cleanupMode` allows user to decide what kind of cleanup to apply to source
* Tokenizer has been severely enhanced to allow easier and more intuitive customization
* Norvig and Symmetric spell checkers now report confidence scores in metadata
* NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through `setTrainValidationProp`
* Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development

---------------
Bugfixes
---------------
* Fixed Dependency Parser not reporting offsets correctly
* Dependency Parser now only shows head token as part of the result, instead of pairs
* Fixed NerDLModel not allowing to pick noncontrib versions from linux
* Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
* Removed unintentional gc calls causing some performance issues

---------------
Framework
---------------
* ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)

---------------
Documentation
---------------
* Scaladocs for Spark NLP reference
* Added Google Colab workthrough guide
* Added Approach and Model class names in reference documentation
* Fixed various typos and outdated pieces in documentation

========
2.0.8
========
Expand Down
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http:

## Apache Spark Support

Spark NLP *2.0.8* has been built on top of Apache Spark 2.4.3
Spark NLP *2.1.0* has been built on top of Apache Spark 2.4.3

Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work.

Expand All @@ -65,18 +65,18 @@ This library has been uploaded to the [spark-packages repository](https://spark-

Benefit of spark-packages is that makes it available for both Scala-Java and Python

To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.8` to you spark command
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.1.0` to you spark command

```sh
spark-shell --packages JohnSnowLabs:spark-nlp:2.0.8
spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0
```

```sh
pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
```

```sh
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.8
spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0
```

This can also be used to create a SparkSession manually by using the `spark.jars.packages` option in both Python and Scala
Expand Down Expand Up @@ -144,7 +144,7 @@ Our package is deployed to maven central. In order to add this package as a depe
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.11</artifactId>
<version>2.0.8</version>
<version>2.1.0</version>
</dependency>
```

Expand All @@ -155,22 +155,22 @@ and
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-ocr_2.11</artifactId>
<version>2.0.8</version>
<version>2.1.0</version>
</dependency>
```

### SBT

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.8"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.1.0"
```

and

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-ocr
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.8"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.1.0"
```

Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
Expand All @@ -185,7 +185,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through

Pip:
```bash
pip install spark-nlp==2.0.8
pip install spark-nlp==2.1.0
```
Conda:
```bash
Expand All @@ -202,7 +202,7 @@ spark = SparkSession.builder \
.master("local[4]")\
.config("spark.driver.memory","4G")\
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.8")\
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.1.0")\
.config("spark.kryoserializer.buffer.max", "500m")\
.getOrCreate()
```
Expand All @@ -216,7 +216,7 @@ Use either one of the following options
* Add the following Maven Coordinates to the interpreter's library list

```bash
com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.8
com.johnsnowlabs.nlp:spark-nlp_2.11:2.1.0
```

* Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
Expand All @@ -226,7 +226,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.8
Apart from previous step, install python module through pip

```bash
pip install spark-nlp==2.0.8
pip install spark-nlp==2.1.0
```

Or you can install `spark-nlp` from inside Zeppelin by using Conda:
Expand All @@ -251,7 +251,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
pyspark --packages JohnSnowLabs:spark-nlp:2.1.0
```

Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
Expand Down
50 changes: 36 additions & 14 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ if(is_gpu.equals("false")){

organization:= "com.johnsnowlabs.nlp"

version := "2.0.8"
version := "2.1.0"

scalaVersion in ThisBuild := scalaVer

Expand Down Expand Up @@ -86,6 +86,7 @@ developers in ThisBuild:= List(
Developer(id="showy", name="Eduardo Muñoz", email="eduardo@johnsnowlabs.com", url=url("https://github.com/showy"))
)

target in Compile in doc := baseDirectory.value / "docs/api"

lazy val ocrDependencies = Seq(
"net.sourceforge.tess4j" % "tess4j" % "4.2.1"
Expand All @@ -108,19 +109,19 @@ lazy val testDependencies = Seq(
lazy val utilDependencies = Seq(
"com.typesafe" % "config" % "1.3.0",
"org.rocksdb" % "rocksdbjni" % "5.17.2",
"org.apache.hadoop" % "hadoop-aws" % "2.7.3"
"org.apache.hadoop" % "hadoop-aws" % "3.2.0"
exclude("com.fasterxml.jackson.core", "jackson-annotations")
exclude("com.fasterxml.jackson.core", "jackson-databind")
exclude("com.fasterxml.jackson.core", "jackson-core")
exclude("commons-configuration","commons-configuration")
exclude("com.amazonaws","aws-java-sdk-bundle")
exclude("org.apache.hadoop" ,"hadoop-common"),
"com.amazonaws" % "aws-java-sdk" % "1.11.568"
exclude("commons-codec", "commons-codec")
exclude("com.fasterxml.jackson.core", "jackson-core")
"com.amazonaws" % "aws-java-sdk-core" % "1.11.375"
exclude("com.fasterxml.jackson.core", "jackson-annotations")
exclude("com.fasterxml.jackson.core", "jackson-databind")
exclude("com.fasterxml.jackson.dataformat", "jackson-dataformat-smile")
exclude("com.fasterxml.jackson.datatype", "jackson-datatype-joda"),

exclude("com.fasterxml.jackson.core", "jackson-core")
exclude("commons-configuration","commons-configuration"),
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.375",
"org.rocksdb" % "rocksdbjni" % "5.17.2",
"com.github.universal-automata" % "liblevenshtein" % "3.0.0"
exclude("com.google.guava", "guava")
Expand Down Expand Up @@ -158,7 +159,6 @@ lazy val root = (project in file("."))


val ocrMergeRules: String => MergeStrategy = {

case "versionchanges.txt" => MergeStrategy.discard
case "StaticLoggerBinder" => MergeStrategy.discard
case PathList("META-INF", fileName)
Expand All @@ -171,10 +171,23 @@ val ocrMergeRules: String => MergeStrategy = {
case _ => MergeStrategy.deduplicate
}

val evalMergeRules: String => MergeStrategy = {
case "versionchanges.txt" => MergeStrategy.discard
case "StaticLoggerBinder" => MergeStrategy.discard
case PathList("META-INF", fileName)
if List("NOTICE", "MANIFEST.MF", "DEPENDENCIES", "INDEX.LIST").contains(fileName) || fileName.endsWith(".txt")
=> MergeStrategy.discard
case PathList("META-INF", "services", _ @ _*) => MergeStrategy.first
case PathList("META-INF", xs @ _*) => MergeStrategy.first
case PathList("org", "apache", "spark", _ @ _*) => MergeStrategy.discard
case PathList("apache", "commons", "logging", "impl", xs @ _*) => MergeStrategy.discard
case _ => MergeStrategy.deduplicate
}

assemblyMergeStrategy in assembly := {
case PathList("apache.commons.lang3", _ @ _*) => MergeStrategy.discard
case PathList("org.apache.hadoop", _ @ _*) => MergeStrategy.last
case PathList("com.amazonaws", _ @ _*) => MergeStrategy.last
case PathList("org.apache.hadoop", xs @ _*) => MergeStrategy.first
case PathList("com.amazonaws", xs @ _*) => MergeStrategy.last
case PathList("com.fasterxml.jackson") => MergeStrategy.first
case PathList("META-INF", "io.netty.versions.properties") => MergeStrategy.first
case PathList("org", "tensorflow", _ @ _*) => MergeStrategy.first
Expand All @@ -187,7 +200,15 @@ assemblyMergeStrategy in assembly := {
lazy val evaluation = (project in file("eval"))
.settings(
name := "spark-nlp-eval",
version := "2.0.8",
version := "2.1.0",

assemblyMergeStrategy in assembly := evalMergeRules,

libraryDependencies ++= testDependencies ++ Seq(
"org.mlflow" % "mlflow-client" % "1.0.0"
),

test in assembly := {},

publishTo := Some(
if (isSnapshot.value)
Expand Down Expand Up @@ -220,7 +241,7 @@ lazy val evaluation = (project in file("eval"))
lazy val ocr = (project in file("ocr"))
.settings(
name := "spark-nlp-ocr",
version := "2.0.8",
version := "2.1.0",

test in assembly := {},

Expand Down Expand Up @@ -294,9 +315,10 @@ copyAssembledOcrJar := {
println(s"[info] $jarFilePath copied to $newJarFilePath ")
}

// Includes spark-nlp, so use sparknlp.jar
copyAssembledEvalJar := {
val jarFilePath = (assemblyOutputPath in assembly in "evaluation").value
val newJarFilePath = baseDirectory( _ / "python" / "lib" / "sparknlp-eval.jar").value
val newJarFilePath = baseDirectory( _ / "python" / "lib" / "sparknlp.jar").value
IO.copyFile(jarFilePath, newJarFilePath)
println(s"[info] $jarFilePath copied to $newJarFilePath ")
}
Expand Down
10 changes: 5 additions & 5 deletions docs/_layouts/landing.html
Original file line number Diff line number Diff line change
Expand Up @@ -49,22 +49,22 @@ <h1>{{ _section.title }}</h1>
<div class="cell cell--12 cell--lg-12" style="text-align: left; background-color: #2d2d2d; padding: 10px">
{% highlight bash %}
# Install Spark NLP from PyPI
$ pip install spark-nlp==2.0.8
$ pip install spark-nlp==2.1.0

# Install Spark NLP from Anacodna/Conda
$ conda install -c johnsnowlabs spark-nlp

# Load Spark NLP with Spark Shell
$ spark-shell --packages JohnSnowLabs:spark-nlp:2.0.8
$ spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0

# Load Spark NLP with PySpark
$ pyspark --packages JohnSnowLabs:spark-nlp:2.0.8
$ pyspark --packages JohnSnowLabs:spark-nlp:2.1.0

# Load Spark NLP with Spark Submit
$ spark-submit --packages JohnSnowLabs:spark-nlp:2.0.8
$ spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0

# Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
$ spark-shell --jar spark-nlp-assembly-2.0.8
$ spark-shell --jar spark-nlp-assembly-2.1.0
{% endhighlight %}
</div>
</div>
Expand Down
Loading

0 comments on commit 5cbaa02

Please sign in to comment.