Merge pull request #258 from JohnSnowLabs/161-release-candidate

1.6.1 release candidate
JohnSnowLabs · Aug 9, 2018 · 2e9808a · 2e9808a
2 parents 594342b + 4f6bd68
commit 2e9808a
Show file tree

Hide file tree

Showing 16 changed files with 144 additions and 244 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,54 @@
+========
+1.6.1
+========
+---------------
+Overview
+---------------
+Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground. First of all, we've worked hard with the community to understand S3-based clusters,
+which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings. We fixed two things here,
+on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
+Please share your feedback on this regard.
+On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.
+
+---------------
+New features
+---------------
+* new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
+filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.
+
+---------------
+Bug fixes
+---------------
+* Fixed incorrect filesystem readings in some S3 environments for word embeddings
+* Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)
+
+---------------
+Enhancements
+---------------
+* Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
+* New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
+This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
+* New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
+* SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
+* AssertionDLApproach now may be used within LightPipelines
+* AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)
+
+---------------
+Other
+---------------
+* PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
+* Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
+* application.conf renamed configuration values for better consistency
+
+---------------
+Developer API
+---------------
+* Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
+* Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
+* Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
+* WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
+* RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.
+
 ========
 1.6.0
 ========

diff --git a/README.md b/README.md
@@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to nlp@johnsnowlabs.com
 
 This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
 
-To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.0` to you spark command
+To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.1` to you spark command
 
 ```sh
-spark-shell --packages JohnSnowLabs:spark-nlp:1.6.0
+spark-shell --packages JohnSnowLabs:spark-nlp:1.6.1
 ```
 
 ```sh
-pyspark --packages JohnSnowLabs:spark-nlp:1.6.0
+pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
 ```
 
 ```sh
-spark-submit --packages JohnSnowLabs:spark-nlp:1.6.0
+spark-submit --packages JohnSnowLabs:spark-nlp:1.6.1
 ```
 
 ## Jupyter Notebook
@@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages JohnSnowLabs:spark-nlp:1.6.0
+pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
 ```
 
 ## Apache Zeppelin
 This way will work for both Scala and Python
 ```
-export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.0"
+export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.1"
 ```
 Alternatively, add the following Maven Coordinates to the interpreter's library list
 ```
-com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.0
+com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.1
 ```
 
 ## Python without explicit Spark installation
 If you installed pyspark through pip, you can now install sparknlp through pip
 ```
-pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.0
+pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.1
 ```
 Then you'll have to create a SparkSession manually, for example:
 ```
@@ -67,11 +67,11 @@ spark = SparkSession.builder \
 
 ## Pre-compiled Spark-NLP and Spark-NLP-OCR
 You may download fat-jar from here:
-[Spark-NLP 1.6.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.0.jar)
+[Spark-NLP 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.1.jar)
 or non-fat from here
-[Spark-NLP 1.6.0 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.0/spark-nlp_2.11-1.6.0.jar)
+[Spark-NLP 1.6.1 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.1/spark-nlp_2.11-1.6.1.jar)
 Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
-[Spark-NLP-OCR 1.6.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.0.jar)
+[Spark-NLP-OCR 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.1.jar)
 
 ## Maven central
 
@@ -83,19 +83,19 @@ Our package is deployed to maven central. In order to add this package as a depe
 <dependency>
   <groupId>com.johnsnowlabs.nlp</groupId>
   <artifactId>spark-nlp_2.11</artifactId>
-  <version>1.6.0</version>
+  <version>1.6.1</version>
 </dependency>
 ```
 
 #### SBT
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.0"
+libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.1"
 ```
 
 If you are using `scala 2.11`
 
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.0"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.1"
 ```
 
 ## Using the jar manually 

diff --git a/build.sbt b/build.sbt
@@ -9,7 +9,7 @@ name := "spark-nlp"
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "1.6.0"
+version := "1.6.1"
 
 scalaVersion in ThisBuild := scalaVer
 
@@ -137,7 +137,7 @@ assemblyMergeStrategy in assembly := {
 lazy val ocr = (project in file("ocr"))
   .settings(
     name := "spark-nlp-ocr",
-    version := "1.6.0",
+    version := "1.6.1",
     libraryDependencies ++= ocrDependencies ++
       analyticsDependencies ++
       testDependencies,

diff --git a/docs/index.html b/docs/index.html
@@ -78,8 +78,8 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
                     </p>
                 <a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:nlp@johnsnowlabs.com?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
                 <b/><p/><p/>
-                <p><span class="label label-warning">2018 Jul 7th - Update!</span> 1.6.0 Released! OCR PDF to Spark-NLP capabilities, new Chunker annotator, fixed AWS compatibility, better performance and much more.
-                    Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/CHANGELOG">HERE</a> and check out for updated documentation below</p>
+                <p><span class="label label-warning">2018 Aug 9th - Update!</span> 1.6.1 Released! Fixed S3-based clusters support, new CHUNK type annotation and more!
+                    Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/CHANGELOG">HERE</a> and check out for updated documentation below</p>
             </div>
             <div id="cards-wrapper" class="cards-wrapper row">
                 <div class="item item-green col-md-4 col-sm-6 col-xs-6">

diff --git a/docs/notebooks.html b/docs/notebooks.html
@@ -103,7 +103,7 @@ <h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis<
                                     Since we are dealing with small amounts of data, we put in practice LightPipelines.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                         </section>
@@ -135,7 +135,7 @@ <h4 id="vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
                                     better Sentiment Analysis accuracy
                                   </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -157,7 +157,7 @@ <h4 id="sentiment-notebook" class="section-block"> Rule-based Sentiment Analysis
                                 Each of these sentences will be used for giving a score to text
                             </p>
                                 </p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -177,7 +177,7 @@ <h4 id="crfner-notebook" class="section-block"> CRF Named Entity Recognition</h4
                                     approach to use the same pipeline for tagging external resources.
                                 </p>
                                 <p>
-                                <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
+                                <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -196,7 +196,7 @@ <h4 id="dlner-notebook" class="section-block"> CNN Deep Learning NER</h4>
                                     and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
                                     This annotator is an AnnotatorModel and does not require training.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -226,7 +226,7 @@ <h4 id="assertion-notebook" class="section-block"> Assertion Status with LogReg<
                                     dataset will return the appropriate result.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -241,7 +241,7 @@ <h4 id="dlassertion-notebook" class="section-block"> Deep Learning Assertion Sta
                                     graphs may be redesigned if needed.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                             <div>
@@ -260,7 +260,7 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
                                     Such components may then be injected seamlessly into further pipelines, and so on.
                                 </p>
                                 <p>
-                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
+                                    <a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
                                 </p>
                             </div>
                         </section>