- Updated release number

- Updated documentation for release 1.2.4 - Updated missing format params
JohnSnowLabs · Dec 23, 2017 · 022f98b · 022f98b
1 parent 984ccf8
commit 022f98b
Show file tree

Hide file tree

Showing 14 changed files with 316 additions and 89 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,56 @@
+========
+1.2.4
+========
+---------------
+New features
+---------------
+* https://github.com/JohnSnowLabs/spark-nlp/commit/c17ddac7a5a9e775cddc18d672e80e60f0040e38
+ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.
+
+---------------
+Enhancements and progress
+---------------
+* https://github.com/JohnSnowLabs/spark-nlp/commit/4920e5ce394b25937969cc4cab1d81172be722a3
+CRF NER Benchmarking progress
+* https://github.com/JohnSnowLabs/spark-nlp/pull/64
+EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
+
+---------------
+Bug fixes
+---------------
+* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/41 <> https://github.com/JohnSnowLabs/spark-nlp/commit/d3b9086e834233f3281621d7c82e32195479fc82
+Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
+* https://github.com/JohnSnowLabs/spark-nlp/commit/08405858c6186e6c3e8b668233e30df12fa50374
+Corrected param names in DocumentAssembler
+* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/58 <> https://github.com/JohnSnowLabs/spark-nlp/commit/5a533952cdacf67970c5a8042340c8a4c9416b13
+Deleted a left-over deprecated function which was misleading.
+* https://github.com/JohnSnowLabs/spark-nlp/commit/c02591bd683db3f615150d7b1d121ffe5d9e4535
+Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis
+
+---------------
+Documentation and examples
+---------------
+* https://github.com/JohnSnowLabs/spark-nlp/commit/b81e95ce37ed3c4bd7b05e9f9c7b63b31d57e660
+Added additional resources into FAQ page.
+* https://github.com/JohnSnowLabs/spark-nlp/commit/0c3f43c0d3e210f3940f7266fe84426900a6294e
+Added Spark Summit example notebook with full Pipeline use case
+* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/53 <> https://github.com/JohnSnowLabs/spark-nlp/commit/20efe4a3a5ffbceedac7bf775466b7a8cde5044f
+Fixed scala python documentation mistakes
+* https://github.com/JohnSnowLabs/spark-nlp/commit/782eb8dce171b69a615887b3defaf8b729b735f2
+Typos fix
+
+---------------
+Other
+---------------
+* https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
+Removed Regex NER due to slowness and little use. CRF NER to replace NER.
+
+---------------
+Other
+---------------
+https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
+Removed Regex NER due to slowness and little use. CRF NER to replace NER.
+
 ========
 1.2.3
 ========

diff --git a/README.md b/README.md
@@ -13,15 +13,15 @@ This library has been uploaded to the spark-packages repository https://spark-pa
 To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.0.0` to you spark command
 
 ```sh
-spark-shell --packages JohnSnowLabs:spark-nlp:1.2.3
+spark-shell --packages JohnSnowLabs:spark-nlp:1.2.4
 ```
 
 ```sh
-pyspark --packages JohnSnowLabs:spark-nlp:1.2.3
+pyspark --packages JohnSnowLabs:spark-nlp:1.2.4
 ```
 
 ```sh
-spark-submit --packages JohnSnowLabs:spark-nlp:1.2.3
+spark-submit --packages JohnSnowLabs:spark-nlp:1.2.4
 ```
 
 If you want to use and old version check the spark-packages websites to see all the releases.
@@ -36,19 +36,19 @@ Our package is deployed to maven central. In order to add this package as a depe
 <dependency>
   <groupId>com.johnsnowlabs.nlp</groupId>
   <artifactId>spark-nlp_2.11</artifactId>
-  <version>1.2.3</version>
+  <version>1.2.4</version>
 </dependency>
 ```
 
 #### SBT
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.2.3"
+libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.2.4"
 ```
 
 If you are using `scala 2.11`
 
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.2.3"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.2.4"
 ```
 
 ## Using the jar manually 

diff --git a/build.sbt b/build.sbt
@@ -7,7 +7,7 @@ name := "spark-nlp"
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "1.2.3"
+version := "1.2.4"
 
 scalaVersion := scalaVer
 

diff --git a/docs/components.html b/docs/components.html
@@ -336,6 +336,21 @@ <h4 id="Lemmatizer" class="section-block"> 5. Lemmatizer: Lemmas</h4>
                                                     setDictionary(path): Path to file containing multiple key to value
                                                     dictionary, or key,value lemma dictionary. Default: Not provided
                                                 </li>
+                                                <li>
+                                                    setLemmaFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs)
+                                                    Default:
+                                                    Looks up path in configuration
+                                                </li>
+                                                <li>
+                                                    setLemmaKeySep(format): Separator for keys and multiple values
+                                                    Default:
+                                                    "->" or Looks up path in configuration
+                                                </li>
+                                                <li>
+                                                    setLemmaValSep(format): Separator among values
+                                                    Default:
+                                                    "\t" or Looks up path in configuration
+                                                </li>
                                             </ul>
                                             <b>Example:</b><br>
                                             </p>
@@ -361,6 +376,21 @@ <h4 id="Lemmatizer" class="section-block"> 5. Lemmatizer: Lemmas</h4>
                                                     setDictionary(path): Path to file containing multiple key to value
                                                     dictionary, or key,value lemma dictionary. Default: Not provided
                                                 </li>
+                                                <li>
+                                                    setLemmaFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs)
+                                                    Default:
+                                                    Looks up path in configuration
+                                                </li>
+                                                <li>
+                                                    setLemmaKeySep(format): Separator for keys and multiple values
+                                                    Default:
+                                                    "->" or Looks up path in configuration
+                                                </li>
+                                                <li>
+                                                    setLemmaValSep(format): Separator among values
+                                                    Default:
+                                                    "\t" or Looks up path in configuration
+                                                </li>
                                             </ul>
                                             <b>Example:</b><br>
                                             </p>
@@ -396,10 +426,20 @@ <h4 id="RegexMatcher" class="section-block"> 6. RegexMatcher: Rule matching</h4>
                                                     MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
                                                 </li>
                                                 <li>
-                                                    setRules(path): Path to file containing a set of regex,key pair.
+                                                    setRulesPath(path): Path to file containing a set of regex,key pair.
                                                     Default:
                                                     Looks up path in configuration
                                                 </li>
+                                                <li>
+                                                    setRulesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs)
+                                                    Default:
+                                                    TXT or looks up path in configuration
+                                                </li>
+                                                <li>
+                                                    setRulesSeparator(sep): Separator for rules file
+                                                    Default:
+                                                    "," or looks up path in configuration
+                                                </li>
                                             </ul>
                                             <b>Example:</b><br>
                                             </p>
@@ -424,10 +464,20 @@ <h4 id="RegexMatcher" class="section-block"> 6. RegexMatcher: Rule matching</h4>
                                                     MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE
                                                 </li>
                                                 <li>
-                                                    setRules(path): Path to file containing a set of regex,key pair.
+                                                    setRulesPath(path): Path to file containing a set of regex,key pair.
                                                     Default:
                                                     Looks up path in configuration
                                                 </li>
+                                                <li>
+                                                    setRulesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs)
+                                                    Default:
+                                                    TXT or looks up path in configuration
+                                                </li>
+                                                <li>
+                                                    setRulesSeparator(sep): Separator for rules file
+                                                    Default:
+                                                    "," or looks up path in configuration
+                                                </li>
                                             </ul>
                                             <b>Example:</b><br>
                                             </p>
@@ -467,10 +517,15 @@ <h4 id="EntityExtractor" class="section-block"> 7. EntityExtractor: Phrase match
                                                     boundaries for better precision
                                                 </li>
                                                 <li>
-                                                    setEntities(path): Provides a file with phrases to match. Default:
+                                                    setEntitiesPath(path): Provides a file with phrases to match. Default:
                                                     Looks up
                                                     path in configuration
                                                 </li>
+                                                <li>
+                                                    setEntitiesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs)
+                                                    Default:
+                                                    TXT or looks up path in configuration
+                                                </li>
                                             </ul>
                                             <b>Example:</b><br>
                                             </p>
@@ -498,10 +553,15 @@ <h4 id="EntityExtractor" class="section-block"> 7. EntityExtractor: Phrase match
                                                     boundaries for better precision
                                                 </li>
                                                 <li>
-                                                    setEntities(path): Provides a file with phrases to match. Default:
+                                                    setEntitiesPath(path): Provides a file with phrases to match. Default:
                                                     Looks up
                                                     path in configuration
                                                 </li>
+                                                <li>
+                                                  setEntitiesFormat(format): TXT for txt files or TXTDS for text files read as dataset (allows hdfs)
+                                                  Default:
+                                                  TXT or looks up path in configuration
+                                                </li>
                                             </ul>
                                             <b>Example:</b><br>
                                             </p>
@@ -710,6 +770,12 @@ <h4 id="SentimentDetector" class="section-block"> 11. SentimentDetector: Sentime
                                                 <li>
                                                     setDictPath(path)
                                                 </li>
+                                                <li>
+                                                    setDictFormat(path)
+                                                </li>
+                                                <li>
+                                                    setDictSeparator(path)
+                                                </li>
                                             </ul>
                                             <br>
                                             <b>Input:</b>
@@ -739,6 +805,12 @@ <h4 id="SentimentDetector" class="section-block"> 11. SentimentDetector: Sentime
                                                 <li>
                                                     setDictPath(path)
                                                 </li>
+                                                <li>
+                                                    setDictFormat(path)
+                                                </li>
+                                                <li>
+                                                    setDictSeparator(path)
+                                                </li>
                                             </ul>
                                             <br>
                                             <b>Input:</b>
@@ -884,7 +956,7 @@ <h4 id="SpellChecker" class="section-block"> 13. SpellChecker: Token spell
                                                     setCorpusPath: path to training corpus. Can be any good text.
                                                 </li>
                                                 <li>
-                                                  setCorpusFormat(format): Allowed “txt” or “txtds”. The latter uses spark dataframes from text
+                                                    setCorpusFormat(format): Allowed “txt” or “txtds”. The latter uses spark dataframes from text
                                                 </li>
                                                 <li>
                                                     setSlangPath: path to custom dictionares, separated by comma

diff --git a/python/example/vivekn-sentiment/sentiment.ipynb b/python/example/vivekn-sentiment/sentiment.ipynb
@@ -38,9 +38,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "#Load the input data to be annotated\n",
@@ -160,9 +158,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "pipeline = Pipeline(stages=[\n",
@@ -182,9 +178,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "for r in sentiment_data.take(5):\n",
@@ -217,9 +211,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "Pipeline.read().load(\"./ps\")\n",
@@ -239,7 +231,7 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python [default]",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },