✨ Update to v3.3.2 (#51)

## v3.3.2, 2021-05-01 ### 🔴 Виправлення помилок #### ENG🇬🇧 - виправлено помилку `nlp.max_length limit exceeded`: Text of length 1195652 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text). Встановлено ліміт `NLP_EN.max_length = 5000000`; - виправлено помилку `ImportError: cannot import name 'escape' from 'jinja2'`: This happens because Jinja has removed those functions in a recent version — 3.1.0 — [released on March 24th, 2022](https://jinja.palletsprojects.com/en/3.1.x/changes/#version-3-1-0). > ``Markup`` and ``escape`` should be imported from MarkupSafe. You have two options form here: 1. either this error comes from one of your dependency. The first thing you should consider is to upgrade the said dependence(s). If this is not possible, what you can do, from here is to downgrade your Jinja version to a version that would still include `escape`, for example, adding it explicitly in your _requirements.txt_: ``` jinja2<3.1.0 ``` 2. or, your error is from code you wrote, so you can fix it by importing it from MarkupSafe, as suggested in the Jinja release notes. So, you should use ```python from markupsafe import escape ``` instead of ```python from jinja2 import escape ``` При використанні `Flask==1.1.2` треба зафікмувати наступні залежності: `jinja2<3.1.0`; `itsdangerous==2.0.1`, `Werkzeug<2.0.0`; - дрібні виправлення коду. ### ⚠️ Зауваження #### ENG🇬🇧 - оновлено бібліотеку spaCy до версії `3.0.6`; - встановлено `keepalive_timeout 1050` для nginx; ----- * fix fetch of parce.xml * update to v3.3.1 - update to v3.3.1 * update spaCy up to `3.0.6` update spaCy up to `3.0.6` * V3/feature/fix text max length and version update 3 0 8 (#50) * update spaCy up to `3.0.8`; update spaCy up to `3.0.8`; * Fix maximum Text of length of nlp Fix Text of length nlp exceeds maximum of 1000000. Set nlp up to `NLP_EN.max_length = 2000000` * cleanup * update and fix Flask fix `ImportError: cannot import name 'escape' from 'jinja2`: Jinja is a dependency of Flask and Flask V1.X.X uses the escape module from Jinja, however recently support for the escape module was dropped in newer versions of Jinja. To fix this issue, simply update to the newer version of Flask V2.X.X in your requirements.txt where Flask no longer uses the escape module from Jinja. Flask==2.1.2 * set spacy to 3.0.6 * revert spacy to 3.0.8 revert spacy to 3.0.8 * update spaCy up to `3.1.6` update spaCy up to `3.1.6` * Set Flask to 2.0.3 Set Flask to 2.0.3 * Set Flask==1.1.2 Set Flask==1.1.2 * Set jinja2<3.1.0 Set jinja2<3.1.0 * Set itsdangerous==2.0.1 Set itsdangerous==2.0.1 * Cleanup Cleanup * Set Werkzeug < 2.0.0 Set Werkzeug < 2.0.0 * Set spacy to 3.0.8 Set spacy to 3.0.8 * Set spacy to 3.0.6 Set spacy to 3.0.6 * Set keepalive_timeout 1050 Set keepalive_timeout 1050 * Set NLP_EN.max_length = 5000000 Set NLP_EN.max_length = 5000000 * Set keepalive_timeout 1550 Set keepalive_timeout 1550 * add big test file for en add big test file for en * Set keepalive_timeout 1050 Set keepalive_timeout 1050 * Update to v3.3.2 Update to v3.3.2
malakhovks · May 1, 2022 · 84270f0 · 84270f0
1 parent 29ed112
commit 84270f0
Show file tree

Hide file tree

Showing 11 changed files with 313 additions and 187 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,44 @@
+## v3.3.2, 2021-05-01
+
+### 🔴 Виправлення помилок
+
+#### ENG🇬🇧
+
+- виправлено помилку `nlp.max_length limit exceeded`:
+  Text of length 1195652 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).
+  Встановлено ліміт `NLP_EN.max_length = 5000000`;
+- виправлено помилку `ImportError: cannot import name 'escape' from 'jinja2'`:
+  This happens because Jinja has removed those functions in a recent version —  3.1.0 — [released on March 24th, 2022](https://jinja.palletsprojects.com/en/3.1.x/changes/#version-3-1-0).
+
+  > ``Markup`` and ``escape`` should be imported from MarkupSafe.
+
+  You have two options form here:
+  1. either this error comes from one of your dependency.
+    The first thing you should consider is to upgrade the said dependence(s). If this is not possible, what you can do, from here is to downgrade your Jinja version to a version that would still include `escape`, for example, adding it explicitly in your _requirements.txt_:
+
+    ```
+    jinja2<3.1.0
+    ```
+  2. or, your error is from code you wrote, so you can fix it by importing it from MarkupSafe, as suggested in the Jinja release notes.
+    
+    So, you should use
+    ```python
+    from markupsafe import escape
+    ```
+    instead of 
+    ```python
+    from jinja2 import escape
+    ```
+  При використанні `Flask==1.1.2` треба зафікмувати наступні залежності: `jinja2<3.1.0`; `itsdangerous==2.0.1`, `Werkzeug<2.0.0`;
+- дрібні виправлення коду.
+
+### ⚠️ Зауваження
+
+#### ENG🇬🇧
+
+- оновлено бібліотеку spaCy до версії `3.0.6`;
+- встановлено `keepalive_timeout   1050` для nginx;
+
 ## v3.3.1, 2021-04-21
 
 ### 🔴 Виправлення помилок

diff --git a/KNOWN-ISSUES.md b/KNOWN-ISSUES.md
@@ -0,0 +1,19 @@
+# Відомі проблеми
+
+## v3.3.2, 2022-05-01
+
+#### ENG🇬🇧
+
+- необхідне оновлення бібліотеки `Flask` до актальної версії та внесення відповідних змін до початкоого коду;
+- необхідне оновлення бібліотеки `spaCy` та статистичних моделей до актальної версії, і внесення відповідних змін до початкоого коду;
+- потрібно динамічно встановити `nlp.max_length` відповідно до довжини документа. Це спрощує роботу з документами/текстом невідомої довжини;
+- виправити і оновити до актуального стану документацію англійською мовою;
+- Валідація іменних груп **NP** (так званих `base noun phrases`, `noun chunks` - словосполучень, в якому іменник є вершиною, тобто головним словом, що визначає характеристику всієї складової) на підтвердження того, чи є вони **термінами**  (**Те́рмін** (від лат. terminus — межа, кордон) — слово або словосполучення, застосоване для позначення деякого **поняття**.
+
+  > `base noun phrases`, `noun chunks` - a noun phrase or nominal phrase is a phrase that has a noun (or indefinite pronoun) as its head or performs the same grammatical function as such a phrase. Noun phrases are very common cross-linguistically, and they may be the most frequently occurring phrase type.
+
+  > Именная группа (ИГ) (англ. noun phrase, NP) — словосочетание, в котором имя существительное является вершиной, то есть главным словом, определяющим характеристику всей составляющей. Иногда к ИГ относятся также группы с вершиной в виде местоимения, но чаще они обозначаются как PRNP или PrNP (англ. pronoun phrase). В современных синтаксических теориях принято считать, что даже если имя не содержит зависимых, оно всё равно является именной группой (состоящей из одного слова).
+  Обычно именные группы функционируют как объекты и субъекты глаголов, предикативные выражения и комплименты предлогов и послелогов. Именные группы могут быть вложены внутрь друг друга, например, ИГ замок с привидениями содержит внутри себя предложную группу (ПГ) (англ. prepositional phrase, PP) с привидениями, комплементом которой является другая ИГ привидениями в творительном падеже.
+  Именная группа, содержащая детерминатор, является детерминированной группой (ДГ) (англ. determiner phrase, DP). Детерминатор может быть непроизносимым (англ. silent determiner), тогда ИГ всё равно является ДГ.
+
+- оформити належним чином Обробку винятків.
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 
 -------
 
+Актуальна версія **KEn** (Konspekt English & Ukkrainian) доступна для вільного використання в науково-дослідних та педагогічних ціляз за посиланням: [https://konspekt.ai-service.ml](https://konspekt.ai-service.ml/)
+
 <a name="toc-ua"></a>
 ## **KEn** (Konspekt English) - мережевий засіб виокремлення термінів з природномовних текстів англійською мовою
 

diff --git a/deploy/nginx.conf b/deploy/nginx.conf
@@ -38,7 +38,7 @@ http {
     tcp_nodelay         on;
 
     # server will close connection after this time -- default 75 
-    keepalive_timeout   750;
+    keepalive_timeout   1050;
 
     # internal parameter to speed up hashtable lookups
     types_hash_max_size 2048;

diff --git a/deploy/requirements.txt b/deploy/requirements.txt
@@ -1,14 +1,24 @@
+#---------------------
 Flask==1.1.2
-# flask-cors==3.0.8
+jinja2<3.1.0
+itsdangerous==2.0.1
+Werkzeug<2.0.0
+#---------------------
+# Flask==2.0.3
 flask-cors==3.0.10
+#---------------------
 # spacy>=2.2.0,<3.0.0
-spacy==3.0.5
+spacy==3.0.6
+# spacy==3.1.6
+# https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0.tar.gz#egg=en_core_web_sm
 https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz#egg=en_core_web_sm
 # https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.0.0/en_core_web_trf-3.0.0.tar.gz#egg=en_core_web_trf
 # https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0.tar.gz#egg=en_core_web_lg
+#---------------------
 # pdfminer.six==20200402
 # pdfminer.six==20200517
 pdfminer.six==20201018
+#---------------------
 uWSGI==2.0.18
 nltk==3.5
 chardet==3.0.4

diff --git a/srvr.py b/srvr.py
@@ -55,7 +55,7 @@
 from pdfminer.pdfparser import PDFParser
 
 # load libraries for API proccessing
-from flask import Flask, jsonify, flash, request, Response, redirect, url_for, abort, render_template, send_file, safe_join
+from flask import Flask, jsonify, flash, request, Response, abort, render_template, send_file, safe_join
 # A Flask extension for handling Cross Origin Resource Sharing (CORS), making cross-origin AJAX possible.
 from flask_cors import CORS
 from werkzeug.utils import secure_filename
@@ -64,6 +64,16 @@
 
 # Load globally spaCy model via package name
 NLP_EN = spacy.load('en_core_web_sm')
+# TODO Dynamically set the nlp.max_length according to the length of the document. This makes it simpler while handling documents/text of unknown length.
+"""
+# Example
+for file in folder_text_files:
+    with open(file, 'r', errors="ignore") as f:
+         text = f.read()
+         f.close()
+    nlp.max_length = len(text) + 100 """
+# Text of length 1195652 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
+NLP_EN.max_length = 5000000
 # NLP_EN_TRF = spacy.load('en_core_web_trf')
 
 __author__ = "Kyrylo Malakhov <malakhovks@nas.gov.ua> and Vitalii Velychko <aduisukr@gmail.com>"
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,6 +6,8 @@ @@
     -------
+    Актуальна версія **KEn** (Konspekt English & Ukkrainian) доступна для вільного використання в науково-дослідних та педагогічних ціляз за посиланням: [https://konspekt.ai-service.ml](https://konspekt.ai-service.ml/)
     <a name="toc-ua"></a>
     ## **KEn** (Konspekt English) - мережевий засіб виокремлення термінів з природномовних текстів англійською мовою
@@ Expand Down @@