-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #83 from Knox-AAU/75-train-danish-model-to-find-li…
…terals 75 train danish model to find literals
- Loading branch information
Showing
12 changed files
with
915 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Pypi Repository | ||
|
||
For the purpose of our pipeline, we have created a pip repository which is hosted in a docker container on the _knox-web01.srv.aau.dk_ server. It runs off of [pypiserver](https://github.com/pypiserver/pypiserver). | ||
|
||
As of writing this, a ticket has been send to ITS to create a domain name for the server at the internal domain [http://pypi.knox.cs.aau.dk](http://pypi.knox.cs.aau.dk). Internal meaning it is only accessible when at Campus or through VPN. | ||
|
||
## Compose File Location | ||
|
||
The _compose.yml_ file is located in _/srv/data/pip-repo/_ in the _knox-web01.srv.aau.dk_ server. | ||
|
||
## Uploading | ||
|
||
Uploading can be done using [twine](https://github.com/pypa/twine). If the server is not yet setup to the domain [http://pypi.knox.cs.aau.dk](http://pypi.knox.cs.aau.dk), it should still be running on the web01 server. Because of this, here are two ways to upload to the pip repository: | ||
|
||
### By SSH | ||
|
||
First you need to SSH into the server using the following command: | ||
|
||
```BASH | ||
ssh USERNAME@student.aau.dk@knox-web01.srv.aau.dk -L 8081:localhost:8081 | ||
``` | ||
|
||
The above command SSH's you into the server and forwards the port 8081 on the server into your local machine. You should now be able to go to <http://localhost:8081/simple> in your browser and see the repository. | ||
|
||
To upload using twine, run the following command: | ||
|
||
```BASH | ||
twine upload -r http://localhost:8081 --sign PACKAGENAME.whl | ||
``` | ||
|
||
Uploading requires no authentication as the repository is only available when on campus anyways. | ||
|
||
### By Domain (if domain is up) | ||
|
||
When the domain is eventually up, the following twine command is also applicable | ||
|
||
```BASH | ||
twine upload -r http://pypi.knox.cs.aau.dk --sign PACKAGENAME.whl | ||
``` | ||
|
||
## Installing through the repository | ||
|
||
To install packages from the repository you simply use pip. | ||
Again because we at this state don't know when the domain will be available, two methods are possible. | ||
|
||
You can either connect to the web01 server with the command: | ||
|
||
```BASH | ||
ssh USERNAME@student.aau.dk@knox-web01.srv.aau.dk -L 8081:localhost:8081 | ||
``` | ||
|
||
And afterwards in another terminal run the pip command: | ||
|
||
```BASH | ||
pip3 install --index-url http://localhost:8081/simple PACKAGE-NAME | ||
``` | ||
|
||
If the domain is available simply replace the localhost:8081 with the domain: | ||
|
||
```BASH | ||
pip3 install --index-url http://pypi.knox.cs.aau.dk/simle PACKAGE-NAME | ||
``` | ||
|
||
## Creating a whl package from Spacy | ||
|
||
If you have trained a model, you can use spacy to create a whl package for the repository. | ||
|
||
This is done with the command | ||
|
||
```BASH | ||
spacy package MODEL-FOLDER OUTPUT-LOCATION --name package-name --build wheel | ||
``` | ||
|
||
Example command: | ||
|
||
```BASH | ||
spacy package trainedmodel/updated_da_model model_packages --name core_news_knox_lg --build wheel | ||
``` | ||
|
||
Note that we have left out the "da" before core in the --name, this is added by default through the meta.json file in the model. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
|
||
en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.0/en_core_web_lg-3.7.0-py3-none-any.whl | ||
da_core_news_lg @ https://github.com/explosion/spacy-models/releases/download/da_core_news_lg-3.7.0/da_core_news_lg-3.7.0-py3-none-any.whl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
[tool.pytest.ini_options] | ||
minversion = "6.0" | ||
addopts = "-W ignore::DeprecationWarning --cov ." | ||
addopts = "-W ignore::DeprecationWarning --cov" | ||
testpaths = ["tests/unit", "tests/integration"] | ||
|
||
[tool.black] | ||
line-length = 79 | ||
line-length = 79 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
import spacy | ||
|
||
# Load your trained model | ||
nlp = spacy.load("trainedmodel/updated_da_model") | ||
|
||
class bcolors: | ||
HEADER = '\033[95m' | ||
OKBLUE = '\033[94m' | ||
OKCYAN = '\033[96m' | ||
OKGREEN = '\033[92m' | ||
WARNING = '\033[93m' | ||
FAIL = '\033[91m' | ||
ENDC = '\033[0m' | ||
BOLD = '\033[1m' | ||
UNDERLINE = '\033[4m' | ||
|
||
|
||
# Evaluation data | ||
eval_data = [ | ||
( | ||
"I 1976 blev Apple opfundet", | ||
{"entities": [(0, 6, "LITERAL"), (12, 17, "ORG")]}, | ||
), | ||
( | ||
"iPhone 12 blev udgivet i 2020", | ||
{"entities": [(0, 9, "MISC"), (23, 29, "LITERAL")]}, | ||
), | ||
( | ||
"Det koster 1000 kr. at købe denne ting.", | ||
{"entities": [(11, 19, "LITERAL")]}, | ||
), | ||
( | ||
"I morgen skal jeg i skole.", | ||
{ | ||
"entities": [ | ||
(0, 8, "LITERAL"), | ||
] | ||
}, | ||
), | ||
( | ||
"I dag skulle vi møde kl. 08:15.", | ||
{"entities": [(0, 5, "LITERAL"), (21, 30, "LITERAL")]}, | ||
), | ||
( | ||
"Bussen kommer 13:00.", | ||
{"entities": [(14, 19, "LITERAL")]}, | ||
), | ||
( | ||
"Vi skal aflevere d. 21/12/2023.", | ||
{"entities": [(17, 30, "LITERAL")]}, | ||
), | ||
( | ||
"Vestjyllands finansminister Jørgen Kofoed og hans børn, blev mandag d. 3. December opkøbt af storkoncernen Apple for 20 kr.", | ||
{ | ||
"entities": [ | ||
(0, 12, "LOCATION"), | ||
(28, 41, "PERSON"), | ||
(45, 54, "PERSON"), | ||
(61, 82, "LITERAL"), | ||
(107, 112, "ORG"), | ||
(117, 123, "LITERAL"), | ||
] | ||
}, | ||
), | ||
( | ||
"George Bush var skyld i 9/11, og jeg skal til Struer d. 28/11/2023.", | ||
{ | ||
"entities": [ | ||
(0, 11, "PERSON"), | ||
(24, 28, "LITERAL"), | ||
(46, 52, "LOCATION"), | ||
(53, 66, "LITERAL"), | ||
] | ||
}, | ||
), | ||
( | ||
"Peter gik over vejen og købte mælk og Epstein dræbte ikke sig selv for 2 dage siden.", | ||
{ | ||
"entities": [ | ||
(0, 5, "PERSON"), | ||
(38, 45, "PERSON"), | ||
(67, 83, "LITERAL"), | ||
] | ||
}, | ||
), | ||
] | ||
|
||
# Initialize evaluation metrics | ||
eval_metrics = { | ||
"correct": 0, | ||
"incorrect": 0, | ||
"missed": 0, | ||
"partial": 0, | ||
"spurious": 0, | ||
} | ||
|
||
# Evaluate the model | ||
for text, annotations in eval_data: | ||
gold_entities = [ | ||
text[start:end] for start, end, _ in annotations.get("entities", []) | ||
] | ||
gold_labels = [ | ||
label for start, end, label in annotations.get("entities", []) | ||
] | ||
doc = nlp(text) | ||
|
||
print(f"Text: {text}") | ||
print("Gold Entities:", gold_entities) | ||
print("Gold Labels", gold_labels) | ||
|
||
recognized_entities = [ent.text for ent in doc.ents] | ||
recognized_labels = [ent.label_ for ent in doc.ents] | ||
print("Recognized Entities:", recognized_entities) | ||
print("Recognized Labels", recognized_labels) | ||
|
||
for ent in doc.ents: | ||
if ent.text in gold_entities: | ||
eval_metrics["correct"] += 1 | ||
else: | ||
eval_metrics["spurious"] += 1 | ||
|
||
for gold_entity in gold_entities: | ||
if gold_entity not in recognized_entities: | ||
eval_metrics["missed"] += 1 | ||
if recognized_entities == gold_entities: | ||
print(f"{bcolors.OKGREEN}PASSED!{bcolors.ENDC}") | ||
else: | ||
print(f"{bcolors.FAIL}FAILED{bcolors.ENDC}") | ||
print("\n---\n") | ||
|
||
# Calculate precision, recall, and F1 score | ||
precision = eval_metrics["correct"] / ( | ||
eval_metrics["correct"] + eval_metrics["spurious"] + 1e-8 | ||
) | ||
recall = eval_metrics["correct"] / ( | ||
eval_metrics["correct"] + eval_metrics["missed"] + 1e-8 | ||
) | ||
f1_score = 2 * (precision * recall) / (precision + recall + 1e-8) | ||
|
||
print( | ||
f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1_score:.2f}" | ||
) |
Oops, something went wrong.