-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- add our component diagram update api.md update readme
- Loading branch information
1 parent
07cb9ba
commit 5b3d721
Showing
6 changed files
with
154 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,19 @@ | ||
# ProcessingLayer_EntityRecognitionAndLinking | ||
# PreProcessingLayer_EntityRecognitionAndLinking | ||
Pipeline B's python implementation of Entity Recognition and Entity Linking | ||
|
||
## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) | ||
|
||
## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md) | ||
|
||
### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/) | ||
### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597) | ||
|
||
### 2023 Authors | ||
```txt | ||
Alija Cerimagic | ||
Frederik Ødgaard Hammer | ||
Mathias Frihauge | ||
Nichlas Blak Rønberg | ||
Peter Bækgaard | ||
Åsmundur Alexander Kjærbæk Thorsen | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Our part of the pipeline | ||
### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction)) | ||
|
||
Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking. | ||
|
||
> Every following section describes this pipeline *in order*, but first a visual overview. | ||
## Overview | ||
data:image/s3,"s3://crabby-images/d2b83/d2b83753a2e0a4bc2ce9d1a3dff2405dbec1131b" alt="" | ||
|
||
## How to get started | ||
See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide. | ||
|
||
## The input that the solution takes | ||
See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation | ||
|
||
## Entity Recognition | ||
Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md) | ||
|
||
## Entity Linking | ||
Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) | ||
|
||
## The output it produces | ||
See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation | ||
|
||
|
||
## Other components | ||
- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md) | ||
- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md) | ||
- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md) | ||
- The [Language Detector](https://pypi.org/project/langdetect/) | ||
|
||
## Future work |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Pipeline input | ||
The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A** | ||
|
||
## Example input data | ||
```txt | ||
Since the sudden exit of the controversial CEO Martin Kjær last week, | ||
both he and the executive board in Region North Jutland | ||
have been in hiding. | ||
``` | ||
> some/article.txt | ||
## Preprocessing the input | ||
Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed. | ||
|
||
### Example preprocessed input data | ||
```txt | ||
Since the sudden exit of the controversial CEO Martin Kjær last week, | ||
both he and the executive board in Region North Jutland. have been in hiding. | ||
``` | ||
|
||
----------- | ||
<div style="text-align: right"> | ||
Up next: | ||
<br> | ||
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md">Entity Recognition</a> | ||
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color:currentColor;width:11px;height:11px"><path d="M9 18l6-6-6-6"></path></svg></span> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Pipeline output | ||
The pipeline output is a [JSON](https://en.wikipedia.org/wiki/JSON) structure containing the entitymentions and links for a given article | ||
|
||
## The [JSON](https://en.wikipedia.org/wiki/JSON) output | ||
```JSON | ||
{ | ||
"fileName": STRING, | ||
"language": STRING, | ||
"metadataId": UUID (STRING), | ||
"sentences": [ | ||
{ | ||
"sentence": STRING, | ||
"sentenceStartIndex": INT, | ||
"sentenceEndIndex": INT, | ||
"entityMentions": [ | ||
{ | ||
"name": STRING, | ||
"type": STRING, | ||
"label": STRING, | ||
"startIndex": INT, | ||
"endIndex": INT, | ||
"iri": STRING? | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
Here we see a file (article) contains a language (detected by the [Language Detector](https://pypi.org/project/langdetect/)), a metadataId (forwarded by **pipeline A**), as well as a list of sentences, further consisting of a list of entity mentions. | ||
> _**NOTE**_: The `iri` property can be null | ||
## Example [JSON](https://en.wikipedia.org/wiki/JSON) output | ||
```JSON | ||
{ | ||
"language": "en", | ||
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d", | ||
"sentences": [ | ||
{ | ||
"sentence": "Barrack Obama was married to Michelle Obama two days ago.", | ||
"sentenceStartIndex": 20, | ||
"sentenceEndIndex": 62, | ||
"entityMentions": | ||
[ | ||
{ "name": "Barrack Obama", "type": "Entity", "label": "PERSON", "startIndex": 0, "endIndex": 12, "iri": "knox-kb01.srv.aau.dk/Barack_Obama" }, | ||
{ "name": "Michelle Obama", "type": "Entity", "label": "PERSON", "startIndex": 59, "endIndex": 73, "iri": "knox-kb01.srv.aau.dk/Michele_Obama" }, | ||
{ "name": "two days ago", "type": "Literal", "label": "DATE", "startIndex": 74, "endIndex": 86, "iri": null } | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Sending the [JSON](https://en.wikipedia.org/wiki/JSON) output to pipeline C | ||
Lastly the [JSON](https://en.wikipedia.org/wiki/JSON) output is sent to **pipeline C** using a `POST` request. See [the code](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/e442dc496002b788d30f996cdfc87d36f5bcaa35/main.py#L32) for implementation details. | ||
|
||
----------- | ||
<div style="text-align: left"> | ||
Go back to: | ||
<br> | ||
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color: currentcolor; width: 11px; height: 11px;"><path d="M15 18l-6-6 6-6"></path></svg></span> | ||
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md">Entity Linker</a> | ||
|
||
</div> |