diff --git a/README.md b/README.md index 0fc45d7..328075b 100644 --- a/README.md +++ b/README.md @@ -1 +1,19 @@ -# ProcessingLayer_EntityRecognitionAndLinking \ No newline at end of file +# PreProcessingLayer_EntityRecognitionAndLinking +Pipeline B's python implementation of Entity Recognition and Entity Linking + +## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) + +## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md) + +### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/) +### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597) + +### 2023 Authors +```txt +Alija Cerimagic +Frederik Ødgaard Hammer +Mathias Frihauge +Nichlas Blak Rønberg +Peter Bækgaard +Åsmundur Alexander Kjærbæk Thorsen +``` diff --git a/docs/api.md b/docs/api.md index 59b9607..cae1e72 100644 --- a/docs/api.md +++ b/docs/api.md @@ -12,6 +12,7 @@ The `/entitymentions` endpoint is a **GET****GET****GET****GET** + + +
Entity Extraction and Linking
Entity Extraction and Linking
API
API
«Component»
/entitymentions/all
«Component»...
«Component»
/detectlanguage
«Component»...
«Component»
/entitymentions
«Component»...
«Component»
DirectoryWatcher
«Component»...
DirectoryWatcher
DirectoryWatcher
directory: str
async_callback: func
directory: str...
start_watching()
stop_watching()
on_created(event)
run_once()
start_watching()...
DB
DB
EntityIndex
EntityIndex
Sentence
Sentence
entitymention
entityment...
«Component»
Database
«Component»...
«Function»
async modifyTxt()
«Function»...
«Function»
async processInput()
«Function»...
«Component»
Functionality
«Component»...
entity_mentions.json
entity_mentions.json
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/our-part-of-the-pipeline.md b/docs/our-part-of-the-pipeline.md new file mode 100644 index 0000000..fcf5e94 --- /dev/null +++ b/docs/our-part-of-the-pipeline.md @@ -0,0 +1,33 @@ +# Our part of the pipeline +### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction)) + +Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking. + +> Every following section describes this pipeline *in order*, but first a visual overview. + +## Overview +![](img/KNOX_component_diagram-B.drawio.svg) + +## How to get started +See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide. + +## The input that the solution takes +See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation + +## Entity Recognition +Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md) + +## Entity Linking +Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) + +## The output it produces +See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation + + +## Other components +- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md) +- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md) +- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md) +- The [Language Detector](https://pypi.org/project/langdetect/) + +## Future work diff --git a/docs/our-part-of-the-pipeline/pipeline-input.md b/docs/our-part-of-the-pipeline/pipeline-input.md new file mode 100644 index 0000000..a0bed08 --- /dev/null +++ b/docs/our-part-of-the-pipeline/pipeline-input.md @@ -0,0 +1,28 @@ +# Pipeline input +The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A** + +## Example input data +```txt +Since the sudden exit of the controversial CEO Martin Kjær last week, +both he and the executive board in Region North Jutland + +have been in hiding. +``` +> some/article.txt + +## Preprocessing the input +Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed. + +### Example preprocessed input data +```txt +Since the sudden exit of the controversial CEO Martin Kjær last week, +both he and the executive board in Region North Jutland. have been in hiding. +``` + +----------- +
+ Up next: +
+ Entity Recognition + +
\ No newline at end of file diff --git a/docs/our-part-of-the-pipeline/pipeline-output.md b/docs/our-part-of-the-pipeline/pipeline-output.md new file mode 100644 index 0000000..2de1c9c --- /dev/null +++ b/docs/our-part-of-the-pipeline/pipeline-output.md @@ -0,0 +1,63 @@ +# Pipeline output +The pipeline output is a [JSON](https://en.wikipedia.org/wiki/JSON) structure containing the entitymentions and links for a given article + +## The [JSON](https://en.wikipedia.org/wiki/JSON) output +```JSON + { + "fileName": STRING, + "language": STRING, + "metadataId": UUID (STRING), + "sentences": [ + { + "sentence": STRING, + "sentenceStartIndex": INT, + "sentenceEndIndex": INT, + "entityMentions": [ + { + "name": STRING, + "type": STRING, + "label": STRING, + "startIndex": INT, + "endIndex": INT, + "iri": STRING? + } + ] + } + ] + } +``` +Here we see a file (article) contains a language (detected by the [Language Detector](https://pypi.org/project/langdetect/)), a metadataId (forwarded by **pipeline A**), as well as a list of sentences, further consisting of a list of entity mentions. +> _**NOTE**_: The `iri` property can be null + +## Example [JSON](https://en.wikipedia.org/wiki/JSON) output +```JSON +{ + "language": "en", + "metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d", + "sentences": [ + { + "sentence": "Barrack Obama was married to Michelle Obama two days ago.", + "sentenceStartIndex": 20, + "sentenceEndIndex": 62, + "entityMentions": + [ + { "name": "Barrack Obama", "type": "Entity", "label": "PERSON", "startIndex": 0, "endIndex": 12, "iri": "knox-kb01.srv.aau.dk/Barack_Obama" }, + { "name": "Michelle Obama", "type": "Entity", "label": "PERSON", "startIndex": 59, "endIndex": 73, "iri": "knox-kb01.srv.aau.dk/Michele_Obama" }, + { "name": "two days ago", "type": "Literal", "label": "DATE", "startIndex": 74, "endIndex": 86, "iri": null } + ] + } + ] + } +``` + +## Sending the [JSON](https://en.wikipedia.org/wiki/JSON) output to pipeline C +Lastly the [JSON](https://en.wikipedia.org/wiki/JSON) output is sent to **pipeline C** using a `POST` request. See [the code](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/e442dc496002b788d30f996cdfc87d36f5bcaa35/main.py#L32) for implementation details. + +----------- +
+ Go back to: +
+ + Entity Linker + +
\ No newline at end of file