diff --git a/README.md b/README.md index 0fc45d7..328075b 100644 --- a/README.md +++ b/README.md @@ -1 +1,19 @@ -# ProcessingLayer_EntityRecognitionAndLinking \ No newline at end of file +# PreProcessingLayer_EntityRecognitionAndLinking +Pipeline B's python implementation of Entity Recognition and Entity Linking + +## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) + +## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md) + +### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/) +### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597) + +### 2023 Authors +```txt +Alija Cerimagic +Frederik Ødgaard Hammer +Mathias Frihauge +Nichlas Blak Rønberg +Peter Bækgaard +Åsmundur Alexander Kjærbæk Thorsen +``` diff --git a/components/GetSpacyData.py b/components/GetSpacyData.py index baa2939..78a31ec 100644 --- a/components/GetSpacyData.py +++ b/components/GetSpacyData.py @@ -88,7 +88,7 @@ def BuildJSONFromEntities(entities: List[EntityLinked], doc, fileName: str) -> J # Create the final JSON structure final_json = { - "fileName": fileName, + "fileName": fileName.split("/")[-1], "language": DetectLang(doc), "metadataId":"7467628c-ad77-4bd7-9810-5f3930796fb5", "sentences": sentences_json, diff --git a/docs/api.md b/docs/api.md index 59b9607..cae1e72 100644 --- a/docs/api.md +++ b/docs/api.md @@ -12,6 +12,7 @@ The `/entitymentions` endpoint is a **GET****GET****GET****GET**Overview +The database contains the following tables: + +![](img/database-visualized.png) + +### sentence +Contains each sentence from all input text. Has a unique `sid`. + +### entitymention +Represents each entity mention from all input text. Has `sid` as foreign key (a sentence must exist for the entitymention to exist). + +### EntityIndex +Used by the [Entity Linker](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/EntityLinker.py) to find potential matches for a given entity mention. See [Entity Linker Docs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) for more information. + +## Methods +```python +async def InitializeIndexDB(dbPath): +``` +### Parameters: +- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. + +```python +async def Insert(dbPath, tableName, queryInformation): +``` +### Parameters: +- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. +- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). +- **queryInformation** (JSON): A JSON object containing the key-value pairs you want to insert, for example: +```JSON +{ + "fileName": "article.txt", + "string": "A duck walked across the road", + "startindex": 20, + "endIndex": 29 +} +``` +Would be a valid insert in the `sentence` table. + +> **_NOTE:_** The `sid` is autogenerated using `AUTOINCREMENT`. + +```python +async def Read(dbPath, tableName, searchPred=""): +``` +### Parameters: +- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. +- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). +- The search predicate to query the table with, for example if `searchPred` = `Jones` and the `tableName` = `entitymention`, the entitymention table will be searched for `Jones`. + +```python +async def Update(dbPath, tableName, indexID, updatedName): +``` +### Parameters: +- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. +- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). +- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to update. +- **updatedName** (str): What the `string`, `mention` or `name` should be updated to. + + +```python +async def Delete(dbPath, tableName, indexID): +``` +### Parameters: +- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. +- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). +- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to delete. \ No newline at end of file diff --git a/docs/directorywatcher.md b/docs/directorywatcher.md new file mode 100644 index 0000000..ef1cc1f --- /dev/null +++ b/docs/directorywatcher.md @@ -0,0 +1,60 @@ +# [Directory Watcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py) +The pipeline starts when a new file is placed in a watched folder by pipeline part A. The Directory Watcher's responsibility is to call a callback function when a new file is created in the watched folder. + +## Features +- [watchdog](https://pypi.org/project/watchdog/) for file events +- Async callback support +- [Threading](https://docs.python.org/3/library/threading.html) + +## Overview + +The `DirectoryWatcher` provides a simple way to monitor a specified directory for file creation events and execute asynchronous callbacks in response. It utilizes the [watchdog](https://pypi.org/project/watchdog/) library for filesystem monitoring and integrates with [asyncio](https://docs.python.org/3/library/asyncio.html) for handling asynchronous tasks. Furthermore the `DirectoryWatcher` uses [threading](https://docs.python.org/3/library/threading.html). + +> **_NOTE:_** [Threading](https://docs.python.org/3/library/threading.html) is used to avoid blocking the main thread's code from executing. + + +## Example usage +```python +# Importing +from lib.DirectoryWatcher import DirectoryWatcher + +dirPath = "some/path/to/a/directory" + +# Setup +async def newFileCreated(file_path: str): + print("New file created in " + file_path) + + +dirWatcher = DirectoryWatcher( + directory=dirPath, async_callback=newFileCreated +) + +# A fast API event function running on startup +@app.on_event("startup") +async def startEvent(): + dirWatcher.start_watching() + +# A fast API event function running on shutdown +@app.on_event("shutdown") +def shutdown_event(): + dirWatcher.stop_watching() +``` + +> **_NOTE:_** The fast API event functions are not needed to use the `Directory Watcher` + + +## Methods +```python +def __init__(self, directory, async_callback): +``` +### Parameters: +- **directory** (str): A path to the directory you want to watch ie. `some/path/to/a/directory` +- **async_callback** (function): An async callback function to be called when a new file is created in the **directory**. This function should accept a single parameter, which is the path of the created file. + +```python +def start_watching(self) -> threading.Thread: +``` + +```python +def stop_watching(self): +``` diff --git a/docs/img/KNOX_component_diagram-B.drawio.svg b/docs/img/KNOX_component_diagram-B.drawio.svg new file mode 100644 index 0000000..1cd135a --- /dev/null +++ b/docs/img/KNOX_component_diagram-B.drawio.svg @@ -0,0 +1,4 @@ + + + +
Entity Extraction and Linking
Entity Extraction and Linking
API
API
«Component»
/entitymentions/all
«Component»...
«Component»
/detectlanguage
«Component»...
«Component»
/entitymentions
«Component»...
«Component»
DirectoryWatcher
«Component»...
DirectoryWatcher
DirectoryWatcher
directory: str
async_callback: func
directory: str...
start_watching()
stop_watching()
on_created(event)
run_once()
start_watching()...
DB
DB
EntityIndex
EntityIndex
Sentence
Sentence
entitymention
entityment...
«Component»
Database
«Component»...
«Function»
async modifyTxt()
«Function»...
«Function»
async processInput()
«Function»...
«Component»
Functionality
«Component»...
entity_mentions.json
entity_mentions.json
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/img/database-visualized.png b/docs/img/database-visualized.png new file mode 100644 index 0000000..7f25a80 Binary files /dev/null and b/docs/img/database-visualized.png differ diff --git a/docs/our-part-of-the-pipeline.md b/docs/our-part-of-the-pipeline.md new file mode 100644 index 0000000..fcf5e94 --- /dev/null +++ b/docs/our-part-of-the-pipeline.md @@ -0,0 +1,33 @@ +# Our part of the pipeline +### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction)) + +Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking. + +> Every following section describes this pipeline *in order*, but first a visual overview. + +## Overview +![](img/KNOX_component_diagram-B.drawio.svg) + +## How to get started +See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide. + +## The input that the solution takes +See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation + +## Entity Recognition +Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md) + +## Entity Linking +Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) + +## The output it produces +See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation + + +## Other components +- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md) +- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md) +- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md) +- The [Language Detector](https://pypi.org/project/langdetect/) + +## Future work diff --git a/docs/our-part-of-the-pipeline/pipeline-input.md b/docs/our-part-of-the-pipeline/pipeline-input.md new file mode 100644 index 0000000..a0bed08 --- /dev/null +++ b/docs/our-part-of-the-pipeline/pipeline-input.md @@ -0,0 +1,28 @@ +# Pipeline input +The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A** + +## Example input data +```txt +Since the sudden exit of the controversial CEO Martin Kjær last week, +both he and the executive board in Region North Jutland + +have been in hiding. +``` +> some/article.txt + +## Preprocessing the input +Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed. + +### Example preprocessed input data +```txt +Since the sudden exit of the controversial CEO Martin Kjær last week, +both he and the executive board in Region North Jutland. have been in hiding. +``` + +----------- +
+ Up next: +
+ Entity Recognition + +
\ No newline at end of file diff --git a/docs/our-part-of-the-pipeline/pipeline-output.md b/docs/our-part-of-the-pipeline/pipeline-output.md new file mode 100644 index 0000000..2de1c9c --- /dev/null +++ b/docs/our-part-of-the-pipeline/pipeline-output.md @@ -0,0 +1,63 @@ +# Pipeline output +The pipeline output is a [JSON](https://en.wikipedia.org/wiki/JSON) structure containing the entitymentions and links for a given article + +## The [JSON](https://en.wikipedia.org/wiki/JSON) output +```JSON + { + "fileName": STRING, + "language": STRING, + "metadataId": UUID (STRING), + "sentences": [ + { + "sentence": STRING, + "sentenceStartIndex": INT, + "sentenceEndIndex": INT, + "entityMentions": [ + { + "name": STRING, + "type": STRING, + "label": STRING, + "startIndex": INT, + "endIndex": INT, + "iri": STRING? + } + ] + } + ] + } +``` +Here we see a file (article) contains a language (detected by the [Language Detector](https://pypi.org/project/langdetect/)), a metadataId (forwarded by **pipeline A**), as well as a list of sentences, further consisting of a list of entity mentions. +> _**NOTE**_: The `iri` property can be null + +## Example [JSON](https://en.wikipedia.org/wiki/JSON) output +```JSON +{ + "language": "en", + "metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d", + "sentences": [ + { + "sentence": "Barrack Obama was married to Michelle Obama two days ago.", + "sentenceStartIndex": 20, + "sentenceEndIndex": 62, + "entityMentions": + [ + { "name": "Barrack Obama", "type": "Entity", "label": "PERSON", "startIndex": 0, "endIndex": 12, "iri": "knox-kb01.srv.aau.dk/Barack_Obama" }, + { "name": "Michelle Obama", "type": "Entity", "label": "PERSON", "startIndex": 59, "endIndex": 73, "iri": "knox-kb01.srv.aau.dk/Michele_Obama" }, + { "name": "two days ago", "type": "Literal", "label": "DATE", "startIndex": 74, "endIndex": 86, "iri": null } + ] + } + ] + } +``` + +## Sending the [JSON](https://en.wikipedia.org/wiki/JSON) output to pipeline C +Lastly the [JSON](https://en.wikipedia.org/wiki/JSON) output is sent to **pipeline C** using a `POST` request. See [the code](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/e442dc496002b788d30f996cdfc87d36f5bcaa35/main.py#L32) for implementation details. + +----------- +
+ Go back to: +
+ + Entity Linker + +
\ No newline at end of file