-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into 86-send-output-to-group-cs-endpoint
- Loading branch information
Showing
10 changed files
with
288 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,19 @@ | ||
# ProcessingLayer_EntityRecognitionAndLinking | ||
# PreProcessingLayer_EntityRecognitionAndLinking | ||
Pipeline B's python implementation of Entity Recognition and Entity Linking | ||
|
||
## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) | ||
|
||
## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md) | ||
|
||
### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/) | ||
### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597) | ||
|
||
### 2023 Authors | ||
```txt | ||
Alija Cerimagic | ||
Frederik Ødgaard Hammer | ||
Mathias Frihauge | ||
Nichlas Blak Rønberg | ||
Peter Bækgaard | ||
Åsmundur Alexander Kjærbæk Thorsen | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/Db.py) | ||
The database is responsible for keeping track of sentences, entity mentions, and entity indices. | ||
|
||
## Features | ||
- CRUD (Create, Read, Update, Delete) Operations supported. | ||
- Uses [SQLite](https://www.sqlite.org/index.html). | ||
- Seeds the database with required tables if they do not exist. | ||
|
||
## <a name="overview"></a>Overview | ||
The database contains the following tables: | ||
|
||
data:image/s3,"s3://crabby-images/0be57/0be5778ac79074448df5e3ef50d1aaa39a1aa6a8" alt="" | ||
|
||
### sentence | ||
Contains each sentence from all input text. Has a unique `sid`. | ||
|
||
### entitymention | ||
Represents each entity mention from all input text. Has `sid` as foreign key (a sentence must exist for the entitymention to exist). | ||
|
||
### EntityIndex | ||
Used by the [Entity Linker](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/EntityLinker.py) to find potential matches for a given entity mention. See [Entity Linker Docs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) for more information. | ||
|
||
## Methods | ||
```python | ||
async def InitializeIndexDB(dbPath): | ||
``` | ||
### Parameters: | ||
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. | ||
|
||
```python | ||
async def Insert(dbPath, tableName, queryInformation): | ||
``` | ||
### Parameters: | ||
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. | ||
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). | ||
- **queryInformation** (JSON): A JSON object containing the key-value pairs you want to insert, for example: | ||
```JSON | ||
{ | ||
"fileName": "article.txt", | ||
"string": "A duck walked across the road", | ||
"startindex": 20, | ||
"endIndex": 29 | ||
} | ||
``` | ||
Would be a valid insert in the `sentence` table. | ||
|
||
> **_NOTE:_** The `sid` is autogenerated using `AUTOINCREMENT`. | ||
```python | ||
async def Read(dbPath, tableName, searchPred=""): | ||
``` | ||
### Parameters: | ||
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. | ||
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). | ||
- The search predicate to query the table with, for example if `searchPred` = `Jones` and the `tableName` = `entitymention`, the entitymention table will be searched for `Jones`. | ||
|
||
```python | ||
async def Update(dbPath, tableName, indexID, updatedName): | ||
``` | ||
### Parameters: | ||
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. | ||
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). | ||
- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to update. | ||
- **updatedName** (str): What the `string`, `mention` or `name` should be updated to. | ||
|
||
|
||
```python | ||
async def Delete(dbPath, tableName, indexID): | ||
``` | ||
### Parameters: | ||
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`. | ||
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview). | ||
- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to delete. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# [Directory Watcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py) | ||
The pipeline starts when a new file is placed in a watched folder by pipeline part A. The Directory Watcher's responsibility is to call a callback function when a new file is created in the watched folder. | ||
|
||
## Features | ||
- [watchdog](https://pypi.org/project/watchdog/) for file events | ||
- Async callback support | ||
- [Threading](https://docs.python.org/3/library/threading.html) | ||
|
||
## Overview | ||
|
||
The `DirectoryWatcher` provides a simple way to monitor a specified directory for file creation events and execute asynchronous callbacks in response. It utilizes the [watchdog](https://pypi.org/project/watchdog/) library for filesystem monitoring and integrates with [asyncio](https://docs.python.org/3/library/asyncio.html) for handling asynchronous tasks. Furthermore the `DirectoryWatcher` uses [threading](https://docs.python.org/3/library/threading.html). | ||
|
||
> **_NOTE:_** [Threading](https://docs.python.org/3/library/threading.html) is used to avoid blocking the main thread's code from executing. | ||
|
||
## Example usage | ||
```python | ||
# Importing | ||
from lib.DirectoryWatcher import DirectoryWatcher | ||
|
||
dirPath = "some/path/to/a/directory" | ||
|
||
# Setup | ||
async def newFileCreated(file_path: str): | ||
print("New file created in " + file_path) | ||
|
||
|
||
dirWatcher = DirectoryWatcher( | ||
directory=dirPath, async_callback=newFileCreated | ||
) | ||
|
||
# A fast API event function running on startup | ||
@app.on_event("startup") | ||
async def startEvent(): | ||
dirWatcher.start_watching() | ||
|
||
# A fast API event function running on shutdown | ||
@app.on_event("shutdown") | ||
def shutdown_event(): | ||
dirWatcher.stop_watching() | ||
``` | ||
|
||
> **_NOTE:_** The fast API event functions are not needed to use the `Directory Watcher` | ||
|
||
## Methods | ||
```python | ||
def __init__(self, directory, async_callback): | ||
``` | ||
### Parameters: | ||
- **directory** (str): A path to the directory you want to watch ie. `some/path/to/a/directory` | ||
- **async_callback** (function): An async callback function to be called when a new file is created in the **directory**. This function should accept a single parameter, which is the path of the created file. | ||
|
||
```python | ||
def start_watching(self) -> threading.Thread: | ||
``` | ||
|
||
```python | ||
def stop_watching(self): | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Our part of the pipeline | ||
### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction)) | ||
|
||
Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking. | ||
|
||
> Every following section describes this pipeline *in order*, but first a visual overview. | ||
## Overview | ||
data:image/s3,"s3://crabby-images/d2b83/d2b83753a2e0a4bc2ce9d1a3dff2405dbec1131b" alt="" | ||
|
||
## How to get started | ||
See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide. | ||
|
||
## The input that the solution takes | ||
See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation | ||
|
||
## Entity Recognition | ||
Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md) | ||
|
||
## Entity Linking | ||
Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) | ||
|
||
## The output it produces | ||
See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation | ||
|
||
|
||
## Other components | ||
- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md) | ||
- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md) | ||
- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md) | ||
- The [Language Detector](https://pypi.org/project/langdetect/) | ||
|
||
## Future work |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Pipeline input | ||
The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A** | ||
|
||
## Example input data | ||
```txt | ||
Since the sudden exit of the controversial CEO Martin Kjær last week, | ||
both he and the executive board in Region North Jutland | ||
have been in hiding. | ||
``` | ||
> some/article.txt | ||
## Preprocessing the input | ||
Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed. | ||
|
||
### Example preprocessed input data | ||
```txt | ||
Since the sudden exit of the controversial CEO Martin Kjær last week, | ||
both he and the executive board in Region North Jutland. have been in hiding. | ||
``` | ||
|
||
----------- | ||
<div style="text-align: right"> | ||
Up next: | ||
<br> | ||
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md">Entity Recognition</a> | ||
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color:currentColor;width:11px;height:11px"><path d="M9 18l6-6-6-6"></path></svg></span> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Pipeline output | ||
The pipeline output is a [JSON](https://en.wikipedia.org/wiki/JSON) structure containing the entitymentions and links for a given article | ||
|
||
## The [JSON](https://en.wikipedia.org/wiki/JSON) output | ||
```JSON | ||
{ | ||
"fileName": STRING, | ||
"language": STRING, | ||
"metadataId": UUID (STRING), | ||
"sentences": [ | ||
{ | ||
"sentence": STRING, | ||
"sentenceStartIndex": INT, | ||
"sentenceEndIndex": INT, | ||
"entityMentions": [ | ||
{ | ||
"name": STRING, | ||
"type": STRING, | ||
"label": STRING, | ||
"startIndex": INT, | ||
"endIndex": INT, | ||
"iri": STRING? | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
Here we see a file (article) contains a language (detected by the [Language Detector](https://pypi.org/project/langdetect/)), a metadataId (forwarded by **pipeline A**), as well as a list of sentences, further consisting of a list of entity mentions. | ||
> _**NOTE**_: The `iri` property can be null | ||
## Example [JSON](https://en.wikipedia.org/wiki/JSON) output | ||
```JSON | ||
{ | ||
"language": "en", | ||
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d", | ||
"sentences": [ | ||
{ | ||
"sentence": "Barrack Obama was married to Michelle Obama two days ago.", | ||
"sentenceStartIndex": 20, | ||
"sentenceEndIndex": 62, | ||
"entityMentions": | ||
[ | ||
{ "name": "Barrack Obama", "type": "Entity", "label": "PERSON", "startIndex": 0, "endIndex": 12, "iri": "knox-kb01.srv.aau.dk/Barack_Obama" }, | ||
{ "name": "Michelle Obama", "type": "Entity", "label": "PERSON", "startIndex": 59, "endIndex": 73, "iri": "knox-kb01.srv.aau.dk/Michele_Obama" }, | ||
{ "name": "two days ago", "type": "Literal", "label": "DATE", "startIndex": 74, "endIndex": 86, "iri": null } | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Sending the [JSON](https://en.wikipedia.org/wiki/JSON) output to pipeline C | ||
Lastly the [JSON](https://en.wikipedia.org/wiki/JSON) output is sent to **pipeline C** using a `POST` request. See [the code](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/e442dc496002b788d30f996cdfc87d36f5bcaa35/main.py#L32) for implementation details. | ||
|
||
----------- | ||
<div style="text-align: left"> | ||
Go back to: | ||
<br> | ||
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color: currentcolor; width: 11px; height: 11px;"><path d="M15 18l-6-6 6-6"></path></svg></span> | ||
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md">Entity Linker</a> | ||
|
||
</div> |