Skip to content

Commit

Permalink
Merge branch 'main' into 86-send-output-to-group-cs-endpoint
Browse files Browse the repository at this point in the history
  • Loading branch information
FredTheNoob authored Dec 11, 2023
2 parents e442dc4 + 5fcd59b commit 9ac335e
Show file tree
Hide file tree
Showing 10 changed files with 288 additions and 4 deletions.
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,19 @@
# ProcessingLayer_EntityRecognitionAndLinking
# PreProcessingLayer_EntityRecognitionAndLinking
Pipeline B's python implementation of Entity Recognition and Entity Linking

## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md)

## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md)

### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/)
### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597)

### 2023 Authors
```txt
Alija Cerimagic
Frederik Ødgaard Hammer
Mathias Frihauge
Nichlas Blak Rønberg
Peter Bækgaard
Åsmundur Alexander Kjærbæk Thorsen
```
2 changes: 1 addition & 1 deletion components/GetSpacyData.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def BuildJSONFromEntities(entities: List[EntityLinked], doc, fileName: str) -> J

# Create the final JSON structure
final_json = {
"fileName": fileName,
"fileName": fileName.split("/")[-1],
"language": DetectLang(doc),
"metadataId":"7467628c-ad77-4bd7-9810-5f3930796fb5",
"sentences": sentences_json,
Expand Down
9 changes: 7 additions & 2 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The `/entitymentions` endpoint is a <span style="color:lightgreen">**GET**</span
{
"fileName": STRING,
"language": STRING,
"metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
Expand All @@ -24,7 +25,7 @@ The `/entitymentions` endpoint is a <span style="color:lightgreen">**GET**</span
"label": STRING,
"startIndex": INT,
"endIndex": INT,
"iri": STRING
"iri": STRING?
}
]
}
Expand All @@ -40,6 +41,7 @@ Here is an example of an output from the endpoint `/entitymentions?article=test.
{
"fileName": "test.txt",
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666a",
"sentences": [
{
"sentence": "Hi my name is marc",
Expand Down Expand Up @@ -70,6 +72,7 @@ The `/entitymentions/all` endpoint is a <span style="color:lightgreen">**GET**</
{
"fileName": STRING,
"language": STRING,
"metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
Expand All @@ -82,7 +85,7 @@ The `/entitymentions/all` endpoint is a <span style="color:lightgreen">**GET**</
"label": STRING,
"startIndex": INT,
"endIndex": INT,
"iri": STRING
"iri": STRING?
}
]
}
Expand All @@ -100,6 +103,7 @@ Here is an example of an output from the endpoint when getting all articles. For
{
"fileName": "test.txt",
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d",
"sentences": [
{
"sentence": "Hi my name is marc",
Expand All @@ -121,6 +125,7 @@ Here is an example of an output from the endpoint when getting all articles. For
{
"fileName": "test2.txt",
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666c",
"sentences": [
{
"sentence": "Hi my name is joe",
Expand Down
73 changes: 73 additions & 0 deletions docs/database.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/Db.py)
The database is responsible for keeping track of sentences, entity mentions, and entity indices.

## Features
- CRUD (Create, Read, Update, Delete) Operations supported.
- Uses [SQLite](https://www.sqlite.org/index.html).
- Seeds the database with required tables if they do not exist.

## <a name="overview"></a>Overview
The database contains the following tables:

![](img/database-visualized.png)

### sentence
Contains each sentence from all input text. Has a unique `sid`.

### entitymention
Represents each entity mention from all input text. Has `sid` as foreign key (a sentence must exist for the entitymention to exist).

### EntityIndex
Used by the [Entity Linker](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/EntityLinker.py) to find potential matches for a given entity mention. See [Entity Linker Docs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) for more information.

## Methods
```python
async def InitializeIndexDB(dbPath):
```
### Parameters:
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.

```python
async def Insert(dbPath, tableName, queryInformation):
```
### Parameters:
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
- **queryInformation** (JSON): A JSON object containing the key-value pairs you want to insert, for example:
```JSON
{
"fileName": "article.txt",
"string": "A duck walked across the road",
"startindex": 20,
"endIndex": 29
}
```
Would be a valid insert in the `sentence` table.

> **_NOTE:_** The `sid` is autogenerated using `AUTOINCREMENT`.
```python
async def Read(dbPath, tableName, searchPred=""):
```
### Parameters:
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
- The search predicate to query the table with, for example if `searchPred` = `Jones` and the `tableName` = `entitymention`, the entitymention table will be searched for `Jones`.

```python
async def Update(dbPath, tableName, indexID, updatedName):
```
### Parameters:
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to update.
- **updatedName** (str): What the `string`, `mention` or `name` should be updated to.


```python
async def Delete(dbPath, tableName, indexID):
```
### Parameters:
- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to delete.
60 changes: 60 additions & 0 deletions docs/directorywatcher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# [Directory Watcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py)
The pipeline starts when a new file is placed in a watched folder by pipeline part A. The Directory Watcher's responsibility is to call a callback function when a new file is created in the watched folder.

## Features
- [watchdog](https://pypi.org/project/watchdog/) for file events
- Async callback support
- [Threading](https://docs.python.org/3/library/threading.html)

## Overview

The `DirectoryWatcher` provides a simple way to monitor a specified directory for file creation events and execute asynchronous callbacks in response. It utilizes the [watchdog](https://pypi.org/project/watchdog/) library for filesystem monitoring and integrates with [asyncio](https://docs.python.org/3/library/asyncio.html) for handling asynchronous tasks. Furthermore the `DirectoryWatcher` uses [threading](https://docs.python.org/3/library/threading.html).

> **_NOTE:_** [Threading](https://docs.python.org/3/library/threading.html) is used to avoid blocking the main thread's code from executing.

## Example usage
```python
# Importing
from lib.DirectoryWatcher import DirectoryWatcher

dirPath = "some/path/to/a/directory"

# Setup
async def newFileCreated(file_path: str):
print("New file created in " + file_path)


dirWatcher = DirectoryWatcher(
directory=dirPath, async_callback=newFileCreated
)

# A fast API event function running on startup
@app.on_event("startup")
async def startEvent():
dirWatcher.start_watching()

# A fast API event function running on shutdown
@app.on_event("shutdown")
def shutdown_event():
dirWatcher.stop_watching()
```

> **_NOTE:_** The fast API event functions are not needed to use the `Directory Watcher`

## Methods
```python
def __init__(self, directory, async_callback):
```
### Parameters:
- **directory** (str): A path to the directory you want to watch ie. `some/path/to/a/directory`
- **async_callback** (function): An async callback function to be called when a new file is created in the **directory**. This function should accept a single parameter, which is the path of the created file.

```python
def start_watching(self) -> threading.Thread:
```

```python
def stop_watching(self):
```
4 changes: 4 additions & 0 deletions docs/img/KNOX_component_diagram-B.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/database-visualized.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions docs/our-part-of-the-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Our part of the pipeline
### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction))

Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking.

> Every following section describes this pipeline *in order*, but first a visual overview.
## Overview
![](img/KNOX_component_diagram-B.drawio.svg)

## How to get started
See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide.

## The input that the solution takes
See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation

## Entity Recognition
Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md)

## Entity Linking
Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md)

## The output it produces
See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation


## Other components
- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md)
- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md)
- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md)
- The [Language Detector](https://pypi.org/project/langdetect/)

## Future work
28 changes: 28 additions & 0 deletions docs/our-part-of-the-pipeline/pipeline-input.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Pipeline input
The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A**

## Example input data
```txt
Since the sudden exit of the controversial CEO Martin Kjær last week,
both he and the executive board in Region North Jutland
have been in hiding.
```
> some/article.txt
## Preprocessing the input
Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed.

### Example preprocessed input data
```txt
Since the sudden exit of the controversial CEO Martin Kjær last week,
both he and the executive board in Region North Jutland. have been in hiding.
```

-----------
<div style="text-align: right">
Up next:
<br>
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md">Entity Recognition</a>
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color:currentColor;width:11px;height:11px"><path d="M9 18l6-6-6-6"></path></svg></span>
</div>
63 changes: 63 additions & 0 deletions docs/our-part-of-the-pipeline/pipeline-output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Pipeline output
The pipeline output is a [JSON](https://en.wikipedia.org/wiki/JSON) structure containing the entitymentions and links for a given article

## The [JSON](https://en.wikipedia.org/wiki/JSON) output
```JSON
{
"fileName": STRING,
"language": STRING,
"metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
"sentenceStartIndex": INT,
"sentenceEndIndex": INT,
"entityMentions": [
{
"name": STRING,
"type": STRING,
"label": STRING,
"startIndex": INT,
"endIndex": INT,
"iri": STRING?
}
]
}
]
}
```
Here we see a file (article) contains a language (detected by the [Language Detector](https://pypi.org/project/langdetect/)), a metadataId (forwarded by **pipeline A**), as well as a list of sentences, further consisting of a list of entity mentions.
> _**NOTE**_: The `iri` property can be null
## Example [JSON](https://en.wikipedia.org/wiki/JSON) output
```JSON
{
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d",
"sentences": [
{
"sentence": "Barrack Obama was married to Michelle Obama two days ago.",
"sentenceStartIndex": 20,
"sentenceEndIndex": 62,
"entityMentions":
[
{ "name": "Barrack Obama", "type": "Entity", "label": "PERSON", "startIndex": 0, "endIndex": 12, "iri": "knox-kb01.srv.aau.dk/Barack_Obama" },
{ "name": "Michelle Obama", "type": "Entity", "label": "PERSON", "startIndex": 59, "endIndex": 73, "iri": "knox-kb01.srv.aau.dk/Michele_Obama" },
{ "name": "two days ago", "type": "Literal", "label": "DATE", "startIndex": 74, "endIndex": 86, "iri": null }
]
}
]
}
```

## Sending the [JSON](https://en.wikipedia.org/wiki/JSON) output to pipeline C
Lastly the [JSON](https://en.wikipedia.org/wiki/JSON) output is sent to **pipeline C** using a `POST` request. See [the code](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/e442dc496002b788d30f996cdfc87d36f5bcaa35/main.py#L32) for implementation details.

-----------
<div style="text-align: left">
Go back to:
<br>
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color: currentcolor; width: 11px; height: 11px;"><path d="M15 18l-6-6 6-6"></path></svg></span>
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md">Entity Linker</a>

</div>

0 comments on commit 9ac335e

Please sign in to comment.