diff --git a/README.md b/README.md
index 0fc45d7..328075b 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,19 @@
-# ProcessingLayer_EntityRecognitionAndLinking
\ No newline at end of file
+# PreProcessingLayer_EntityRecognitionAndLinking
+Pipeline B's python implementation of Entity Recognition and Entity Linking
+
+## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md)
+
+## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md)
+
+### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/)
+### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597)
+
+### 2023 Authors
+```txt
+Alija Cerimagic
+Frederik Ødgaard Hammer
+Mathias Frihauge
+Nichlas Blak Rønberg
+Peter Bækgaard
+Åsmundur Alexander Kjærbæk Thorsen
+```
diff --git a/components/GetSpacyData.py b/components/GetSpacyData.py
index baa2939..78a31ec 100644
--- a/components/GetSpacyData.py
+++ b/components/GetSpacyData.py
@@ -88,7 +88,7 @@ def BuildJSONFromEntities(entities: List[EntityLinked], doc, fileName: str) -> J
# Create the final JSON structure
final_json = {
- "fileName": fileName,
+ "fileName": fileName.split("/")[-1],
"language": DetectLang(doc),
"metadataId":"7467628c-ad77-4bd7-9810-5f3930796fb5",
"sentences": sentences_json,
diff --git a/docs/api.md b/docs/api.md
index 59b9607..cae1e72 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -12,6 +12,7 @@ The `/entitymentions` endpoint is a **GET****GET****GET**
{
"fileName": STRING,
"language": STRING,
+ "metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
@@ -82,7 +85,7 @@ The `/entitymentions/all` endpoint is a **GET**
"label": STRING,
"startIndex": INT,
"endIndex": INT,
- "iri": STRING
+ "iri": STRING?
}
]
}
@@ -100,6 +103,7 @@ Here is an example of an output from the endpoint when getting all articles. For
{
"fileName": "test.txt",
"language": "en",
+ "metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d",
"sentences": [
{
"sentence": "Hi my name is marc",
@@ -121,6 +125,7 @@ Here is an example of an output from the endpoint when getting all articles. For
{
"fileName": "test2.txt",
"language": "en",
+ "metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666c",
"sentences": [
{
"sentence": "Hi my name is joe",
diff --git a/docs/database.md b/docs/database.md
new file mode 100644
index 0000000..99c32f8
--- /dev/null
+++ b/docs/database.md
@@ -0,0 +1,73 @@
+# [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/Db.py)
+The database is responsible for keeping track of sentences, entity mentions, and entity indices.
+
+## Features
+- CRUD (Create, Read, Update, Delete) Operations supported.
+- Uses [SQLite](https://www.sqlite.org/index.html).
+- Seeds the database with required tables if they do not exist.
+
+## Overview
+The database contains the following tables:
+
+data:image/s3,"s3://crabby-images/0be57/0be5778ac79074448df5e3ef50d1aaa39a1aa6a8" alt=""
+
+### sentence
+Contains each sentence from all input text. Has a unique `sid`.
+
+### entitymention
+Represents each entity mention from all input text. Has `sid` as foreign key (a sentence must exist for the entitymention to exist).
+
+### EntityIndex
+Used by the [Entity Linker](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/components/EntityLinker.py) to find potential matches for a given entity mention. See [Entity Linker Docs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md) for more information.
+
+## Methods
+```python
+async def InitializeIndexDB(dbPath):
+```
+### Parameters:
+- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
+
+```python
+async def Insert(dbPath, tableName, queryInformation):
+```
+### Parameters:
+- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
+- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
+- **queryInformation** (JSON): A JSON object containing the key-value pairs you want to insert, for example:
+```JSON
+{
+ "fileName": "article.txt",
+ "string": "A duck walked across the road",
+ "startindex": 20,
+ "endIndex": 29
+}
+```
+Would be a valid insert in the `sentence` table.
+
+> **_NOTE:_** The `sid` is autogenerated using `AUTOINCREMENT`.
+
+```python
+async def Read(dbPath, tableName, searchPred=""):
+```
+### Parameters:
+- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
+- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
+- The search predicate to query the table with, for example if `searchPred` = `Jones` and the `tableName` = `entitymention`, the entitymention table will be searched for `Jones`.
+
+```python
+async def Update(dbPath, tableName, indexID, updatedName):
+```
+### Parameters:
+- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
+- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
+- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to update.
+- **updatedName** (str): What the `string`, `mention` or `name` should be updated to.
+
+
+```python
+async def Delete(dbPath, tableName, indexID):
+```
+### Parameters:
+- **dbPath** (str): A path where the database will be stored / is stored, ie. `some/path/to/a/Database/directory`.
+- **tableName** (str): The name of the table you want to insert into, available ones can be found in [Overview](##Overview).
+- **indexID** (str): The `sid`, `eid` or `id` (EntityIndex) to delete.
\ No newline at end of file
diff --git a/docs/directorywatcher.md b/docs/directorywatcher.md
new file mode 100644
index 0000000..ef1cc1f
--- /dev/null
+++ b/docs/directorywatcher.md
@@ -0,0 +1,60 @@
+# [Directory Watcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py)
+The pipeline starts when a new file is placed in a watched folder by pipeline part A. The Directory Watcher's responsibility is to call a callback function when a new file is created in the watched folder.
+
+## Features
+- [watchdog](https://pypi.org/project/watchdog/) for file events
+- Async callback support
+- [Threading](https://docs.python.org/3/library/threading.html)
+
+## Overview
+
+The `DirectoryWatcher` provides a simple way to monitor a specified directory for file creation events and execute asynchronous callbacks in response. It utilizes the [watchdog](https://pypi.org/project/watchdog/) library for filesystem monitoring and integrates with [asyncio](https://docs.python.org/3/library/asyncio.html) for handling asynchronous tasks. Furthermore the `DirectoryWatcher` uses [threading](https://docs.python.org/3/library/threading.html).
+
+> **_NOTE:_** [Threading](https://docs.python.org/3/library/threading.html) is used to avoid blocking the main thread's code from executing.
+
+
+## Example usage
+```python
+# Importing
+from lib.DirectoryWatcher import DirectoryWatcher
+
+dirPath = "some/path/to/a/directory"
+
+# Setup
+async def newFileCreated(file_path: str):
+ print("New file created in " + file_path)
+
+
+dirWatcher = DirectoryWatcher(
+ directory=dirPath, async_callback=newFileCreated
+)
+
+# A fast API event function running on startup
+@app.on_event("startup")
+async def startEvent():
+ dirWatcher.start_watching()
+
+# A fast API event function running on shutdown
+@app.on_event("shutdown")
+def shutdown_event():
+ dirWatcher.stop_watching()
+```
+
+> **_NOTE:_** The fast API event functions are not needed to use the `Directory Watcher`
+
+
+## Methods
+```python
+def __init__(self, directory, async_callback):
+```
+### Parameters:
+- **directory** (str): A path to the directory you want to watch ie. `some/path/to/a/directory`
+- **async_callback** (function): An async callback function to be called when a new file is created in the **directory**. This function should accept a single parameter, which is the path of the created file.
+
+```python
+def start_watching(self) -> threading.Thread:
+```
+
+```python
+def stop_watching(self):
+```
diff --git a/docs/img/KNOX_component_diagram-B.drawio.svg b/docs/img/KNOX_component_diagram-B.drawio.svg
new file mode 100644
index 0000000..1cd135a
--- /dev/null
+++ b/docs/img/KNOX_component_diagram-B.drawio.svg
@@ -0,0 +1,4 @@
+
+
+
+
\ No newline at end of file
diff --git a/docs/img/database-visualized.png b/docs/img/database-visualized.png
new file mode 100644
index 0000000..7f25a80
Binary files /dev/null and b/docs/img/database-visualized.png differ
diff --git a/docs/our-part-of-the-pipeline.md b/docs/our-part-of-the-pipeline.md
new file mode 100644
index 0000000..fcf5e94
--- /dev/null
+++ b/docs/our-part-of-the-pipeline.md
@@ -0,0 +1,33 @@
+# Our part of the pipeline
+### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction))
+
+Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking.
+
+> Every following section describes this pipeline *in order*, but first a visual overview.
+
+## Overview
+data:image/s3,"s3://crabby-images/d2b83/d2b83753a2e0a4bc2ce9d1a3dff2405dbec1131b" alt=""
+
+## How to get started
+See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide.
+
+## The input that the solution takes
+See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation
+
+## Entity Recognition
+Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md)
+
+## Entity Linking
+Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md)
+
+## The output it produces
+See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation
+
+
+## Other components
+- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md)
+- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md)
+- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md)
+- The [Language Detector](https://pypi.org/project/langdetect/)
+
+## Future work
diff --git a/docs/our-part-of-the-pipeline/pipeline-input.md b/docs/our-part-of-the-pipeline/pipeline-input.md
new file mode 100644
index 0000000..a0bed08
--- /dev/null
+++ b/docs/our-part-of-the-pipeline/pipeline-input.md
@@ -0,0 +1,28 @@
+# Pipeline input
+The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A**
+
+## Example input data
+```txt
+Since the sudden exit of the controversial CEO Martin Kjær last week,
+both he and the executive board in Region North Jutland
+
+have been in hiding.
+```
+> some/article.txt
+
+## Preprocessing the input
+Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed.
+
+### Example preprocessed input data
+```txt
+Since the sudden exit of the controversial CEO Martin Kjær last week,
+both he and the executive board in Region North Jutland. have been in hiding.
+```
+
+-----------
+