Skip to content

Commit

Permalink
add "our part of the pipeline"
Browse files Browse the repository at this point in the history
- add our component diagram
update api.md
update readme
  • Loading branch information
FredTheNoob committed Dec 8, 2023
1 parent 07cb9ba commit 5b3d721
Show file tree
Hide file tree
Showing 6 changed files with 154 additions and 3 deletions.
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,19 @@
# ProcessingLayer_EntityRecognitionAndLinking
# PreProcessingLayer_EntityRecognitionAndLinking
Pipeline B's python implementation of Entity Recognition and Entity Linking

## [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md)

## [This pipeline explained](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline.md)

### Full documentation available at: [http://wiki.knox.aau.dk](http://wiki.knox.aau.dk/)
### The 2023 report is available at: [https://www.overleaf.com/project/64feed8bda5b70b36afb6597](https://www.overleaf.com/project/64feed8bda5b70b36afb6597)

### 2023 Authors
```txt
Alija Cerimagic
Frederik Ødgaard Hammer
Mathias Frihauge
Nichlas Blak Rønberg
Peter Bækgaard
Åsmundur Alexander Kjærbæk Thorsen
```
9 changes: 7 additions & 2 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The `/entitymentions` endpoint is a <span style="color:lightgreen">**GET**</span
{
"fileName": STRING,
"language": STRING,
"metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
Expand All @@ -24,7 +25,7 @@ The `/entitymentions` endpoint is a <span style="color:lightgreen">**GET**</span
"label": STRING,
"startIndex": INT,
"endIndex": INT,
"iri": STRING
"iri": STRING?
}
]
}
Expand All @@ -40,6 +41,7 @@ Here is an example of an output from the endpoint `/entitymentions?article=test.
{
"fileName": "test.txt",
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666a",
"sentences": [
{
"sentence": "Hi my name is marc",
Expand Down Expand Up @@ -70,6 +72,7 @@ The `/entitymentions/all` endpoint is a <span style="color:lightgreen">**GET**</
{
"fileName": STRING,
"language": STRING,
"metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
Expand All @@ -82,7 +85,7 @@ The `/entitymentions/all` endpoint is a <span style="color:lightgreen">**GET**</
"label": STRING,
"startIndex": INT,
"endIndex": INT,
"iri": STRING
"iri": STRING?
}
]
}
Expand All @@ -100,6 +103,7 @@ Here is an example of an output from the endpoint when getting all articles. For
{
"fileName": "test.txt",
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d",
"sentences": [
{
"sentence": "Hi my name is marc",
Expand All @@ -121,6 +125,7 @@ Here is an example of an output from the endpoint when getting all articles. For
{
"fileName": "test2.txt",
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666c",
"sentences": [
{
"sentence": "Hi my name is joe",
Expand Down
4 changes: 4 additions & 0 deletions docs/img/KNOX_component_diagram-B.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions docs/our-part-of-the-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Our part of the pipeline
### (also available at: [http://wiki.knox.aau.dk/en/entity-extraction](http://wiki.knox.aau.dk/en/entity-extraction))

Our part of the pipeline is concerned with Entity Recognition and Entity Linking. This solution utilizes the [SpaCy](https://spacy.io/) library to perform Entity Recognition, and the [FuzzyWuzzy](https://pypi.org/project/fuzzywuzzy/) library for the entity linking.

> Every following section describes this pipeline *in order*, but first a visual overview.
## Overview
![](img/KNOX_component_diagram-B.drawio.svg)

## How to get started
See the [Getting started](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/gettingstarted.md) guide.

## The input that the solution takes
See the [input](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-input.md) explanation

## Entity Recognition
Check out the [Entity Recognition documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md)

## Entity Linking
Check out the [Entity Linker documentation](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md)

## The output it produces
See the [output](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md) explanation


## Other components
- The [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/DirectoryWatcher.md)
- The [Database](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/database.md)
- The [APIs](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/api.md)
- The [Language Detector](https://pypi.org/project/langdetect/)

## Future work
28 changes: 28 additions & 0 deletions docs/our-part-of-the-pipeline/pipeline-input.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Pipeline input
The pipeline starts when a new file (article) is detected in a watched directory by the [DirectoryWatcher](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/lib/DirectoryWatcher.py). This new file is produced by **pipeline A**

## Example input data
```txt
Since the sudden exit of the controversial CEO Martin Kjær last week,
both he and the executive board in Region North Jutland
have been in hiding.
```
> some/article.txt
## Preprocessing the input
Before the Entity Recognizer can use the input, it must be preprocessed. This entails removing newlines and adding punctuation where needed.

### Example preprocessed input data
```txt
Since the sudden exit of the controversial CEO Martin Kjær last week,
both he and the executive board in Region North Jutland. have been in hiding.
```

-----------
<div style="text-align: right">
Up next:
<br>
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md">Entity Recognition</a>
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color:currentColor;width:11px;height:11px"><path d="M9 18l6-6-6-6"></path></svg></span>
</div>
63 changes: 63 additions & 0 deletions docs/our-part-of-the-pipeline/pipeline-output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Pipeline output
The pipeline output is a [JSON](https://en.wikipedia.org/wiki/JSON) structure containing the entitymentions and links for a given article

## The [JSON](https://en.wikipedia.org/wiki/JSON) output
```JSON
{
"fileName": STRING,
"language": STRING,
"metadataId": UUID (STRING),
"sentences": [
{
"sentence": STRING,
"sentenceStartIndex": INT,
"sentenceEndIndex": INT,
"entityMentions": [
{
"name": STRING,
"type": STRING,
"label": STRING,
"startIndex": INT,
"endIndex": INT,
"iri": STRING?
}
]
}
]
}
```
Here we see a file (article) contains a language (detected by the [Language Detector](https://pypi.org/project/langdetect/)), a metadataId (forwarded by **pipeline A**), as well as a list of sentences, further consisting of a list of entity mentions.
> _**NOTE**_: The `iri` property can be null
## Example [JSON](https://en.wikipedia.org/wiki/JSON) output
```JSON
{
"language": "en",
"metadataId": "790261e8-b8ec-4801-9cbd-00263bcc666d",
"sentences": [
{
"sentence": "Barrack Obama was married to Michelle Obama two days ago.",
"sentenceStartIndex": 20,
"sentenceEndIndex": 62,
"entityMentions":
[
{ "name": "Barrack Obama", "type": "Entity", "label": "PERSON", "startIndex": 0, "endIndex": 12, "iri": "knox-kb01.srv.aau.dk/Barack_Obama" },
{ "name": "Michelle Obama", "type": "Entity", "label": "PERSON", "startIndex": 59, "endIndex": 73, "iri": "knox-kb01.srv.aau.dk/Michele_Obama" },
{ "name": "two days ago", "type": "Literal", "label": "DATE", "startIndex": 74, "endIndex": 86, "iri": null }
]
}
]
}
```

## Sending the [JSON](https://en.wikipedia.org/wiki/JSON) output to pipeline C
Lastly the [JSON](https://en.wikipedia.org/wiki/JSON) output is sent to **pipeline C** using a `POST` request. See [the code](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/e442dc496002b788d30f996cdfc87d36f5bcaa35/main.py#L32) for implementation details.

-----------
<div style="text-align: left">
Go back to:
<br>
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color: currentcolor; width: 11px; height: 11px;"><path d="M15 18l-6-6 6-6"></path></svg></span>
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md">Entity Linker</a>

</div>

0 comments on commit 5b3d721

Please sign in to comment.