Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

104 entity linker #105

Merged
merged 2 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 26 additions & 9 deletions components/EntityLinker.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,29 +3,46 @@
from components import Db
from lib.EntityLinked import EntityLinked
from lib.Entity import Entity
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

async def entitylinkerFunc(entities: List[Entity], db_path: str, threshold:int=80):

async def entitylinkerFunc(
entities: List[Entity], db_path: str, threshold: int = 80
):
iri_dict = {}
linked_entities = []

for entity in entities:
if entity.type == "Literal":
linked_entities.append(EntityLinked(entity, ""))
linked_entities.append(EntityLinked(entity, ""))
continue

# Use the Read function to get all entities starting with the same name
potential_matches = await Db.Read(
db_path, "EntityIndex", searchPred=entity.name
)

if potential_matches:
names_only = [match[1] for match in potential_matches]
# Sort the potential matches by length difference and select the first one
best_candidate_name = min(
names_only,
key=lambda x: abs(len(x[0]) - len(entity.name)),

# Use fuzzy matching to find the best candidate
best_candidate_name, similarity = process.extractOne(
entity.name, names_only
)
iri = best_candidate_name.replace(" ", "_")
iri_dict[entity] = EntityLinked(entity, iri)

# Check if the similarity is above the threshold
if similarity >= threshold:
iri = best_candidate_name.replace(" ", "_")
iri_dict[entity] = EntityLinked(entity, iri)
else:
# If no match above the threshold, add to the result and update the database
iri = entity.name.replace(" ", "_")
iri_dict[entity] = EntityLinked(entity, iri)
await Db.Insert(
db_path,
"EntityIndex",
queryInformation={"entity": entity.name},
)
else:
# If not found in the database, add to the result and update the database
iri = entity.name.replace(" ", "_")
Expand Down
81 changes: 81 additions & 0 deletions docs/entitylinker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Entity Linking

Entity linking in the knox project is performed using a string comparison algorithm to determine the closest comparable entity.

## How it is linked

Linking entities to eachother happens through IRI's. An entity is given a unique IRI. Whenever an entity is identified as being the same entity as another, the entitiy is linked to the same IRI.

## Comparison Algorithm

Currently, KNOX utilizes the FuzzyWuzzy library for python to determine candidates to link an entity to. FuzzyWuzzy is build upon the Levenshtein algorithm, which works by looking at how many modifications is needed to change one string to another. The less modification needed to alter the string to be equal to the other, the closer the string is. Using FuzzyWuzzy we naively determine entities to link to. It is therefore not the optimal solution, and this should be changed later on.

## Performing entity linking on an input

```PYTHON
async def entitylinkerFunc(
entities: List[Entity], db_path: str, threshold: int = 80
):
iri_dict = {}
linked_entities = []

for entity in entities:
if entity.type == "Literal":
linked_entities.append(EntityLinked(entity, ""))
continue

# Use the Read function to get all entities starting with the same name
potential_matches = await Db.Read(
db_path, "EntityIndex", searchPred=entity.name
)

if potential_matches:
names_only = [match[1] for match in potential_matches]

# Use fuzzy matching to find the best candidate
best_candidate_name, similarity = process.extractOne(
entity.name, names_only
)

# Check if the similarity is above the threshold
if similarity >= threshold:
iri = best_candidate_name.replace(" ", "_")
iri_dict[entity] = EntityLinked(entity, iri)
else:
# If no match above the threshold, add to the result and update the database
iri = entity.name.replace(" ", "_")
iri_dict[entity] = EntityLinked(entity, iri)
await Db.Insert(
db_path,
"EntityIndex",
queryInformation={"entity": entity.name},
)
else:
# If not found in the database, add to the result and update the database
iri = entity.name.replace(" ", "_")
iri_dict[entity] = EntityLinked(entity, iri)
await Db.Insert(
db_path,
"EntityIndex",
queryInformation={"entity": entity.name},
)

# Convert the result to an array of EntityLinked
for linked_entity in iri_dict.values():
linked_entities.append(linked_entity)

return linked_entities
```

Entity linking is performed using the above function. The function takes in a list of entities which would be found in a new article processed in the KNOX pipeline. It then iterates through all found entities and sort out all that is of type LITERAL.

After this, a list of potential matches is then found from the database, that all start the string of the entity to be linked.

FuzzyWuzzy is then used to find the best candidate.

<div style="text-align: right">
Up next:
<br>
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/our-part-of-the-pipeline/pipeline-output.md">Entity Linker</a>
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color:currentColor;width:11px;height:11px"><path d="M9 18l6-6-6-6"></path></svg></span>
</div>
16 changes: 11 additions & 5 deletions docs/entityrecognition.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,30 @@
# Entity Recognition
The entity recognition part is performed by using danish and english pre-trained models published by SpaCy.

The entity recognition part is performed by using danish and english pre-trained models published by SpaCy.

## Model Links

- Danish model: [https://spacy.io/models/da#da_core_news_lg](https://spacy.io/models/da#da_core_news_lg)
- English model: [https://spacy.io/models/en#en_core_web_lg](https://spacy.io/models/en#en_core_web_lg)

## Custom Danish Model

The danish model has been trained on top of the danish pre-trained SpaCy model to improve its accuracy and be able to recognize literals. See [Pypi Repository](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/pypi.md) for more information on where to find the custom model.

## Loading a SpaCy Model

```python
import en_core_web_lg
import da_core_news_lg
import da_core_news_knox_lg

nlp_en = en_core_web_lg.load()
nlp_da = da_core_news_lg.load()
nlp_da = da_core_news_knox_lg.load()
```

> Full code available [here](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/5fcd59bac0fbd91b2543d7d78a893f16da49f25f/components/GetSpacyData.py#L17#L18).

## Performing Entity Recognition on Input

The entity recognition is performed using either the `nlp_en` or `nlp_da` variable defined in [Loading a SpaCy Model](https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entityrecognition.md#loading-a-spacy-model).

```python
Expand All @@ -37,10 +42,11 @@ def GetTokens(text: str):

The return type of this function is a [Doc](https://spacy.io/api/doc) containing information such as the entity's start and end index, the entity's belonging sentence, and so on.

-----------
---

<div style="text-align: right">
Up next:
<br>
<a href="https://github.com/Knox-AAU/PreProcessingLayer_EntityRecognitionAndLinking/blob/main/docs/entitylinker.md">Entity Linker</a>
<span class="pagination_icon__3ocd0"><svg class="with-icon_icon__MHUeb" data-testid="geist-icon" fill="none" height="24" shape-rendering="geometricPrecision" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" viewBox="0 0 24 24" width="24" style="color:currentColor;width:11px;height:11px"><path d="M9 18l6-6-6-6"></path></svg></span>
</div>
</div>
16 changes: 10 additions & 6 deletions tests/unit/test_EntityLinker.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ async def mock_read(db_path, table, searchPred):
return [("1", "Entity1"), ("2", "Entity2")]
return []

async def mock_insert(db_path, table, entity_name):
async def mock_insert(db_path, table, queryInformation):
return None

# Patch the Db.Read and Db.Insert functions with the mock functions
Expand All @@ -28,7 +28,9 @@ async def mock_insert(db_path, table, entity_name):
# Create some Entity instances
entMentions = [
Entity("Entity1", 0, 6, "Sentence1", 0, 9, "PERSON", "Entity"),
Entity("newEntity3", 0, 6, "Sentence2", 0, 9, "PERSON", "Entity"),
Entity(
"newEntity3", 0, 6, "Sentence2", 0, 9, "PERSON", "Entity"
),
]

# Call the entitylinkerFunc
Expand All @@ -40,7 +42,7 @@ async def mock_insert(db_path, table, entity_name):
assert entLinks[0].iri == "Entity1"

# Ensure the second mention creates a new entity
assert entLinks[1].iri == "Entity1"
assert entLinks[1].iri == "newEntity3"


# Define a test case with a mock database and Entity instances
Expand All @@ -52,7 +54,7 @@ async def mock_read(db_path, table, searchPred):
return [("1", "Entity1")]
return []

async def mock_insert(db_path, table, entity_name):
async def mock_insert(db_path, table, queryInformation):
return None

# Patch the Db.Read and Db.Insert functions with the mock functions
Expand Down Expand Up @@ -84,7 +86,7 @@ async def mock_read(db_path, table, searchPred):
return [("1", "Entity 1")]
return []

async def mock_insert(db_path, table, entity_name):
async def mock_insert(db_path, table, queryInformation):
return None

# Patch the Db.Read and Db.Insert functions with the mock functions
Expand Down Expand Up @@ -167,7 +169,9 @@ async def mock_insert(db_path, table, queryInformation):
}

# Call the entitylinkerFunc
entLinks = await entitylinkerFunc(TestingDataset["test"], db_path="DB_PATH")
entLinks = await entitylinkerFunc(
TestingDataset["test"], db_path="DB_PATH"
)
for index, link in enumerate(entLinks):
assert link.name == TestingDataset["GoldStandardNames"][index]
assert link.iri == TestingDataset["GoldStandardIRIs"][index]