Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

presidio-structured misidentifies email as URL #1316

Open
ardband opened this issue Feb 28, 2024 · 2 comments
Open

presidio-structured misidentifies email as URL #1316

ardband opened this issue Feb 28, 2024 · 2 comments

Comments

@ardband
Copy link

ardband commented Feb 28, 2024

Presidio-structured incorrectly identifies an email address as a URL within the extracted entities. This can be observed in the following example output:

StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'URL', 'city': 'LOCATION', 'state': 'LOCATION'})

the value in the "email" column ("john.doe@example.com") is mistakenly identified as a URL ("URL") instead of an email address ("EMAIL") during entity extraction.

@miltonsim
Copy link
Contributor

I've also encountered this issue.

The issue mainly stems from the _find_most_common_entity() method where email addresses in test_structured.csv are being incorrectly identified as URLs, albeit with low confidence. It prioritises the entity with the highest count.

Observed behavior:

  • Entity Count: {'URL': 6, 'EMAIL_ADDRESS': 3}
  • Confidence Scores: {'EMAIL_ADDRESS': [1.0, 1.0, 1.0], 'URL': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}

The emails are accurately recognised but are outnumbered by the URL identifications due to their higher frequency, despite lower confidence levels.

I would like to suggest two potential improvements:

  1. Adapting _find_most_common_entity() to Consider Confidence Scores: It might be beneficial to adjust the method to account for the actual confidence scores provided by the recognizer results.
  2. Enhancing the URL Recognizer: Improving the recognizer's ability to differentiate between URLs and email addresses could help reduce this type of misidentification

I'm keen to contribute to making these improvements and would love to work on refining the logic. Any thoughts or feedback on these suggestions would be greatly appreciated!

@omri374
Copy link
Contributor

omri374 commented Feb 28, 2024

Thanks for the feedback! the URL recognizer detects parts of emails as well (e.g. microsoft.com is a url inside john.doe@microsoft.com), which makes it detect more URLs than emails.

I think that a good way forward here would be to allow the user to decide on a strategy for the entity selected. In some cases, we would want the entity with the majority of cases, in others we'd like the one that has the highest confidence, and in others we might want a mix of the two (e.g. most common entity, if confidence > 0.5)

A quick fix could be to update the structured analysis once finalized, in case the column's name is "email" but the detection is actually "URL".

If you're interested in creating a PR, I'd be happy to review it and discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants