Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more information to outputs (add instance indices for error types) #72

Merged
merged 12 commits into from
Mar 2, 2024

Conversation

jackboyla
Copy link
Contributor

Hi 👋 I would like nervaluate's results to show me the examples where my NER model makes mistakes, i.e false positive/false negatives. This functionality has already been mentioned in issue #68.

I have made some simple additions to evaluate.py, but this will break a lot of tests. I'm happy to adapt them, but I want to first check if we are happy with this form of additional information before I continue 😄

Here's an example of how the changes affect the output:

from nervaluate import Evaluator


def test_evaluator_simple_case():
    true = [
        [{"label": "PER", "start": 2, "end": 4}],
        [
            {"label": "LOC", "start": 1, "end": 2},
        ],
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "LOC", "start": 3, "end": 4}, # missed
        ],
        [
            {"label": "PER", "start": 27, "end": 29},
        ],
        [
            {"label": "LOC", "start": 4, "end": 7},
        ],
    ]
    pred = [
        [{"label": "PER", "start": 2, "end": 4}],
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "PER", "start": 13, "end": 14}, # false positive (spurious)
        ],
        [
            {"label": "LOC", "start": 1, "end": 2},
        ],
        [
            {"label": "PER", "start": 28, "end": 31}, # partial
        ],
        [
            {"label": "LOC", "start": 4, "end": 7},
            {"label": "LOC", "start": 24, "end": 26}, # another false positive (spurious)
        ],
    ]
    evaluator = Evaluator(true, pred, tags=["LOC", "PER"])
    results, results_agg = evaluator.evaluate()

    return results, results_agg

Our overall results will look like:

{'ent_type': {'correct': 5,
  'incorrect': 0,
  'incorrect_indices': [],
  'partial': 0,
  'partial_indices': [],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.7142857142857143,
  'recall': 0.8333333333333334,
  'f1': 0.7692307692307692},
 'partial': {'correct': 4,
  'incorrect': 0,
  'incorrect_indices': [],
  'partial': 1,
  'partial_indices': [3],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.6428571428571429,
  'recall': 0.75,
  'f1': 0.6923076923076924},
 'strict': {'correct': 4,
  'incorrect': 1,
  'incorrect_indices': [3],
  'partial': 0,
  'partial_indices': [],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.5714285714285714,
  'recall': 0.6666666666666666,
  'f1': 0.6153846153846153},
 'exact': {'correct': 4,
  'incorrect': 1,
  'incorrect_indices': [3],
  'partial': 0,
  'partial_indices': [],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.5714285714285714,
  'recall': 0.6666666666666666,
  'f1': 0.6153846153846153}}

The relevant indices will also be added on the per-tag output.

@davidsbatista
Copy link
Collaborator

Hi @jackboyla, and thanks a lot for you collaboration.

Please don't use self.results to store the indices, define another structure. I believe this should be done in a different structure to keep things clean and separate; this might also reduce the number of failed tests. Maybe you can follow the same philosophy as in self.results, i.e, define a base dictionary and then use deepcopy.

Going back to the original issue:

Is there a way to find out for which instance during evaluation was marked under 'correct' or 'incorrect' or 'spurious', etc for a particular evaluation schema?

I see that the question was to have a way to find which instances were marked under correct, incorrect, spurious, etc.

Maybe it's also useful to have a function that prints a nice output to the console with those entities, just as indices, but optionally if the text is given text surface strings themselves.

What do you think @ivyleavedtoadflax ?

By the way, late happy New Year to you all :)

@davidsbatista
Copy link
Collaborator

davidsbatista commented Jan 16, 2024

Hi @jackboyla, thanks for your updates - I've left a few comments, mostly regarding variable names. I fear that this code starts to become a bit spaghetti; that's the reason I'm so strict.

By the way, did you run the code quality checks locally?

@jackboyla
Copy link
Contributor Author

Hey @davidsbatista, thanks for the feedback! I appreciate the strictness, it's in danger of getting messy 😅 I will run the code quality checks now, I just had one more question:

The current implementation will provide output like:

{'strict': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [3],
  'partial_indices': [],
  'missed_indices': [2],
  'spurious_indices': [1, 4]},
 'ent_type': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [],
  'partial_indices': [],
  'missed_indices': [2],
  'spurious_indices': [1, 4]},
 'partial': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [],
  'partial_indices': [3],
  'missed_indices': [2],
  'spurious_indices': [1, 4]},
 'exact': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [3],
  'partial_indices': [],
  'missed_indices': [2],
  'spurious_indices': [1, 4]}}

Each value in an index list represents an instance of an entity that was predicted correctly/erroneously.

You can see for 'correct_indices': [0, 1, 1, 2, 4], the instance at index 1 had two correct predictions (hence 1 appears twice), but it will not specify which predictions within the instance were correct.

I can add this information so each element is a tuple (instance_index, entity_within_instance_index), but I'm afraid this will make the output too verbose:

{'strict': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'ent_type': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'partial': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [(3, 0)],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'exact': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]}}

Do you think this addition would be valuable?

@davidsbatista
Copy link
Collaborator

You can see for 'correct_indices': [0, 1, 1, 2, 4], the instance at index 1 had two correct predictions (hence 1 appears twice), but it will not specify which predictions within the instance were correct.

Sorry, I am a bit confused about how this can happen? How can a predicted entity be correct twice under the same evaluation scenario. Maybe you can show me an example? As mentioned earlier, I haven't looked at and touched the code in a while.

I thought that the index represents each instance in the prediction list, as shown below:

pred = [
        [{"label": "PER", "start": 2, "end": 4}],
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "PER", "start": 13, "end": 14}, # false positive (spurious)
        ],
        [
            {"label": "LOC", "start": 1, "end": 2},
        ],
        [
            {"label": "PER", "start": 28, "end": 31}, # partial
        ],
        [
            {"label": "LOC", "start": 4, "end": 7},
            {"label": "LOC", "start": 24, "end": 26}, # another false positive (spurious)
        ],
    ]

I can add this information so each element is a tuple (instance_index, entity_within_instance_index), but I'm afraid this will make the output too verbose:

Regarding this suggestion, I think this starts to become really convoluted. The end user just wants a nice report, in this case, the entities themselves or the offsets of the entities in the document which were incorrect under each scenario.

Please, see the two functions of the below:

  • summary_report_ent()
  • summary_report_overall()

If you run those functions, you will see a nice report, see the folder examples, to run in example that cause this functions.

We can use the format that you proposed as some intermediate representation, but I don't think that should ever be exposed to the user. You can rely on that intermediate representation to output a nice report to the console or to a file.

Sorry if this seems more work than you had initially thought, but I think it's better if we tackle this in a structured and clean way to avoid having more and more spaghetti code.

@jackboyla
Copy link
Contributor Author

jackboyla commented Jan 17, 2024

Thanks for taking the time to give this feedback 😄 I think I didn't make it clear in my last comment. Here's an example:

We have predictions for 3 separate instances:

pred = [
        [{"label": "PER", "start": 2, "end": 4}], # correct
        [
            {"label": "LOC", "start": 1, "end": 2}, # correct
            {"label": "LOC", "start": 4, "end": 5}, # false positive (spurious)
            {"label": "PER", "start": 13, "end": 14}, # false positive (spurious)
        ],
       [{"label": "PER", "start": 7, "end": 9}], # false positive (spurious)
    ]

We see that for instance at index 1, there are two spurious predictions for two separate entities -- at indices 1 and 2 inside that instance. So the current implementation records 2 errors under spurious. Additionally, instance 2 contains one prediction which is spurious:

{'strict': {...,
  ...,
  'spurious_indices': [1, 1, 2]},
...
}

This tells the user that 2 spurious errors have been recorded in instance 1 (under the strict eval schema). Alternatively, we can just return a set() of these indices, but then the user doesn't know how many spurious errors are present in instance 1.

I then proposed:

I can add this information so each element is a tuple (instance_index, entity_within_instance_index), but I'm afraid this will make the output too verbose:

This would include the position of the predictions within the instance where the error occurred:

{'strict': {...,
  ...,
  'spurious_indices': [(1 1), (1, 2), (2,0)]},
...
}

With regards to the above, I do agree that it becomes too convoluted, and it would be impossible to print this nicely when dealing with many instances and entities.

Looking at it from the end-user's point of view, they just want an output that shows them at what instances the NER model failed. If there is an error in the instance, the user should look at the entire instance.

Regarding the printing, I can do it in the style of summary_report_ent and summary_report_overall, but it may not be very pretty: if there are many instances, the output will not fit on one line.

I hope this makes it a bit clearer what I'm trying to say 😃

@jackboyla
Copy link
Contributor Author

Hi @davidsbatista, I've added print functions for both overall evaluation indices and per-entity indices -- summary_report_overall_indices and summary_report_ents_indices respectively. They differ from the other summary functions in that they will only print the results from one given evaluation schema (exact, ent_type, etc..) as printing all schemas is very verbose.

Here's an example:

evaluation_indices = {'strict': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'ent_type': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'partial': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [(3, 0)],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'exact': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]}}

You have the option to provide preds as an argument:

print(summary_report_overall_indices(evaluation_indices, 'partial', preds))

which will return:

Indices for error schema 'partial':

Correct indices indices:
  - Instance 0, Entity 0: Label=PER, Start=2, End=4
  - Instance 1, Entity 0: Label=LOC, Start=1, End=2
  - Instance 1, Entity 1: Label=LOC, Start=5, End=6
  - Instance 2, Entity 0: Label=LOC, Start=1, End=2
  - Instance 4, Entity 0: Label=LOC, Start=4, End=7

Incorrect indices indices:
  - None

Partial indices indices:
  - Instance 3, Entity 0: Label=PER, Start=28, End=31

Missed indices indices:
  - Instance 2, Entity 0: Label=LOC, Start=1, End=2

Spurious indices indices:
  - Instance 1, Entity 2: Label=PER, Start=13, End=14
  - Instance 4, Entity 1: Label=LOC, Start=24, End=26

or do not add preds, in which case the function returns:

Indices for error schema 'partial':

Correct indices indices:
  - Instance 0, Entity 0
  - Instance 1, Entity 0
  - Instance 1, Entity 1
  - Instance 2, Entity 0
  - Instance 4, Entity 0

Incorrect indices indices:
  - None

Partial indices indices:
  - Instance 3, Entity 0

Missed indices indices:
  - Instance 2, Entity 0

Spurious indices indices:
  - Instance 1, Entity 2
  - Instance 4, Entity 1

On an per-entity level, we can use:

print(summary_report_ents_indices(evaluation_agg_indices, 'partial', preds))

Again, the preds is optional. This will return:


Entity Type: LOC
  Error Schema: 'partial'
    (LOC) Correct indices:
      - Instance 1, Entity 0: Label=LOC, Start=1, End=2
      - Instance 1, Entity 1: Label=LOC, Start=5, End=6
      - Instance 2, Entity 0: Label=LOC, Start=1, End=2
      - Instance 4, Entity 0: Label=LOC, Start=4, End=7
    (LOC) Incorrect indices:
      - None
    (LOC) Partial indices:
      - None
    (LOC) Missed indices:
      - Instance 2, Entity 0: Label=LOC, Start=1, End=2
    (LOC) Spurious indices:
      - Instance 4, Entity 1: Label=LOC, Start=24, End=26

Entity Type: PER
  Error Schema: 'partial'
    (PER) Correct indices:
      - Instance 0, Entity 0: Label=PER, Start=2, End=4
    (PER) Incorrect indices:
      - None
    (PER) Partial indices:
      - Instance 3, Entity 0: Label=PER, Start=28, End=31
    (PER) Missed indices:
      - None
    (PER) Spurious indices:
      - Instance 1, Entity 2: Label=PER, Start=13, End=14

@davidsbatista
Copy link
Collaborator

Hello @jackboyla and thanks once again for your efforts.

This seems to be in line with what I envision as a report summary. I will approve the workflow so that the call quality checks can be run and we can start reviewing the code together.

@davidsbatista
Copy link
Collaborator

@ivyleavedtoadflax it seems the coverage badge is giving issues again. Do you have any idea why? Maybe we can disable it or try to find a replacement, why do you think?

@davidsbatista
Copy link
Collaborator

@ivyleavedtoadflax shall I just open a PR to disable the coverage badge?

@davidsbatista
Copy link
Collaborator

@ivyleavedtoadflax bump

@ivyleavedtoadflax
Copy link
Collaborator

Hey sorry for the slow reply. Let's go with the easy option of disabling for now to unblock this PR. Are you ok to put in a PR? 🙏

@davidsbatista
Copy link
Collaborator

@jackboyla merge the main branch into your - I've removed the workflow that was causing this issue

@davidsbatista davidsbatista merged commit db0b257 into MantisAI:main Mar 2, 2024
1 check passed
@davidsbatista
Copy link
Collaborator

done! :) @ivyleavedtoadflax do you know if I have permissions to make a new release?

@jackboyla
Copy link
Contributor Author

jackboyla commented Mar 2, 2024

great stuff thank you! 😄 I just realised I forgot to include in the README how this changes things. Evaluator.evaluate() now returns 4 variables instead of 2:

from nervaluate import Evaluator
true = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4}]
]

pred = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4},
     {"label": "LOC", "start": 12, "end": 14}]
]

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'])

results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

Additionally, I wanted to include how users can pretty print this new info:

from nervaluate import summary_report_ents_indices
print(summary_report_ents_indices(result_indices_by_tag, error_schema='partial', preds=pred))

Entity Type: LOC
  Error Schema: 'partial'
    (LOC) Correct indices:
      - Instance 1, Entity 0: Label=LOC, Start=1, End=2
      - Instance 1, Entity 1: Label=LOC, Start=3, End=4
    (LOC) Incorrect indices:
      - None
    (LOC) Partial indices:
      - None
    (LOC) Missed indices:
      - None
    (LOC) Spurious indices:
      - Instance 1, Entity 2: Label=LOC, Start=12, End=14

Entity Type: PER
  Error Schema: 'partial'
    (PER) Correct indices:
      - Instance 0, Entity 0: Label=PER, Start=2, End=4
    (PER) Incorrect indices:
      - None
    (PER) Partial indices:
      - None
    (PER) Missed indices:
      - None
    (PER) Spurious indices:
      - None

@davidsbatista would it be possible for you to add this to the README so we don't have to revert the whole PR?

@davidsbatista
Copy link
Collaborator

feel free open another PR

@jackboyla
Copy link
Contributor Author

ok cool #74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants