Add more information to outputs (add instance indices for error types) #72

jackboyla · 2024-01-11T11:32:10Z

Hi 👋 I would like nervaluate's results to show me the examples where my NER model makes mistakes, i.e false positive/false negatives. This functionality has already been mentioned in issue #68.

I have made some simple additions to evaluate.py, but this will break a lot of tests. I'm happy to adapt them, but I want to first check if we are happy with this form of additional information before I continue 😄

Here's an example of how the changes affect the output:

from nervaluate import Evaluator


def test_evaluator_simple_case():
    true = [
        [{"label": "PER", "start": 2, "end": 4}],
        [
            {"label": "LOC", "start": 1, "end": 2},
        ],
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "LOC", "start": 3, "end": 4}, # missed
        ],
        [
            {"label": "PER", "start": 27, "end": 29},
        ],
        [
            {"label": "LOC", "start": 4, "end": 7},
        ],
    ]
    pred = [
        [{"label": "PER", "start": 2, "end": 4}],
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "PER", "start": 13, "end": 14}, # false positive (spurious)
        ],
        [
            {"label": "LOC", "start": 1, "end": 2},
        ],
        [
            {"label": "PER", "start": 28, "end": 31}, # partial
        ],
        [
            {"label": "LOC", "start": 4, "end": 7},
            {"label": "LOC", "start": 24, "end": 26}, # another false positive (spurious)
        ],
    ]
    evaluator = Evaluator(true, pred, tags=["LOC", "PER"])
    results, results_agg = evaluator.evaluate()

    return results, results_agg

Our overall results will look like:

{'ent_type': {'correct': 5,
  'incorrect': 0,
  'incorrect_indices': [],
  'partial': 0,
  'partial_indices': [],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.7142857142857143,
  'recall': 0.8333333333333334,
  'f1': 0.7692307692307692},
 'partial': {'correct': 4,
  'incorrect': 0,
  'incorrect_indices': [],
  'partial': 1,
  'partial_indices': [3],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.6428571428571429,
  'recall': 0.75,
  'f1': 0.6923076923076924},
 'strict': {'correct': 4,
  'incorrect': 1,
  'incorrect_indices': [3],
  'partial': 0,
  'partial_indices': [],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.5714285714285714,
  'recall': 0.6666666666666666,
  'f1': 0.6153846153846153},
 'exact': {'correct': 4,
  'incorrect': 1,
  'incorrect_indices': [3],
  'partial': 0,
  'partial_indices': [],
  'missed': 1,
  'missed_indices': [2],
  'spurious': 2,
  'spurious_indices': [1, 4],
  'possible': 6,
  'actual': 7,
  'precision': 0.5714285714285714,
  'recall': 0.6666666666666666,
  'f1': 0.6153846153846153}}

The relevant indices will also be added on the per-tag output.

…ious, and missed entities

davidsbatista · 2024-01-15T18:40:49Z

Hi @jackboyla, and thanks a lot for you collaboration.

Please don't use self.results to store the indices, define another structure. I believe this should be done in a different structure to keep things clean and separate; this might also reduce the number of failed tests. Maybe you can follow the same philosophy as in self.results, i.e, define a base dictionary and then use deepcopy.

Going back to the original issue:

Is there a way to find out for which instance during evaluation was marked under 'correct' or 'incorrect' or 'spurious', etc for a particular evaluation schema?

I see that the question was to have a way to find which instances were marked under correct, incorrect, spurious, etc.

Maybe it's also useful to have a function that prints a nice output to the console with those entities, just as indices, but optionally if the text is given text surface strings themselves.

What do you think @ivyleavedtoadflax ?

By the way, late happy New Year to you all :)

src/nervaluate/evaluate.py

davidsbatista · 2024-01-16T16:14:00Z

Hi @jackboyla, thanks for your updates - I've left a few comments, mostly regarding variable names. I fear that this code starts to become a bit spaghetti; that's the reason I'm so strict.

By the way, did you run the code quality checks locally?

jackboyla · 2024-01-16T16:47:51Z

Hey @davidsbatista, thanks for the feedback! I appreciate the strictness, it's in danger of getting messy 😅 I will run the code quality checks now, I just had one more question:

The current implementation will provide output like:

{'strict': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [3],
  'partial_indices': [],
  'missed_indices': [2],
  'spurious_indices': [1, 4]},
 'ent_type': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [],
  'partial_indices': [],
  'missed_indices': [2],
  'spurious_indices': [1, 4]},
 'partial': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [],
  'partial_indices': [3],
  'missed_indices': [2],
  'spurious_indices': [1, 4]},
 'exact': {'correct_indices': [0, 1, 1, 2, 4],
  'incorrect_indices': [3],
  'partial_indices': [],
  'missed_indices': [2],
  'spurious_indices': [1, 4]}}

Each value in an index list represents an instance of an entity that was predicted correctly/erroneously.

You can see for 'correct_indices': [0, 1, 1, 2, 4], the instance at index 1 had two correct predictions (hence 1 appears twice), but it will not specify which predictions within the instance were correct.

I can add this information so each element is a tuple (instance_index, entity_within_instance_index), but I'm afraid this will make the output too verbose:

{'strict': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'ent_type': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'partial': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [(3, 0)],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'exact': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]}}

Do you think this addition would be valuable?

davidsbatista · 2024-01-16T19:01:54Z

You can see for 'correct_indices': [0, 1, 1, 2, 4], the instance at index 1 had two correct predictions (hence 1 appears twice), but it will not specify which predictions within the instance were correct.

Sorry, I am a bit confused about how this can happen? How can a predicted entity be correct twice under the same evaluation scenario. Maybe you can show me an example? As mentioned earlier, I haven't looked at and touched the code in a while.

I thought that the index represents each instance in the prediction list, as shown below:

pred = [
        [{"label": "PER", "start": 2, "end": 4}],
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "PER", "start": 13, "end": 14}, # false positive (spurious)
        ],
        [
            {"label": "LOC", "start": 1, "end": 2},
        ],
        [
            {"label": "PER", "start": 28, "end": 31}, # partial
        ],
        [
            {"label": "LOC", "start": 4, "end": 7},
            {"label": "LOC", "start": 24, "end": 26}, # another false positive (spurious)
        ],
    ]

I can add this information so each element is a tuple (instance_index, entity_within_instance_index), but I'm afraid this will make the output too verbose:

Regarding this suggestion, I think this starts to become really convoluted. The end user just wants a nice report, in this case, the entities themselves or the offsets of the entities in the document which were incorrect under each scenario.

Please, see the two functions of the below:

summary_report_ent()
summary_report_overall()

If you run those functions, you will see a nice report, see the folder examples, to run in example that cause this functions.

We can use the format that you proposed as some intermediate representation, but I don't think that should ever be exposed to the user. You can rely on that intermediate representation to output a nice report to the console or to a file.

Sorry if this seems more work than you had initially thought, but I think it's better if we tackle this in a structured and clean way to avoid having more and more spaghetti code.

jackboyla · 2024-01-17T17:50:15Z

Thanks for taking the time to give this feedback 😄 I think I didn't make it clear in my last comment. Here's an example:

We have predictions for 3 separate instances:

pred = [
        [{"label": "PER", "start": 2, "end": 4}], # correct
        [
            {"label": "LOC", "start": 1, "end": 2}, # correct
            {"label": "LOC", "start": 4, "end": 5}, # false positive (spurious)
            {"label": "PER", "start": 13, "end": 14}, # false positive (spurious)
        ],
       [{"label": "PER", "start": 7, "end": 9}], # false positive (spurious)
    ]

We see that for instance at index 1, there are two spurious predictions for two separate entities -- at indices 1 and 2 inside that instance. So the current implementation records 2 errors under spurious. Additionally, instance 2 contains one prediction which is spurious:

{'strict': {...,
  ...,
  'spurious_indices': [1, 1, 2]},
...
}

This tells the user that 2 spurious errors have been recorded in instance 1 (under the strict eval schema). Alternatively, we can just return a set() of these indices, but then the user doesn't know how many spurious errors are present in instance 1.

I then proposed:

I can add this information so each element is a tuple (instance_index, entity_within_instance_index), but I'm afraid this will make the output too verbose:

This would include the position of the predictions within the instance where the error occurred:

{'strict': {...,
  ...,
  'spurious_indices': [(1 1), (1, 2), (2,0)]},
...
}

With regards to the above, I do agree that it becomes too convoluted, and it would be impossible to print this nicely when dealing with many instances and entities.

Looking at it from the end-user's point of view, they just want an output that shows them at what instances the NER model failed. If there is an error in the instance, the user should look at the entire instance.

Regarding the printing, I can do it in the style of summary_report_ent and summary_report_overall, but it may not be very pretty: if there are many instances, the output will not fit on one line.

I hope this makes it a bit clearer what I'm trying to say 😃

…es results

jackboyla · 2024-01-19T17:10:00Z

Hi @davidsbatista, I've added print functions for both overall evaluation indices and per-entity indices -- summary_report_overall_indices and summary_report_ents_indices respectively. They differ from the other summary functions in that they will only print the results from one given evaluation schema (exact, ent_type, etc..) as printing all schemas is very verbose.

Here's an example:

evaluation_indices = {'strict': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'ent_type': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'partial': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [],
  'partial_indices': [(3, 0)],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]},
 'exact': {'correct_indices': [(0, 0), (1, 0), (1, 1), (2, 0), (4, 0)],
  'incorrect_indices': [(3, 0)],
  'partial_indices': [],
  'missed_indices': [(2, 0)],
  'spurious_indices': [(1, 2), (4, 1)]}}

You have the option to provide preds as an argument:

print(summary_report_overall_indices(evaluation_indices, 'partial', preds))

which will return:

Indices for error schema 'partial':

Correct indices indices:
  - Instance 0, Entity 0: Label=PER, Start=2, End=4
  - Instance 1, Entity 0: Label=LOC, Start=1, End=2
  - Instance 1, Entity 1: Label=LOC, Start=5, End=6
  - Instance 2, Entity 0: Label=LOC, Start=1, End=2
  - Instance 4, Entity 0: Label=LOC, Start=4, End=7

Incorrect indices indices:
  - None

Partial indices indices:
  - Instance 3, Entity 0: Label=PER, Start=28, End=31

Missed indices indices:
  - Instance 2, Entity 0: Label=LOC, Start=1, End=2

Spurious indices indices:
  - Instance 1, Entity 2: Label=PER, Start=13, End=14
  - Instance 4, Entity 1: Label=LOC, Start=24, End=26

or do not add preds, in which case the function returns:

Indices for error schema 'partial':

Correct indices indices:
  - Instance 0, Entity 0
  - Instance 1, Entity 0
  - Instance 1, Entity 1
  - Instance 2, Entity 0
  - Instance 4, Entity 0

Incorrect indices indices:
  - None

Partial indices indices:
  - Instance 3, Entity 0

Missed indices indices:
  - Instance 2, Entity 0

Spurious indices indices:
  - Instance 1, Entity 2
  - Instance 4, Entity 1

On an per-entity level, we can use:

print(summary_report_ents_indices(evaluation_agg_indices, 'partial', preds))

Again, the preds is optional. This will return:


Entity Type: LOC
  Error Schema: 'partial'
    (LOC) Correct indices:
      - Instance 1, Entity 0: Label=LOC, Start=1, End=2
      - Instance 1, Entity 1: Label=LOC, Start=5, End=6
      - Instance 2, Entity 0: Label=LOC, Start=1, End=2
      - Instance 4, Entity 0: Label=LOC, Start=4, End=7
    (LOC) Incorrect indices:
      - None
    (LOC) Partial indices:
      - None
    (LOC) Missed indices:
      - Instance 2, Entity 0: Label=LOC, Start=1, End=2
    (LOC) Spurious indices:
      - Instance 4, Entity 1: Label=LOC, Start=24, End=26

Entity Type: PER
  Error Schema: 'partial'
    (PER) Correct indices:
      - Instance 0, Entity 0: Label=PER, Start=2, End=4
    (PER) Incorrect indices:
      - None
    (PER) Partial indices:
      - Instance 3, Entity 0: Label=PER, Start=28, End=31
    (PER) Missed indices:
      - None
    (PER) Spurious indices:
      - Instance 1, Entity 2: Label=PER, Start=13, End=14

davidsbatista · 2024-01-20T13:11:11Z

Hello @jackboyla and thanks once again for your efforts.

This seems to be in line with what I envision as a report summary. I will approve the workflow so that the call quality checks can be run and we can start reviewing the code together.

davidsbatista · 2024-01-20T13:14:24Z

@ivyleavedtoadflax it seems the coverage badge is giving issues again. Do you have any idea why? Maybe we can disable it or try to find a replacement, why do you think?

davidsbatista · 2024-02-06T20:44:06Z

@ivyleavedtoadflax shall I just open a PR to disable the coverage badge?

davidsbatista · 2024-02-20T12:40:56Z

@ivyleavedtoadflax bump

ivyleavedtoadflax · 2024-03-01T20:03:33Z

Hey sorry for the slow reply. Let's go with the easy option of disabling for now to unblock this PR. Are you ok to put in a PR? 🙏

davidsbatista · 2024-03-01T23:33:18Z

@jackboyla merge the main branch into your - I've removed the workflow that was causing this issue

davidsbatista · 2024-03-02T12:35:25Z

done! :) @ivyleavedtoadflax do you know if I have permissions to make a new release?

jackboyla · 2024-03-02T12:52:48Z

great stuff thank you! 😄 I just realised I forgot to include in the README how this changes things. Evaluator.evaluate() now returns 4 variables instead of 2:

from nervaluate import Evaluator
true = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4}]
]

pred = [
    [{"label": "PER", "start": 2, "end": 4}],
    [{"label": "LOC", "start": 1, "end": 2},
     {"label": "LOC", "start": 3, "end": 4},
     {"label": "LOC", "start": 12, "end": 14}]
]

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'])

results, results_by_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

Additionally, I wanted to include how users can pretty print this new info:

from nervaluate import summary_report_ents_indices
print(summary_report_ents_indices(result_indices_by_tag, error_schema='partial', preds=pred))


Entity Type: LOC
  Error Schema: 'partial'
    (LOC) Correct indices:
      - Instance 1, Entity 0: Label=LOC, Start=1, End=2
      - Instance 1, Entity 1: Label=LOC, Start=3, End=4
    (LOC) Incorrect indices:
      - None
    (LOC) Partial indices:
      - None
    (LOC) Missed indices:
      - None
    (LOC) Spurious indices:
      - Instance 1, Entity 2: Label=LOC, Start=12, End=14

Entity Type: PER
  Error Schema: 'partial'
    (PER) Correct indices:
      - Instance 0, Entity 0: Label=PER, Start=2, End=4
    (PER) Incorrect indices:
      - None
    (PER) Partial indices:
      - None
    (PER) Missed indices:
      - None
    (PER) Spurious indices:
      - None

@davidsbatista would it be possible for you to add this to the README so we don't have to revert the whole PR?

davidsbatista · 2024-03-02T13:05:22Z

feel free open another PR

jackboyla · 2024-03-02T13:31:17Z

ok cool #74

Adds index lists to output for examples with incorrect, partial, spur…

5398c1e

…ious, and missed entities

jackboyla mentioned this pull request Jan 11, 2024

More information about output? #68

Closed

ivyleavedtoadflax requested a review from davidsbatista January 12, 2024 22:55

jackboyla added 2 commits January 16, 2024 15:08

Moves evaluation indices to separate data structures

6970d65

Adds correct indices to result indices output

dede3d7