Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-group standard error to consensus sampling #210

Closed
wants to merge 1 commit into from

Conversation

jamesbraza
Copy link
Collaborator

I wanted to quantify uncertainty in our consensus reporting. The metric used is binomial standard error on a per-group basis.

@jamesbraza jamesbraza added the enhancement New feature or request label Jan 15, 2025
@jamesbraza jamesbraza requested review from whitead, sidnarayanan, Ryan-Rhys and a team January 15, 2025 22:22
@jamesbraza jamesbraza self-assigned this Jan 15, 2025
Two-tuple of consensus list generated by collections.Counter.most_common and
the proportion of groups for which the consensus matches the ideal.
Two-tuple of:
- Three-tuple of consensus list generated by collections.Counter.most_common
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? I'm not sure what "consensus list generated by ..." means. Also instead of nested tuples, what about returning a flat 4-tuple?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should've reviewed this more thoroughly in the last PR, but on second read I have more questions for this docstring:

  1. What is data? Is it a set of answers for a single problem?
  2. What is a group key?
  3. When you select "number of samples to choose from the N total", what are those samples used for? If consensus is computed over these samples, why default to 1?

Copy link
Collaborator Author

@jamesbraza jamesbraza Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this comment made me realize I need to vastly expand the docs here.

In general, the workflow is:

  1. Given TaskDataset of size T tasks
  2. Evaluate it n>=1 times, storing a Frame for each rollout that is (1) done and (2) not failed
  3. Now we have a huge bucket of Frames called data, where 0 <= len(data) <= n * T
  4. Pass this data directly into evaluate_consensus and get your consensus 🥳

Now come the logistics:

  1. In the 0 <= len(data) <= n * T Frames, we have T tasks, so the group_fn/TGroupKey split them
    • For LitQA2, this would be by question
  2. Now we have T groups of Frames of size 0 <= x <= n, let's get consensus on each by sampling k.
  3. To compute consensus, we form counted bins using collections.Counter. The counted bins are two-tuples of (answer: TAnswer, count: int).
    • image
  4. Let's say T=49 and 32 groups' consensus matched the respective ideal answer. Then the return's second term would be 32/49

I'm not sure what "consensus list generated by ..." means.

It's the counted bins made, it gives you insight into not only the consensus, but all bins

Also instead of nested tuples, what about returning a flat 4-tuple?

Currently we return a two-tuple:

  • Part 1: per-group metrics
  • Part 2: group aggregate metrics

This is why we don't do a flat 4-tuple, because the first half is per-group while the second half is aggregate

What is data? Is it a set of answers for a single problem?

Yes, it's all answers for >=1 invocations of a TaskDataset

What is a group key?

It's a way of discerning tasks within a TaskDataset. We use a groups to discern task-level accuracy of consensus.

When you select "number of samples to choose from the N total", what are those samples used for?

The samples are used for computing consensus while also measuring uncertainty in the consensus

If consensus is computed over these samples, why default to 1?

Yeah 1 is a bad default, you're right.


Originally I had crafted this to match Andrew's work, where he had group-aggregate metrics.

However, I think we should now deviate so evaluate_consensus does not need to take into account grouping. We can just have an upstream entity (1) compute groups and (2) task-level accuracy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added lots of comments to the code reflecting this as part of #211

@jamesbraza
Copy link
Collaborator Author

Closing in favor of #211

@jamesbraza jamesbraza closed this Jan 16, 2025
@jamesbraza jamesbraza deleted the consensus-per-group-std-err branch January 16, 2025 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants