-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-group standard error to consensus sampling #210
Conversation
Two-tuple of consensus list generated by collections.Counter.most_common and | ||
the proportion of groups for which the consensus matches the ideal. | ||
Two-tuple of: | ||
- Three-tuple of consensus list generated by collections.Counter.most_common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this? I'm not sure what "consensus list generated by ..." means. Also instead of nested tuples, what about returning a flat 4-tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should've reviewed this more thoroughly in the last PR, but on second read I have more questions for this docstring:
- What is data? Is it a set of answers for a single problem?
- What is a group key?
- When you select "number of samples to choose from the N total", what are those samples used for? If consensus is computed over these samples, why default to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this comment made me realize I need to vastly expand the docs here.
In general, the workflow is:
- Given
TaskDataset
of sizeT
tasks - Evaluate it
n>=1
times, storing aFrame
for each rollout that is (1) done and (2) not failed - Now we have a huge bucket of
Frame
s calleddata
, where0 <= len(data) <= n * T
- Pass this
data
directly intoevaluate_consensus
and get your consensus 🥳
Now come the logistics:
- In the
0 <= len(data) <= n * T
Frame
s, we haveT
tasks, so thegroup_fn
/TGroupKey
split them- For LitQA2, this would be by question
- Now we have
T
groups ofFrame
s of size0 <= x <= n
, let's get consensus on each by samplingk
. - To compute consensus, we form counted bins using
collections.Counter
. The counted bins are two-tuples of(answer: TAnswer, count: int)
. - Let's say
T=49
and 32 groups' consensus matched the respective ideal answer. Then the return's second term would be32/49
I'm not sure what "consensus list generated by ..." means.
It's the counted bins made, it gives you insight into not only the consensus, but all bins
Also instead of nested tuples, what about returning a flat 4-tuple?
Currently we return a two-tuple:
- Part 1: per-group metrics
- Part 2: group aggregate metrics
This is why we don't do a flat 4-tuple, because the first half is per-group while the second half is aggregate
What is data? Is it a set of answers for a single problem?
Yes, it's all answers for >=1 invocations of a TaskDataset
What is a group key?
It's a way of discerning tasks within a TaskDataset
. We use a groups to discern task-level accuracy of consensus.
When you select "number of samples to choose from the N total", what are those samples used for?
The samples are used for computing consensus while also measuring uncertainty in the consensus
If consensus is computed over these samples, why default to 1?
Yeah 1 is a bad default, you're right.
Originally I had crafted this to match Andrew's work, where he had group-aggregate metrics.
However, I think we should now deviate so evaluate_consensus
does not need to take into account grouping. We can just have an upstream entity (1) compute groups and (2) task-level accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added lots of comments to the code reflecting this as part of #211
Closing in favor of #211 |
I wanted to quantify uncertainty in our consensus reporting. The metric used is binomial standard error on a per-group basis.