Per-group standard error to consensus sampling #210

jamesbraza · 2025-01-15T22:22:48Z

I wanted to quantify uncertainty in our consensus reporting. The metric used is binomial standard error on a per-group basis.

sidnarayanan · 2025-01-15T22:29:43Z

ldp/alg/algorithms.py

-        Two-tuple of consensus list generated by collections.Counter.most_common and
-            the proportion of groups for which the consensus matches the ideal.
+        Two-tuple of:
+            - Three-tuple of consensus list generated by collections.Counter.most_common


Can you clarify this? I'm not sure what "consensus list generated by ..." means. Also instead of nested tuples, what about returning a flat 4-tuple?

Sorry, I should've reviewed this more thoroughly in the last PR, but on second read I have more questions for this docstring:

What is data? Is it a set of answers for a single problem?

What is a group key?

When you select "number of samples to choose from the N total", what are those samples used for? If consensus is computed over these samples, why default to 1?

Okay this comment made me realize I need to vastly expand the docs here.

In general, the workflow is:

Given TaskDataset of size T tasks

Evaluate it n>=1 times, storing a Frame for each rollout that is (1) done and (2) not failed

Now we have a huge bucket of Frames called data, where 0 <= len(data) <= n * T

Pass this data directly into evaluate_consensus and get your consensus 🥳

Now come the logistics:

In the 0 <= len(data) <= n * T Frames, we have T tasks, so the group_fn/TGroupKey split them

For LitQA2, this would be by question

Now we have T groups of Frames of size 0 <= x <= n, let's get consensus on each by sampling k.

To compute consensus, we form counted bins using collections.Counter. The counted bins are two-tuples of (answer: TAnswer, count: int).

Let's say T=49 and 32 groups' consensus matched the respective ideal answer. Then the return's second term would be 32/49

I'm not sure what "consensus list generated by ..." means.

It's the counted bins made, it gives you insight into not only the consensus, but all bins

Also instead of nested tuples, what about returning a flat 4-tuple?

Currently we return a two-tuple:

Part 1: per-group metrics

Part 2: group aggregate metrics

This is why we don't do a flat 4-tuple, because the first half is per-group while the second half is aggregate

What is data? Is it a set of answers for a single problem?

Yes, it's all answers for >=1 invocations of a TaskDataset

What is a group key?

It's a way of discerning tasks within a TaskDataset. We use a groups to discern task-level accuracy of consensus.

When you select "number of samples to choose from the N total", what are those samples used for?

The samples are used for computing consensus while also measuring uncertainty in the consensus

If consensus is computed over these samples, why default to 1?

Yeah 1 is a bad default, you're right.

Originally I had crafted this to match Andrew's work, where he had group-aggregate metrics.

However, I think we should now deviate so evaluate_consensus does not need to take into account grouping. We can just have an upstream entity (1) compute groups and (2) task-level accuracy.

I added lots of comments to the code reflecting this as part of #211

jamesbraza · 2025-01-16T00:57:46Z

Closing in favor of #211

Added standard error to consensus sampling groups

6322cad

jamesbraza added the enhancement New feature or request label Jan 15, 2025

jamesbraza requested review from whitead, sidnarayanan, Ryan-Rhys and a team January 15, 2025 22:22

jamesbraza self-assigned this Jan 15, 2025

sidnarayanan reviewed Jan 15, 2025

View reviewed changes

jamesbraza closed this Jan 16, 2025

jamesbraza deleted the consensus-per-group-std-err branch January 16, 2025 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-group standard error to consensus sampling #210

Per-group standard error to consensus sampling #210

jamesbraza commented Jan 15, 2025

sidnarayanan Jan 15, 2025

sidnarayanan Jan 15, 2025

jamesbraza Jan 15, 2025 •

edited

Loading

jamesbraza Jan 16, 2025

jamesbraza commented Jan 16, 2025

Per-group standard error to consensus sampling #210

Per-group standard error to consensus sampling #210

Conversation

jamesbraza commented Jan 15, 2025

sidnarayanan Jan 15, 2025

Choose a reason for hiding this comment

sidnarayanan Jan 15, 2025

Choose a reason for hiding this comment

jamesbraza Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

jamesbraza Jan 16, 2025

Choose a reason for hiding this comment

jamesbraza commented Jan 16, 2025

jamesbraza Jan 15, 2025 •

edited

Loading