Allowing `random.Random` seed and documenting `TaskDataset` shuffling risk #177

jamesbraza · 2025-01-17T21:07:41Z

@ludomitch and I realized that if you specify the same int seed for an entire TaskDataset (which we don't currently do, but could do), assuming the multiple choice options are the same length, the ideal answer would show up in the same index.

This PR:

Documents this risk
Adds support for random.Random to MultipleChoiceQuestion
Adds a (not in-place) shuffle utility

…ed Random support

To optimize the `shuffle` function, we can implement several improvements that will make it run faster without changing its functionality or behavior. 1. Reduce type checking by consolidating checks. 2. Use separate logic paths that do not recheck the type once confirmed. Here is the optimized version of the `shuffle` function. Improvements and explanations. 1. **Type Consolidation**: Instead of checking the type of `seed` multiple times in different `if` statements, we determine the type once and proceed accordingly. This reduces overhead for type checking. 2. **Direct Use of RNG**: For instances where the seed is a `random.Random` or integer, we create/randomize the RNG once and use it directly, simplifying the process. 3. **Exception Handling**: Instead of allowing for silent errors when an invalid type is passed for `seed`, this implementation explicitly raises a `TypeError` for an invalid type. This change enhances code robustness and helps in debugging.

src/aviary/utils.py

ludomitch · 2025-01-17T21:41:58Z

src/aviary/utils.py

+    if seed is None:
+        return random.sample(value, k=len(value))
+    # Numpy RNG. Note this will have a type error for sequences like str, but oh well
+    return seed.choice(value, size=len(value), replace=False)  # type: ignore[arg-type,return-value]


I'm not sure we have an obvious use for sampling with replacement anyway do we?

Are you saying this helper function should expose replace as an argument?

My answer to that is no, because this is a shuffle helper function (and its internals are an implementation detail). I had to use sampling without replacement inside because the shuffle functions are all in-place.... but I don't want our code to come with associated risks of in-place edits.

jamesbraza added 3 commits January 17, 2025 12:57

Added numpy as an explicit dev dependency

e95d98a

Created shuffle utility function that works with numpy and immutability

62c7c20

Documented the risk of same-seeding when doing a TaskDataset, and add…

9563cd9

…ed Random support

jamesbraza added the enhancement New feature or request label Jan 17, 2025

jamesbraza requested review from sidnarayanan and a team January 17, 2025 21:07

jamesbraza self-assigned this Jan 17, 2025

sidnarayanan reviewed Jan 17, 2025

View reviewed changes

src/aviary/utils.py Outdated Show resolved Hide resolved

ludomitch approved these changes Jan 17, 2025

View reviewed changes

Expanded wording on int seed risk in shuffle_seed's description

53b19c8

jamesbraza requested a review from sidnarayanan January 17, 2025 21:58

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. documentation Improvements or additions to documentation labels Jan 17, 2025

sidnarayanan approved these changes Jan 18, 2025

View reviewed changes

jamesbraza merged commit 27915cb into main Jan 18, 2025
6 checks passed

jamesbraza deleted the better-randomness branch January 18, 2025 20:17

jamesbraza mentioned this pull request Jan 21, 2025

Removed arbitrary_types_allowed=True from MultipleChoiceQuestion via custom Pydantic serializer/validator #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing `random.Random` seed and documenting `TaskDataset` shuffling risk #177

Allowing `random.Random` seed and documenting `TaskDataset` shuffling risk #177

jamesbraza commented Jan 17, 2025

ludomitch Jan 17, 2025

jamesbraza Jan 17, 2025

Allowing random.Random seed and documenting TaskDataset shuffling risk #177

Allowing random.Random seed and documenting TaskDataset shuffling risk #177

Conversation

jamesbraza commented Jan 17, 2025

ludomitch Jan 17, 2025

Choose a reason for hiding this comment

jamesbraza Jan 17, 2025

Choose a reason for hiding this comment

Allowing `random.Random` seed and documenting `TaskDataset` shuffling risk #177

Allowing `random.Random` seed and documenting `TaskDataset` shuffling risk #177