Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing random.Random seed and documenting TaskDataset shuffling risk #177

Merged
merged 4 commits into from
Jan 18, 2025

Conversation

jamesbraza
Copy link
Collaborator

@ludomitch and I realized that if you specify the same int seed for an entire TaskDataset (which we don't currently do, but could do), assuming the multiple choice options are the same length, the ideal answer would show up in the same index.

This PR:

  1. Documents this risk
  2. Adds support for random.Random to MultipleChoiceQuestion
  3. Adds a (not in-place) shuffle utility

@jamesbraza jamesbraza added the enhancement New feature or request label Jan 17, 2025
@jamesbraza jamesbraza requested review from sidnarayanan and a team January 17, 2025 21:07
@jamesbraza jamesbraza self-assigned this Jan 17, 2025
codeflash-ai bot added a commit that referenced this pull request Jan 17, 2025
To optimize the `shuffle` function, we can implement several improvements that will make it run faster without changing its functionality or behavior.

1. Reduce type checking by consolidating checks.
2. Use separate logic paths that do not recheck the type once confirmed.

Here is the optimized version of the `shuffle` function.

 

Improvements and explanations.
1. **Type Consolidation**: Instead of checking the type of `seed` multiple times in different `if` statements, we determine the type once and proceed accordingly. This reduces overhead for type checking.
2. **Direct Use of RNG**: For instances where the seed is a `random.Random` or integer, we create/randomize the RNG once and use it directly, simplifying the process.
3. **Exception Handling**: Instead of allowing for silent errors when an invalid type is passed for `seed`, this implementation explicitly raises a `TypeError` for an invalid type. This change enhances code robustness and helps in debugging.
if seed is None:
return random.sample(value, k=len(value))
# Numpy RNG. Note this will have a type error for sequences like str, but oh well
return seed.choice(value, size=len(value), replace=False) # type: ignore[arg-type,return-value]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we have an obvious use for sampling with replacement anyway do we?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying this helper function should expose replace as an argument?

My answer to that is no, because this is a shuffle helper function (and its internals are an implementation detail). I had to use sampling without replacement inside because the shuffle functions are all in-place.... but I don't want our code to come with associated risks of in-place edits.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. documentation Improvements or additions to documentation labels Jan 17, 2025
@jamesbraza jamesbraza merged commit 27915cb into main Jan 18, 2025
6 checks passed
@jamesbraza jamesbraza deleted the better-randomness branch January 18, 2025 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants