-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing random.Random
seed and documenting TaskDataset
shuffling risk
#177
Conversation
To optimize the `shuffle` function, we can implement several improvements that will make it run faster without changing its functionality or behavior. 1. Reduce type checking by consolidating checks. 2. Use separate logic paths that do not recheck the type once confirmed. Here is the optimized version of the `shuffle` function. Improvements and explanations. 1. **Type Consolidation**: Instead of checking the type of `seed` multiple times in different `if` statements, we determine the type once and proceed accordingly. This reduces overhead for type checking. 2. **Direct Use of RNG**: For instances where the seed is a `random.Random` or integer, we create/randomize the RNG once and use it directly, simplifying the process. 3. **Exception Handling**: Instead of allowing for silent errors when an invalid type is passed for `seed`, this implementation explicitly raises a `TypeError` for an invalid type. This change enhances code robustness and helps in debugging.
if seed is None: | ||
return random.sample(value, k=len(value)) | ||
# Numpy RNG. Note this will have a type error for sequences like str, but oh well | ||
return seed.choice(value, size=len(value), replace=False) # type: ignore[arg-type,return-value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we have an obvious use for sampling with replacement anyway do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying this helper function should expose replace
as an argument?
My answer to that is no, because this is a shuffle helper function (and its internals are an implementation detail). I had to use sampling without replacement inside because the shuffle functions are all in-place.... but I don't want our code to come with associated risks of in-place edits.
@ludomitch and I realized that if you specify the same
int
seed for an entireTaskDataset
(which we don't currently do, but could do), assuming the multiple choice options are the same length, the ideal answer would show up in the same index.This PR:
random.Random
toMultipleChoiceQuestion
shuffle
utility