-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(chore): cleanup s2s offsets #784
Conversation
""" | ||
Look at the last offset of the tokenized target to find the position of the last | ||
character of the target string that was used by the model. | ||
Note that typically the model does not use the entire target during teacher forcing | ||
and there is a cut-off point (for example 128 tokens, or 512 tokens, etc). | ||
""" | ||
df_copy = df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the copy for ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just good practice in a helper fn with vaex dataframe when you're updating the DF. we probably would be fine without but i add it as a safety precaution, it can help avoid unexpected side effects and sometimes you might want to preserve the original df without df updates. can be helpful for testing too
Codecov Report
@@ Coverage Diff @@
## main #784 +/- ##
==========================================
- Coverage 87.15% 87.12% -0.03%
==========================================
Files 186 186
Lines 15238 15257 +19
==========================================
+ Hits 13280 13293 +13
- Misses 1958 1964 +6
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
raise GalileoException( | ||
msg.format(col="Input", val=input_col, col_name="input_col") | ||
) | ||
if target_col not in ds.column_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also check for target and then if both are missing throw that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long is the for loop in line 29? Can it be parallelized or combined with a jit (Jax or numbs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd say that's outside of the scope of this PR but we can (and will) look into speed improvements for seq2seq when we do a robustification sprint
last token we use the offset_mapping returned by the tokenizer. | ||
""" | ||
df_copy = df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The df can't be edited by reference? I assume is not too expensive on ram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be but i'm not sure it's best vaex practice, and returning new df is helpful for testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, how much slower is seq2seq with dq vs without dq
As in what's the overhead of logging with galileo? great question, i think it depends on a few things, like if you do generation or not. we should do some testing without generation on large runs to know exactly the overhead |
user cleaner vaex pattern of adding to df and return full DF
we also materialize
input_cutoff
to speedup larger runs