Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(chore): cleanup s2s offsets #784

Merged
merged 3 commits into from
Oct 31, 2023
Merged

fix(chore): cleanup s2s offsets #784

merged 3 commits into from
Oct 31, 2023

Conversation

elboy3
Copy link
Contributor

@elboy3 elboy3 commented Oct 31, 2023

user cleaner vaex pattern of adding to df and return full DF

we also materialize input_cutoff to speedup larger runs

@elboy3 elboy3 requested review from dcaustin33 and a team as code owners October 31, 2023 14:26
"""
Look at the last offset of the tokenized target to find the position of the last
character of the target string that was used by the model.
Note that typically the model does not use the entire target during teacher forcing
and there is a cut-off point (for example 128 tokens, or 512 tokens, etc).
"""
df_copy = df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the copy for ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just good practice in a helper fn with vaex dataframe when you're updating the DF. we probably would be fine without but i add it as a safety precaution, it can help avoid unexpected side effects and sometimes you might want to preserve the original df without df updates. can be helpful for testing too

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2023

Codecov Report

Merging #784 (089974a) into main (ef4b69d) will decrease coverage by 0.03%.
The diff coverage is 62.06%.

@@            Coverage Diff             @@
##             main     #784      +/-   ##
==========================================
- Coverage   87.15%   87.12%   -0.03%     
==========================================
  Files         186      186              
  Lines       15238    15257      +19     
==========================================
+ Hits        13280    13293      +13     
- Misses       1958     1964       +6     
Files Coverage Δ
dataquality/loggers/data_logger/seq2seq.py 70.71% <100.00%> (ø)
dataquality/utils/seq2seq/offsets.py 100.00% <100.00%> (ø)
tests/utils/test_seq2seq_offset.py 100.00% <100.00%> (ø)
dataquality/integrations/seq2seq/s2s_trainer.py 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@bogdan-galileo bogdan-galileo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

raise GalileoException(
msg.format(col="Input", val=input_col, col_name="input_col")
)
if target_col not in ds.column_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also check for target and then if both are missing throw that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, updated!

Copy link
Contributor

@franz101 franz101 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long is the for loop in line 29? Can it be parallelized or combined with a jit (Jax or numbs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd say that's outside of the scope of this PR but we can (and will) look into speed improvements for seq2seq when we do a robustification sprint

last token we use the offset_mapping returned by the tokenizer.
"""
df_copy = df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The df can't be edited by reference? I assume is not too expensive on ram.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be but i'm not sure it's best vaex practice, and returning new df is helpful for testing

Copy link
Contributor

@franz101 franz101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, how much slower is seq2seq with dq vs without dq

@elboy3
Copy link
Contributor Author

elboy3 commented Oct 31, 2023

Looks good to me, how much slower is seq2seq with dq vs without dq

As in what's the overhead of logging with galileo? great question, i think it depends on a few things, like if you do generation or not. we should do some testing without generation on large runs to know exactly the overhead

@elboy3 elboy3 merged commit f8c6a3a into main Oct 31, 2023
@elboy3 elboy3 deleted the fix/chore/cleanup-offsets branch October 31, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants