fix(chore): cleanup s2s offsets #784

elboy3 · 2023-10-31T14:26:23Z

user cleaner vaex pattern of adding to df and return full DF

we also materialize input_cutoff to speedup larger runs

bogdan-galileo · 2023-10-31T14:33:59Z

dataquality/utils/seq2seq/offsets.py

    """
    Look at the last offset of the tokenized target to find the position of the last
    character of the target string that was used by the model.
    Note that typically the model does not use the entire target during teacher forcing
    and there is a cut-off point (for example 128 tokens, or 512 tokens, etc).
    """
+    df_copy = df.copy()


what is the copy for ?

I think it's just good practice in a helper fn with vaex dataframe when you're updating the DF. we probably would be fine without but i add it as a safety precaution, it can help avoid unexpected side effects and sometimes you might want to preserve the original df without df updates. can be helpful for testing too

codecov-commenter · 2023-10-31T14:36:15Z

Codecov Report

Merging #784 (089974a) into main (ef4b69d) will decrease coverage by 0.03%.
The diff coverage is 62.06%.

@@            Coverage Diff             @@
##             main     #784      +/-   ##
==========================================
- Coverage   87.15%   87.12%   -0.03%     
==========================================
  Files         186      186              
  Lines       15238    15257      +19     
==========================================
+ Hits        13280    13293      +13     
- Misses       1958     1964       +6

Files	Coverage Δ
dataquality/loggers/data_logger/seq2seq.py	`70.71% <100.00%> (ø)`
dataquality/utils/seq2seq/offsets.py	`100.00% <100.00%> (ø)`
tests/utils/test_seq2seq_offset.py	`100.00% <100.00%> (ø)`
dataquality/integrations/seq2seq/s2s_trainer.py	`0.00% <0.00%> (ø)`

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bogdan-galileo

LGTM

franz101 · 2023-10-31T15:07:28Z

dataquality/integrations/seq2seq/s2s_trainer.py

+        raise GalileoException(
+            msg.format(col="Input", val=input_col, col_name="input_col")
+        )
+    if target_col not in ds.column_names:


Maybe also check for target and then if both are missing throw that

good call, updated!

franz101 · 2023-10-31T15:11:17Z

dataquality/utils/seq2seq/offsets.py

how long is the for loop in line 29? Can it be parallelized or combined with a jit (Jax or numbs)

i'd say that's outside of the scope of this PR but we can (and will) look into speed improvements for seq2seq when we do a robustification sprint

franz101 · 2023-10-31T15:13:49Z

dataquality/utils/seq2seq/offsets.py

    last token we use the offset_mapping returned by the tokenizer.
    """
+    df_copy = df.copy()


The df can't be edited by reference? I assume is not too expensive on ram.

it could be but i'm not sure it's best vaex practice, and returning new df is helpful for testing

franz101

Looks good to me, how much slower is seq2seq with dq vs without dq

elboy3 · 2023-10-31T15:17:23Z

Looks good to me, how much slower is seq2seq with dq vs without dq

As in what's the overhead of logging with galileo? great question, i think it depends on a few things, like if you do generation or not. we should do some testing without generation on large runs to know exactly the overhead

fix(chore): cleanup s2s offsets

1e6a7e5

elboy3 requested review from dcaustin33 and a team as code owners October 31, 2023 14:26

bogdan-galileo reviewed Oct 31, 2023

View reviewed changes

bogdan-galileo approved these changes Oct 31, 2023

View reviewed changes

validate dataset cols

bbc4b2c

franz101 reviewed Oct 31, 2023

View reviewed changes

franz feedback

089974a

franz101 reviewed Oct 31, 2023

View reviewed changes

franz101 approved these changes Oct 31, 2023

View reviewed changes

elboy3 merged commit f8c6a3a into main Oct 31, 2023

elboy3 deleted the fix/chore/cleanup-offsets branch October 31, 2023 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chore): cleanup s2s offsets #784

fix(chore): cleanup s2s offsets #784

elboy3 commented Oct 31, 2023

bogdan-galileo Oct 31, 2023

elboy3 Oct 31, 2023

codecov-commenter commented Oct 31, 2023 •

edited

Loading

bogdan-galileo left a comment

franz101 Oct 31, 2023

elboy3 Oct 31, 2023

franz101 Oct 31, 2023 •

edited

Loading

elboy3 Oct 31, 2023

franz101 Oct 31, 2023

elboy3 Oct 31, 2023

franz101 left a comment

elboy3 commented Oct 31, 2023

fix(chore): cleanup s2s offsets #784

fix(chore): cleanup s2s offsets #784

Conversation

elboy3 commented Oct 31, 2023

bogdan-galileo Oct 31, 2023

Choose a reason for hiding this comment

elboy3 Oct 31, 2023

Choose a reason for hiding this comment

codecov-commenter commented Oct 31, 2023 • edited Loading

Codecov Report

bogdan-galileo left a comment

Choose a reason for hiding this comment

franz101 Oct 31, 2023

Choose a reason for hiding this comment

elboy3 Oct 31, 2023

Choose a reason for hiding this comment

franz101 Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

elboy3 Oct 31, 2023

Choose a reason for hiding this comment

franz101 Oct 31, 2023

Choose a reason for hiding this comment

elboy3 Oct 31, 2023

Choose a reason for hiding this comment

franz101 left a comment

Choose a reason for hiding this comment

elboy3 commented Oct 31, 2023

codecov-commenter commented Oct 31, 2023 •

edited

Loading

franz101 Oct 31, 2023 •

edited

Loading