-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/seq2seq finish #716
Fix/seq2seq finish #716
Conversation
Codecov Report
@@ Coverage Diff @@
## main #716 +/- ##
==========================================
- Coverage 89.69% 89.58% -0.11%
==========================================
Files 166 166
Lines 13323 13359 +36
==========================================
+ Hits 11950 11968 +18
- Misses 1373 1391 +18
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good update. Left a few questions to clarify a few things.
@@ -633,3 +633,5 @@ def set_tokenizer(tokenizer: PreTrainedTokenizerFast) -> None: | |||
for attr in ["encode", "decode", "encode_plus", "padding_side"]: | |||
assert hasattr(tokenizer, attr), f"Tokenizer must support `{attr}`" | |||
seq2seq_logger_config.tokenizer = tokenizer | |||
# Seq2Seq doesn't have labels but we need to set this to avoid validation errors | |||
seq2seq_logger_config.labels = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have labels, though, right? Are we maybe saying it's not expected to be set at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, not dataset ground truth labels like we have in TC or NER. we have a target
column but that's different than this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, we call that target
here, makes sense.
if len(extensions) > 1: | ||
raise GalileoException( | ||
f"Multiple file extensions found in {dir_}: {extensions}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels a bit fragile, honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions? the flow currently expects all files to be hdf5 so this at least now allows for hdf5 and arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's fair. It just feels a bit risky since we could populate files with internal code that maybe doesn't conform to this. But it's possible we will avoid that and likely never hit that from dq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's a very specific helper we need in a very specific use case, which is that we've saved a bunch of batches to a folder (as hdf5 or arrow files) and now need to combine the batches.
so for this specific helper i do want it to alert if there are multiple types of files in the folder.
would it help if i update the fn name or the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, I think it's good for now. I don't expect us to hit this pathway much, so it feels fine.
@@ -67,6 +68,7 @@ class Seq2SeqDataLogger(BaseGalileoDataLogger): | |||
|
|||
__logger_name__ = "seq2seq" | |||
logger_config = seq2seq_logger_config | |||
DATA_FOLDER_EXTENSION = {"emb": "hdf5", "prob": "hdf5", "data": "arrow"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the file format we save for the files uploaded to teh root bucket in minio
def separate_dataframe( | ||
cls, df: DataFrame, prob_only: bool = True, split: Optional[str] = None | ||
) -> BaseLoggerDataFrames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like that we split these up.
if len(extensions) > 1: | ||
raise GalileoException( | ||
f"Multiple file extensions found in {dir_}: {extensions}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's fair. It just feels a bit risky since we could populate files with internal code that maybe doesn't conform to this. But it's possible we will avoid that and likely never hit that from dq
.
https://app.shortcut.com/galileo/story/6733/dq-seq2seq-finish-and-upload-support Update the `.finish()` flow to combine and upload the data.arrow file for Seq2Seq We pin `transformers==4.30.0` due to a new error introduced with 4.31.0 here: https://github.com/rungalileo/dataquality/actions/runs/5603321961/jobs/10249775309?pr=716 Franz is looking into a fix and then we will unpin transformers
https://app.shortcut.com/galileo/story/6733/dq-seq2seq-finish-and-upload-support
Update the
.finish()
flow to combine and upload the data.arrow file for Seq2SeqWe pin
transformers==4.30.0
due to a new error introduced with 4.31.0 here: https://github.com/rungalileo/dataquality/actions/runs/5603321961/jobs/10249775309?pr=716Franz is looking into a fix and then we will unpin transformers