Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/seq2seq finish #716

Merged
merged 6 commits into from
Jul 20, 2023
Merged

Fix/seq2seq finish #716

merged 6 commits into from
Jul 20, 2023

Conversation

elboy3
Copy link
Contributor

@elboy3 elboy3 commented Jul 19, 2023

https://app.shortcut.com/galileo/story/6733/dq-seq2seq-finish-and-upload-support

Update the .finish() flow to combine and upload the data.arrow file for Seq2Seq

We pin transformers==4.30.0 due to a new error introduced with 4.31.0 here: https://github.com/rungalileo/dataquality/actions/runs/5603321961/jobs/10249775309?pr=716

Franz is looking into a fix and then we will unpin transformers

@codecov-commenter
Copy link

codecov-commenter commented Jul 19, 2023

Codecov Report

Merging #716 (dbebb83) into main (7db3ef4) will decrease coverage by 0.11%.
The diff coverage is 55.55%.

@@            Coverage Diff             @@
##             main     #716      +/-   ##
==========================================
- Coverage   89.69%   89.58%   -0.11%     
==========================================
  Files         166      166              
  Lines       13323    13359      +36     
==========================================
+ Hits        11950    11968      +18     
- Misses       1373     1391      +18     
Files Changed Coverage Δ
dataquality/loggers/data_logger/seq2seq.py 81.37% <28.57%> (-13.69%) ⬇️
dataquality/utils/file.py 44.82% <77.77%> (+6.82%) ⬆️
dataquality/utils/vaex.py 73.71% <78.57%> (-0.29%) ⬇️
dataquality/core/log.py 95.55% <100.00%> (+0.03%) ⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@elboy3 elboy3 marked this pull request as ready for review July 19, 2023 21:37
@elboy3 elboy3 requested review from a team and dcaustin33 as code owners July 19, 2023 21:37
Copy link
Member

@setu4993 setu4993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good update. Left a few questions to clarify a few things.

@@ -633,3 +633,5 @@ def set_tokenizer(tokenizer: PreTrainedTokenizerFast) -> None:
for attr in ["encode", "decode", "encode_plus", "padding_side"]:
assert hasattr(tokenizer, attr), f"Tokenizer must support `{attr}`"
seq2seq_logger_config.tokenizer = tokenizer
# Seq2Seq doesn't have labels but we need to set this to avoid validation errors
seq2seq_logger_config.labels = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have labels, though, right? Are we maybe saying it's not expected to be set at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, not dataset ground truth labels like we have in TC or NER. we have a target column but that's different than this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, we call that target here, makes sense.

pyproject.toml Show resolved Hide resolved
Comment on lines +22 to +25
if len(extensions) > 1:
raise GalileoException(
f"Multiple file extensions found in {dir_}: {extensions}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit fragile, honestly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions? the flow currently expects all files to be hdf5 so this at least now allows for hdf5 and arrow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's fair. It just feels a bit risky since we could populate files with internal code that maybe doesn't conform to this. But it's possible we will avoid that and likely never hit that from dq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's a very specific helper we need in a very specific use case, which is that we've saved a bunch of batches to a folder (as hdf5 or arrow files) and now need to combine the batches.

so for this specific helper i do want it to alert if there are multiple types of files in the folder.

would it help if i update the fn name or the docstring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I think it's good for now. I don't expect us to hit this pathway much, so it feels fine.

@@ -67,6 +68,7 @@ class Seq2SeqDataLogger(BaseGalileoDataLogger):

__logger_name__ = "seq2seq"
logger_config = seq2seq_logger_config
DATA_FOLDER_EXTENSION = {"emb": "hdf5", "prob": "hdf5", "data": "arrow"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this get used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the file format we save for the files uploaded to teh root bucket in minio

Comment on lines +197 to +199
def separate_dataframe(
cls, df: DataFrame, prob_only: bool = True, split: Optional[str] = None
) -> BaseLoggerDataFrames:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like that we split these up.

Comment on lines +22 to +25
if len(extensions) > 1:
raise GalileoException(
f"Multiple file extensions found in {dir_}: {extensions}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's fair. It just feels a bit risky since we could populate files with internal code that maybe doesn't conform to this. But it's possible we will avoid that and likely never hit that from dq.

@elboy3 elboy3 merged commit a726fe8 into main Jul 20, 2023
@elboy3 elboy3 deleted the fix/seq2seq-finish branch July 20, 2023 00:02
bogdan-galileo pushed a commit that referenced this pull request Jul 21, 2023
https://app.shortcut.com/galileo/story/6733/dq-seq2seq-finish-and-upload-support

Update the `.finish()` flow to combine and upload the data.arrow file
for Seq2Seq

We pin `transformers==4.30.0` due to a new error introduced with 4.31.0
here:
https://github.com/rungalileo/dataquality/actions/runs/5603321961/jobs/10249775309?pr=716

Franz is looking into a fix and then we will unpin transformers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants