Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: auto alpaca #778

Merged
merged 3 commits into from
Oct 20, 2023
Merged

fix: auto alpaca #778

merged 3 commits into from
Oct 20, 2023

Conversation

elboy3
Copy link
Contributor

@elboy3 elboy3 commented Oct 19, 2023

We were running into some column renaming issues that I fix in this PR

I also made some edits to allow the user to pass in an upper limit to the dataset size that is configurable in DatasetConfig

A HF dataset has a default size limit that can be upped by the user

@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2023

Codecov Report

Merging #778 (526c32d) into main (672c905) will increase coverage by 0.01%.
The diff coverage is 27.27%.

@@            Coverage Diff             @@
##             main     #778      +/-   ##
==========================================
+ Coverage   87.68%   87.70%   +0.01%     
==========================================
  Files         184      184              
  Lines       15092    15127      +35     
==========================================
+ Hits        13234    13267      +33     
- Misses       1858     1860       +2     
Files Coverage Δ
dataquality/dq_auto/base_data_manager.py 100.00% <ø> (ø)
dataquality/dq_auto/schema.py 93.10% <100.00%> (+93.10%) ⬆️
dataquality/integrations/seq2seq/formatter.py 81.25% <100.00%> (-5.42%) ⬇️
dataquality/integrations/seq2seq/auto.py 0.00% <0.00%> (ø)
dataquality/utils/auto.py 81.19% <12.50%> (-10.89%) ⬇️

... and 9 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@elboy3 elboy3 marked this pull request as ready for review October 19, 2023 23:04
@elboy3 elboy3 requested review from dcaustin33 and a team as code owners October 19, 2023 23:04
Copy link
Member

@setu4993 setu4993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creds in notebook again...

Copy link
Member

@setu4993 setu4993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Left a suggestion.

Let's clean up the notebook and creds before merging.

@@ -46,6 +48,9 @@ class BaseAutoDatasetConfig:
# Column names
input_col: str = "text"
target_col: str = "label"
# Dataset input / output formatter
max_train_size: Optional[int] = None
formatter: BaseFormatter = DefaultFormatter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more idiomatic way of initializing for a dataclass:

Suggested change
formatter: BaseFormatter = DefaultFormatter()
from dataclasses import field
...
formatter: BaseFormatter = field(default_factory=DefaultFormatter)

bump version

docstring

make max train size default to none

remove notebook
@elboy3 elboy3 merged commit 6c77a8c into main Oct 20, 2023
@elboy3 elboy3 deleted the fix/auto-alpaca branch October 20, 2023 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants