fix: auto alpaca #778

elboy3 · 2023-10-19T22:07:43Z

We were running into some column renaming issues that I fix in this PR

I also made some edits to allow the user to pass in an upper limit to the dataset size that is configurable in DatasetConfig

A HF dataset has a default size limit that can be upped by the user

codecov-commenter · 2023-10-19T22:16:19Z

Codecov Report

Merging #778 (526c32d) into main (672c905) will increase coverage by 0.01%.
The diff coverage is 27.27%.

@@            Coverage Diff             @@
##             main     #778      +/-   ##
==========================================
+ Coverage   87.68%   87.70%   +0.01%     
==========================================
  Files         184      184              
  Lines       15092    15127      +35     
==========================================
+ Hits        13234    13267      +33     
- Misses       1858     1860       +2

Files	Coverage Δ
dataquality/dq_auto/base_data_manager.py	`100.00% <ø> (ø)`
dataquality/dq_auto/schema.py	`93.10% <100.00%> (+93.10%)`	⬆️
dataquality/integrations/seq2seq/formatter.py	`81.25% <100.00%> (-5.42%)`	⬇️
dataquality/integrations/seq2seq/auto.py	`0.00% <0.00%> (ø)`
dataquality/utils/auto.py	`81.19% <12.50%> (-10.89%)`	⬇️

... and 9 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

setu4993

Creds in notebook again...

setu4993

This looks good. Left a suggestion.

Let's clean up the notebook and creds before merging.

setu4993 · 2023-10-19T23:35:07Z

dataquality/dq_auto/schema.py

@@ -46,6 +48,9 @@ class BaseAutoDatasetConfig:
    # Column names
    input_col: str = "text"
    target_col: str = "label"
+    # Dataset input / output formatter
+    max_train_size: Optional[int] = None
+    formatter: BaseFormatter = DefaultFormatter()


This is a more idiomatic way of initializing for a dataclass:

Suggested change

formatter: BaseFormatter = DefaultFormatter()

from dataclasses import field

...

formatter: BaseFormatter = field(default_factory=DefaultFormatter)

bump version docstring make max train size default to none remove notebook

elboy3 marked this pull request as ready for review October 19, 2023 23:04

elboy3 requested review from dcaustin33 and a team as code owners October 19, 2023 23:04

setu4993 requested changes Oct 19, 2023

View reviewed changes

setu4993 reviewed Oct 19, 2023

View reviewed changes

elboy3 force-pushed the fix/auto-alpaca branch from 1d37f19 to 9695326 Compare October 20, 2023 04:32

fix: auto alpaca

50e0a87

bump version docstring make max train size default to none remove notebook

elboy3 force-pushed the fix/auto-alpaca branch from 9695326 to 50e0a87 Compare October 20, 2023 04:32

setu4993 approved these changes Oct 20, 2023

View reviewed changes

elboy3 added 2 commits October 20, 2023 09:40

default formatter

526c32d

fix splits

03cf7d2

elboy3 merged commit 6c77a8c into main Oct 20, 2023

elboy3 deleted the fix/auto-alpaca branch October 20, 2023 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto alpaca #778

fix: auto alpaca #778

elboy3 commented Oct 19, 2023 •

edited

Loading

codecov-commenter commented Oct 19, 2023 •

edited

Loading

setu4993 left a comment

setu4993 left a comment

setu4993 Oct 19, 2023

-    formatter: BaseFormatter = DefaultFormatter()
+from dataclasses import field
+...
+    formatter: BaseFormatter = field(default_factory=DefaultFormatter)

fix: auto alpaca #778

fix: auto alpaca #778

Conversation

elboy3 commented Oct 19, 2023 • edited Loading

codecov-commenter commented Oct 19, 2023 • edited Loading

Codecov Report

setu4993 left a comment

Choose a reason for hiding this comment

setu4993 left a comment

Choose a reason for hiding this comment

setu4993 Oct 19, 2023

Choose a reason for hiding this comment

elboy3 commented Oct 19, 2023 •

edited

Loading

codecov-commenter commented Oct 19, 2023 •

edited

Loading