feat: s2s auto chat support #779

elboy3 · 2023-10-23T19:21:41Z

V1 of auto chat support. This just unravels the turns and has input as user, target as assistant

V2 will include some of the chat history

https://app.shortcut.com/galileo/story/8388/dq-support-basic-chat-models

Did e2e tests for:
✅ chat data link
✅ auto with alpaca link
✅ completion dataset

elboy3 · 2023-10-23T19:22:08Z

dataquality/integrations/seq2seq/auto.py

+        # Add validation data if missing, add 'id' column
+        dd, dataset_config = self._validate_dataset_dict(dd, []), dataset_config
+        # Apply the datasets custom formatter on load dataset dict
+        col_names = (
+            dd[Split.train].column_names
+            if dataset_config.formatter.remove_columns
+            else []
+        )
+        dd = dd.map(
+            dataset_config.formatter.format_batch,
+            batched=True,
+            remove_columns=col_names,
+            with_indices=True,
+        )
+        dd, dataset_config = self._validate_dataset_dict(dd, []), dataset_config


will clean this up before we merge

Nice nice! I think this looks pretty clean to me though. Is the assumption that you could also be formatting e.g. with the Alpaca formatter?

elboy3 · 2023-10-23T19:22:23Z

dataquality/integrations/seq2seq/formatters/chat.py

+        metadata = sample.get(self.metadata_col, {})
+        sample_cols = [
+            col
+            for col in sample.keys()
+            if col not in [self.metadata_col, self.turns_col]
+        ]
+        for col in sample_cols:
+            metadata[col] = sample[col]
+        unraveled_turns = unraveled_turns | metadata


will also clean this up

codecov-commenter · 2023-10-23T19:32:09Z

Codecov Report

Merging #779 (3b36f93) into main (6c77a8c) will decrease coverage by 0.32%.
The diff coverage is 34.28%.

@@            Coverage Diff             @@
##             main     #779      +/-   ##
==========================================
- Coverage   87.72%   87.41%   -0.32%     
==========================================
  Files         184      186       +2     
  Lines       15127    15195      +68     
==========================================
+ Hits        13270    13282      +12     
- Misses       1857     1913      +56

Files	Coverage Δ
dataquality/dq_auto/schema.py	`93.10% <100.00%> (ø)`
...ataquality/loggers/data_logger/base_data_logger.py	`88.34% <100.00%> (ø)`
...aquality/integrations/seq2seq/formatters/alpaca.py	`76.92% <76.92%> (ø)`
dataquality/integrations/seq2seq/auto.py	`0.00% <0.00%> (ø)`
...ataquality/integrations/seq2seq/formatters/base.py	`64.86% <64.86%> (ø)`
...ataquality/integrations/seq2seq/formatters/chat.py	`0.00% <0.00%> (ø)`

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jonathangomesselman · 2023-10-23T19:52:05Z

dataquality/integrations/seq2seq/formatters/__init__.py

@@ -0,0 +1,19 @@
+from typing import Dict, Type


I never quite understand when to put things in the init?

things that are common to all the formatters, like this map!

jonathangomesselman · 2023-10-23T21:00:55Z

dataquality/integrations/seq2seq/formatters/alpaca.py

+    target_col: str = "output"
+    max_train_size: int = 1000
+
+    def format_sample(self, sample: Dict[str, str], idx: int) -> Dict[str, str]:


What is idx used for?

Should this be made optional?

jonathangomesselman · 2023-10-23T21:04:57Z

dataquality/integrations/seq2seq/formatters/base.py

+    max_train_size: Optional[int] = None
+    remove_columns: bool = False
+
+    def format_batch(self, batch: Dict, idxs: List[int]) -> Dict[str, List]:


This approach doesn't seem bad! There may be a bit of extra data copying, but for a working solution this seems fine.

My only thought is you could have a wrapper data class something like:

class DataBatch(): batch: Dict active_row: int = 0 ... def get(key: str): return batch[key][active_row] ...

This could at least help avoid some of the copying involved in creating the sample. You would still need to return a Dict and add to the result I guess.

dataquality/integrations/seq2seq/formatters/base.py

dataquality/integrations/seq2seq/formatters/chat.py

jonathangomesselman

Awesome 🚢 !! I left just some small comments. On to the chat with history now :)

jonathangomesselman · 2023-10-24T01:48:37Z

dataquality/integrations/seq2seq/auto.py

+        )
+        # We must re-add the id column if it's been dropped
+        dd = self._validate_dataset_dict(dd, [])
+        return dd, dataset_config


jonathangomesselman · 2023-10-24T01:53:21Z

dataquality/integrations/seq2seq/formatters/alpaca.py

+    target_col: str = "output"
+    max_train_size: int = 1000
+
+    def format_sample(self, sample: Dict[str, str], idx: int) -> Dict[str, str]:


Should this be made optional?

jonathangomesselman · 2023-10-24T01:57:19Z

dataquality/integrations/seq2seq/formatters/base.py

+
+
+@dataclass
+class BatchData:


jonathangomesselman · 2023-10-24T01:59:58Z

dataquality/integrations/seq2seq/formatters/chat.py

+                # Add sample level metadata
+                turn_data.update(metadata)
+                for k, v in turn_data.items():
+                    # NOTE: When we drop p3.8 we can use 'turn_data |= turn_meta'


This comment maybe is meant to be above

setu4993

Left a few questions / notes. Feel free to take it or leave it!

setu4993 · 2023-10-24T17:27:41Z

dataquality/utils/auto.py

-    max_train_sz = (
-        dataset_config.max_train_size or dataset_config.formatter.max_train_size
-    )
+    max_train_sz = max_train_size or dataset_config.formatter.max_train_size


Seeing that max_train_size is optional even for formatters, isn't it possible we'd end up with None here?

setu4993 · 2023-10-24T17:30:02Z

dataquality/integrations/seq2seq/formatters/chat.py

+            # Add metadata to each turn
+            turn_meta = {
+                f"{role}_{col}": turn[col]
+                for col in turn.keys()
+                if col not in turn_default_cols
+                and isinstance(turn[col], valid_meta_types)
+            }
+            # Add turn level metadata to turn
+            # NOTE: When we drop p3.8 we can use 'turn_data |= turn_meta'
+            turn_data.update(turn_meta)


Not seeing this in the docstring example above?

setu4993 · 2023-10-24T17:31:35Z

dataquality/integrations/seq2seq/formatters/chat.py

+                # Reset turn data
+                turn_data = {}


Why only under the elif?

setu4993 · 2023-10-24T17:37:07Z

dataquality/integrations/seq2seq/auto.py

+        # Add validation data if missing, add 'id' column
+        dd = self._validate_dataset_dict(dd, [])
+        formatter = dataset_config.formatter
+        if formatter.process_batch:


When would this be false?

setu4993 · 2023-10-24T17:38:11Z

dataquality/integrations/seq2seq/auto.py

+            # We must re-add the id column if it's been dropped
+            dd = self._validate_dataset_dict(dd, [])
+        else:
+            dd = dd.map(formatter.format_sample, remove_columns=formatter.remove_cols)


It'd be nice to just have a wrapper method format_batch for all formatters that operated on the batch level so we don't have to if-else it this way.

feat: s2s auto chat support

04715e9

elboy3 commented Oct 23, 2023

View reviewed changes

jonathangomesselman reviewed Oct 23, 2023

View reviewed changes

dataquality/integrations/seq2seq/formatters/base.py Show resolved Hide resolved

jonathangomesselman reviewed Oct 23, 2023

View reviewed changes

dataquality/integrations/seq2seq/formatters/chat.py Show resolved Hide resolved

jonathangomesselman reviewed Oct 23, 2023

View reviewed changes

dataquality/integrations/seq2seq/formatters/chat.py Outdated Show resolved Hide resolved

jon updates

3b36f93

elboy3 requested a review from jonathangomesselman October 23, 2023 22:23

elboy3 marked this pull request as ready for review October 23, 2023 22:25

elboy3 requested review from dcaustin33 and a team as code owners October 23, 2023 22:25

bump version

619435c

jonathangomesselman approved these changes Oct 24, 2023

View reviewed changes

setu4993 approved these changes Oct 24, 2023

View reviewed changes

final fixes

95115ec

elboy3 merged commit e40e48f into main Oct 24, 2023

elboy3 deleted the feat/s2s-auto-chat-support branch October 24, 2023 02:53

elboy3 mentioned this pull request Oct 24, 2023

fix: history chat #780

Merged

setu4993 reviewed Oct 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: s2s auto chat support #779

feat: s2s auto chat support #779

elboy3 commented Oct 23, 2023 •

edited

Loading

elboy3 Oct 23, 2023

jonathangomesselman Oct 23, 2023

elboy3 Oct 23, 2023

codecov-commenter commented Oct 23, 2023 •

edited

Loading

jonathangomesselman Oct 23, 2023

elboy3 Oct 23, 2023

jonathangomesselman Oct 23, 2023

jonathangomesselman Oct 24, 2023

jonathangomesselman Oct 23, 2023

jonathangomesselman left a comment

jonathangomesselman Oct 24, 2023

jonathangomesselman Oct 24, 2023

jonathangomesselman Oct 24, 2023

jonathangomesselman Oct 24, 2023

setu4993 left a comment

setu4993 Oct 24, 2023

setu4993 Oct 24, 2023

setu4993 Oct 24, 2023

setu4993 Oct 24, 2023

setu4993 Oct 24, 2023

feat: s2s auto chat support #779

feat: s2s auto chat support #779

Conversation

elboy3 commented Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 23, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathangomesselman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

setu4993 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elboy3 commented Oct 23, 2023 •

edited

Loading

codecov-commenter commented Oct 23, 2023 •

edited

Loading