[Chatllama] Add multiple sources for generating synthetic data #221

diegofiori · 2023-03-08T13:37:34Z

Description

Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003.
Both for conversations and for scores.

In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).

Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.

TODO

Add support for gpt-3.5-turbo . Externally respect to LangChain models.
Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
Modify langchain-based script for supporting multiple API models and providers.
Add support for HF models to perform the generation task.
Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
Provide multiple template examples for dataset generation.

The text was updated successfully, but these errors were encountered:

pengwei-iie · 2023-03-31T08:51:37Z

hi did you add support for HF models in dataset generation? It seems only OpenAI’s davinci-003 in line 21 in generate_rewards.py.

diegofiori added the chatllama Issue related to the ChatLLaMA module label Mar 8, 2023

nebuly-ai moved this to Requested Features in ChatLLaMA Roadmap Mar 8, 2023

nebuly-ai added this to ChatLLaMA Roadmap Mar 8, 2023

diegofiori added the good first issue Good for newcomers label Mar 9, 2023

PierpaoloSorbellini changed the title ~~Add multiple sources for generating synthetic data~~ [Chatllama] Add multiple sources for generating synthetic data Mar 31, 2023

Linus-J mentioned this issue May 29, 2023

[ChatLLaMA] Add flan-t5-xl support for local and API model to generate synthetic reward_training_data scores #344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chatllama] Add multiple sources for generating synthetic data #221

[Chatllama] Add multiple sources for generating synthetic data #221

diegofiori commented Mar 8, 2023 •

edited by PierpaoloSorbellini

Loading

pengwei-iie commented Mar 31, 2023

[Chatllama] Add multiple sources for generating synthetic data #221

[Chatllama] Add multiple sources for generating synthetic data #221

Comments

diegofiori commented Mar 8, 2023 • edited by PierpaoloSorbellini Loading

Description

TODO

pengwei-iie commented Mar 31, 2023

diegofiori commented Mar 8, 2023 •

edited by PierpaoloSorbellini

Loading