Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] Add multiple sources for generating synthetic data #221

Open
6 tasks
diegofiori opened this issue Mar 8, 2023 · 1 comment
Open
6 tasks
Labels
chatllama Issue related to the ChatLLaMA module good first issue Good for newcomers

Comments

@diegofiori
Copy link
Collaborator

diegofiori commented Mar 8, 2023

Description

Currently, chatllama supports the synthetic data generation just from OpenAI’s davinci-003.
Both for conversations and for scores.

In order to avoid huge costs while generating data we should support other API models (as the cheaper gpt-3.5-turbo ), other API providers and local models (Flan T5 seems a good candidate).

Furthermore, in order to generate more diverse data, it could be beneficial to be able to use multiple prompt templates during the generation.

TODO

  • Add support for gpt-3.5-turbo . Externally respect to LangChain models.
  • Add preview of the costs associated with the API models (i.e. n_words / 0.75 * API_cost_per_token) before proceeding with the labelling.
  • Modify langchain-based script for supporting multiple API models and providers.
  • Add support for HF models to perform the generation task.
  • Allow user to specify multiple templates when generating synthetic data that can be customisable to the user needs.
  • Provide multiple template examples for dataset generation.
@diegofiori diegofiori added the chatllama Issue related to the ChatLLaMA module label Mar 8, 2023
@nebuly-ai nebuly-ai moved this to Requested Features in ChatLLaMA Roadmap Mar 8, 2023
@diegofiori diegofiori added the good first issue Good for newcomers label Mar 9, 2023
@pengwei-iie
Copy link

hi did you add support for HF models in dataset generation? It seems only OpenAI’s davinci-003 in line 21 in generate_rewards.py.

@PierpaoloSorbellini PierpaoloSorbellini changed the title Add multiple sources for generating synthetic data [Chatllama] Add multiple sources for generating synthetic data Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chatllama Issue related to the ChatLLaMA module good first issue Good for newcomers
Projects
Status: Requested Features
Development

No branches or pull requests

2 participants