Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Format Compatibility in Ianvs Datasets #183

Open
IcyFeather233 opened this issue Feb 6, 2025 · 2 comments
Open

Data Format Compatibility in Ianvs Datasets #183

IcyFeather233 opened this issue Feb 6, 2025 · 2 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@IcyFeather233
Copy link
Contributor

Background

The Ianvs dataset is currently divided into two main categories:

  1. Pre-October 2024 Datasets: These datasets primarily consist of image data, organized in folders with an accompanying index file (like index.txt) that stores indexing information.

  2. Post-October 2024 Datasets: Following this date, the Ianvs community has expanded its research and adaptation to include LLM-related datasets. These datasets are predominantly in JSON or JSONL format, storing data information directly without relying on an index file to track data paths.

The algorithms for reading these two types of datasets differ significantly and are not compatible with each other. We need a mechanism to actively identify which data format is being used and handle them accordingly.

Proposed Solution

I propose that we differentiate between these two formats using the dataset path fields in the configuration file. Specifically:

  • If the configuration uses train_index/test_index, it corresponds to the first category (image datasets).
  • If it uses train_url/test_url, it corresponds to the second category (LLM datasets).

This functionality has already been implemented. For more details, please refer to the proposal here: LLM Benchmarks Proposal.

Impact

  1. All existing projects utilizing image datasets will need to update their configuration files from train_url/test_url to train_index/test_index.
  2. Documentation will require corresponding updates to reflect these changes.
  3. The sedna.datasources library will need to introduce a new dataset type in the DatasetFormat code, and we should assess how to ensure compatibility with Sedna.
@IcyFeather233 IcyFeather233 added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 6, 2025
@IcyFeather233
Copy link
Contributor Author

@MooreZheng cc

@MooreZheng
Copy link
Collaborator

It is related to the cooperation between Ianvs and Sedna in KubeEdge SIG AI. The remaining issues include:

  1. What would be the best implementation? Will there be any chance to avoid using different fields for training data? @hsj576 @IcyFeather233
  2. Clearify whether the field of dataset source is consistent with Sedna. Justify whether it is necessary to launch an issue and PR to Sedna codespace. @tangming1996

I suggest discussing the above issues in a routine meeting, e.g., on 13th Feb. Does the time work for you? @tangming1996 @hsj576 @IcyFeather233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants