Data Format Compatibility in Ianvs Datasets #183

IcyFeather233 · 2025-02-06T11:19:11Z

Background

The Ianvs dataset is currently divided into two main categories:

Pre-October 2024 Datasets: These datasets primarily consist of image data, organized in folders with an accompanying index file (like index.txt) that stores indexing information.
Post-October 2024 Datasets: Following this date, the Ianvs community has expanded its research and adaptation to include LLM-related datasets. These datasets are predominantly in JSON or JSONL format, storing data information directly without relying on an index file to track data paths.

The algorithms for reading these two types of datasets differ significantly and are not compatible with each other. We need a mechanism to actively identify which data format is being used and handle them accordingly.

Proposed Solution

I propose that we differentiate between these two formats using the dataset path fields in the configuration file. Specifically:

If the configuration uses train_index/test_index, it corresponds to the first category (image datasets).
If it uses train_url/test_url, it corresponds to the second category (LLM datasets).

This functionality has already been implemented. For more details, please refer to the proposal here: LLM Benchmarks Proposal.

Impact

All existing projects utilizing image datasets will need to update their configuration files from train_url/test_url to train_index/test_index.
Documentation will require corresponding updates to reflect these changes.
The sedna.datasources library will need to introduce a new dataset type in the DatasetFormat code, and we should assess how to ensure compatibility with Sedna.

The text was updated successfully, but these errors were encountered:

IcyFeather233 · 2025-02-06T11:19:23Z

@MooreZheng cc

MooreZheng · 2025-02-07T07:27:51Z

It is related to the cooperation between Ianvs and Sedna in KubeEdge SIG AI. The remaining issues include:

What would be the best implementation? Will there be any chance to avoid using different fields for training data? @hsj576 @IcyFeather233
Clearify whether the field of dataset source is consistent with Sedna. Justify whether it is necessary to launch an issue and PR to Sedna codespace. @tangming1996

I suggest discussing the above issues in a routine meeting, e.g., on 13th Feb. Does the time work for you? @tangming1996 @hsj576 @IcyFeather233

IcyFeather233 added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 6, 2025

AryanNanda17 mentioned this issue Feb 6, 2025

Documentation involving pcb-aoi singletask and incremental learning corrected #182

Merged

MooreZheng assigned MooreZheng, tangming1996 and hsj576 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Format Compatibility in Ianvs Datasets #183

Data Format Compatibility in Ianvs Datasets #183

IcyFeather233 commented Feb 6, 2025

IcyFeather233 commented Feb 6, 2025

MooreZheng commented Feb 7, 2025

Data Format Compatibility in Ianvs Datasets #183

Data Format Compatibility in Ianvs Datasets #183

Comments

IcyFeather233 commented Feb 6, 2025

Background

Proposed Solution

Impact

IcyFeather233 commented Feb 6, 2025

MooreZheng commented Feb 7, 2025