You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Ianvs dataset is currently divided into two main categories:
Pre-October 2024 Datasets: These datasets primarily consist of image data, organized in folders with an accompanying index file (like index.txt) that stores indexing information.
Post-October 2024 Datasets: Following this date, the Ianvs community has expanded its research and adaptation to include LLM-related datasets. These datasets are predominantly in JSON or JSONL format, storing data information directly without relying on an index file to track data paths.
The algorithms for reading these two types of datasets differ significantly and are not compatible with each other. We need a mechanism to actively identify which data format is being used and handle them accordingly.
Proposed Solution
I propose that we differentiate between these two formats using the dataset path fields in the configuration file. Specifically:
If the configuration uses train_index/test_index, it corresponds to the first category (image datasets).
If it uses train_url/test_url, it corresponds to the second category (LLM datasets).
This functionality has already been implemented. For more details, please refer to the proposal here: LLM Benchmarks Proposal.
Impact
All existing projects utilizing image datasets will need to update their configuration files from train_url/test_url to train_index/test_index.
Documentation will require corresponding updates to reflect these changes.
The sedna.datasources library will need to introduce a new dataset type in the DatasetFormat code, and we should assess how to ensure compatibility with Sedna.
The text was updated successfully, but these errors were encountered:
It is related to the cooperation between Ianvs and Sedna in KubeEdge SIG AI. The remaining issues include:
What would be the best implementation? Will there be any chance to avoid using different fields for training data? @hsj576@IcyFeather233
Clearify whether the field of dataset source is consistent with Sedna. Justify whether it is necessary to launch an issue and PR to Sedna codespace. @tangming1996
I suggest discussing the above issues in a routine meeting, e.g., on 13th Feb. Does the time work for you? @tangming1996@hsj576@IcyFeather233
Background
The Ianvs dataset is currently divided into two main categories:
Pre-October 2024 Datasets: These datasets primarily consist of image data, organized in folders with an accompanying index file (like
index.txt
) that stores indexing information.Post-October 2024 Datasets: Following this date, the Ianvs community has expanded its research and adaptation to include LLM-related datasets. These datasets are predominantly in JSON or JSONL format, storing data information directly without relying on an index file to track data paths.
The algorithms for reading these two types of datasets differ significantly and are not compatible with each other. We need a mechanism to actively identify which data format is being used and handle them accordingly.
Proposed Solution
I propose that we differentiate between these two formats using the dataset path fields in the configuration file. Specifically:
train_index/test_index
, it corresponds to the first category (image datasets).train_url/test_url
, it corresponds to the second category (LLM datasets).This functionality has already been implemented. For more details, please refer to the proposal here: LLM Benchmarks Proposal.
Impact
train_url/test_url
totrain_index/test_index
.sedna.datasources
library will need to introduce a new dataset type in theDatasetFormat
code, and we should assess how to ensure compatibility with Sedna.The text was updated successfully, but these errors were encountered: