Skip to content

strwberry-smggls/format_transform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Guide to creating the NYT, RCV1 and Yelp corpora in: Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification (Fromme et. al. 2022).

Build NYT, RCV1 and Yelp datasets from source

We use train/test/val splits as given by the source dataset.

The required target data format is .csv with the following columns: text,label_1,label_2,...,label_n

"text" contains the input text, "label_i" is either 0 or 1.

In order to create the label columns, we first crawl all labels from the dataset and organize them in a hierarchy. Hierarchies are stored in a .csv with the following columns: parent_class,child_class

Please check https://drive.google.com/file/d/11ikCpcdnvGuZsnigTZkT3BC8Ab4Z7cYk/view?usp=sharing (arXiv dataset) or https://drive.google.com/file/d/1CM6ybPQk2ZKZWAiHen2gogoO3wDeUzjf/view?usp=sharing (EurLex dataset) for examples of file format.

We use only leaf classes - that means we omit all labels that occur on the left side as parents and remove them from the dataset. If a sentence will have 0 labels as a result of this process we remove it from the dataset entirely.

Build NYT, RCV1 and Yelp datasets from available splits

If you have the following file structures available or are willing to create them, you can also use the script format_transform_from_hierarchical.py to create the datasets. In any case, the script is included for your reference on how to build the datasets.

Required are the following:

train.csv test.csv and val.csv files in a folder, seperate for each dataset. Example file structure: NYT --train.csv --test.csv --val.csv

Files need to be in .csv format with the following fields: labels,text OR labels,title,abstract

labels in the "labels" field need to be separated by the "|" character: label1|label2|label3....

Splits are used as given in the datasets.

The Label hierarchy in a .csv file. This file is created by the corresponding script in https://github.com/morningmoni/HiLAP.

run the python script: python format_transform_from_hierarchical.py path/to/data-folder path/to/hierarchy-file --outname desired-output-name

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages