Skip to content

2022 통계청 주최 통계데이터 인공지능 활용대회 GIST SCI LAB

Notifications You must be signed in to change notification settings

donggunseo/SCI_Kostat2022

Repository files navigation

SCI_Kostat2022

2022 통계데이터 인공지능 활용대회 GIST SCI LAB

Team Members

서동건(Team Leader) 김주영 김주연
GIST EECS Bachelor student GIST IIT Combined MS/PhD student GIST IIT MS student
Modeling, Tuning EDA, Data cleaning EDA, ML-based approach

😀Directory

kostat
├── SCI_Kostat2022
│   ├── README.md
│   ├── create_kfold.py
│   ├── dataset.py
│   ├── inference.py
│   ├── model.py
│   ├── preprocess.py
│   ├── requirements.txt
│   ├── train_MD.py
│   ├── train_WC.py
│   └── utils.py
└── input
    ├── 1. 실습용자료.txt
    ├── 2. 모델개발용자료.txt
    ├── 답안 작성용 파일.csv
    └── 한국표준산업분류(10차)_국문.xlsx

😀Environment setting

Default python version == 3.9.10

pip install -r requirements.txt

😀Train model(WC)

Using Default AutoModelForSequenceClassification from Huggingface transformers

python train_WC.py --kfold 5

😀Train model(MD)

Using Custom model which Multi-Dropout is applied (model is implemented in model.py)

python train_MD.py -=kfold 5

😀Inference

Choose best CV checkpoint model for each fold
All hyperparameters used to get results below are described in code

CV accuracy for each model

WC MD
fold0 93.032 92.963
fold1 92.965 92.954
fold2 93.015 93.033
fold3 92.954 92.991
fold4 92.918 92.954
## you can edit this checkpoint list depended on your result
model_checkpoint = [f'../best_model/roberta_large_WC_fold{fold}' for fold in range(0,2)]
model_checkpoint1 = [f'../best_model/roberta_large_WC_MD_fold{fold}' for fold in range(2,5)]
model_checkpoint.extend(model_checkpoint1)
inference(model_checkpoint) 
python inference.py

About

2022 통계청 주최 통계데이터 인공지능 활용대회 GIST SCI LAB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages