Huawei-project

This repository contains code for pre-training and fine-tuning BART for headline generation and summary generation. NLP course template is a summary of what I have done.

Usage

Notebook API

Check example notebook notebooks/Headline_generation.ipynb.

Config API

Here is a example for config API. All yo need to do is to change your bart config and parameters config, and then run train.py script.

#configs/bart/bart.yaml
bart:
  vocab_size: tokenizer.get_vocab_size() #you can leave these steps unchanged
  pad_token_id: tokenizer.token_to_id("<pad>")
  bos_token_id: tokenizer.token_to_id("<s>")
  eos_token_id: tokenizer.token_to_id("</s>")
  encoder_layers: 6
  decoder_layers: 6
  activation_function: gelu 
  dropout: 0.1
  #additionally, you can pass parameters like dim_size, embedding_size and etc...

#configs/parameters/parameters.yaml
checkpoints:
  save_path: model/checkpoints #path to checkpoint
  monitor: bleu #metric to monitor
  mode: max
  min_delta: 0.05
  patience: 2
  verbose: true

dataset:
  dataset_name: ria #ria, gazeta
  col_article: text #column name of the article
  col_summary: title #column name of the summary
  n_rows: 500000 #number of rows to take
  chunk_size: 10000
  test_size: 0.02

tokenizer:
  train: false #whether to train optimizer
  max_length: 750

model:
  finetune: # finetune or pretrain
    #additionally you can pass pretrained_path
    #pretrained_path: path of the pretrained model
    train_args:
      #parameters of the optimizer and trainer
      n_gpus: 1
      lr: 1e-5
      max_lr: 1e-4
      pct_start: 0.06
      num_epoch: 5
      batch_size: 8
      acc_step: 16
      max_length: ${tokenizer.max_length}

And then run it :

python src/train.py

Inference

You can download finetuned model using gdown or use your own.

gdown https://drive.google.com/uc?id=1-pl-7i9a7QZrTNybZvaicIVmlcUbagMH

Then you can generate summaries for the specific text using API.

from src.model.bart.finetune_model import generate_summary

#config - bart config
#path - path to the finetuned model
#text - text of the article
summary = generate_summary(config, path, text, max_length = 750, num_beams = 5)

Code structure

├── NLP_Course_Template.pdf
├── README.md
├── configs
│   ├── bart
│   │   └── bart.yaml
│   ├── config.yaml
│   └── parameters
│       └── parameters.yaml
├── data
│   ├── gazeta_test.jsonl
│   ├── gazeta_train.jsonl
│   ├── gazeta_val.jsonl
│   └── ria.json.gz
├── model
│   └── tokenizer
│       ├── merges.txt
│       └── vocab.json
├── notebooks
│   └── Headline_generation.ipynb
├── requirements.txt
└── src
   ├── __init__.py
   ├── inference.py
   ├── loaders
   │   ├── __init__.py
   │   ├── finetune_loader.py
   │   └── pretrain_loader.py
   ├── model
   │   ├── __init__.py
   │   └── bart
   │       ├── __init__.py
   │       ├── finetune_model.py
   │       └── pretrain_model.py
   ├── train.py
   └── utils
       ├── __init__.py
       ├── load_data.py
       └── tokenizer.py

Results

GAZETA dataset:

	BLEU	ROUGE-2	ROUGE-L
n lead rows	0.4341	0.1122	0.2264
SummaRunner	0.4145	0.1093	0.2229
BART	0.3851	0.0944	0.2085

RIA dataset:

	BLEU	ROUGE-2	ROUGE-L
n lead rows	0.2025	0.1013	0.1618
BART	0.5043	0.1999	0.3921

Svpressa dataset:

	BLEU	ROUGE-2	ROUGE-L
n lead rows	0.1563	0.0761	0.1488
BART	0.1872	0.0663	0.1692

Samples

predicted: большая часть детей, которых граждане сша пытались вывезти из гаити в доминиканской приют, не являются сиротами
reference: большинство детей, которых пытались увезти в сша из гаити, не сироты

predicted: тимошенко надеется, что в случае победы на выборах президента будет работать в ее команде премьер-министр украины
reference: луценко будет работать в команде тимошенко, если она победит в выборах

predicted: леверкузенский "байер" вернулся на первое место в чемпионате германии по футболу "фрайбург" 3:1"
reference: "байер" вернулся в лидеры чемпионата германии по футболу

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Huawei-project

Usage

Notebook API

Config API

Inference

Code structure

Results

Samples

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
configs		configs
data		data
model/tokenizer		model/tokenizer
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NLP_Course_Template.pdf		NLP_Course_Template.pdf
README.md		README.md
requirements.txt		requirements.txt

License

alexvishnevskiy/Huawei-project

Folders and files

Latest commit

History

Repository files navigation

Huawei-project

Usage

Notebook API

Config API

Inference

Code structure

Results

Samples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages