Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

t2t transformer model #446

Closed
zll0000 opened this issue Nov 28, 2017 · 5 comments
Closed

t2t transformer model #446

zll0000 opened this issue Nov 28, 2017 · 5 comments
Labels

Comments

@zll0000
Copy link

zll0000 commented Nov 28, 2017

@schani now ,I want to traning my transformer model wtih corpus chinese-japanese,I have corpus about 10 million,
1 ,generator traing and dev data,the code adding my data in word2def.py , as follows:

from future import absolute_import

from future import division
from future import print_function
import os
import tarfile

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.translate import character_generator

from tensor2tensor.utils import registry
from tensor2tensor.models import transformer
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
import tensorflow as tf

EOS = text_encoder.EOS_ID

English Word2def datasets

#LOCATION_OF_DATA='/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14/'
_WORD2DEF_TRAIN_DATASETS = [["/training-parallel-jp-ch.tgz",("training/translate_train_jpch.jp","training/translate_train_jpch.ch")]]

_WORD2DEF_TEST_DATASETS = [["/dev-parallel-jp-ch.tgz",("dev/translate_dev_jpch.jp","dev/translate_dev_jpch.ch")]]

@registry.register_problem()
class word2def(translate.TranslateProblem):
"""Problem spec for English word to dictionary definition."""

@Property
def targeted_vocab_size(self):
return 2**16+20000 # 32768

@Property
def vocab_name(self):
return "vocab.jpch"

def generator(self, data_dir, tmp_dir, train):
symbolizer_vocab = generator_utils.get_or_generate_vocab(data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size,_WORD2DEF_TRAIN_DATASETS)
datasets = _WORD2DEF_TRAIN_DATASETS if train else WORD2DEF_TEST_DATASETS
tag = "train" if train else "dev"
data_path = translate.compile_data(tmp_dir, datasets,"wmt_jpch_tok
%s" % tag)
return translate.token_generator(data_path + ".lang1", data_path + ".lang2",symbolizer_vocab, EOS)

@Property
def input_space_id(self):
return problem.SpaceID.EN_CHR

@Property
def target_space_id(self):
return problem.SpaceID.EN_CHR

@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparams
""" Problem definition for word to dictionary definition.
"""
from future import absolute_import
from future import division
from future import print_function

import os
import tarfile

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.translate import character_generator

from tensor2tensor.utils import registry
from tensor2tensor.models import transformer
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import translate
import tensorflow as tf

EOS = text_encoder.EOS_ID

English Word2def datasets

#LOCATION_OF_DATA='/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14/'
_WORD2DEF_TRAIN_DATASETS = [["/training-parallel-jp-ch.tgz",("training/translate_train_jpch.jp","training/translate_train_jpch.ch")]]

_WORD2DEF_TEST_DATASETS = [["/dev-parallel-jp-ch.tgz",("dev/translate_dev_jpch.jp","dev/translate_dev_jpch.ch")]]

@registry.register_problem()
class word2def(translate.TranslateProblem):
"""Problem spec for English word to dictionary definition."""

@Property
def targeted_vocab_size(self):
return 2**16

@Property
def vocab_name(self):
return "vocab.jpch"

def generator(self, data_dir, tmp_dir, train):
symbolizer_vocab = generator_utils.get_or_generate_vocab(data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size,_WORD2DEF_TRAIN_DATASETS)
datasets = _WORD2DEF_TRAIN_DATASETS if train else WORD2DEF_TEST_DATASETS
tag = "train" if train else "dev"
data_path = translate.compile_data(tmp_dir, datasets,"wmt_jpch_tok
%s" % tag)
return translate.token_generator(data_path + ".lang1", data_path + ".lang2",symbolizer_vocab, EOS)

@Property
def input_space_id(self):
return problem.SpaceID.EN_CHR

@Property
def target_space_id(self):
return problem.SpaceID.EN_CHR

@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparams

  1. init.py as follows:
    #encoding:utf-8
    from . import word2def

  2. to genereate data

PROBLEM=word2def
MODEL=transformer
HPARAMS=transformer_base_single_gpu

DATA_DIR=t2t_data
TMP_DIR=t2t_datagen
TRAIN_DIR=t2t_train/$PROBLEM/$MODEL-$HPARAMS
t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --t2t_usr_dir=/dnn4/dnn4_added/zhangxiaolei/env150zxlpy36-980

now I have a problem why the vocab_size is 2**16 in the code ,however ,at last the vocab_size in the file only 47000
how to expand the size of vocab

@martinpopel
Copy link
Contributor

martinpopel commented Nov 28, 2017

You need to increase the file_byte_budget.
Obviously, 1e6 bytes with your training data is not enough to get more than 47k subwords.
That said, I am not sure if a so big vocabulary pays off: the training and decoding is slower and it needs more memory (so you must use smaller batch_size, which seems to affect also the quality, see #444).

@zll0000
Copy link
Author

zll0000 commented Nov 29, 2017

@martinpopel At first ,I have not change any parameters ,only added my dataset ,however ,after training the model when I calculate the blue_score ,the performance is very bad,the blue_score is lower than Groundhog translation model base on the same dataset .I do not what to do to improve the performance,
can you give some advices,
thank you

@martinpopel
Copy link
Contributor

First, I see you use SpaceID.EN_CHR for both input and output, but actually you don't want character-based translation but rather subwords. I think for two-languages translation (non-multitask) the SpaceID does not matter, but I am not sure.
My T2T know-how: start with 32k vocabulary, make sure that the final min_count is not too low when building the subword vocab (otherwise increase file_byte_budget), set the batch_size as high as possible (without hitting OOM, but keep some reserve), use transformer_big_single_gpu, store checkpoints each hour (instead of each 10 minutes), which improves both the training speed and the final averaging. Check the training loss and test-metric (approx_bleu or real BLEU) in TensorBoard. If the training diverges, increase learning_rate_warmup_steps and start again from scratch. Increase training_steps e.g. to 1M - you can always kill the training when you see the BLEU curve is flat or even decreasing.

Finally, I am not sure this is the best place to discuss such general knowhow. Github issues should be for reporting bugs, feature request or very specific questions.

@zll0000
Copy link
Author

zll0000 commented Nov 30, 2017

@martinpopel thanks if I want to increase the file_byte_budget.what shoud I do .
t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --t2t_usr_dir=/dnn4/dnn4_added/zhangxiaolei/env63zxlpy36tf14 --hparams='file_byte_budget=10000000'

is it right?

do you have count in https://gitter.im/tensor2tensor/Lobby

@twairball
Copy link
Contributor

No. You'll need to modify the file_size_budget yourself when making vocab. See: https://github.com/twairball/t2t_wmt_zhen/blob/master/data_generators/utils.py#L130 for example.

twairball added a commit to twairball/tensor2tensor that referenced this issue Dec 5, 2017
Fix tensorflow#446 Added `file_size_budget` as argument to `get_or_generate_vocab`.
rsepassi pushed a commit that referenced this issue Jan 9, 2018
* Enhance WMT17 En-Zh task with full dataset.
Fix #446 Added `file_size_budget` as argument to `get_or_generate_vocab`.

* Made requested Fixes:
- Added TranslateEnzhWmt8k problem.
- Renamed to TranslateEnzhWmt32k, to reflect target vocab in problem name
- Added instructions for manually downloading full dataset.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants