Difference between the PyTorch-converted pre-trained BERT parameters released on Google Drive and the one obtained using HuggingFace conversion script #1

todpole3 · 2019-02-14T02:17:43Z

I tried to get the Pytorch pre-trained BERT checkpoint using the conversion script provided by HuggingFace. The script executed without any problem and I was able to obtain a binary converted file.

However, I noticed a few differences between this file compared with the PyTorch-converted pre-trained BERT parameters released on Google Drive.

First, the two files has different variable naming. The HuggingFace converted file has the prefix bert. for each variable and cannot be taken by SQLova directly.

RuntimeError: Error(s) in loading state_dict for BertModel:

Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.gamma", "embeddings.LayerNorm.beta", "encoder.layer.0.attention.self.query.weight", "encoder.layer.0.attention.self.query.bias", "encoder.layer.0.attention.self.key.weight", "encoder.layer.0.attention.self.key.bias", "encoder.layer.0.attention.self.value.weight", "encoder.layer.0.attention.self.value.bias", "encoder.layer.0.attention.output.dense.weight", "encoder.layer.0.attention.output.dense.bias", "encoder.layer.0.attention.output.LayerNorm.gamma", "encoder.layer.0.attention.output.LayerNorm.beta", "encoder.layer.0.intermediate.dense.weight", "encoder.layer.0.intermediate.dense.bias", "encoder.layer.0.output.dense.weight", "encoder.layer.0.output.dense.bias", "encoder.layer.0.output.LayerNorm.gamma", "encoder.layer.0.output.LayerNorm.beta", "encoder.layer.1.attention.self.query.weight"...

Unexpected key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias", "bert.encoder.layer.0.attention.self.key.weight", "bert.encoder.layer.0.attention.self.key.bias", "bert.encoder.layer.0.attention.self.value.weight", "bert.encoder.layer.0.attention.self.value.bias", "bert.encoder.layer.0.attention.output.dense.weight", "bert.encoder.layer.0.attention.output.dense.bias", "bert.encoder.layer.0.attention.output.LayerNorm.weight", "bert.encoder.layer.0.attention.output.LayerNorm.bias", "bert.encoder.layer.0.intermediate.dense.weight", "bert.encoder.layer.0.intermediate.dense.bias", "bert.encoder.layer.0.output.dense.weight", "bert.encoder.layer.0.output.dense.bias", "bert.encoder.layer.0.output.LayerNorm.weight", "bert.encoder.layer.0.output.LayerNorm.bias"...

I was able to map most variables in these two files by manipulating the naming and verify their equivalence, but I cannot find a mapping of the following tensors in the HuggingFace conversion to the Google Drive release, most of them related to layer normalization.

bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.output.LayerNorm.weight
bert.encoder.layer.1.attention.output.LayerNorm.bias
bert.encoder.layer.1.output.LayerNorm.weight
bert.encoder.layer.1.output.LayerNorm.bias
bert.encoder.layer.2.attention.output.LayerNorm.weight
bert.encoder.layer.2.attention.output.LayerNorm.bias
bert.encoder.layer.2.output.LayerNorm.weight
bert.encoder.layer.2.output.LayerNorm.bias
bert.encoder.layer.3.attention.output.LayerNorm.weight
bert.encoder.layer.3.attention.output.LayerNorm.bias
bert.encoder.layer.3.output.LayerNorm.weight
bert.encoder.layer.3.output.LayerNorm.bias
bert.encoder.layer.4.attention.output.LayerNorm.weight
bert.encoder.layer.4.attention.output.LayerNorm.bias
bert.encoder.layer.4.output.LayerNorm.weight
bert.encoder.layer.4.output.LayerNorm.bias
bert.encoder.layer.5.attention.output.LayerNorm.weight
bert.encoder.layer.5.attention.output.LayerNorm.bias
bert.encoder.layer.5.output.LayerNorm.weight
bert.encoder.layer.5.output.LayerNorm.bias
bert.encoder.layer.6.attention.output.LayerNorm.weight
bert.encoder.layer.6.attention.output.LayerNorm.bias
bert.encoder.layer.6.output.LayerNorm.weight
bert.encoder.layer.6.output.LayerNorm.bias
bert.encoder.layer.7.attention.output.LayerNorm.weight
bert.encoder.layer.7.attention.output.LayerNorm.bias
bert.encoder.layer.7.output.LayerNorm.weight
bert.encoder.layer.7.output.LayerNorm.bias
bert.encoder.layer.8.attention.output.LayerNorm.weight
bert.encoder.layer.8.attention.output.LayerNorm.bias
bert.encoder.layer.8.output.LayerNorm.weight
bert.encoder.layer.8.output.LayerNorm.bias
bert.encoder.layer.9.attention.output.LayerNorm.weight
bert.encoder.layer.9.attention.output.LayerNorm.bias
bert.encoder.layer.9.output.LayerNorm.weight
bert.encoder.layer.9.output.LayerNorm.bias
bert.encoder.layer.10.attention.output.LayerNorm.weight
bert.encoder.layer.10.attention.output.LayerNorm.bias
bert.encoder.layer.10.output.LayerNorm.weight
bert.encoder.layer.10.output.LayerNorm.bias
bert.encoder.layer.11.attention.output.LayerNorm.weight
bert.encoder.layer.11.attention.output.LayerNorm.bias
bert.encoder.layer.11.output.LayerNorm.weight
bert.encoder.layer.11.output.LayerNorm.bias
cls.predictions.bias
cls.predictions.transform.dense.weight
cls.predictions.transform.dense.bias
cls.predictions.transform.LayerNorm.weight
cls.predictions.transform.LayerNorm.bias
cls.predictions.decoder.weight
cls.seq_relationship.weight
cls.seq_relationship.bias

May I understand what causes the above differences? Is layer normalization removed from the BERT architecture on purpose? Thanks.

The text was updated successfully, but these errors were encountered:

whwang299 · 2019-02-14T04:07:29Z

Hi @todpole3

I found huggingface team has changed the variables names in their "updated" code.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/c35d9d48d91282f7b7776751fc5630b1af1d3b97/pytorch_pretrained_bert/modeling.py#L220-L233

Previously, self.weight in BERTLayerNorm was self.gamma and self.bias was self.beta.

See below.

sqlova/bert/modeling.py

Lines 131 to 138 in b7ce9ad

    
           class BERTLayerNorm(nn.Module): 
        
               def __init__(self, config, variance_epsilon=1e-12): 
        
                   """Construct a layernorm module in the TF style (epsilon inside the square root). 
        
                   """ 
        
                   super(BERTLayerNorm, self).__init__() 
        
                   self.gamma = nn.Parameter(torch.ones(config.hidden_size)) 
        
                   self.beta = nn.Parameter(torch.zeros(config.hidden_size)) 
        
                   self.variance_epsilon = variance_epsilon

For the compatibility, please use the old convert_tf_checkpoint_to_pytorch.py which I have used.

Thanks!

Wonseok

todpole3 · 2019-02-14T04:31:15Z

That makes a lot of sense. Thanks for the quick response!

Update README.md

luckysunda · 2019-09-26T17:21:10Z

I am using everyhing from the repo itself but still getting this error:

RuntimeError: Error(s) in loading state_dict for BertModel:
	Unexpected key(s) in state_dict: "encoder.layer.12.attention.self.query.weight", "encoder.layer.12.attention.self.query.bias", "encoder.layer.12.attention.self.key.weight", "encoder.layer.12.attention.self.key.bias", "encoder.layer.12.attention.self.value.weight", "encoder.layer.12.attention.self.value.bias", "encoder.layer.12.attention.output.dense.weight", "encoder.layer.12.attention.output.dense.bias", "encoder.layer.12.attention.output.LayerNorm.gamma", "encoder.layer.12.attention.output.LayerNorm.beta", "encoder.layer.12.intermediate.dense.weight", "encoder.layer.12.intermediate.dense.bias", "encoder.layer.12.output.dense.weight", "encoder.layer.12.output.dense.bias", "encoder.layer.12.output.LayerNorm.gamma", "encoder.layer.12.output.LayerNorm.beta", "encoder.layer.13.attention.self.query.weight", "encoder.layer.13.attention.self.query.bias", "encoder.layer.13.attention.self.key.weight", "encoder.layer.13.attention.self.key.bias", "encoder.layer.13.attention.self.value.weight", "encoder.layer.13.attention.self.value.bias", "encoder.layer.13.attention.output.dense.weight", "encoder.layer.13.attention.output.dense.bias", "encoder.layer.13.attention.output.LayerNorm.gamma", "encoder.layer.13.attention.output.LayerNorm.beta", "encoder.layer.13.intermediate.dense.weight", "encoder.layer.13.intermediate.dense.bias", "encoder.layer.13.output.dense.weight", "encoder.layer.13.output.dense.bias", "encoder.layer.13.output.LayerNorm.gamma", "encoder.layer.13.output.LayerNorm.beta", "encoder.layer.14.attention.self.query.weight", "encoder.layer.14.attention.self.query.bias", "encoder.layer.14.attention.self.key.weight", "encoder.layer.14.attention.self.key.bias", "encoder.layer.14.attention.self.value.weight", "encoder.layer.14.attention.self.value.bias", "encoder.layer.14.attention.output.dense.weight", "encoder.layer.14.attention.output.dense.bias", "encoder.layer.14.attention.output.LayerNorm.gamma", "encoder.layer.14.attention.output.LayerNorm.beta", "encoder.layer.14.intermediate.dense.weight", "encoder.layer.14.intermediate.dense.bias", "encoder.layer.14.output.dense.weight", "encoder.layer.14.output.dense.bias", "encoder.layer.14.output.LayerNorm.gamma", "encoder.layer.14.output.LayerNorm.beta", "encoder.layer.15.attention.self.query.weight", "encoder.layer.15.attention.self.query.bias", "encoder.layer.15.attention.self.key.weight", "encoder.layer.15.attention.self.key.bias", "encoder.layer.15.attention.self.value.weight", "encoder.layer.15.attention.self.value.bias", "encoder.layer.15.attention.output.dense.weight", "encoder.layer.15.attention.output.dense.bias", "encoder.layer.15.attention.output.LayerNorm.gamma", "encoder.layer.15.attention.output.LayerNorm.beta", "encoder.layer.15.intermediate.dense.weight", "encoder.layer.15.intermediate.dense.bias", "encoder.layer.15.output.dense.weight", "encoder.layer.15.output.dense.bias", "encoder.layer.15.output.LayerNorm.gamma", "encoder.layer.15.output.LayerNorm.beta", "encoder.layer.16.attention.self.query.weight", "encoder.layer.16.attention.self.query.bias", "encoder.layer.16.attention.self.key.weight", "encoder.layer.16.attention.self.key.bias", "encoder.layer.16.attention.self.value.weight", "encoder.layer.16.attention.self.value.bias", "encoder.layer.16.attention.output.dense.weight", "encoder.layer.16.attention.output.dense.bias", "encoder.layer.16.attention.output.LayerNorm.gamma", "encoder.layer.16.attention.output.LayerNorm.beta", "encoder.layer.16.intermediate.dense.weight", "encoder.layer.16.intermediate.dense.bias", "encoder.layer.16.output.dense.weight", "encoder.layer.16.output.dense.bias", "encoder.layer.16.output.LayerNorm.gamma", "encoder.layer.16.output.LayerNorm.beta", "encoder.layer.17.attention.self.query.weight", "encoder.layer.17.attention.self.query.bias", "encoder.layer.17.attention.self.key.weight", "encoder.layer.17.attention.self.key.bias", "encoder.layer.17.attention.self.value.weight", "encoder.layer.17.attention.self.value.bias", "encoder.layer.17.attention.output.dense.weight", "encoder.layer.17.attention.output.dense.bias", "encoder.layer.17.attention.output.LayerNorm.gamma", "encoder.layer.17.attention.output.LayerNorm.beta", "encoder.layer.17.intermediate.dense.weight", "encoder.layer.17.intermediate.dense.bias", "encoder.layer.17.output.dense.weight", "encoder.layer.17.output.dense.bias", "encoder.layer.17.output.LayerNorm.gamma", "encoder.layer.17.output.LayerNorm.beta", "encoder.layer.18.attention.self.query.weight", "encoder.layer.18.attention.self.query.bias", "encoder.layer.18.attention.self.key.weight", "encoder.layer.18.attention.self.key.bias", "encoder.layer.18.attention.self.value.weight", "encoder.layer.18.attention.self.value.bias", "encoder.layer.18.attention.output.dense.weight", "encoder.layer.18.attention.output.dense.bias", "encoder.layer.18.attention.output.LayerNorm.gamma", "encoder.layer.18.attention.output.LayerNorm.beta", "encoder.layer.18.intermediate.dense.weight", "encoder.layer.18.intermediate.dense.bias", "encoder.layer.18.output.dense.weight", "encoder.layer.18.output.dense.bias", "encoder.layer.18.output.LayerNorm.gamma", "encoder.layer.18.output.LayerNorm.beta", "encoder.layer.19.attention.self.query.weight", "encoder.layer.19.attention.self.query.bias", "encoder.layer.19.attention.self.key.weight", "encoder.layer.19.attention.self.key.bias", "encoder.layer.19.attention.self.value.weight", "encoder.layer.19.attention.self.value.bias", "encoder.layer.19.attention.output.dense.weight", "encoder.layer.19.attention.output.dense.bias", "encoder.layer.19.attention.output.LayerNorm.gamma", "encoder.layer.19.attention.output.LayerNorm.beta", "encoder.layer.19.intermediate.dense.weight", "encoder.layer.19.intermediate.dense.bias", "encoder.layer.19.output.dense.weight", "encoder.layer.19.output.dense.bias", "encoder.layer.19.output.LayerNorm.gamma", "encoder.layer.19.output.LayerNorm.beta", "encoder.layer.20.attention.self.query.weight", "encoder.layer.20.attention.self.query.bias", "encoder.layer.20.attention.self.key.weight", "encoder.layer.20.attention.self.key.bias", "encoder.layer.20.attention.self.value.weight", "encoder.layer.20.attention.self.value.bias", "encoder.layer.20.attention.output.dense.weight", "encoder.layer.20.attention.output.dense.bias", "encoder.layer.20.attention.output.LayerNorm.gamma", "encoder.layer.20.attention.output.LayerNorm.beta", "encoder.layer.20.intermediate.dense.weight", "encoder.layer.20.intermediate.dense.bias", "encoder.layer.20.output.dense.weight", "encoder.layer.20.output.dense.bias", "encoder.layer.20.output.LayerNorm.gamma", "encoder.layer.20.output.LayerNorm.beta", "encoder.layer.21.attention.self.query.weight", "encoder.layer.21.attention.self.query.bias", "encoder.layer.21.attention.self.key.weight", "encoder.layer.21.attention.self.key.bias", "encoder.layer.21.attention.self.value.weight", "encoder.layer.21.attention.self.value.bias", "encoder.layer.21.attention.output.dense.weight", "encoder.layer.21.attention.output.dense.bias", "encoder.layer.21.attention.output.LayerNorm.gamma", "encoder.layer.21.attention.output.LayerNorm.beta", "encoder.layer.21.intermediate.dense.weight", "encoder.layer.21.intermediate.dense.bias", "encoder.layer.21.output.dense.weight", "encoder.layer.21.output.dense.bias", "encoder.layer.21.output.LayerNorm.gamma", "encoder.layer.21.output.LayerNorm.beta", "encoder.layer.22.attention.self.query.weight", "encoder.layer.22.attention.self.query.bias", "encoder.layer.22.attention.self.key.weight", "encoder.layer.22.attention.self.key.bias", "encoder.layer.22.attention.self.value.weight", "encoder.layer.22.attention.self.value.bias", "encoder.layer.22.attention.output.dense.weight", "encoder.layer.22.attention.output.dense.bias", "encoder.layer.22.attention.output.LayerNorm.gamma", "encoder.layer.22.attention.output.LayerNorm.beta", "encoder.layer.22.intermediate.dense.weight", "encoder.layer.22.intermediate.dense.bias", "encoder.layer.22.output.dense.weight", "encoder.layer.22.output.dense.bias", "encoder.layer.22.output.LayerNorm.gamma", "encoder.layer.22.output.LayerNorm.beta", "encoder.layer.23.attention.self.query.weight", "encoder.layer.23.attention.self.query.bias", "encoder.layer.23.attention.self.key.weight", "encoder.layer.23.attention.self.key.bias", "encoder.layer.23.attention.self.value.weight", "encoder.layer.23.attention.self.value.bias", "encoder.layer.23.attention.output.dense.weight", "encoder.layer.23.attention.output.dense.bias", "encoder.layer.23.attention.output.LayerNorm.gamma", "encoder.layer.23.attention.output.LayerNorm.beta", "encoder.layer.23.intermediate.dense.weight", "encoder.layer.23.intermediate.dense.bias", "encoder.layer.23.output.dense.weight", "encoder.layer.23.output.dense.bias", "encoder.layer.23.output.LayerNorm.gamma", "encoder.layer.23.output.LayerNorm.beta". 
	size mismatch for embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 1024]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
	size mismatch for embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 768]).
	size mismatch for embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 1024]) from checkpoint, the shape in current model is torch.Size([2, 768]).
	size mismatch for embeddings.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for embeddings.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for pooler.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for pooler.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

What can be done?

luckysunda · 2019-09-27T06:47:44Z

It got solved, I used different bert model.

techwithshadab · 2019-10-23T06:03:42Z

It got solved, I used different bert model.

Can you share that model? Even I'm facing the same issue

techwithshadab · 2019-10-23T06:11:40Z

Hi @todpole3

I found huggingface team has changed the variables names in their "updated" code.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/c35d9d48d91282f7b7776751fc5630b1af1d3b97/pytorch_pretrained_bert/modeling.py#L220-L233

Previously, self.weight in BERTLayerNorm was self.gamma and self.bias was self.beta.

See below.

sqlova/bert/modeling.py

Lines 131 to 138 in b7ce9ad

class BERTLayerNorm(nn.Module):

def __init__(self, config, variance_epsilon=1e-12):

"""Construct a layernorm module in the TF style (epsilon inside the square root).

"""

super(BERTLayerNorm, self).__init__()

self.gamma = nn.Parameter(torch.ones(config.hidden_size))

self.beta = nn.Parameter(torch.zeros(config.hidden_size))

self.variance_epsilon = variance_epsilon

For the compatibility, please use the old convert_tf_checkpoint_to_pytorch.py which I have used.

Thanks!

Wonseok

Can we update those changes in this repo also?

Adherer · 2019-11-06T05:29:54Z

mark

todpole3 closed this as completed Feb 14, 2019

whwang299 mentioned this issue Jun 6, 2019

Update README.md #24

Merged

whwang299 pushed a commit that referenced this issue Jun 6, 2019

Merge pull request #1 from 0xflotus/patch-1

7099745

Update README.md

samaritanhu mentioned this issue Sep 10, 2020

load_state_dict error - some problems on parameters songyingxin/Bert-TextClassification#21

Open

Paranioar mentioned this issue Apr 20, 2021

The reproduction results. yiling2018/saem#5

Open

Christof23 mentioned this issue Mar 20, 2024

BERT Safetensors variable mismatch huggingface/candle#1887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between the PyTorch-converted pre-trained BERT parameters released on Google Drive and the one obtained using HuggingFace conversion script #1

Difference between the PyTorch-converted pre-trained BERT parameters released on Google Drive and the one obtained using HuggingFace conversion script #1

todpole3 commented Feb 14, 2019 •

edited

Loading

whwang299 commented Feb 14, 2019 •

edited

Loading

todpole3 commented Feb 14, 2019

luckysunda commented Sep 26, 2019

luckysunda commented Sep 27, 2019

techwithshadab commented Oct 23, 2019

techwithshadab commented Oct 23, 2019

Adherer commented Nov 6, 2019

Difference between the PyTorch-converted pre-trained BERT parameters released on Google Drive and the one obtained using HuggingFace conversion script #1

Difference between the PyTorch-converted pre-trained BERT parameters released on Google Drive and the one obtained using HuggingFace conversion script #1

Comments

todpole3 commented Feb 14, 2019 • edited Loading

whwang299 commented Feb 14, 2019 • edited Loading

todpole3 commented Feb 14, 2019

luckysunda commented Sep 26, 2019

luckysunda commented Sep 27, 2019

techwithshadab commented Oct 23, 2019

techwithshadab commented Oct 23, 2019

Adherer commented Nov 6, 2019

todpole3 commented Feb 14, 2019 •

edited

Loading

whwang299 commented Feb 14, 2019 •

edited

Loading