Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between the PyTorch-converted pre-trained BERT parameters released on Google Drive and the one obtained using HuggingFace conversion script #1

Closed
todpole3 opened this issue Feb 14, 2019 · 7 comments

Comments

@todpole3
Copy link

todpole3 commented Feb 14, 2019

I tried to get the Pytorch pre-trained BERT checkpoint using the conversion script provided by HuggingFace. The script executed without any problem and I was able to obtain a binary converted file.

However, I noticed a few differences between this file compared with the PyTorch-converted pre-trained BERT parameters released on Google Drive.

First, the two files has different variable naming. The HuggingFace converted file has the prefix bert. for each variable and cannot be taken by SQLova directly.

RuntimeError: Error(s) in loading state_dict for BertModel:

Missing key(s) in state_dict: "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.gamma", "embeddings.LayerNorm.beta", "encoder.layer.0.attention.self.query.weight", "encoder.layer.0.attention.self.query.bias", "encoder.layer.0.attention.self.key.weight", "encoder.layer.0.attention.self.key.bias", "encoder.layer.0.attention.self.value.weight", "encoder.layer.0.attention.self.value.bias", "encoder.layer.0.attention.output.dense.weight", "encoder.layer.0.attention.output.dense.bias", "encoder.layer.0.attention.output.LayerNorm.gamma", "encoder.layer.0.attention.output.LayerNorm.beta", "encoder.layer.0.intermediate.dense.weight", "encoder.layer.0.intermediate.dense.bias", "encoder.layer.0.output.dense.weight", "encoder.layer.0.output.dense.bias", "encoder.layer.0.output.LayerNorm.gamma", "encoder.layer.0.output.LayerNorm.beta", "encoder.layer.1.attention.self.query.weight"...

Unexpected key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias", "bert.encoder.layer.0.attention.self.key.weight", "bert.encoder.layer.0.attention.self.key.bias", "bert.encoder.layer.0.attention.self.value.weight", "bert.encoder.layer.0.attention.self.value.bias", "bert.encoder.layer.0.attention.output.dense.weight", "bert.encoder.layer.0.attention.output.dense.bias", "bert.encoder.layer.0.attention.output.LayerNorm.weight", "bert.encoder.layer.0.attention.output.LayerNorm.bias", "bert.encoder.layer.0.intermediate.dense.weight", "bert.encoder.layer.0.intermediate.dense.bias", "bert.encoder.layer.0.output.dense.weight", "bert.encoder.layer.0.output.dense.bias", "bert.encoder.layer.0.output.LayerNorm.weight", "bert.encoder.layer.0.output.LayerNorm.bias"...

I was able to map most variables in these two files by manipulating the naming and verify their equivalence, but I cannot find a mapping of the following tensors in the HuggingFace conversion to the Google Drive release, most of them related to layer normalization.

bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.output.LayerNorm.weight
bert.encoder.layer.1.attention.output.LayerNorm.bias
bert.encoder.layer.1.output.LayerNorm.weight
bert.encoder.layer.1.output.LayerNorm.bias
bert.encoder.layer.2.attention.output.LayerNorm.weight
bert.encoder.layer.2.attention.output.LayerNorm.bias
bert.encoder.layer.2.output.LayerNorm.weight
bert.encoder.layer.2.output.LayerNorm.bias
bert.encoder.layer.3.attention.output.LayerNorm.weight
bert.encoder.layer.3.attention.output.LayerNorm.bias
bert.encoder.layer.3.output.LayerNorm.weight
bert.encoder.layer.3.output.LayerNorm.bias
bert.encoder.layer.4.attention.output.LayerNorm.weight
bert.encoder.layer.4.attention.output.LayerNorm.bias
bert.encoder.layer.4.output.LayerNorm.weight
bert.encoder.layer.4.output.LayerNorm.bias
bert.encoder.layer.5.attention.output.LayerNorm.weight
bert.encoder.layer.5.attention.output.LayerNorm.bias
bert.encoder.layer.5.output.LayerNorm.weight
bert.encoder.layer.5.output.LayerNorm.bias
bert.encoder.layer.6.attention.output.LayerNorm.weight
bert.encoder.layer.6.attention.output.LayerNorm.bias
bert.encoder.layer.6.output.LayerNorm.weight
bert.encoder.layer.6.output.LayerNorm.bias
bert.encoder.layer.7.attention.output.LayerNorm.weight
bert.encoder.layer.7.attention.output.LayerNorm.bias
bert.encoder.layer.7.output.LayerNorm.weight
bert.encoder.layer.7.output.LayerNorm.bias
bert.encoder.layer.8.attention.output.LayerNorm.weight
bert.encoder.layer.8.attention.output.LayerNorm.bias
bert.encoder.layer.8.output.LayerNorm.weight
bert.encoder.layer.8.output.LayerNorm.bias
bert.encoder.layer.9.attention.output.LayerNorm.weight
bert.encoder.layer.9.attention.output.LayerNorm.bias
bert.encoder.layer.9.output.LayerNorm.weight
bert.encoder.layer.9.output.LayerNorm.bias
bert.encoder.layer.10.attention.output.LayerNorm.weight
bert.encoder.layer.10.attention.output.LayerNorm.bias
bert.encoder.layer.10.output.LayerNorm.weight
bert.encoder.layer.10.output.LayerNorm.bias
bert.encoder.layer.11.attention.output.LayerNorm.weight
bert.encoder.layer.11.attention.output.LayerNorm.bias
bert.encoder.layer.11.output.LayerNorm.weight
bert.encoder.layer.11.output.LayerNorm.bias
cls.predictions.bias
cls.predictions.transform.dense.weight
cls.predictions.transform.dense.bias
cls.predictions.transform.LayerNorm.weight
cls.predictions.transform.LayerNorm.bias
cls.predictions.decoder.weight
cls.seq_relationship.weight
cls.seq_relationship.bias

May I understand what causes the above differences? Is layer normalization removed from the BERT architecture on purpose? Thanks.

@whwang299
Copy link
Contributor

whwang299 commented Feb 14, 2019

Hi @todpole3

I found huggingface team has changed the variables names in their "updated" code.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/c35d9d48d91282f7b7776751fc5630b1af1d3b97/pytorch_pretrained_bert/modeling.py#L220-L233

Previously, self.weight in BERTLayerNorm was self.gamma and self.bias was self.beta.

See below.

sqlova/bert/modeling.py

Lines 131 to 138 in b7ce9ad

class BERTLayerNorm(nn.Module):
def __init__(self, config, variance_epsilon=1e-12):
"""Construct a layernorm module in the TF style (epsilon inside the square root).
"""
super(BERTLayerNorm, self).__init__()
self.gamma = nn.Parameter(torch.ones(config.hidden_size))
self.beta = nn.Parameter(torch.zeros(config.hidden_size))
self.variance_epsilon = variance_epsilon

For the compatibility, please use the old convert_tf_checkpoint_to_pytorch.py which I have used.

Thanks!

Wonseok

@todpole3
Copy link
Author

That makes a lot of sense. Thanks for the quick response!

whwang299 pushed a commit that referenced this issue Jun 6, 2019
@luckysunda
Copy link

I am using everyhing from the repo itself but still getting this error:

RuntimeError: Error(s) in loading state_dict for BertModel:
	Unexpected key(s) in state_dict: "encoder.layer.12.attention.self.query.weight", "encoder.layer.12.attention.self.query.bias", "encoder.layer.12.attention.self.key.weight", "encoder.layer.12.attention.self.key.bias", "encoder.layer.12.attention.self.value.weight", "encoder.layer.12.attention.self.value.bias", "encoder.layer.12.attention.output.dense.weight", "encoder.layer.12.attention.output.dense.bias", "encoder.layer.12.attention.output.LayerNorm.gamma", "encoder.layer.12.attention.output.LayerNorm.beta", "encoder.layer.12.intermediate.dense.weight", "encoder.layer.12.intermediate.dense.bias", "encoder.layer.12.output.dense.weight", "encoder.layer.12.output.dense.bias", "encoder.layer.12.output.LayerNorm.gamma", "encoder.layer.12.output.LayerNorm.beta", "encoder.layer.13.attention.self.query.weight", "encoder.layer.13.attention.self.query.bias", "encoder.layer.13.attention.self.key.weight", "encoder.layer.13.attention.self.key.bias", "encoder.layer.13.attention.self.value.weight", "encoder.layer.13.attention.self.value.bias", "encoder.layer.13.attention.output.dense.weight", "encoder.layer.13.attention.output.dense.bias", "encoder.layer.13.attention.output.LayerNorm.gamma", "encoder.layer.13.attention.output.LayerNorm.beta", "encoder.layer.13.intermediate.dense.weight", "encoder.layer.13.intermediate.dense.bias", "encoder.layer.13.output.dense.weight", "encoder.layer.13.output.dense.bias", "encoder.layer.13.output.LayerNorm.gamma", "encoder.layer.13.output.LayerNorm.beta", "encoder.layer.14.attention.self.query.weight", "encoder.layer.14.attention.self.query.bias", "encoder.layer.14.attention.self.key.weight", "encoder.layer.14.attention.self.key.bias", "encoder.layer.14.attention.self.value.weight", "encoder.layer.14.attention.self.value.bias", "encoder.layer.14.attention.output.dense.weight", "encoder.layer.14.attention.output.dense.bias", "encoder.layer.14.attention.output.LayerNorm.gamma", "encoder.layer.14.attention.output.LayerNorm.beta", "encoder.layer.14.intermediate.dense.weight", "encoder.layer.14.intermediate.dense.bias", "encoder.layer.14.output.dense.weight", "encoder.layer.14.output.dense.bias", "encoder.layer.14.output.LayerNorm.gamma", "encoder.layer.14.output.LayerNorm.beta", "encoder.layer.15.attention.self.query.weight", "encoder.layer.15.attention.self.query.bias", "encoder.layer.15.attention.self.key.weight", "encoder.layer.15.attention.self.key.bias", "encoder.layer.15.attention.self.value.weight", "encoder.layer.15.attention.self.value.bias", "encoder.layer.15.attention.output.dense.weight", "encoder.layer.15.attention.output.dense.bias", "encoder.layer.15.attention.output.LayerNorm.gamma", "encoder.layer.15.attention.output.LayerNorm.beta", "encoder.layer.15.intermediate.dense.weight", "encoder.layer.15.intermediate.dense.bias", "encoder.layer.15.output.dense.weight", "encoder.layer.15.output.dense.bias", "encoder.layer.15.output.LayerNorm.gamma", "encoder.layer.15.output.LayerNorm.beta", "encoder.layer.16.attention.self.query.weight", "encoder.layer.16.attention.self.query.bias", "encoder.layer.16.attention.self.key.weight", "encoder.layer.16.attention.self.key.bias", "encoder.layer.16.attention.self.value.weight", "encoder.layer.16.attention.self.value.bias", "encoder.layer.16.attention.output.dense.weight", "encoder.layer.16.attention.output.dense.bias", "encoder.layer.16.attention.output.LayerNorm.gamma", "encoder.layer.16.attention.output.LayerNorm.beta", "encoder.layer.16.intermediate.dense.weight", "encoder.layer.16.intermediate.dense.bias", "encoder.layer.16.output.dense.weight", "encoder.layer.16.output.dense.bias", "encoder.layer.16.output.LayerNorm.gamma", "encoder.layer.16.output.LayerNorm.beta", "encoder.layer.17.attention.self.query.weight", "encoder.layer.17.attention.self.query.bias", "encoder.layer.17.attention.self.key.weight", "encoder.layer.17.attention.self.key.bias", "encoder.layer.17.attention.self.value.weight", "encoder.layer.17.attention.self.value.bias", "encoder.layer.17.attention.output.dense.weight", "encoder.layer.17.attention.output.dense.bias", "encoder.layer.17.attention.output.LayerNorm.gamma", "encoder.layer.17.attention.output.LayerNorm.beta", "encoder.layer.17.intermediate.dense.weight", "encoder.layer.17.intermediate.dense.bias", "encoder.layer.17.output.dense.weight", "encoder.layer.17.output.dense.bias", "encoder.layer.17.output.LayerNorm.gamma", "encoder.layer.17.output.LayerNorm.beta", "encoder.layer.18.attention.self.query.weight", "encoder.layer.18.attention.self.query.bias", "encoder.layer.18.attention.self.key.weight", "encoder.layer.18.attention.self.key.bias", "encoder.layer.18.attention.self.value.weight", "encoder.layer.18.attention.self.value.bias", "encoder.layer.18.attention.output.dense.weight", "encoder.layer.18.attention.output.dense.bias", "encoder.layer.18.attention.output.LayerNorm.gamma", "encoder.layer.18.attention.output.LayerNorm.beta", "encoder.layer.18.intermediate.dense.weight", "encoder.layer.18.intermediate.dense.bias", "encoder.layer.18.output.dense.weight", "encoder.layer.18.output.dense.bias", "encoder.layer.18.output.LayerNorm.gamma", "encoder.layer.18.output.LayerNorm.beta", "encoder.layer.19.attention.self.query.weight", "encoder.layer.19.attention.self.query.bias", "encoder.layer.19.attention.self.key.weight", "encoder.layer.19.attention.self.key.bias", "encoder.layer.19.attention.self.value.weight", "encoder.layer.19.attention.self.value.bias", "encoder.layer.19.attention.output.dense.weight", "encoder.layer.19.attention.output.dense.bias", "encoder.layer.19.attention.output.LayerNorm.gamma", "encoder.layer.19.attention.output.LayerNorm.beta", "encoder.layer.19.intermediate.dense.weight", "encoder.layer.19.intermediate.dense.bias", "encoder.layer.19.output.dense.weight", "encoder.layer.19.output.dense.bias", "encoder.layer.19.output.LayerNorm.gamma", "encoder.layer.19.output.LayerNorm.beta", "encoder.layer.20.attention.self.query.weight", "encoder.layer.20.attention.self.query.bias", "encoder.layer.20.attention.self.key.weight", "encoder.layer.20.attention.self.key.bias", "encoder.layer.20.attention.self.value.weight", "encoder.layer.20.attention.self.value.bias", "encoder.layer.20.attention.output.dense.weight", "encoder.layer.20.attention.output.dense.bias", "encoder.layer.20.attention.output.LayerNorm.gamma", "encoder.layer.20.attention.output.LayerNorm.beta", "encoder.layer.20.intermediate.dense.weight", "encoder.layer.20.intermediate.dense.bias", "encoder.layer.20.output.dense.weight", "encoder.layer.20.output.dense.bias", "encoder.layer.20.output.LayerNorm.gamma", "encoder.layer.20.output.LayerNorm.beta", "encoder.layer.21.attention.self.query.weight", "encoder.layer.21.attention.self.query.bias", "encoder.layer.21.attention.self.key.weight", "encoder.layer.21.attention.self.key.bias", "encoder.layer.21.attention.self.value.weight", "encoder.layer.21.attention.self.value.bias", "encoder.layer.21.attention.output.dense.weight", "encoder.layer.21.attention.output.dense.bias", "encoder.layer.21.attention.output.LayerNorm.gamma", "encoder.layer.21.attention.output.LayerNorm.beta", "encoder.layer.21.intermediate.dense.weight", "encoder.layer.21.intermediate.dense.bias", "encoder.layer.21.output.dense.weight", "encoder.layer.21.output.dense.bias", "encoder.layer.21.output.LayerNorm.gamma", "encoder.layer.21.output.LayerNorm.beta", "encoder.layer.22.attention.self.query.weight", "encoder.layer.22.attention.self.query.bias", "encoder.layer.22.attention.self.key.weight", "encoder.layer.22.attention.self.key.bias", "encoder.layer.22.attention.self.value.weight", "encoder.layer.22.attention.self.value.bias", "encoder.layer.22.attention.output.dense.weight", "encoder.layer.22.attention.output.dense.bias", "encoder.layer.22.attention.output.LayerNorm.gamma", "encoder.layer.22.attention.output.LayerNorm.beta", "encoder.layer.22.intermediate.dense.weight", "encoder.layer.22.intermediate.dense.bias", "encoder.layer.22.output.dense.weight", "encoder.layer.22.output.dense.bias", "encoder.layer.22.output.LayerNorm.gamma", "encoder.layer.22.output.LayerNorm.beta", "encoder.layer.23.attention.self.query.weight", "encoder.layer.23.attention.self.query.bias", "encoder.layer.23.attention.self.key.weight", "encoder.layer.23.attention.self.key.bias", "encoder.layer.23.attention.self.value.weight", "encoder.layer.23.attention.self.value.bias", "encoder.layer.23.attention.output.dense.weight", "encoder.layer.23.attention.output.dense.bias", "encoder.layer.23.attention.output.LayerNorm.gamma", "encoder.layer.23.attention.output.LayerNorm.beta", "encoder.layer.23.intermediate.dense.weight", "encoder.layer.23.intermediate.dense.bias", "encoder.layer.23.output.dense.weight", "encoder.layer.23.output.dense.bias", "encoder.layer.23.output.LayerNorm.gamma", "encoder.layer.23.output.LayerNorm.beta". 
	size mismatch for embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 1024]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
	size mismatch for embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 768]).
	size mismatch for embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 1024]) from checkpoint, the shape in current model is torch.Size([2, 768]).
	size mismatch for embeddings.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for embeddings.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.0.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.1.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.2.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.3.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.4.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.5.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.6.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.7.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.8.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.9.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.10.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.self.query.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.self.query.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.self.key.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.self.key.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.self.value.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.self.value.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.attention.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
	size mismatch for encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
	size mismatch for encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
	size mismatch for encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.output.LayerNorm.gamma: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for encoder.layer.11.output.LayerNorm.beta: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for pooler.dense.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for pooler.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).



What can be done?

@luckysunda
Copy link

It got solved, I used different bert model.

@techwithshadab
Copy link

It got solved, I used different bert model.

Can you share that model? Even I'm facing the same issue

@techwithshadab
Copy link

Hi @todpole3

I found huggingface team has changed the variables names in their "updated" code.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/c35d9d48d91282f7b7776751fc5630b1af1d3b97/pytorch_pretrained_bert/modeling.py#L220-L233

Previously, self.weight in BERTLayerNorm was self.gamma and self.bias was self.beta.

See below.

sqlova/bert/modeling.py

Lines 131 to 138 in b7ce9ad

class BERTLayerNorm(nn.Module):
def __init__(self, config, variance_epsilon=1e-12):
"""Construct a layernorm module in the TF style (epsilon inside the square root).
"""
super(BERTLayerNorm, self).__init__()
self.gamma = nn.Parameter(torch.ones(config.hidden_size))
self.beta = nn.Parameter(torch.zeros(config.hidden_size))
self.variance_epsilon = variance_epsilon

For the compatibility, please use the old convert_tf_checkpoint_to_pytorch.py which I have used.

Thanks!

Wonseok

Can we update those changes in this repo also?

@Adherer
Copy link

Adherer commented Nov 6, 2019

mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants