Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

A possible solution to the inability to quantify the output with QAT #3204

Closed
Lycan1003 opened this issue Dec 17, 2020 · 1 comment
Closed

Comments

@Lycan1003
Copy link

In issue #3084, I encountered a situation where the quantized output caused the loss to be nan.

Your team reply:

I have tested the mobilenetv2 in QAT. If only quantize weight, the training process is stable and the model can converge normally. If quantize activation, the loss will be nan. I think this phenomenon occurred because folding of batchnorm has not been supported yet. And it will always exists in the QAT method without folding.

And then, I realized the function of fusing batchnorm by myself, but the problem still exists.
So I checked the code again, find a possible solution.
Near line 233 of "quantizers.py", quantize_output function

    def quantize_output(self, output, wrapper, **kwargs):
    ...
        current_min, current_max = torch.min(output), torch.max(output)
        module.tracked_min_biased, module.tracked_min = update_ema(module.tracked_min_biased, current_min, module.ema_decay, self.steps)
        module.tracked_max_biased, module.tracked_max = update_ema(module.tracked_max_biased, current_max, module.ema_decay, self.steps)
        module.scale, module.zero_point = update_quantization_param(output_bits, module.tracked_min, module.tracked_max)

In my opinion, using tracked_min and tracked_max to update quantization param will smooth the drastic changes in data (just my understanding), code show as below

def update_ema(biased_ema, value, decay, step):
    biased_ema = biased_ema * decay + (1 - decay) * value
    unbiased_ema = biased_ema / (1 - decay ** step)  # Bias correction
    return biased_ema, unbiased_ema

However, When quantifying for the first time, the default value of tracked_max_biased (aka biased_ema in update_ema function) is zero, which will cause tracked_max (aka unbiased_ema) much smaller than current_max, same for the tracked_min. So the QAT result of the activation layer is incorrect, which causes weights of the Conv layer will be extremely large. After several epochs, the weight will be larger and larger until to be nan.

This problem does not appear in the example (QAT_torch_quantizer.py), I guess it may be because the network structure and dataset are simple?

So I change the code like this, using current_min and current_max replace the tracked_min and tracked_max, and the problem fixed.

 module.scale, module.zero_point = update_quantization_param(output_bits, current_min, current_max)

The accuracy of models will be acceptable.
ShufflenetV2(net_size=0.5) CIFAR10 with fuse bn, accuracy=74.6%
ShufflenetV2(net_size=0.5) CIFAR10 w/o fuse bn, accuracy=56.0%

I hope this can be helpful! ^_^

@linbinskn
Copy link
Contributor

linbinskn commented Dec 19, 2020

Great! Thank you for your issue! Have submitted a pr to fix this problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants