Combine tensorrt tool with NNI quantization algorithms. #3488

linbinskn · 2021-03-29T01:13:46Z

NNI tensorrt support
Target:
1. Support real quantization speed up for NNI for different hardware(now only support TensorRT)
2. Support mixed precision search specially for mixed quantization(design interface)
3. Combine quantization inference and current simulated quantization interface in NNI, mainly support QAT.

colorjam · 2021-03-30T07:47:53Z

examples/model_compress/quantization/mixed_precision_speedup_cifar10.py

+
+
+def resnet18(**kwargs):
+    return _resnet(BasicBlock, [2, 2, 2, 2], **kwargs)


better to create a models folder?

Good Point. Will create this model by importing from folder after the position of model compression models folder in examples is comfirmed.

colorjam · 2021-03-30T08:00:13Z

examples/model_compress/quantization/mixed_precision_speedup_cifar10.py

+    engine.compress()
+    output, time = engine.inference(test_set)
+
+    check_accuracy(output, test_labels)


Why not inference on the full test dataset and compare the accuracy of the quantized model with the original model?

The evaluation dataset has already been full test dataset. For the current scenario training QAT model from scratch in this example, we should compare accuracy of QAT quantized model and accuracy of speedup model, both of which have been printed.

colorjam · 2021-03-30T08:03:05Z

examples/model_compress/quantization/mixed_precision_speedup_cifar10.py

+        'layer4.1.conv1':{'weight_bit':8, 'activation_bit':8},
+        'layer4.1.conv2':{'weight_bit':8, 'activation_bit':8},
+        'fc':{'weight_bit':8, 'activation_bit':8},
+    }


Can we specify one bit for all layers?

We haven't supported it because not all of ops have been supported in quantization. But we can specify one bit for specific supported op type and all layers of this op will be quantized.

colorjam · 2021-03-30T08:14:57Z

nni/compression/pytorch/quantization_speedup/calibrator.py

+        self.algorithm = algorithm
+        self.cache_file = cache_file
+
+        # Every time get_batch is called, the next batch of size batch_size will be copied to the device and returned.


This comment looks strange here, move to get_match would be better.

This comment is for self.batch_size. Have modified to make it clear.

colorjam · 2021-03-31T06:59:28Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+                return None
+
+        input_tensor = network.get_input(0)
+        input_tensor.dynamic_range = (-100, 100)


Why input dynamic range set to (-100, 100)?

Range (-100, 100) is just for testing. It has been deleted in latest commit.

QuanluZhang · 2021-04-03T15:24:10Z

examples/model_compress/quantization/mixed_precision_speedup_mnist_QAT.py

+    engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=batch_size)
+    engine.compress()
+
+    test_trt(engine, test_loader)


seems this example includes the above example, so why need two examples?

In example mixed_precision_speedup_mnist.py , model will be quantized in tensorrt directly by providing calibration dataset and tensorrt will get quantization parameter by calibration process. We can consider it as post-training quantization.
However, in example mixed_precision_speedup_mnist_QAT.py, we first finetune the model and get quantization parameters by using QAT algorithm. Then this model will be quantized in tensorrt without calibration dataset. We can consider it as Quantization aware training.

suggest to put them into one example file

Have put them into one file.

QuanluZhang · 2021-04-03T15:26:53Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

@@ -31,7 +31,7 @@ def validate_config(self, model, config_list):

        schema.validate(config_list)

-    def quantize_weight(self, wrapper, **kwargs):
+    def quantize_weight(self, input, wrapper, **kwargs):


what is the meaning of input? and is it used in this function?...

In this new version, input tensor's dynamic range will also be recorded to meet the requirement of tensorrt tensor range setting.

so what is the meaning of input?

input is input tensor of this op. It is used to calibrate the input tensor dynamic range. It won't be used in all quantizers so I have passed it by kwargs.

QuanluZhang · 2021-04-03T15:28:25Z

nni/compression/pytorch/compressor.py

@@ -720,11 +720,11 @@ def quant_backward(tensor, grad_output, quant_type, scale, zero_point, qmin, qma
        return grad_output

    @staticmethod
-    def forward(ctx, tensor, quant_type, wrapper, **kwargs):
+    def forward(ctx, tensor, quant_type, wrapper, tensor_alt=None, **kwargs):


what is the meaning of tensor_alt?

It is used to transfer the second tensor during the forward process. In previous implementation, we only transfer one tensor like weight, input, output. But in some situation, we need to transfer two tensors to calibrate both of them like the function quantize_weight. So I add it here.

is tensor_alt commonly used in different quantizers? if it is specific for some quantizers, suggest to put it in kwargs, this is howkwargs used for

The argument tensor_alt may not be used by most of quantizers but I don't think it is a bad nothing to provide an alternative preparation here. What's more, forward in class QuantGrad is called by apply() which only supports positional argument so that kwargs may be nothing here. If we put it as kwargs forcibly, error would be raised. Based on the above reasons, I think it can be kept.

QuanluZhang · 2021-04-03T15:30:02Z

nni/compression/pytorch/quantization_speedup/common.py

+        return self.__str__()
+
+# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
+def allocate_buffers(engine):


it is a little strange to put these tensorrt specific functions to common.py

All functions in it are about cuda memory operation. I think it is better to take them out to make code easier understand.

i agree, so you can rename this file, for example, "trt_cuda.py". because we are supposed to support different backends, cuda memory operations are still specific for nvidia gpu.

Have mv common.py to 'trt_pycuda.py'.

QuanluZhang · 2021-04-03T15:31:45Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+        engine = build_engine(onnx_path, calib, self.onnx_config, self.extra_layer_bit, self.strict_datatype)
+        return engine.create_execution_context()
+
+    def tensorrt_build_withoutcalib(self, onnx_path):


if a member function is not supposed to be exposed to users, it would be better to add _ before function name

Agree. Have added.

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

QuanluZhang · 2021-04-03T15:34:47Z

@linbinskn please update doc accordingly. And prepare unit test, we will setup environment for this unit test.

linbinskn · 2021-04-05T10:33:42Z

@linbinskn please update doc accordingly. And prepare unit test, we will setup environment for this unit test.

Have updated doc in #3512 . Unit test is also prepared and will be pushed after environment setup.

nni/compression/pytorch/quantization_speedup/calibrator.py

nni/compression/pytorch/quantization_speedup/frontend_to_onnx.py

QuanluZhang · 2021-04-06T06:00:39Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+        engine = builder.build_cuda_engine(network)
+        return engine
+
+def build_engine_without_calib(model_file, config=None, extra_layer_bit=32, strict_datatype=False):


suggest to combine this function with "build_engine"

Have combined them into one function.

QuanluZhang · 2021-04-06T06:05:09Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+
+        if extra_layer_bit == 32 and config is None:
+            pass
+        elif extra_layer_bit == 8 and config is None:


what if extra_layer_bit is 16 and config is None?

We should turn on fp16 mode. Have modified.

QuanluZhang · 2021-04-06T06:06:16Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+        else:
+            builder.int8_mode = True
+            builder.fp16_mode = True
+            builder.int8_calibrator = calib


so calib is only for int8? what about 2 bits? 4 bits?

Only int8 is supported in TensorRT. The int8_calibrator parameter is fixed in TensorRT builder.

QuanluZhang · 2021-04-06T06:07:38Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+        # Parse onnx model
+        with open(model_file, 'rb') as model:
+            if not parser.parse(model.read()):
+                print ('ERROR: Fail to parse the ONNX file.')


print -> logging

Have substituted.

QuanluZhang · 2021-04-06T06:11:19Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+            if layer.name in config:
+                w_bit = config[layer.name]['weight_bit']
+                a_bit = config[layer.name]['activation_bit']
+                layer.precision = Precision_Dict[w_bit]


is it possible that w_bit is a value other than 8, 16, 32? better to add a validation function of config to trt backend

Have added validate function.

QuanluZhang · 2021-04-06T06:12:12Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+            # entire model in 8bit mode
+            builder.int8_mode = True
+        else:
+            pass


this is too hacky

Sorry, it is a mistake. Have fixed it.

J-shang · 2021-04-06T06:35:12Z

nni/compression/pytorch/quantization_speedup/trt_pycuda.py

+import pycuda.autoinit
+import tensorrt as trt
+
+pycuda.autoinit


maybe remove?

This have to be kept because import pycuda.autoinit is necessary here otherwise pycuda would not be ready and error would be raised. But itself will not be used in following code which is not allowed in python test pipeline. So I choose to put this sentence here.

better to use comment to escape pylint for that line

J-shang · 2021-04-06T06:58:03Z

nni/compression/pytorch/quantization_speedup/trt_pycuda.py

+# inputs and outputs are expected to be lists of HostDeviceMem objects.
+def do_inference_v2(context, bindings, inputs, outputs, stream):
+    # Transfer input data to the GPU.
+    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]


is this a code style or for some reason?

This is a code style learned from NVIDIA trt example.

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

J-shang · 2021-04-06T07:23:46Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+        """
+        # Attention that, builder should be set to 1 because of the implementation of allocate_buffer
+        builder.max_batch_size = 1
+        builder.max_workspace_size = common.GiB(1)


is fixing size to 1GiB enough for all scenes?

Good question! I think 1GiB is enough for single model. To prevent memory limitation in some special cases, I extended it to 4GiB.

nni/algorithms/compression/pytorch/quantization/quantizers.py

QuanluZhang · 2021-04-07T13:34:34Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+            for i in range(network.num_layers):
+                if config is None:
+                    break
+                valid_config(config)


valid_config should be called many times?

Have modified.

QuanluZhang · 2021-04-07T13:44:02Z

nni/compression/pytorch/quantization_speedup/frontend_to_onnx.py

+        Input name of onnx model providing for torch.onnx.export to generate onnx model
+    output_name : list
+        Output name of onnx model providing for torch.onnx.export to generate onnx model
+    Returns


add a blank line

Have added.

QuanluZhang · 2021-04-07T14:17:11Z

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

+                    in_tensor = layer.get_input(0)
+                    in_tensor.dynamic_range = (tracked_min_input, tracked_max_input)
+                    # Gemm will generate two shuffle layers before and after itself, need specific setting
+                    if layer.name[0:4] == "Gemm":


why "Gemm" is handled only when calib is None?

When calib is not none, quantization speedup module will do post training quantization. In current implementation, we do not consider any extra modification to post training quantization.

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py

linbinskn added 2 commits March 28, 2021 22:42

combine quantization algorithm and backend

a10644b

update quantize without calibration

e83d90c

SparkSnail mentioned this pull request Mar 29, 2021

NNI 2021 Mar~Apr Iteration Planning #3445

Closed

78 tasks

SparkSnail requested review from QuanluZhang, J-shang and liuzhe-lz March 29, 2021 03:19

colorjam reviewed Mar 31, 2021

View reviewed changes

linbinskn added 7 commits April 2, 2021 12:57

Update new calibration parameter to align QAT and tensorrt

887af8d

update dynamic range setting

d3f81f1

pass the pipeline

5a814c3

fix ut test

1e16543

fix ut test

3c68855

convert input datatype to pytorch tensor

7f638b7

delete cifar10 example

6b56265

QuanluZhang reviewed Apr 3, 2021

View reviewed changes

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py Show resolved Hide resolved

linbinskn added 3 commits April 5, 2021 11:43

add export_quantized_model() for trt

fb77c39

add load_quantized_model() for trt

7065d84

pass the pipeline

5e889b1

linbinskn added a commit to linbinskn/nni that referenced this pull request Apr 5, 2021

update doc for pr microsoft#3488(quantization speedup tool)

91461cf

linbinskn added 3 commits April 5, 2021 16:34

fix some comments

6bab4a3

fix pipeline

7c58fad

modify calibration datatype convert

91626af

QuanluZhang reviewed Apr 6, 2021

View reviewed changes

nni/compression/pytorch/quantization_speedup/calibrator.py Show resolved Hide resolved

QuanluZhang reviewed Apr 6, 2021

View reviewed changes

nni/compression/pytorch/quantization_speedup/calibrator.py Show resolved Hide resolved

QuanluZhang reviewed Apr 6, 2021

View reviewed changes

nni/compression/pytorch/quantization_speedup/frontend_to_onnx.py Show resolved Hide resolved

QuanluZhang reviewed Apr 6, 2021

View reviewed changes

J-shang reviewed Apr 6, 2021

View reviewed changes

QuanluZhang reviewed Apr 6, 2021

View reviewed changes

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py Show resolved Hide resolved

J-shang reviewed Apr 6, 2021

View reviewed changes

linbinskn added 3 commits April 7, 2021 09:41

resolve comments

db9cad2

fix logger warning

2a3015c

fix pipeline

fe2ccf6

linbinskn requested review from QuanluZhang and J-shang April 7, 2021 02:50

fix test ut

7b197fa

J-shang approved these changes Apr 7, 2021

View reviewed changes

QuanluZhang reviewed Apr 7, 2021

View reviewed changes

nni/algorithms/compression/pytorch/quantization/quantizers.py Show resolved Hide resolved

QuanluZhang reviewed Apr 7, 2021

View reviewed changes

nni/compression/pytorch/quantization_speedup/integrated_tensorrt.py Show resolved Hide resolved

resolve some comments

05823f8

linbinskn requested a review from QuanluZhang April 8, 2021 08:13

QuanluZhang approved these changes Apr 9, 2021

View reviewed changes

QuanluZhang merged commit f0e3c58 into microsoft:master Apr 9, 2021

SparkSnail pushed a commit that referenced this pull request Apr 9, 2021

update doc for pr #3488(quantization speedup tool) (#3512)

26207d1



		def resnet18(**kwargs):
		return _resnet(BasicBlock, [2, 2, 2, 2], **kwargs)

Combine tensorrt tool with NNI quantization algorithms. #3488

Combine tensorrt tool with NNI quantization algorithms. #3488

Conversation

linbinskn commented Mar 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linbinskn Apr 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuanluZhang Apr 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuanluZhang commented Apr 3, 2021

linbinskn commented Apr 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuanluZhang Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linbinskn Apr 5, 2021 •

edited

Loading

QuanluZhang Apr 5, 2021 •

edited

Loading

QuanluZhang Apr 6, 2021 •

edited

Loading