-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QNN] Optimize requantize for power of 2 and fix dequantize for per-channel quantized input #6675
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -155,17 +155,22 @@ Expr RequantizeLower(const Expr& input_tensor, const Expr& input_scale, | |
if (!IsEqualScalar(input_scale, output_scale)) { | ||
int32_t fixed_point_multiplier, shift; | ||
std::tie(fixed_point_multiplier, shift) = GetFixedPointMultiplierShift(double_multiplier); | ||
|
||
const bool is_upward_rounding = (param->rounding == "UPWARD"); | ||
|
||
// When using upward rounding (i.e., x.5 rounded to x+1), leverage | ||
// the FixedPointMultiply operator | ||
scaled_int32_t = | ||
(is_upward_rounding | ||
? FixedPointMultiply(scaled_int32_t, fixed_point_multiplier, shift) | ||
: FixedPointMultiplyToNearest(scaled_int32_t, double_multiplier, input_shape)); | ||
if (is_upward_rounding && fixed_point_multiplier == (1 << 30)) { | ||
// Power of 2 is determined by the fixed_point_multiplier == 1 << 30. In case of power of 2, | ||
// fixed point multiplier will represent a float value of 0.5. In fixed point, this is | ||
// represented by 1 << 30. | ||
scaled_int32_t = PowerOfTwoMultiply(scaled_int32_t, shift - 1); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it make sense for this to go in |
||
} else { | ||
// When using upward rounding (i.e., x.5 rounded to x+1), leverage | ||
// the FixedPointMultiply operator | ||
scaled_int32_t = | ||
(is_upward_rounding | ||
? FixedPointMultiply(scaled_int32_t, fixed_point_multiplier, shift) | ||
: FixedPointMultiplyToNearest(scaled_int32_t, double_multiplier, input_shape)); | ||
} | ||
} | ||
|
||
} else { | ||
// This is per-channel (per=axis) quantization. | ||
std::vector<double> double_multipliers; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,6 +56,21 @@ std::pair<int32_t, int32_t> GetFixedPointMultiplierShift(double double_multiplie | |
return std::make_pair(significand, exponent); | ||
} | ||
|
||
Expr PowerOfTwoMultiply(Expr tensor, int32_t exp) { | ||
Expr out; | ||
if (exp > 0) { | ||
// power of 2 is greater than 0, apply left shift. | ||
out = LeftShift(tensor, MakeConstantScalar(DataType::Int(32), exp)); | ||
} else { | ||
// power of 2 is less than 0, round and then apply right shift. | ||
exp = -exp; | ||
auto rounding_factor = 1 << (exp - 1); | ||
auto rounded_t = Add(tensor, MakeConstantScalar(DataType::Int(32), rounding_factor)); | ||
out = RightShift(rounded_t, MakeConstantScalar(DataType::Int(32), exp)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure you don't need to convert to |
||
} | ||
return out; | ||
} | ||
|
||
Expr FixedPointMultiplyToNearest(Expr tensor, double multiplier, | ||
const Array<IndexExpr>& input_shape) { | ||
// Choose high precision datatype to be int64. This is for avoiding overflow | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the
fixed_point_multiplier
must be(1 << 30)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we use
frexp
to represent a floating point numbers. It gives a float significant which is between [0.5, 1). For power of 2, it is always 0.5. We convert the float significand into a fixed point 32-bit integer, where decimal point is between the first and second bit. 0.5 in this representation = 1 << 30There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anijain2305 , can add a small one line comment regarding (1<<30) ? These days aside from
float32
many other types offloat
floats around.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbalint13 Added a comment, can you PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anijain2305 , Thank you !