[TOPI] Winograd #899

tqchen · 2018-02-13T16:54:25Z

So far we didn't have winograd, and with #898 brings the first implementation, we want to push it for other backends, so this is an issue to track the progress. Ideally, let us make the implementation also works for bigger batches.

Mali
CUDA
AMDGPU
arm

tqchen · 2018-02-13T16:56:10Z

@ZihengJiang @masahi @adityaatluri @Laurawly

aditya4d · 2018-05-14T22:42:49Z

I'm getting into implementing winograd kernels, will let you know the progress.

masahi · 2018-06-24T03:42:47Z

I have a very basic winograd working for CUDA and AMDGPU here. My code is modified one from Mali winograd which is a very good reference. Will try optimize batched gemm which is taking 96% of compute.

For AOT compiler like TVM, filter transform can be pre-computed during compile time. TVM Mali implementation doesn't do this, but it should.

merrymercy · 2018-06-24T13:45:26Z

I implemented some experimental cuda winograd with filter transform precomputed. The code is not very clean so I only keep it in my local branch.

ref (only op/compute definition, schedule is in a private repo):
https://github.com/merrymercy/nnvm/blob/winograd/python/nnvm/top/contrib.py
https://github.com/merrymercy/tvm/blob/winograd/topi/python/topi/contrib.py

To support pre-computing filter transform, we need

implement two ops in NNVM : conv2d_winograd_filter_transform, conv2d_winograd_without_filter_transform
register alter op in NNVM: it can replace the original conv2d op with two ops: one for filter transform and one for other parts. Then the filter transform op can be pre-computed by optimization pass PrecomputePrune
The alter op registration (implemented by General Layout Support dmlc/nnvm#447) looks like

@reg.register_alter_op_layout("conv2d")
def alter_conv2d_layout(attrs, inputs, tinfos):
    ....
    if groups == 1 and kernel_size == (3, 3) and strides == (1, 1):
        copy_inputs[1] = sym.contrib.conv2d_winograd_6x6_3x3_weight_transform(copy_inputs[1])
        return sym.contrib.conv2d_winograd_6x6_3x3_without_weight_transform(*copy_inputs, **new_attrs)
    else:
        return sym.conv2d(*copy_inputs, **new_attrs)

masahi · 2018-06-24T14:55:42Z

nice, I am hoping to push my winograd code to AMDGPU backend with necessary changes in NNVM soon. Later when you add cuda winograd, I can update AMDGPU winograd to be in sync with cuda one.

For AMDGPU backend, my basic winograd is already faster than existing direct conv, which uses cuda schedules as is.

masahi · 2018-06-25T06:30:37Z

@merrymercy for input transform, my IR dump is something like this

 produce V.local {
    V.local[0] = 0.000000f
    V.local[0] = (V.local[0] + d[0])
    V.local[0] = (V.local[0] - d[2])
    V.local[0] = (V.local[0] - d[8])
    V.local[0] = (V.local[0] + d[10])
    V.local[1] = 0.000000f
    V.local[1] = (V.local[1] + d[1])
    V.local[1] = (V.local[1] + d[2])
    V.local[1] = (V.local[1] - d[9])
    V.local[1] = (V.local[1] - d[10])
    V.local[2] = 0.000000f
    V.local[2] = (V.local[2] - d[1])
    V.local[2] = (V.local[2] + d[2])
    V.local[2] = (V.local[2] + d[9])
    V.local[2] = (V.local[2] - d[10])
    V.local[3] = 0.000000f
    V.local[3] = (V.local[3] + d[1])
    V.local[3] = (V.local[3] - d[3])
    V.local[3] = (V.local[3] - d[9])
    V.local[3] = (V.local[3] + d[11])
    V.local[4] = 0.000000f
    V.local[4] = (V.local[4] + d[4])
    V.local[4] = (V.local[4] - d[6])
    V.local[4] = (V.local[4] + d[8])
    V.local[4] = (V.local[4] - d[10])
    V.local[5] = 0.000000f
    V.local[5] = (V.local[5] + d[5])
    V.local[5] = (V.local[5] + d[6])
    V.local[5] = (V.local[5] + d[9])
    V.local[5] = (V.local[5] + d[10])
    V.local[6] = 0.000000f
    V.local[6] = (V.local[6] - d[5])
    V.local[6] = (V.local[6] + d[6])
    V.local[6] = (V.local[6] - d[9])
    V.local[6] = (V.local[6] + d[10])
    V.local[7] = 0.000000f
    V.local[7] = (V.local[7] + d[5])
    V.local[7] = (V.local[7] - d[7])
    V.local[7] = (V.local[7] + d[9])
    V.local[7] = (V.local[7] - d[11])
    ...

But ideally I want minimum amount of add and sub, and remove add by 0.0f. The desired code is something like this

Is achieving minimal math possible with tvm? Same goes for inverse transform.

aditya4d · 2018-06-25T18:09:43Z

Great work @masahi . Can you share your performance numbers?

masahi · 2018-06-25T20:00:00Z

It is still preliminary, but I put some numbers here

I'm sure there are many opportunities for improvement.

tqchen · 2018-07-26T16:18:28Z

#1487 provides winograd for cpu

tqchen · 2018-08-08T17:35:29Z

Close as most winograd are checked in, let us open new threads for specific working items

tqchen mentioned this issue Feb 13, 2018

[TOPI] Add winograd for mali #898

Merged

tqchen added the status: help wanted label Feb 13, 2018

tqchen closed this as completed Aug 8, 2018

tqchen removed the status: help wanted label Aug 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Winograd #899

[TOPI] Winograd #899

tqchen commented Feb 13, 2018 •

edited

Loading

tqchen commented Feb 13, 2018

aditya4d commented May 14, 2018

masahi commented Jun 24, 2018 •

edited

Loading

merrymercy commented Jun 24, 2018 •

edited

Loading

masahi commented Jun 24, 2018

masahi commented Jun 25, 2018 •

edited

Loading

aditya4d commented Jun 25, 2018

masahi commented Jun 25, 2018 •

edited

Loading

tqchen commented Jul 26, 2018

tqchen commented Aug 8, 2018

[TOPI] Winograd #899

[TOPI] Winograd #899

Comments

tqchen commented Feb 13, 2018 • edited Loading

tqchen commented Feb 13, 2018

aditya4d commented May 14, 2018

masahi commented Jun 24, 2018 • edited Loading

merrymercy commented Jun 24, 2018 • edited Loading

masahi commented Jun 24, 2018

masahi commented Jun 25, 2018 • edited Loading

aditya4d commented Jun 25, 2018

masahi commented Jun 25, 2018 • edited Loading

tqchen commented Jul 26, 2018

tqchen commented Aug 8, 2018

tqchen commented Feb 13, 2018 •

edited

Loading

masahi commented Jun 24, 2018 •

edited

Loading

merrymercy commented Jun 24, 2018 •

edited

Loading

masahi commented Jun 25, 2018 •

edited

Loading

masahi commented Jun 25, 2018 •

edited

Loading