microsoft · chicm-ms · Sep 21, 2020 · Jul 5, 2020 · Jul 6, 2020 · Jul 6, 2020
diff --git a/docs/en_US/Compressor/Pruner.md b/docs/en_US/Compressor/Pruner.md
@@ -10,9 +10,12 @@ We provide several pruning algorithms that support fine-grained weight pruning a
 * [Slim Pruner](#slim-pruner)
 * [FPGM Pruner](#fpgm-pruner)
 * [L1Filter Pruner](#l1filter-pruner)
+* [Constrained L1Filter Pruner](#constrained-l1filter-pruner)
 * [L2Filter Pruner](#l2filter-pruner)
+* [Constrained L2Filter Pruner](#constrained-l2filter-pruner)
 * [APoZ Rank Pruner](#activationapozrankfilterpruner)
 * [Activation Mean Rank Pruner](#activationmeanrankfilterpruner)
+* [Constrained Activation Mean Rank Filter Pruner](#constrained-activationmeanrankfilter-pruner)
 * [Taylor FO On Weight Pruner](#taylorfoweightfilterpruner)
 
 **Pruning Schedule**
@@ -177,6 +180,27 @@ The experiments code can be found at [examples/model_compress]( https://github.c
 
 ***
 
+## Constrained L1Filter Pruner
+This is a topology constraint-aware one-shot pruner. Compared to the [original L1 Filter Pruner](#l1filter-pruner), this pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the network architecture of the target model. Specifically, for the example, if the output channels of two convolutional layers(conv1, conv2) are added together, then we can say that these two conv layers have channel dependency with each other(more details please see [Compression Utils](./CompressionUtils.md)). If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning. To better gain the speed benefit of the model pruning, we develop this constraint(topology)-aware one-shot pruner.
+
+The `Constrained L1Filter Pruner` will try to prune the same output channels for the layers that have the channel dependencies with each other. `Constrained L1Filter Pruner` will calculate the L1 norm sum of all the layers in the dependency set for each channel. We know that the maximum sparsity of the channels of this dependency set is determined by the minimum sparsity of layers in this dependency set(denoted by `min_sparsity`). According to the L1 norm sum of each channel, `Constrained L1Filter Pruner` will prune the same `min_sparsity` channels for all the layers. Next, the pruner will additionally prune `sparsity` - `min_sparsity` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of `conv1` , `conv2` are added together and the configured sparsities of `conv1` and `conv2` are 0.3, 0.2 respectively. In this case, `Constrained L1Filter Pruner` will prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`. Next, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+
+ In addition. for the convolutional layers that have more than one filter group, `Constrained L1Filter Pruner` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process. 
+
+ In a word, compared to `L1Filter`, `Constrained L1Filter Pruner` will provide a better speed gain from the model pruning.
+
+
+### Usage
+Pytorch code
+```python
+from nni.compression.torch import Constrained_L1FilterPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+dummy_input = torch.rand(1, 3, 32, 32)
+pruner = Constrained_L1FilterPruner(model, config_list, dummy_input)
+pruner.compress()
+```
+Compared to `L1FilterPruner`, `ConstrainedL1FilterPruner` needs an additional input parameter called `dummy_input` to analyze the topology of the input model. The other input parameters are same as `L1FilterPruner`.
+
 ## L2Filter Pruner
 
 This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
@@ -199,6 +223,19 @@ pruner.compress()
 
 ***
 
+## Constrained L2Filter Pruner
+Similar to Constrained L1Filter Pruner, this pruner prunes the model based on the L2 norm and the topology of the model.
+
+### Usage
+Pytorch code
+```python
+from nni.compression.torch import Constrained_L2FilterPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+dummy_input = torch.rand(1, 3, 32, 32)
+pruner = Constrained_L2FilterPruner(model, config_list, dummy_input)
+pruner.compress()
+```
+
 ## ActivationAPoZRankFilterPruner
 
 ActivationAPoZRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion `APoZ` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion `APoZ` is explained in the paper [Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures](https://arxiv.org/abs/1607.03250).
@@ -261,6 +298,20 @@ You can view example for more information
 
 ***
 
+
+## Constrained ActivationMeanRankFilter Pruner
+Similar to Constrained L1Filter Pruner, this pruner prunes the model based on the activation rank of the filters and the topology of the model.
+
+### Usage
+Pytorch code
+```python
+from nni.compression.torch import ConstrainedActivationMeanRankFilterPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+dummy_input = torch.rand(1, 3, 32, 32)
+pruner = ConstrainedActivationMeanRankFilterPruner(model, config_list, dummy_input)
+pruner.compress()
+```
+
 ## TaylorFOWeightFilterPruner
 
 TaylorFOWeightFilterPruner is a pruner which prunes convolutional layers based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity. The estimated importance of filters is defined as the paper [Importance Estimation for Neural Network Pruning](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf). Other pruning criteria mentioned in this paper will be supported in future release.

diff --git a/examples/model_compress/constrained_pruner.py b/examples/model_compress/constrained_pruner.py
@@ -0,0 +1,219 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+'''
+Examples for automatic pruners
+'''
+
+import argparse
+import os
+import json
+import torch
+from torch.optim.lr_scheduler import StepLR, MultiStepLR
+from torchvision import datasets, transforms, models
+
+from models.mnist.lenet import LeNet
+from models.cifar10.vgg import VGG
+from nni.compression.torch import L1FilterPruner, Constrained_L1FilterPruner
+from nni.compression.torch import L2FilterPruner, Constrained_L2FilterPruner
+from nni.compression.torch import ActivationMeanRankFilterPruner, ConstrainedActivationMeanRankFilterPruner
+from nni.compression.torch import ModelSpeedup
+from nni.compression.torch.utils.counter import count_flops_params 
+
+def cifar10_dataset(args):
+    """
+    return the train & test dataloader for the cifar10 dataset.
+    """
+    kwargs = {'num_workers': 10, 'pin_memory': True} if torch.cuda.is_available() else {
+    }
+
+
+    normalize = transforms.Normalize(
+        (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
+    train_loader = torch.utils.data.DataLoader(
+        datasets.CIFAR10(args.data_dir, train=True, transform=transforms.Compose([
+            transforms.RandomHorizontalFlip(),
+            transforms.RandomCrop(32, 4),
+            transforms.ToTensor(),
+            normalize,
+        ]), download=True),
+        batch_size=args.batch_size, shuffle=True, **kwargs)
+
+    val_loader = torch.utils.data.DataLoader(
+        datasets.CIFAR10(args.data_dir, train=False, transform=transforms.Compose([
+            transforms.ToTensor(),
+            normalize,
+        ])),
+        batch_size=args.batch_size, shuffle=False, **kwargs)
+    dummy_input = torch.ones(1, 3, 32, 32)
+    return train_loader, val_loader, dummy_input
+
+def imagenet_dataset(args):
+    kwargs = {'num_workers': 10, 'pin_memory': True} if torch.cuda.is_available() else {}
+    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                        std=[0.229, 0.224, 0.225])
+    train_loader = torch.utils.data.DataLoader(
+        datasets.ImageFolder(os.path.join(args.data_dir, 'train'),
+                                transform=transforms.Compose([
+                                    transforms.RandomResizedCrop(224),
+                                    transforms.RandomHorizontalFlip(),
+                                    transforms.ToTensor(),
+                                    normalize,
+                                ])),
+        batch_size=args.batch_size, shuffle=True, **kwargs)
+
+    val_loader = torch.utils.data.DataLoader(
+        datasets.ImageFolder(os.path.join(args.data_dir, 'val'),
+                                transform=transforms.Compose([
+                                    transforms.Resize(256),
+                                    transforms.CenterCrop(224),
+                                    transforms.ToTensor(),
+                                    normalize,
+                                ])),
+        batch_size=args.batch_size, shuffle=True, **kwargs)
+    dummy_input = torch.ones(1, 3, 224, 224)
+    return train_loader, val_loader, dummy_input
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--dataset', type=str, default='imagenet',
+                        help='dataset to use, mnist, cifar10 or imagenet (default cifar10)')
+    parser.add_argument('--model', type=str, default='resnet18',
+                        help='model to use, LeNet, vgg16, resnet18 or mobilenet_v2')
+    parser.add_argument('--data-dir', type=str, default='/mnt/imagenet/raw_jpeg/2012/',
+                        help='dataset directory')
+    parser.add_argument('--batch-size', type=int, default=64,
+                        help='input batch size for training (default: 64)')
+    parser.add_argument('--sparsity', type=float, default=0.1,
+                        help='overall target sparsity')
+    parser.add_argument('--log-interval', type=int, default=200,
+                        help='how many batches to wait before logging training status')
+    parser.add_argument('--finetune_epochs', type=int, default=15,
+                        help='the number of finetune epochs after pruning')
+    parser.add_argument('--lr', type=float, default=0.001, help='the learning rate of model')
+    return parser.parse_args()
+
+
+def train(args, model, device, train_loader, criterion, optimizer, epoch, callback=None):
+    model.train()
+    loss_sum = 0
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = criterion(output, target)
+        loss_sum += loss.item()
+        loss.backward()
+        # callback should be inserted between loss.backward() and optimizer.step()
+        if callback:
+            callback()
+        optimizer.step()
+        if batch_idx % args.log_interval == 0:
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
+                epoch, batch_idx * len(data), len(train_loader.dataset),
+                100. * batch_idx / len(train_loader), loss_sum/(batch_idx+1)))
+
+
+def test(model, device, criterion, val_loader):
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in val_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            # sum up batch loss
+            test_loss += criterion(output, target).item()
+            # get the index of the max log-probability
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+
+    test_loss /= len(val_loader.dataset)
+    accuracy = correct / len(val_loader.dataset)
+
+    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
+        test_loss, correct, len(val_loader.dataset), 100. * accuracy))
+
+    return accuracy
+
+def get_data(args):
+    if args.dataset == 'cifar10':
+        return cifar10_dataset(args)
+    elif args.dataset == 'imagenet':
+        return imagenet_dataset(args)
+
+if __name__ == '__main__':
+    args = parse_args()
+    torch.manual_seed(0)
+    Model = getattr(models, args.model)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    train_loader, val_loader, dummy_input = get_data(args)
+    net1 = Model(pretrained=True).to(device)
+    net2 = Model(pretrained=True).to(device)
+
+    optimizer1 = torch.optim.SGD(net1.parameters(), lr=args.lr,
+                                momentum=0.9,
+                                weight_decay=5e-4)
+    scheduler1 = MultiStepLR(
+        optimizer1, milestones=[int(args.finetune_epochs*0.5), int(args.finetune_epochs*0.75)], gamma=0.1)
+    criterion1 = torch.nn.CrossEntropyLoss()
+    optimizer2 = torch.optim.SGD(net2.parameters(), lr=args.lr,
+                                momentum=0.9,
+                                weight_decay=5e-4)
+    scheduler2 = MultiStepLR(
+        optimizer2, milestones=[int(args.finetune_epochs*0.5), int(args.finetune_epochs*0.75)], gamma=0.1)
+    criterion2 = torch.nn.CrossEntropyLoss()
+
+    cfglist = [{'op_types':['Conv2d'], 'sparsity':args.sparsity}]
+    #pruner1 = L1FilterPruner(net1, cfglist, optimizer1)
+    #pruner2 = Constrained_L1FilterPruner(net2, cfglist, dummy_input.to(device), optimizer2)
+
+    pruner1 = ActivationMeanRankFilterPruner(net1, cfglist, optimizer1, statistics_batch_num=10)
+    pruner2 = ConstrainedActivationMeanRankFilterPruner(net2, cfglist, dummy_input.to(device), optimizer2, statistics_batch_num=10)
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data = data.to(device)
+        net1(data)
+        net2(data)
+        if batch_idx > 10:
+            # enough data to calculate the activation
+            break
+
+    pruner1.compress()
+    pruner2.compress()
+    pruner1.export_model('./ori_%f.pth' % args.sparsity, './ori_mask_%f' % args.sparsity)
+    pruner2.export_model('./cons_%f.pth' % args.sparsity, './cons_mask_%f' % args.sparsity)
+    pruner1._unwrap_model()
+    pruner2._unwrap_model()
+    ms1 = ModelSpeedup(net1, dummy_input.to(device), './ori_mask_%f' % args.sparsity)
+    ms2 = ModelSpeedup(net2, dummy_input.to(device), './cons_mask_%f' % args.sparsity)
+    ms1.speedup_model()
+    ms2.speedup_model()
+    print('Model speedup finished')
+
+    acc1 = test(net1, device, criterion1, val_loader)
+    acc2 = test(net2, device, criterion2, val_loader)
+    print('After pruning: Acc of Original Pruner %f, Acc of Constrained Pruner %f' % (acc1, acc2))
+
+    for epoch in range(args.finetune_epochs):
+        train(args, net2, device, train_loader,
+                criterion2, optimizer2, epoch)
+        scheduler2.step()
+        acc2 = test(net2, device, criterion2, val_loader)
+        print('Finetune Epoch %d, acc of constrained pruner %f'%(epoch, acc2))
+
+    for epoch in range(args.finetune_epochs):
+        train(args, net1, device, train_loader,
+                criterion1, optimizer1, epoch)
+        scheduler1.step()
+        acc1 = test(net1, device, criterion1, val_loader)
+        print('Finetune Epoch %d, acc of original pruner %f'%(epoch, acc1))
+
+
+
+    acc1 = test(net1, device, criterion1, val_loader)
+    acc2 = test(net2, device, criterion2, val_loader)
+    print('After finetuning: Acc of Original Pruner %f, Acc of Constrained Pruner %f' % (acc1, acc2))
+
+    flops1, weights1 = count_flops_params(net1, dummy_input.size())
+    flops2, weights2 = count_flops_params(net2, dummy_input.size())
+    print('L1filter pruner flops:{} weight:{}'.format(flops1, weights1))
+    print('Constrained L1filter pruner flops:{} weight:{}'.format(flops2, weights2))
diff --git a/src/sdk/pynni/nni/compression/torch/pruning/constants.py b/src/sdk/pynni/nni/compression/torch/pruning/constants.py
@@ -4,15 +4,20 @@
 
 from ..pruning import LevelPrunerMasker, SlimPrunerMasker, L1FilterPrunerMasker, \
     L2FilterPrunerMasker, FPGMPrunerMasker, TaylorFOWeightFilterPrunerMasker, \
-    ActivationAPoZRankFilterPrunerMasker, ActivationMeanRankFilterPrunerMasker
+    ActivationAPoZRankFilterPrunerMasker, ActivationMeanRankFilterPrunerMasker, \
+    L1ConstrainedFilterPrunerMasker, L2ConstrainedFilterPrunerMasker, \
+    ConstrainedActivationMeanRankFilterPrunerMasker
 
 MASKER_DICT = {
     'level': LevelPrunerMasker,
     'slim': SlimPrunerMasker,
     'l1': L1FilterPrunerMasker,
+    'l1_constrained': L1ConstrainedFilterPrunerMasker,
     'l2': L2FilterPrunerMasker,
+    'l2_constrained': L2ConstrainedFilterPrunerMasker,
     'fpgm': FPGMPrunerMasker,
     'taylorfo': TaylorFOWeightFilterPrunerMasker,
     'apoz': ActivationAPoZRankFilterPrunerMasker,
-    'mean_activation': ActivationMeanRankFilterPrunerMasker
+    'mean_activation': ActivationMeanRankFilterPrunerMasker,
+    'mean_activation_constrained': ConstrainedActivationMeanRankFilterPrunerMasker
 }