-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathspatialbasis.tex
868 lines (776 loc) · 78 KB
/
spatialbasis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
% !TEX root = thesis.tex
\documentclass[thesis]{subfiles}
\begin{document}
\chapter{Spatial Connectivity}\label{lowrankfilters}
%\chapter{Learning a Basis for the Spatial Extents of Filters}
\begin{chapquote}{Yann LeCun, \textit{Backprop.\ Applied to Handwritten Zip Code Recognition}, 1989}
``Classical work in visual pattern recognition has demonstrated the advantage of extracting local features and combining them to form higher-order features. Such knowledge can be easily built into the network by forcing the hidden units to combine only local sources of information. Distinctive features of an object can appear at various location on the input image. Therefore it seems judicious to have a set of feature detectors that can detect a particular instance of a feature anywhere on the input plane.''
%Since the precise location of a feature is not relevant to the classification, we can afford to lose some position information in the process. Nevertheless, approximate position information must be preserved, to allow the next levels to detect higher order, more complex features.''
\end{chapquote}
\glspl{cnn}\index{CNN} (see \cref{cnns}) are a highly specialized form of neural network\index{neural network} for learning image representations. Their use of \emph{convolutional filters} allows \glspl{cnn}\index{CNN} to learn much more efficient representations, from a memory and computational efficiency standpoint, than a full connected network. Such filters usually have limited spatial extents (\ie width and height, as opposed to channels) and their learned weights are shared across the image's spatial domain to provide translation invariance~\citep{Fuk80,Lecun1998}.
Thus, as illustrated in \cref{fig:sparseconn}, in comparison with fully-connected network layers (\cref{fig:sparseconn}(a)), convolutional layers have a much sparser connection structure and use fewer parameters (\cref{fig:sparseconn}(b)).
This leads to faster training and inference, better generalization, and higher accuracy.
This chapter focuses on reducing the computational complexity of the convolutional layers of \glspl{cnn}\index{CNN} by further sparsifying their spatial connection structures. Specifically, we show that by representing convolutional filters using a basis space comprising groups of filters of different spatial dimensions (examples shown in \cref{fig:sparseconn}(c, d)), we can significantly reduce the computational complexity of existing state-of-the-art \glspl{cnn}\index{CNN} without compromising classification accuracy.
%%% Figure
\afterpage{
\begin{landscape}
\begin{figure}[p]
\centering
\includegraphics[height=0.8\textheight]{sparseconn3}
\caption[Image access map visualizing sparsity of convolutional filters]{
{\bf Network connection structure for convolutional layers.} For a single-layer neural network, the sparsity of a convolutional filter as compared to a fully connected network is illustrated. Connection weight maps (centre) show the pairwise dependencies between input and output pixels. In a fully-connected network (a), each output node is connected to all input pixels. For a \gls{cnn} (b,c,d), the output pixels depend only on a sparse subset of input pixels, where shared weights are represented by repeated unique colours, and white pixels represent pixels with no connection. Note that sparsity increases from (a) to (d), opening up potentially more efficient implementation.
}
\label{fig:sparseconn}
\end{figure}
\end{landscape}
}
%%%
Our contributions include a novel method of learning a set of small basis filters that are combined to represent larger filters efficiently. Rather than approximating previously trained networks, we train networks \emph{from scratch} and show that our convolutional layer representation can improve both efficiency and classification accuracy. Unlike methods that approximate previously-trained models (as listed in \cref{approxmethods,factorized}) this allows us to reduce training time, and even increase accuracy over the original model. We further describe how to initialize connection weights effectively for training networks with composite convolutional layers containing groups of differently-shaped filters, which we found to be of critical importance to our training method\footnote{note that much of this work was done before the widespread use of batch normalization, however inititalization still plays an important role.}.
\section{Related Work}
\label{relatedwork}
There has been much previous work on increasing the test-time efficiency of \glspl{cnn}\index{CNN}. Some promising approaches work by making use of more hardware-efficient representations. For example \citet{1502.02551v1} and \citet{vanhoucke2011improving} achieve training- and test-time compute savings by further quantization of network weights that were originally represented as 32-bit floating point numbers. However, more relevant to our work are approaches that depend on new network connection structures, efficient approximations of previously trained networks, and learning low rank filters.
\paragraph{Efficient Network Connection Structures}
There has been shown to be significant redundancy in the trained weights of \glspl{cnn}\index{CNN}~\citep{Denil2013predicting}. \citet{lecun1989optimal} suggest a method of pruning unimportant connections within networks. However this requires repeated network re-training and may be infeasible for modern, state-of-the-art \glspl{cnn}\index{CNN} requiring weeks of training time. \citet{Lin2013NiN} show that the geometric increase in the number and dimensions of filters with deeper networks can be managed using low-dimensional embeddings. The same authors show that global average-pooling may be used to decrease model size in networks with fully-connected layers. \citet{Simonyan2014verydeep} show that stacked filters with small spatial dimensions (\eg $3$$\times$$3$), can operate on the effective receptive field of larger filters (\eg $5$$\times$$5$) with less computational complexity.
\paragraph{Low-Rank Filter Approximations}
\label{approxmethods}
\citet{conf/cvpr/RigamontiSLF13} approximate {\em previously trained} \glspl{cnn}\index{CNN} with low-rank filters for the semantic segmentation of curvilinear structures within volumetric medical imagery. They discuss two approaches: enforcing an $\ell_1$-based regularization to learn approximately low rank filters, which are later truncated to enforce a strict rank, and approximating a set of pre-learned filters with a tensor decomposition into many rank-1 filters. Neither approach learns low rank filters directly, and indeed the second approach proved the more successful.
The work of \citet{journals/corr/JaderbergVZ14} also approximates the existing filters of previously trained networks. They find separable 1D filters through an optimization minimizing the reconstruction error of the already learned full rank filters. They achieve a 4.5$\times$ speed-up with a loss of accuracy of 1\% in a text recognition problem. However since the method is demonstrated only on text recognition, it is not clear how well it would scale to larger data sets or more challenging problems. A key insight of the paper is that filters can be represented by low rank approximations not only in the spatial domain but also in the channel domain.
Both of these methods show that, at least for their respective applications, low rank approximations of full-rank filters learned in convolutional networks can increase test-time efficiency significantly. However, being approximations of pre-trained networks, they are unlikely to improve test accuracy, and can only increase the computational requirements during training.
\paragraph{Learning Separable (Factorized) Filters}
\label{factorized}
\citet{mamalet2012simplifying} propose training networks with separable filters on the task of digit recognition with the \gls{mnist} dataset. They train networks with \emph{sequential} convolutional layers of horizontal and vertical 1D filters, achieving a speed-up factor of 1.6$\times$, but with a relative increase in test error of 13\% (1.45\% \vs 1.28\%). Our approach is different than this, allowing both horizontal and vertical 1D filters (and other shapes too) on the same layer and avoiding issues with ordering. We also demonstrate a decrease in error, and validate on more challenging datasets.
\section{Using Low-Rank Filters in CNNs}
%%% Figure
\begin{figure}[tbp]
\begin{subfigure}[b]{0.98\textwidth}
\centering
\includegraphics[height=0.165\textheight, page=1]{sparsification}
\caption{A standard convolutional layer.}\label{fig:fullrank}
\end{subfigure}\\
\begin{subfigure}[b]{0.98\textwidth}
\centering
\includegraphics[height=0.16\textheight, page=2]{sparsification}
\caption{Sequential separable filters~\citep{journals/corr/JaderbergVZ14}.}\label{fig:separableseq}
\end{subfigure}\\
\begin{subfigure}[b]{0.98\textwidth}
\centering
\includegraphics[height=0.19\textheight, page=3]{sparsification}
\caption{Our method, a learned basis space of filters that are rectangular in the spatial domain and oriented horizontally and vertically.}\label{fig:ourmethod}
\end{subfigure}\\
\begin{subfigure}[b]{0.98\textwidth}
\centering
\includegraphics[height=0.23\textheight, page=4]{sparsification}
\caption{Our method, a learned basis space of vertical/horizontal rectangular filters and square filters. Filters of other shapes are also possible.}\label{fig:ourmethodfullrank}
\end{subfigure}
\caption[Overview of methods of using low-rank filters]{\textbf{Methods of using low-rank filters in \glsfmtplural{cnn}\index{CNN}}. Methods from literature and our proposed methods for learning low rank filters. The activation function is not shown, coming after the last layer in each configuration.}\label{fig:separablemethods}
\end{figure}
%%%
\subsection{Convolutional Filters}
The convolutional layers of a \gls{cnn} produce output `images' (usually called \emph{feature maps}\index{feature map}) by convolving input images with one or more learned filters. %The output images of convolutional layers are to distinguish them from raw input images.
In a typical convolutional layer, as illustrated in \cref{fig:fullrank}, a $c$-channel input image of size $H$$\times$$W$ pixels is convolved with $d$ filters of size $h$$\times$$w$$\times$$c$ to create a $d$-channel output image. Each filter is represented by $h w c$ independent weights. Therefore the computational complexity for the convolution of the filter with a $c$-channel input image is $\gls{bigoh}(d w h c)$ (per pixel in the output \gls{featuremap}\index{feature map}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Here we will use our existing mathematical description of convolution from \cref{sec:convolutioninpractice}, where the incoming \gls{featuremap} is denoted by $\gls{fmX}$, outgoing \gls{featuremap} $\gls{fmY}$, and convolutional filter, or kernel $\gls{fmK}$. The scalar elements of each \gls{featuremap} are $\gls{fmX}_{i,j,k}$, $\gls{fmY}_{i,j,k}$, where $i=\{0,\ldots,\gls{c}\}$ is the \glspl{featuremap} channel (\ie colour for an input image), and $j=\{0,\ldots,\gls{filterh}\}$, $k=\{0,\ldots,\gls{filterw}\}$ are the spatial coordinates, rows and columns respectively, of the channel $i$ image. The filter's scalar elements are $\gls{fmK}_{i,j,k,l}$, where $i$ is the filter's index in the convolutional layer's filter bank and the output channel in $\gls{fmY}$ to which the filter's result is written, $j$ is the input channel in $\gls{fmX}$ over which the filter's spatial elements are convolved, and $(k, l)$ are the row and column offset between the output and input images.
A convolutional layer, as illustrated in \cref{fig:fullrank}, then convolves across the layer such that,
\begin{equation}
\begin{aligned}
\gls{fmY}_{i,j,k} &= \sum_{l,m,n} \gls{fmX}_{l,j+m,k+n}\; \gls{fmK}_{i,l,m,n},
\end{aligned}
\end{equation}
for all valid indices $l,m,n$, depending on the \gls{padding}\index{padding} of the input image. We will express this in shorter terms using the convolution operator $\gls{convolution}$ and allowing considering only a single pixel (\ie fixed $j,\,k$) output spatially for simplicity,
\begin{equation}
\begin{aligned}
\gls{fmY}_i &= \sum_{l} \gls{fmX}_l \gls{convolution} \gls{fmK}_{il}.
\end{aligned}
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%55
In what follows, we describe schemes for modifying the architecture of the convolutional layers so as to reduce computational complexity. The idea is to replace expensive, full-rank spatial convolutional filters, with modified versions that represent the same number of effective filters by a linear combinations of smaller basis vectors.
\subsection{Sequential Separable Filters}\label{seqsep}
An existing scheme for reducing the computational complexity of convolutional layers~\citep{journals/corr/JaderbergVZ14} is to replace each one with a sequence of two regular convolutional layers but with filters that are rectangular in the spatial domain, as shown in \cref{fig:separableseq}.
The first convolutional layer has $m$ horizontal filters $\gls{fmK}_{i,\,l=0,\ldots,m}$ of size $w$$\times$$1$$\times$$c$, producing an output \gls{featuremap}\index{feature map} with $m$ channels. The second convolutional layer has $d$ vertical filters $\gls{fmK}_{i,\,l=0,\ldots,d}$ of size $1$$\times$$h$$\times$$m$, producing an output \gls{featuremap}\index{feature map} with $d$ channels.
Mathematically, the seperable convolution illustrated in \cref{fig:separableseq} can be expressed,
\begin{equation}
\begin{aligned}
\gls{fmY}_i &= \sum_{l} \gls{fmY}^\textrm{h}_l \gls{convolution} \gls{fmK}^\textrm{v}_{il}\\
=& \sum_{l} \left(\gls{fmX}^\textrm{h}_l \gls{convolution} \gls{fmK}^\textrm{h}_{il}\right) \gls{convolution} \gls{fmK}^\textrm{v}_{il},
\end{aligned}
\end{equation}
where $\gls{fmX}^\textrm{h}$ and $\gls{fmX}^\textrm{v}$ are the input feature maps convolved with the horizontal and vertical filters $\gls{fmK}^\textrm{h}$ and $\gls{fmK}^\textrm{v}$ respectively.
By these means the full rank original convolutional filter bank is represented by a low rank approximation formed from a linear combination of a set of separable $w$$\times$$h$ basis filters. The computational complexity of this scheme is $\gls{bigoh}(m c w)$ for the first layer of horizontal filters and $\gls{bigoh}(d m h)$ for the second layer of vertical filters, with a total of $\gls{bigoh}(m(c w + d h))$.
Note that \citet{journals/corr/JaderbergVZ14} use this scheme to approximate existing full rank filters belonging to previously trained networks using a retrospective fitting step. In this work, by contrast, we {\em train} networks containing convolutional layers with this architecture from scratch. In effect, we learn the separable basis filters and their combination weights simultaneously during network training.
\subsection{Filters as Linear Combinations of a Basis}
We introduce a novel method for reducing convolutional layer complexity by training with low-rank filters. This works by representing convolutional filters as linear combinations of basis filters as illustrated in \cref{fig:ourmethod}. This scheme uses \emph{\glspl{compositelayer}\index{composite layer|textbf}} comprising several sets of filters where the filters in each set have different spatial dimensions (see \cref{fig:compositelayers}). The outputs of these basis filters may be combined in a subsequent layer containing filters with spatial dimensions $1$$\times$$1$.
This configuration is illustrated in \cref{fig:ourmethod}, and can be expressed as,
\begin{equation}
\begin{aligned}
\gls{fmY}_i &= \sum_{l} \gls{fmY}^\textrm{basis}_l \gls{convolution} \gls{fmK}^\textrm{weights}_{il}\\
&= \sum_{l} f^\textrm{weights}_{il} \gls{fmY}^\textrm{basis}_l \quad\textrm{(since $\gls{fmK}^\textrm{weights}_{il}$ is a scalar)}\\
&= \sum_{l=0}^{m/2} f^\textrm{weights}_{il} \gls{fmX}_l \gls{convolution} \gls{fmK}^\textrm{h}_{il} + \sum_{l=m/2}^m f^\textrm{weights}_{il} \gls{fmX}_l \gls{convolution} \gls{fmK}^\textrm{v}_{il},
\end{aligned}
\end{equation}
where $\gls{fmK}^\textrm{h}$ and $\gls{fmK}^\textrm{v}$ are the horizontal and vertical filters respectively.
Here, our \gls{compositelayer}\index{composite layer} contains horizontal $w$$\times$$1$ and vertical $1$$\times$$h$ filters, the outputs of which are concatenated in the channel dimension, resulting in an intermediate $m$-channel \gls{featuremap}\index{feature map}. These filter responses are then linearly combined by the next layer of $d$ $1$$\times$$1$ filters to give a $d$-channel output \gls{featuremap}\index{feature map}. In this case, the filters are applied on the input \gls{featuremap}\index{feature map} with $c$ channels and followed by a set of $m$ 1$\times$1 filters over the $m$ output channels of the basis filters. If the number of horizontal and vertical filters is the same, the computational complexity is $\gls{bigoh}( m(wc/2 +hc/2 + d))$.
The effective filters learned in our models are low-rank in that, although we only learn mostly basis filters much smaller than the original networks (\eg $1$$\times$$h$, $w$$\times$$1$), the effective filter size when also using only a few full $w$$\times$$h$ basis filters is still a full 3$\times$3. Necessarily, some of the parameters in our effective filters are linear combinations of others, since the effective filter is a learned linear combination of the low-rank basis filters.
Interestingly, the configuration of \cref{fig:ourmethod}, where we only use horizontal and vertical basis filters, gives rise to linear combinations of horizontal and vertical filters that are cross-shaped in the spatial domain. This is illustrated in \cref{fig:conv1filters} for filters learned in the first convolutional layer of the `vgg-gmp-lr-join' model that is described in \cref{spatialbasisresults}, where it is trained using the \gls{ilsvrc} dataset.
\begin{figure}[tbp]
\centering
\begin{tabular}[c]{rl}
&
\subcaptionbox{$3\times1$ filters.\label{fig:horizontalfilters}}
{
\includegraphics[width=0.25\textheight]{conv1_x}
}\\
\subcaptionbox{$1\times3$ filters.\label{fig:verticalfilters}}[0.1\textheight]
{
\includegraphics[height=0.25\textheight]{conv1_y}
}&
\subcaptionbox{Learned linear combinations.\label{fig:linearcomb}}
{
\includegraphics[width=0.25\textheight]{linearcombinations}
}\\
\end{tabular}
\caption[Learned cross-shaped filters]{\textbf{Learned Cross-Shaped Filters}. The cross-shaped filters (c) learned as weighted linear combination of (b) $1$$\times$$3$ and (c) $3$$\times$$1$ basis filters in the first convolutional layer of the the `vgg-gmp-lr-join' model trained using the \gls{ilsvrc} dataset.}\label{fig:conv1filters}
\end{figure}
Note that, in general, more than two different sizes of basis filter might be used in the \gls{compositelayer}\index{composite layer}. In the more general case, for a set of heterogeneous filter groups $\gls{fmK}^{g=0,\ldots,G}$, we can express this as
\begin{equation}
\begin{aligned}
\gls{fmY}_i &= \sum_{l} f^\textrm{weights}_{il} \sum_g \gls{fmX}_l \gls{convolution} \gls{fmK}^g_{il}.
\end{aligned}
\end{equation}
For example, \cref{fig:ourmethodfullrank} shows a combination of three sets of filters with spatial dimensions $w$$\times$$1$, $1$$\times$$h$, and $w$$\times$$h$. Also note that an interesting option is to omit the $1$$\times$$1$ linear combination layer and instead allow the connection weights in a subsequent network layer to learn to combine the basis filters of the preceding layer (despite any intermediate non-linearity, \eg \glspl{relu}). This possibility is explored empirically in the \cref{spatialbasisresults}.
In that our method uses a combination of filters in a \gls{compositelayer}\index{composite layer}, it is similar to the `GoogLeNet' of \citet{Szegedy2014going} which uses \gls{inception}\index{inception} modules comprising several (square) filters of different sizes ranging from 1$\times$1 to 5$\times$5. In our case, however, we are implicitly learning linear combinations of less computationally expensive filters with different orientations (\eg 3$\times$1 and 1$\times$3 filters), rather than combinations of filters of different sizes. Amongst networks with similar computational requirements, GoogLeNet is one of the most accurate for large scale image classification tasks (see \cref{fig:vggplots}), partly due to the use of heterogeneous filters in the \gls{inception}\index{inception} modules, but also the use of low-dimensional embeddings and global pooling.
\section[Training CNNs with Mixed-Shape Low-Rank Filters]{Training CNNs with Mixed-Shape\texorpdfstring{\\}{ }Low-Rank Filters}\label{initialization}
To determine the standard deviations to be used for weight initialization, we use an approach similar to that described by \citet{glorot2010understanding} (with the adaptation described by \citet{He2015b} for layers followed by a \gls{relu}). In \cref{initializationderivation}, we show the details of our derivation, generalizing the approach of \citet{He2015b} to the initialization of \glspl{compositelayer}\index{composite layer} comprising several groups of filters of different spatial dimensions (see \cref{initializationderivation}, \cref{fig:compositelayers}).
At the start of training, network weights are initialized at random using samples drawn from a Gaussian distribution with a standard deviation parameter specified separately for each layer. We found that the setting of these parameters was critical to the success of network training and difficult to get right, particularly because published parameter settings used elsewhere were not suitable for our new network architectures. With unsuitable weight initialization, training may fail due to {\em exploding gradients}, where back-propagated gradients grow so large as to cause numeric overflow, or {\em vanishing gradients} where back-propagated gradients diminish such that their effect is dwarfed by that of weight decay\index{weight decay} such that loss does not decrease during training~\citep{Hochreiter01gradientflow}.
The approach of \citet{glorot2010understanding} works by ensuring that the magnitudes of back-propagated gradients remain approximately the same throughout the network. Otherwise, if the gradients were inappropriately scaled by some factor (\eg $\gls{beta}$) then the final back-propagated signal would be scaled by a potentially much larger factor ($\gls{beta}^L$ after $L$ layers) (see \cref{ssec:init}).
\subsection{Derivation of the Initialization for \Glsfmtplural{compositelayer}}\label{initializationderivation}
In what follows, we adopt notation similar to that of \citet{He2015b}, and follow their derivation of the appropriate standard deviation for weight initialization. However, we also generalize their approach to the initialization of \glspl{compositelayer}\index{composite layer} comprising several groups of filters of different spatial dimensions (see \cref{fig:compositelayers}).
\paragraph{Forward Propagation}
The response of the $l^\text{th}$ convolutional layer can be represented as,
\begin{equation}
\gls{vectory}_l =\gls{wmatrix}_l \gls{vectorx}_l + \gls{vectorb}_l,
\end{equation}
where $\gls{vectory}_l$ is a $\gls{d}$$\times$$1$ vector representing a pixel in the output \gls{featuremap}\index{feature map}, and $\gls{vectorx}_l$ is a $\gls{filterw} \gls{filterh} \gls{c} \times 1$ vector that represents a $\gls{filterw}$$\times$$\gls{filterh}$ sub-region of the $\gls{c}$-channel input \gls{featuremap}\index{feature map}. $\gls{wmatrix}_l$ is the $\gls{d}$$\times$$n$ weight matrix, where $\gls{d}$ is the number of filters and $n$ is the size of a filter, \ie $n = \gls{filterw} \gls{filterh} \gls{c}$ for a filter with spatial dimensions $\gls{filterw}$$\times$$\gls{filterh}$ operating on an input \gls{featuremap}\index{feature map} of $\gls{c}$ channels, and $\gls{vectorb}_l$ is the bias. Finally $\gls{vectorx}_l = f(\gls{vectory}_{l-1})$ is the output of the previous layer passed through an activation function $\gls{f}$ (\eg the application of a \gls{relu} to each element of $\gls{vectory}_{l-1}$).
\paragraph{Backward Propagation}
During backpropagation\index{backpropagation}, the gradient of a convolutional layer is computed as,
\begin{equation}
\Delta \gls{vectorx}_l = \hat{\gls{wmatrix}}_l \Delta \gls{vectory}_l,
\label{eq:back_prop_gradient}
\end{equation}
where $\Delta \gls{vectorx}_l$ and $\Delta \gls{vectory}_l$ denote the derivatives of loss $\gls{L}$ with respect to input and output pixels. $\Delta \gls{vectorx}_l$ is a $\gls{c}$$\times$$1$ vector of gradients with respect to the channels of a single pixel in the input \gls{featuremap}\index{feature map} and $\Delta \gls{vectory}$ represents $\gls{filterh}$$\times$$\gls{w}$ pixels in $d$ channels of the output \gls{featuremap}\index{feature map}. $\hat{\gls{wmatrix}}_l$ is a $\gls{c}$$\times$$\hat{n}$ matrix, and $\hat{n} = \gls{filterw}\gls{filterh}\gls{d}$. Note that $\hat{\gls{wmatrix}}_l$ can be simply reshaped from $\gls{wmatrix}_l^\top$. Also note that the elements of $\Delta \gls{vectory}_l$ correspond to pixels in the output image that had a forward dependency on the input image pixel corresponding to $\Delta \gls{vectorx}$. In backpropagation\index{backpropagation}, each element $\Delta \gls{y}_l$ of $\Delta \gls{vectory}_l$ is related to an element $\Delta \gls{x}_{l+1}$ of some $\Delta \gls{vectorx}_{l+1}$ (\ie a back-propagated gradient in the next layer) by the derivative of the activation function $\gls{f}$:
\begin{equation}
\Delta \gls{y}_l = \gls{f}^\prime (\gls{y}_l) \Delta \gls{x}_{l+1},
\end{equation}
where $\gls{f}^\prime$ is the derivative of the activation function.
\newcommand{\Expect}{\gls{expected}}
\newcommand{\Var}{\gls{var}}
\paragraph{Weight Initialization}
Now let $\Delta \gls{y}_l$, $\Delta \gls{x}_l$ and $\gls{w}_l$ be scalar random variables that describe the distribution of elements in $\Delta \gls{vectory}_l$, $\Delta \gls{vectorx}_{l}$ and $\hat{\gls{wmatrix}}_l$ respectively. Then, assuming $f^\prime (\gls{y}_l)$ and $\Delta \gls{x}_{l+1}$ are independent,
\begin{equation}
\Expect{}[\Delta \gls{y}_l] = \Expect{}[\gls{f}^\prime (\gls{y}_l)] \, \Expect{}[ \Delta \gls{x}_{l+1}].
\end{equation}
For the \gls{relu} case, $\gls{f}'(\gls{y}_l)$ is zero or one with equal probability.
Like \citet{glorot2010understanding}, we assume that $\gls{w}_l$ and $\Delta \gls{y}_l$ are independent. Thus, \cref{eq:back_prop_gradient} implies that $\Delta \gls{x}_l$ has zero mean for all layers $l$, when $\gls{w}_l$ is initialized by a distribution that is symmetric around zero. Thus we have $\Expect{}[\Delta \gls{y}_l] = \frac{1}{2}\Expect{}[\Delta \gls{x}_{l+1}]= 0$ and also $\Expect{}[(\Delta \gls{y}_l)^2] = \Var{}[\Delta \gls{y}_l] = \frac{1}{2} \Var{}[\Delta \gls{x}_{l+1}]$. Now, since each element of $\Delta \gls{vectorx}_l$ is a summation of $\hat n$ products of elements of $\hat{\gls{wmatrix}}_l$ and elements of $\Delta \gls{vectory}_l$, we can compute the variance of the gradients in \cref{eq:back_prop_gradient}:
\begin{equation}
\begin{aligned}
\Var{}[\Delta \gls{x}_l] &= \hat{n} \Var{}[\gls{w}_l] \Var{}[\Delta \gls{y}_l]\\
&= \frac{1}{2} \hat{n} \Var{}[\gls{w}_l] \Var{} [\Delta \gls{x}_{l+1}].
\end{aligned}
\end{equation}
To avoid scaling the gradients in the convolutional layers (and so avoid exploding or vanishing gradients), we set the ratio between these variances to 1:
\begin{equation}
\frac{1}{2} \hat{n} \Var{}[\gls{w}_l] = 1.
\end{equation}
This leads to the result of \citet{He2015b}, in that a layer with $\hat{n}_l$ connections followed by a \gls{relu} activation function should be initialized with a zero-mean Gaussian distribution with standard deviation $\sqrt{2/ \hat{n}_l}$.
\paragraph{Weight Initialization in \Glsfmtplural{compositelayer}}
%%% Figure
\begin{figure}[tbp]
\centering
\includegraphics[width=0.95\textwidth]{composite}
\caption[A composite convolutional layer]{{\bf A composite convolutional layer}. \Glsfmtplural{compositelayer}\index{composite layer} convolve an input \gls{featuremap}\index{feature map} with $N$ groups of convolutional filters of several different spatial dimensions. Here the $i^\text{th}$ group has $\gls{d}^{[\gls{g}]}$ filters with spatial dimension $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$. The outputs are concatenated to create a $\gls{d}$ channel output \gls{featuremap}\index{feature map}. \Glsfmtplural{compositelayer}\index{composite layer} require careful weight initialization to avoid vanishing/exploding gradients during training.}
\label{fig:compositelayers}
\end{figure}
%%%
The initialization scheme described above assumes that the layer comprises filters of spatial dimension $\gls{w}$$\times$$\gls{filterh}$. Now we extend this scheme to composite convolutional layers\index{composite layer} containing $N$ groups of filters of different spatial dimensions $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ (where superscript $[\gls{g}]$ denotes the group index and with $g\in \{1,\dots,N\}$). Now the layer response is the concatenation of the responses of each group of filters:
\begin{equation}
\gls{vectory}_l =\begin{bmatrix}\gls{wmatrix}_l^{[1]} \gls{vectorx}_l^{[1]} \\ \gls{wmatrix}_l^{[2]} \gls{vectorx}_l^{[2]} \\ \dots \\ \gls{wmatrix}_l^{[N]} \gls{vectorx}_l^{[N]} \end{bmatrix} + \gls{vectorb}_l.
\end{equation}
As before $\gls{vectory}_l$ is a $\gls{d}$$\times$$1$ vector representing the response at one pixel of the output \gls{featuremap}\index{feature map}. Now each ${\gls{vectorx}}^{[\gls{g}]}$ is a $\gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{c} \times 1$ vector that represents a different shaped $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ sub-region of the input \gls{featuremap}\index{feature map}. Each $\gls{wmatrix}_l^{[\gls{g}]}$ is the $\gls{c}_l^{[\gls{g}]}\times \hat{n}^{[\gls{g}]}$ weight matrix, where $\gls{d}$ is the number of filters and $\hat{n}^{[\gls{g}]}$ is the size of a filter, \ie $\hat{n}^{[\gls{g}]} = \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{c}^{[\gls{g}]}$ for a filter of spatial dimension $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ operating on an input \gls{featuremap}\index{feature map} of $\gls{c}_l = \gls{d}_{l-1}$ channels.
During backpropagation, the gradient of the composite convolutional layer\index{composite layer} is computed as a summation of the contributions from each group of filters:
\begin{equation}
\Delta \gls{vectorx}_l = \hat{\gls{wmatrix}}_l^{[1]} \Delta \gls{vectory}_l^{[1]} + \hat{\gls{wmatrix}}_l^{[2]} \Delta \gls{vectory}_l^{[2]} + \cdots+ \hat{\gls{wmatrix}}_l^{[N]} \Delta \gls{vectory}_l^{[N]},
\label{eq:back_prop_gradient_composite}
\end{equation}
where now $\Delta \gls{vectory}^{[\gls{g}]}$ represents $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ pixels in $\gls{d}^{[\gls{g}]}$ channels of the output \gls{featuremap}\index{feature map}. Each $\hat{\gls{wmatrix}}_l^{[\gls{g}]}$ is a $\gls{c}_l \times \hat{n}^{[\gls{g}]}$ matrix of weights arranged appropriately for backpropagation. Again, note that each $\hat{\gls{wmatrix}}_l^{[\gls{g}]}$ can be simply reshaped from $\gls{wmatrix}_l^{[\gls{g}]}$.
As before, each element of $\Delta \gls{vectory}_l$ is a sum over $\hat n$ products between elements of $\hat{\gls{wmatrix}}^{[\gls{g}]}_l$ and elements of $\Delta \gls{vectory}^{[\gls{g}]}_l$ and here $\hat{n}$ is given by:
\begin{equation}
\hat{n} = \sum{ \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}}.
\end{equation}
%
In the case of a \gls{relu} non-linearity, this leads to a zero-mean Gaussian distribution with standard deviation:
%
\begin{equation}
\gls{stddev} = \sqrt{\frac{2}{\sum{ \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}}}}.\label{eqn:reluinit}
\end{equation}
In conclusion, a \gls{compositelayer}\index{composite layer} of heterogeneously-shaped filter groups\index{filter groups}, where each filter group\index{filter groups} $i$ has $\gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}$ outgoing connections should be initialized as if it is a single-layer with $\hat{n} = \sum{ \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}}$. Thus in the case of a \gls{relu} non-linearity, we find that such a \gls{compositelayer}\index{composite layer} should be initialized with a zero-mean Gaussian distribution with standard deviation given in \cref{eqn:reluinit}.
\section{Results}\label{spatialbasisresults}
To validate our approach, we show that we can replace the filters used in existing state-of-the-art network architectures with low-rank representations as described above to reduce computational complexity without reducing accuracy. Here we characterize the computational complexity of a \gls{cnn} using the number of multiply-accumulate operations required for a forward pass (which depends on the size of the filters in each convolutional layer as well as the input image size and stride).
\subsection[Multiply-Accumulate Operations and Caffe CPU/GPU Timings]{Multiply-Accumulate Operations and\texorpdfstring{\\}{ }Caffe CPU/GPU Timings}\label{mavstimings}
We have characterized the computational complexity of a \gls{cnn} using the number of multiply-accumulate operations required for a forward pass (which depends on the size of the filters in each convolutional layer as well as the input image size and stride), to give as close as possible to a hardware and implementation independent evaluation the computational complexity of our method.
\input{lrdata/mavstimings}
However, we have observed strong correlation between multiply-accumulate counts and run-time for both \gls{cpu} and \gls{gpu} implementations of the networks described here (as shown in \cref{fig:mavstimings}). Note that the Caffe timings differ more for the initial convolutional layers where the input sizes are much smaller (3-channels), and \text{\gls{blas}} is less efficient for the relatively small matrices being multiplied.
\subsection{Methodology}
We augment our training set with randomly cropped and mirrored images, but do not use any scale or photometric augmentation, or over-sampling. This allows us to compare the efficiency of different network architectures without having to factor in the computational cost of the various augmentation methods used elsewhere. During training, for every model except GoogLeNet, we adjust the learning rate according to the schedule,
\begin{equation}
\gls{lr}_t = \gls{lr}_0(1+\gls{lr}_0\gls{weightdecay} \gls{t})^{-1},
\end{equation}
where $\gls{lr}_0,\gls{lr}_t$ and $\gls{weightdecay}$ are the initial learning rate, learning rate at iteration $\gls{t}$, and weight decay\index{weight decay} respectively~\citep{Bottou2012sgdtricks}. When the validation accuracy levels off we manually reduce the learning rate by further factors of 10 until the validation accuracy no longer increases. Unless otherwise indicated, aside from changing the standard deviation of the normally distributed weight initialization, as explained in \cref{initialization}, we used the standard hyper-parameters for each given model. Our results use no test-time augmentation.
\subsection[VGG-11 Architectures for \Glsfmttext{ilsvrc} Object Classification and MIT Places Scene Classification]{VGG-11 Architectures for \glsfmttext{ilsvrc} Object\\Classification and MIT Places Scene Classification}\label{vggresults}
We evaluated classification accuracy of the VGG-11 based architectures using two datasets, \gls{ilsvrc}~\citep{Jia2014} and MIT Places~\citep{zhou2014learning}. The \gls{ilsvrc} dataset comprises 1.2M training images of 1000 object classes, commonly evaluated by top-1 and top-5 accuracy on the 50K image validation set. The MIT Places dataset comprises 2.4M training images from 205 scene classes, evaluated with top-1 and top-5 accuracy on the 20K image validation set.
VGG-11 (`VGG-A') is an 11-layer convolutional network introduced by \citet{Simonyan2014verydeep}. It is in the same family of network architectures used by \citet{Simonyan2014verydeep,He2015b} to obtain the state-of-the-art accuracy for \gls{ilsvrc}, but uses fewer convolutional layers and therefore fits on a single \gls{gpu} during training. During training of our VGG-11 based models, we used the standard hyperparameters detailed by \citet{Simonyan2014verydeep} and the initialization of \citet{He2015b}.
\subsubsection{\Glsfmttext{vgg}-derived Model Table}
\begin{table}[tbp]
\centering
\caption[Low-rank \Glsfmttext{vgg} \Glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{vgg} \glsfmttext{ilsvrc} Results.} Accuracy, multiply-accumulate count, and number of parameters for the baseline VGG-11 network (both with and without \gls{gap}) and more efficient versions created by the methods described in this chapter.}\label{table:vggimagenetresults}
\pgfplotstableread[col sep=comma]{lrdata/vggma.csv}\data
% \scalebox{0.9}{
\tabcolsep=5pt
\pgfplotstabletypeset[
every head row/.style={
before row=\toprule,after row=\midrule},
every last row/.style={
after row=\bottomrule},
every first row/.style={
after row=\bottomrule},
%dec sep align, % Align at decimal point
fixed zerofill, % Fill numbers with zeros
columns={Network, Stride, Multiply-Acc., Param., Top-1 Acc., Top-5 Acc.},
column type/.add={lrp{5em}p{4em}rrr}{},
columns/Multiply-Acc./.style={
column name=FLOPS {\small $\times 10^{9}$},
preproc/expr={{##1/1e9}}
},
columns/Param./.style={
column name=Param. {\small $\times 10^{7}$},
preproc/expr={{##1/1e7}}
},
columns/Network/.style={string type},
columns/Stride/.style={precision=0},
columns/Top-1 Acc./.style={precision=3},
columns/Top-5 Acc./.style={precision=3},
highlight col max ={\data}{Top-1 Acc.},
highlight col max ={\data}{Top-5 Acc.},
highlight col min ={\data}{Param.},
highlight col min ={\data}{Multiply-Acc.},
col sep=comma]{\data}
% }
\end{table}
\begin{figure}[tbp]
\centering
\pgfplotstableread[col sep=comma]{lrdata/vggma.csv}\datatable
\pgfplotsset{major grid style={dotted,red}}
\begin{tikzpicture}
\begin{axis}[
width=0.95\textwidth,
height=0.33\textheight,
axis x line=bottom,
ylabel=Top-5 Error,
xlabel=Multiply-Accumulate Operations,
axis lines=left,
enlarge x limits=0.10,
grid=major,
%xmin=0,
ytick={0.01,0.02,...,0.21},
ymin=0.1,ymax=0.15,
yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
/pgf/number format/fixed,
/pgf/number format/precision=1
},
legend style={at={(0.01,0.98)}, anchor=north west, column sep=0.5em},
legend columns=2,
\setplotcyclecat{2},
every axis plot/.append style={fill},
]
\addplot+[mark=square*,nodes near coords,only marks,
point meta=explicit symbolic,
x filter/.code={
\ifnum\coordindex>2\def\pgfmathresult{}\fi
},
] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
\addplot+[mark=*,nodes near coords,only marks,
point meta=explicit symbolic,
x filter/.code={
\ifnum\coordindex<3\def\pgfmathresult{}\fi
},
] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
\legend{Baseline Networks, Our Results}
\end{axis}
\end{tikzpicture}
\caption[Low-rank \Glsfmttext{vgg} \Glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{vgg} \glsfmttext{ilsvrc} Results.} Multiply-accumulate operations \vs top-5 error for VGG-derived models on \gls{ilsvrc} object classification dataset, the most efficient networks are closer to the origin. Our models are significantly faster than the baseline network, in the case of `gmp-lr-2x' by a factor of almost 60\%, while slightly lowering error. Note that the `gmp-lr' and `gmp-lr-join' networks have the same accuracy, showing that an explicit linear combination layer may be unnecessary.}\label{fig:vggplots}
\end{figure}
\begin{table}[tbp]
\centering
\caption[Low-rank MIT Places results]{{\bf MIT Places Results.} Accuracy, multiply-accumulate operations, and number of parameters for the baseline `vgg-11-gmp' network, separable filter network as described by \citet{journals/corr/JaderbergVZ14}, and more efficient models created by the methods described in this chapter. All networks were trained at stride 2 for the MIT Places dataset.
}
%\resizebox{\textwidth}{!}{
\pgfplotstableread[col sep=comma]{lrdata/mitma.csv}\data
\pgfplotstabletypeset[
every head row/.style={
before row=\toprule,after row=\midrule},
every last row/.style={
after row=\bottomrule},
fixed zerofill, % Fill numbers with zeros
columns={Network, Stride, Multiply-Acc., Param., Top-1 Acc., Top-5 Acc.},
column type/.add={lp{5em}p{5em}rrrr}{},
columns/Multiply-Acc./.style={
column name=FLOPS {\small $\times 10^{8}$},
preproc/expr={{##1/1e8}}
},
columns/Param./.style={
column name=Param. {\small $\times 10^{7}$},
preproc/expr={{##1/1e7}}
},
columns/Network/.style={string type},
columns/Stride/.style={precision=0},
columns/Top-1 Acc./.style={precision=3},
columns/Top-5 Acc./.style={precision=3},
highlight col max ={\data}{Top-1 Acc.},
highlight col max ={\data}{Top-5 Acc.},
highlight col min ={\data}{Param.},
highlight col min ={\data}{Multiply-Acc.},
col sep=comma]{\data}
%}
\label{table:placesresults}
\end{table}
\input{lrdata/vggplacesplots}
\Cref{table:vggarch} shows the architectural details of the VGG-11-derived models used in \cref{vggresults}.
In what follows, we compare the accuracy of a number of different network architectures detailed in \cref{table:vggarch}. Results for \gls{ilsvrc} are given in \cref{table:vggimagenetresults}, and plotted in \cref{fig:vggplots}. Results for MIT Places are given in \cref{table:placesresults}, and plotted in \cref{fig:placesresults}.
\paragraph{Baseline (Global Max Pooling)} Compared to the version of the network described by \citet{Simonyan2014verydeep}, we use a variant that replaces the final $2$$\times$$2$ max pooling layer before the first fully-connected layer with a global max pooling operation, similar to the global average pooling used by \citet{Lin2013NiN,Szegedy2014going}. We evaluated the accuracy of the baseline VGG-11 network with global max-pooling (\textbf{vgg-gmp}) and without (\textbf{vgg-11}) on the two datasets. We trained these networks at stride 1 on the \gls{ilsvrc} dataset and at stride 2 on the larger MIT Places dataset. This globally max-pooled variant of VGG-11 uses over 75\% fewer parameters than the original network and gives consistently better accuracy -- almost 3 percentage points lower top-5 error on \gls{ilsvrc} than the baseline VGG-11 network on \gls{ilsvrc} (see \cref{table:vggimagenetresults}). We used this network as the baseline for the rest of our experiments.
\paragraph{Separable Filters} To evaluate the separable filter approach described in \cref{seqsep} (illustrated in \cref{fig:separableseq}, we replaced each convolutional layer in VGG-11 with a sequence of two layers, the first containing horizontally-oriented $1$$\times$$3$ filters and the second containing vertically-oriented $3$$\times$$1$ filters (\textbf{vgg-gmp-sf}). These filters applied in sequence represent $3$$\times$$3$ kernels using a low-dimensional basis space. Unlike \citet{journals/corr/JaderbergVZ14}, we trained this network from scratch instead of approximating the full-rank filters in a previously trained network. Compared to the original VGG-11 network, the separable filter version requires approximately 14\% less computation. Results are shown in \cref{table:vggimagenetresults} for \gls{ilsvrc} and \cref{table:placesresults} for MIT Places. Accuracy for this network is approx.~0.8\% lower than that of the baseline vgg-11-gmp network for \gls{ilsvrc} and broadly comparable for MIT Places. This approach does not give such a significant reduction in computational complexity as what follows, but it is nonetheless interesting that separable filters are capable of achieving quite high classification accuracy on such challenging tasks.
\paragraph{Simple Horizontal/Vertical Basis} To demonstrate the efficacy of the simple low rank filter representation illustrated in \cref{fig:separablemethods}c, we created a new network architecture (\textbf{vgg-gmp-lr-join}) by replacing each of the convolutional layers in VGG-11 (original filter dimensions were $3$$\times$$3$) with a sequence of two layers. The first layer comprises half $1$$\times$$3$ filters and half $3$$\times$$1$ filters whilst the second layer comprises the same number of $1$$\times$$1$ filters. The resulting network is approximately 49\% faster than the original and yet it gives broadly comparable accuracy (within 1 percentage point) for both the \gls{ilsvrc} and MIT Places datasets.
\paragraph{Full-Rank Mixture} An interesting question concerns the impact on accuracy of combining a small proportion of 3$\times$3 filters with the 1$\times$3 and 3$\times$1 filters used in ‘vgg-gmp-lr-join’. To answer this question, we trained a network, \textbf{vgg-gmp-lr-join-wfull}, with a mixture of 25\% $3$$\times$$3$ and 75\% $1$$\times$$3$ and $3$$\times$$1$ filters, while preserving the total number of filters of the baseline network (as illustrated in \cref{fig:ourmethodfullrank}). This network was significantly more accurate than both `vgg-gmp-lr-join' and the baseline, with a top-5 center crop accuracy of 89.7\% on \gls{ilsvrc}, with a computational saving of approximately 16\% over our baseline. We note that the accuracy is approx.~1 percentage point higher than GoogLeNet.
\paragraph{Implicitly Learned Combinations} In addition, we try a network similar to vgg-gmp-lr-join but without the $1$$\times$$1$ convolutional layer (as shown in \cref{fig:ourmethod}) used to sum the contributions of $3$$\times$$1$ and $1$$\times$$3$ filters (\textbf{vgg-gmp-lr}). Interestingly, because of the elimination of the extra $1$$\times$$1$ layers, this gives an additional computational saving such that this model is only $1/3$ of the computation of our baseline, with no reduction in accuracy. This seems to be a consequence of the fact that the subsequent convolutional layer is itself capable of learning effective combinations of filter responses even after the intermediate \gls{relu} non-linearity.
We also trained such a network with double the number of convolutional filters (\textbf{vgg-gmp-lr-2x}), \ie with an equal number of $1$$\times$$3$ and $3$$\times$$1$ filters, or $2c$ filters as shown in \cref{fig:ourmethod}. We found this to increase accuracy further (88.9\% Top-5 on \gls{ilsvrc}) while still being approximately 58\% faster than our baseline network.
\paragraph{Low-Dimensional Embeddings}
% \input{lrdata/vggmodeltable}
\afterpage{\begin{landscape}
\renewcommand{\arraystretch}{1.1}
\setlength{\tabcolsep}{0.3em}
\begin{table*}[p]
\scriptsize
\centering
\caption[VGG model low-rank architectures]{{\bf VGG Model Architectures}. Here ``3$\times$3, 32'' denotes 32 3$\times$3 filters, ``/2'' denotes stride 2, ``fc'' denotes fully-connected, and ``$\|$'' denotes a concatenation within a composite layer.}\label{vggmodeltable}
%\resizebox{0.97\textwidth}{!}{
\begin{tabular}{@{}|c||c|c|c|c|c|c|c|c|@{}}
\hline
Layer & VGG-11 & \textbf{GMP} & \textbf{GMP-SF} & \textbf{GMP-LR} & \textbf{GMP-LR-2X} & \textbf{GMP-LR-JOIN} & \textbf{GMP-LR-LDE} & \textbf{GMP-LR-JOIN-WFULL}\\
\hline
\hline
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV1
\textbf{conv1}& \multicolumn{2}{c|}{3$\times$3, 64} & 1$\times$3, 64 & 3$\times$1, 32 $\|$ 1$\times$3, 32 & 3$\times$1, 64 $\|$ 1$\times$3, 64 & \multicolumn{2}{c|}{3$\times$1, 32 $\|$ 1$\times$3, 32}& 3$\times$1, 24 $\|$ 1$\times$3, 24 $\|$ 3$\times$3, 16\\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 64 & & & 1$\times$1, 64 & 1$\times$1, 32 & 1$\times$1, 64 \\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
\hline
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV2
\textbf{conv2} & \multicolumn{2}{c|}{3$\times$3, 128} & 1$\times$3, 128 & 3$\times$1, 64 $\|$ 1$\times$3, 64 & 3$\times$1, 128 $\|$ 1$\times$3, 128 & \multicolumn{2}{c|}{3$\times$1, 64 $\|$ 1$\times$3, 64} & 3$\times$1, 48 $\|$ 1$\times$3, 48 $\|$ 3$\times$3, 32\\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 128 & & & 1$\times$1, 128 & 1$\times$1, 64 & 1$\times$1, 128 \\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
\hline
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV3
\textbf{conv3} & \multicolumn{2}{c|}{3$\times$3, 256} & 1$\times$3, 256 & 3$\times$1, 128 $\|$ 1$\times$3, 128 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & \multicolumn{2}{c|}{3$\times$1, 128 $\|$ 1$\times$3, 128} & 3$\times$1, 96 $\|$ 1$\times$3, 96 $\|$ 3$\times$3, 64 \\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 256 & & & 1$\times$1, 256 & 1$\times$1, 128 & 1$\times$1, 256\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{2}{c|}{3$\times$3, 256} & 1$\times$3, 256 & 3$\times$1, 128 $\|$ 1$\times$3, 128 & 3$\times$1, 256 $\|$ 1$\times$3, 256 &\multicolumn{2}{c|}{3$\times$1, 128 $\|$ 1$\times$3, 128} & 3$\times$1, 96 $\|$ 1$\times$3, 96 $\|$ 3$\times$3, 64 \\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 256 & & & 1$\times$1, 256 & 1$\times$1, 128 & 1$\times$1, 256\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
\hline
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV4
\textbf{conv4} & \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 &\multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 & \multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
\hline
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV5
\textbf{conv5} & \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 & \multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 & \multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
\cline{4-4} \cline{7-9}
& \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\cline{2-9}
& 2$\times$2 maxpool, /2 & \multicolumn{7}{c|}{global maxpool}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FC
\hline
\textbf{fc6} & $7^2$ $\times$ 512 $\times$ 4096 & \multicolumn{7}{c|}{512 $\times$ 4096}\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\hline
\textbf{fc7} & \multicolumn{8}{c|}{4096 $\times$ 4096}\\
\cline{2-9}
& \multicolumn{8}{c|}{ReLU}\\
\hline
\textbf{fc8} & \multicolumn{8}{c|}{4096 $\times$ 1000}\\
\cline{2-9}
& \multicolumn{8}{c|}{softmax}\\
\hline
\end{tabular}
%}
\label{table:vggarch}
\end{table*}
\end{landscape}}
We attempted to reduce the computational complexity of our `gmp-lr' network further in the \textbf{vgg-gmp-lr-lde} network by using a stride of 2 in the first convolutional layer, and adding low-dimensional embeddings, as in \citet{Lin2013NiN,Szegedy2014going}. We reduced the number of output channels by half after each convolutional layer using $1$$\times$$1$ convolutional layers, as detailed in \cref{vggmodeltable,table:vggarch}. While this reduces computation significantly, by approx.~86\% compared to our baseline, we saw a decrease in top-5 accuracy on \gls{ilsvrc} of 1.2 percentage points. We do note however, that this network remains 2.5 percentage points more accurate than the original VGG-11 network, but is 87\% faster.
\subsection{GoogLeNet for \Glsfmttext{ilsvrc} Object Classification}
GoogLeNet, introduced by \citet{Szegedy2014going}, is the most efficient network for \gls{ilsvrc}, getting close to state-of-the-art results with a fraction of the computation and model size of even VGG-11. The GoogLeNet \gls{inception}\index{inception} module is a composite layer of 5 homogeneously-shaped filters, $1$$\times$$1$, $3$$\times$$3$, $5$$\times$$5$, and the output of a 3$\times$3 average pooling operations. All of these are concatenated and used as input for successive layers (see \cref{backgroundinception}).
For the \textbf{googlenet-lr} network, within only the \gls{inception}\index{inception} modules we replaced each the $3$$\times$$3$ filters with low-rank $3$$\times$$1$ and $1$$\times$$3$ filters, and replaced the layer of $5$$\times$$5$ filters with a set of low-rank $5$$\times$$1$ and $1$$\times$$5$ filters. For the \textbf{googlenet-lr-conv1} network, we similarly replaced the first and second layer convolutional layers with $7$$\times$$1$~/~$1$$\times$$7$ and $3$$\times$$1$~/~$1$$\times$$3$ layers respectively.
Results are shown in \cref{table:googlenetimagenetresultsch4}, and \cref{fig:googlenetimagenetresultsch4}. GoogLeNet uses intermediate losses and fully connected layers only at training time, at test time these are removed. Test-time model size is thus significantly smaller than training time model size. \Cref{table:googlenetimagenetresultsch4} also reports test-time model size. The low-rank network delivers comparable classification accuracy using 26\% less compute. No other networks produce comparable accuracy within an order of magnitude of compute. We note that although the Caffe pre-trained GoogLeNet model~\citep{Jia2014} has a top-5 accuracy of 0.889, our training of the same network using the given model definition, including the hyper-parameters and training schedule, but a different random initialization had a top-5 accuracy of 0.883.
\begin{table}[tbp]
\centering
\caption[Low-rank \Glsfmttext{googlenet} \Glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{googlenet} \glsfmttext{ilsvrc} Results.} Accuracy, multiply-accumulate count, and number of parameters for the baseline GoogLeNet network and more efficient versions created by the methods described in this chapter.
}
% \resizebox{\textwidth}{!}{
\pgfplotstableread[col sep=comma]{lrdata/googlenetma.csv}\data
\pgfplotstabletypeset[
every head row/.style={
before row=\toprule,after row=\midrule},
every last row/.style={
after row=\bottomrule},
every first row/.style={
after row=\bottomrule},
fixed zerofill, % Fill numbers with zeros
columns={Network, Multiply-Acc., Test Param., Top-1 Acc., Top-5 Acc.},
columns/Multiply-Acc./.style={
column name=FLOPS {\small $\times 10^{9}$},
preproc/expr={{##1/1e9}}
},
columns/Test Param./.style={
column name=Test Param. {\small $\times 10^{6}$},
preproc/expr={{##1/1e6}}
},
column type/.add={lrrrrrr}{},
columns/Network/.style={string type},
columns/Top-1 Acc./.style={precision=3},
columns/Top-5 Acc./.style={precision=3},
highlight col max ={\data}{Top-1 Acc.},
highlight col max ={\data}{Top-5 Acc.},
highlight col min ={\data}{Test Param.},
highlight col min ={\data}{Multiply-Acc.},
col sep=comma]{\data}
% }
\label{table:googlenetimagenetresultsch4}
\end{table}
\begin{figure}[tbp]
\centering
\pgfplotstableread[col sep=comma]{lrdata/googlenetma.csv}\datatable
\pgfplotsset{major grid style={dotted,red}}
\begin{tikzpicture}
\begin{axis}[
width=0.95\textwidth,
height=0.33\textwidth,
axis x line=bottom,
ylabel=Top-5 Error,
xlabel=Multiply-Accumulate Operations,
axis lines=left,
enlarge x limits=0.10,
grid=major,
ytick={0.01,0.02,...,0.21},
ymin=0.10,ymax=0.15,
yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
/pgf/number format/fixed,
/pgf/number format/precision=1
},
legend style={at={(0.98,0.98)}, anchor=north east, column sep=0.5em},
legend columns=2,
\setplotcyclecat{2},
every axis plot/.append style={fill},
]
\addplot+[mark=square*,nodes near coords,only marks,
point meta=explicit symbolic,
x filter/.code={
\ifnum\coordindex>0\def\pgfmathresult{}\fi
}
] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
\addplot+[mark=*,nodes near coords,only marks,
point meta=explicit symbolic,
x filter/.code={
\ifnum\coordindex<1\def\pgfmathresult{}\fi
}
] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
\legend{Baseline, Our Results}
\end{axis}
\end{tikzpicture}
\caption[Low-Rank \Glsfmttext{googlenet} \glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{googlenet} \glsfmttext{ilsvrc} Results.} Multiply-accumulate operations \vs top-5 error for \glsfmttext{googlenet}-derived models on \glsfmttext{ilsvrc} object classification dataset.}
\label{fig:googlenetimagenetresultsch4}
\end{figure}
\subsection{\Glsfmttext{nin} for \Glsfmttext{cifar10} Object Classification}
The \gls{cifar10} dataset consists of 60,000 $32\times 32$ images in 10 classes, with 6000 images per class. This is split into standard sets of 50,000 training images, and 10,000 test images~\citep{CIFAR10}. As a baseline for the \gls{cifar10} dataset, we used the \gls{nin} architecture~\citep{Lin2013NiN}, which has a published test-set error of 8.81\%. We also used random crops during training, with which the network has an error of 8.1\%. Like most state-of-the-art CIFAR results, this was with ZCA pre-processed training and test data~\citep{goodfellow2013maxout}, training time mirror augmentation and random sub-crops. The results of our CIFAR experiments are listed in \cref{table:cifarresultsch4} and plotted in \cref{fig:cifarresultsch4}.
\begin{table}[tbp]
\centering
\caption[Low-rank \Glsfmttext{cifar10} results]{\textbf{\Glsfmttext{nin} \glsfmttext{cifar10} Results.} Accuracy, multiply-accumulate operations, and number of parameters for the baseline \gls{nin} model and more efficient versions created by the methods described in this chapter.}
%\resizebox{\textwidth}{!}{
\pgfplotstableread[col sep=comma]{lrdata/cifarma.csv}\data
\pgfplotstabletypeset[
every head row/.style={
before row=\toprule,after row=\midrule},
every last row/.style={
after row=\bottomrule},
every first row/.style={
after row=\bottomrule},
fixed zerofill, % Fill numbers with zeros
columns={Network, Multiply-Acc., Param., Accuracy},
columns/Multiply-Acc./.style={
column name=FLOPS {\small $\times 10^{8}$},
preproc/expr={{##1/1e8}}
},
columns/Param./.style={
column name=Param. {\small $\times 10^{5}$},
preproc/expr={{##1/1e5}}
},
column type/.add={lrrr}{},
columns/Network/.style={string type},
columns/Accuracy/.style={precision=4},
highlight col max ={\data}{Accuracy},
highlight col min ={\data}{Param.},
highlight col min ={\data}{Multiply-Acc.},
col sep=comma]{\data}
%}
\label{table:cifarresultsch4}
\end{table}
\begin{figure}[tbp]
\centering
\pgfplotstableread[col sep=comma]{lrdata/cifarma.csv}\datatable
\pgfplotsset{major grid style={dotted,red}}
\begin{tikzpicture}
\begin{axis}[
width=0.95\textwidth,
height=0.33\textwidth,
axis x line=bottom,
ylabel=Error,
xlabel=Multiply-Accumulate Operations,
axis lines=left,
enlarge x limits=0.10,
grid=major,
xticklabel style={
/pgf/number format/fixed,
/pgf/number format/fixed zerofill,
/pgf/number format/precision=1
},
ytick={0.01,0.015,0.02,...,0.21},
yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
/pgf/number format/fixed,
/pgf/number format/fixed zerofill,
/pgf/number format/precision=1
},
ymin=0.07,ymax=0.1,
legend style={at={(0.98,0.98)}, anchor=north east, column sep=0.5em},
legend columns=2,
\setplotcyclecat{2},
every axis plot/.append style={fill},
]
\addplot+[mark=square*,
nodes near coords,only marks,
point meta=explicit symbolic,
x filter/.code={
\ifnum\coordindex>1\def\pgfmathresult{}\fi
}
] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Accuracy} }]{\datatable};
\addplot+[mark=*,nodes near coords,only marks,
point meta=explicit symbolic,
x filter/.code={
\ifnum\coordindex<2\def\pgfmathresult{}\fi
}
] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Accuracy} }]{\datatable};
\legend{Baseline Networks, Our Results}
\end{axis}
\end{tikzpicture}
\caption[Low-rank \Glsfmttext{cifar10} results]{\textbf{\Glsfmttext{nin} \glsfmttext{cifar10} Results.} Multiply-accumulate operations \vs error for \gls{nin} derived models on \gls{cifar10} object classification dataset.}
\label{fig:cifarresultsch4}
\end{figure}
This architecture uses $5$$\times$$5$ filters in some layers. We found that we could replace all of these with $3$$\times$$3$ filters, with comparable accuracy. As suggested by \citet{Simonyan2014verydeep}, stacked $3$$\times$$3$ filters have the effective receptive field of larger filters with less computational complexity. In this \textbf{nin-c3} network, we replaced the first convolutional layer with one $3$$\times$$3$ layer, and the second convolutional layer with two $3$$\times$$3$ layers. This network is 26\% faster than the standard NiN model, with only 54\% of the model parameters. Using our low-rank filters in this network, we trained the \textbf{nin-c3-lr} network, which is of similar accuracy (91.8\% \vs 91.9\%) but is approximately 54\% of the original network's computational complexity, with only 45\% of the model parameters.
\subsection{Comparing with \Glsfmttext{ilsvrc} State-of-the-Art Networks}
\Cref{fig:bigpicturema,fig:bigpictureparam} compare published top-5 \gls{ilsvrc} validation error \vs multiply-accumulate operations and number of model parameters (respectively) for several state-of-the-art networks~\citep{Simonyan2014verydeep,Szegedy2014going,He2015b}. The error rates for these networks are only reported as obtained with different combinations of computationally expensive training and test-time augmentation methods, including scale, photometric, ensembles (multi-model), and multi-view/dense oversampling. This can make it difficult to compare model architectures, especially with respect to computational requirements.
State-of-the-art networks, such as MSRA-C\footnote{at the time of these experiments.}, VGG-19 and oversampled GoogLeNet are orders of magnitude larger in computational complexity than our networks. From \cref{fig:bigpicturema}, where the multiply-accumulate operations are plotted on a log scale, increasing the model size and/or computational complexity of test-time augmentation of \glspl{cnn}\index{CNN} appears to have diminishing returns for decreasing validation error. Our models \emph{without} training or test-time augmentation show comparable accuracy to networks such as VGG-13 \emph{with} training and test-time augmentation, while having far less computational complexity and a smaller model size. In particular, the `googlenet-lr' model has a much smaller test-time model size than any network of comparable accuracy.
\afterpage{
\begin{landscape}
\begin{figure}[p]
\centering
\pgfplotstableread[col sep=comma]{lrdata/bigpicture.csv}\datatable
\pgfplotstableread[col sep=comma]{lrdata/bigpicture_ours.csv}\datatableours
\pgfplotstableread[col sep=comma]{lrdata/bigpicture_aug.csv}\datatableaug
\pgfplotsset{major grid style={dotted,red}}
\pgfplotsset{minor grid style={dotted,red}}
\begin{tikzpicture}
\begin{axis}[
width=1.37\textwidth,
height=0.95\textheight,
axis x line=bottom,
ylabel=Top-5 Error,
xlabel=$\log_{10}$(Multiply-Accumulate Operations),
axis lines=left,
enlarge x limits=0.10,
enlarge y limits=0.05,
grid=both,
ytick={0.01,0.02,...,0.2},
xmode=log,
yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
/pgf/number format/fixed,
/pgf/number format/precision=1
},
\setplotcyclecat{3},
every axis plot/.append style={fill},
legend style={at={(0.01,0.01)},anchor=south west},
]
\addplot+[mark=*,
nodes near coords,only marks,
point meta=explicit symbolic,
every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
] table[meta=Real Name,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableours};
\addplot+[mark=*,
nodes near coords,only marks,
point meta=explicit symbolic,
every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
] table[meta=Real Name,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatable};
\addplot+[mark=square*,
nodes near coords,only marks,
point meta=explicit symbolic,
every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
] table[meta=Real Name,x=Test Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableaug};
\legend{Our Results, Crop \& Mirror Aug., Extra Augmentation}
\end{axis}
\end{tikzpicture}
\caption[Computational complexity of state-of-the-art \glsfmttext{ilsvrc} models]{\textbf{Computational complexity of state-of-the-art \Glsfmttext{ilsvrc} models.} Test-time multiply-accumulate operations \vs top-5 error on state-of-the-art networks using a \emph{single} model. Note the difference in accuracy and computational complexity for VGG-11 model with/without extra augmentation. Our `vgg-gmp-lr-join-wfull' model \emph{without} extra augmentation is more accurate than VGG-11 \emph{with} extra augmentation, and is much less computationally complex.}
\label{fig:bigpicturema}
\end{figure}
\end{landscape}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\afterpage{
\begin{landscape}
\begin{figure}[p]
\centering
\pgfplotstableread[col sep=comma]{lrdata/bigpicture.csv}\datatable
\pgfplotstableread[col sep=comma]{lrdata/bigpicture_ours.csv}\datatableours
\pgfplotstableread[col sep=comma]{lrdata/bigpicture_aug.csv}\datatableaug
\pgfplotsset{major grid style={dotted,red}}
\pgfplotsset{minor grid style={dotted,red}}
\begin{tikzpicture}
\begin{axis}[
width=1.37\textwidth,
height=0.95\textheight,
axis x line=bottom,
ylabel=Top-5 Error,
xlabel=$\log_{10}$(Number of Parameters),
axis lines=left,
enlarge y limits=0.05,
grid=both,
ytick={0.01,0.02,...,0.2},
xmode=log,
xmin=10e5,xmax=10e8,
yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
/pgf/number format/fixed,
/pgf/number format/precision=1
},
\setplotcyclecat{3},
every axis plot/.append style={fill},
legend style={at={(0.01,0.01)},anchor=south west},
]
\addplot+[mark=*,
nodes near coords,
only marks,
point meta=explicit symbolic,
every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
] table[meta=Real Name,x=Param.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableours};
\addplot+[mark=*,
nodes near coords,
only marks,
point meta=explicit symbolic,
every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
] table[meta=Real Name,x=Param.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatable};
\addplot+[mark=square*,
nodes near coords,
only marks,
point meta=explicit symbolic,
every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
] table[meta=Real Name,x=Param.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableaug};
\legend{Our Results, Crop \& Mirror Aug., Extra Augmentation}
\end{axis}
\end{tikzpicture}
\caption[The number of parameters of state-of-the-art \glsfmttext{ilsvrc} models]{\textbf{Number of Parameters of State-of-the-Art \Glsfmttext{ilsvrc} Models.} Test-time parameters \vs top-5 error for state-of-the-art models. The main factor in reduced model size is the use of global pooling or lack of fully-connected layers. Note that our `googlenet-lr' model is almost an order of magnitude smaller than any other network of comparable accuracy.}
\label{fig:bigpictureparam}
\end{figure}
\end{landscape}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{table}[tbp]
\centering
\caption[State-of-the-art single models with extra augmentation]{\textbf{State-of-the-Art Single Models with Extra Augmentation.} Top-5 \glsfmttext{ilsvrc} validation accuracy, single view and augmented test-time \glsfmttext{flops} (multiply-accumulate) count, and number of parameters for various state-of-the-art models \emph{with} various training and test-time augmentation methods. A multi-model ensemble of MSRA-C is the current state-of-the-art network.}
%\resizebox{\columnwidth}{!}{
\pgfplotstableread[col sep=comma]{lrdata/bigpicture_aug.csv}\data
\pgfplotstabletypeset[
every head row/.style={
before row=\toprule,after row=\midrule},
every last row/.style={
after row=\bottomrule},
empty cells with={--},
%dec sep align, % Align at decimal point
fixed zerofill, % Fill numbers with zeros
columns={Real Name, Multiply-Acc., Test Multiply-Acc., Param., Top-5 Acc.},
column type/.add={lrrrrrr}{},
columns/Multiply-Acc./.style={
column name=FLOPS {\small $\times 10^{9}$},
preproc/expr={{##1/1e9}}
},
columns/Test Multiply-Acc./.style={
column name=FLOPS w/ Aug. {\small $\times 10^{9}$},
preproc/expr={{##1/1e9}}
},
columns/Param./.style={
column name=Param. {\small $\times 10^{7}$},
preproc/expr={{##1/1e7}}
},
columns/Real Name/.style={string type},
columns/Stride/.style={precision=0},
columns/Top-1 Acc./.style={precision=3},
columns/Top-5 Acc./.style={precision=3},
highlight col max ={\data}{Top-5 Acc.},
highlight col min ={\data}{Param.},
highlight col min ={\data}{Multiply-Acc.},
highlight col min ={\data}{Test Multiply-Acc.},
col sep=comma]{\data}
\label{table:bigpicturetable}
\end{table}
\section{Discussion}
%We found our network architecture, which learns a small set of $3$$\times$$3$ basis filters along with many $1$$\times$$3$~/~$3$$\times$$1$ basis filters, gave the most impressive results. Such a model, `vgg-lr-wfull', increased the top-5 center crop validation accuracy on \gls{ilsvrc} by 1 percentage points in accuracy (89.7\% \vs 88.7\%) while reducing computation by 16\%, over our baseline network with global max-pooling. Although we did not try such a configuration of GoogLeNet, our `googlenet-lr' network using only $1$$\times$$3$~/~$3$$\times$$1$ and $1$$\times$$5$~/~$5$$\times$$1$ basis filters within the \gls{inception}\index{inception} modules obtained the smallest model size while maintaining comparable accuracy, using 26\% less compute than GoogLeNet and 41\% less model parameters.`
This chapter has presented a method to train \gls{cnn} from scratch using low-rank filters. This is made possible by a new way of initializing the network’s weights which takes into consideration the presence of differently shaped filters in \glspl{compositelayer}\index{composite layer}.
Validation on image classification in three popular datasets confirms similar or higher accuracy than the state-of-the-art models, with much greater computational efficiency.
It is somewhat surprising that networks based on learning filters with less representational ability are able to do as well, or better, than \glspl{cnn}\index{CNN} with full $k$$\times$$k$ filters on the task of image classification. However, a lot of interesting small-scale image structure is well-characterized by low-rank filters, \eg edges and gradients. Our experiments training a separable (rank-1) model (`vgg-gmp-sf') on \gls{ilsvrc} and MIT Places show surprisingly high accuracy on what are considered challenging problems --- approx.\ 88\% top-5 accuracy on \gls{ilsvrc} --- but not enough to obtain comparable accuracies to the models on which they are based.
Given that most discriminative filters learned for image classification appear to be low-rank, we instead structure our architectures with a set of basis filters in the way illustrated in \cref{fig:ourmethodfullrank}. This allows our networks to learn the most effective combinations of complex (\eg $k$$\times$$k$) and simple (\eg $1$$\times$$k$, $k$$\times$$1$) filters. Furthermore, in restricting how many complex spatial filters may be learned, this architecture prevents overfitting, and helps improve generalization. Even in our models where we do not use square $k$$\times$$k$ filters, we obtain comparable accuracies to the baseline model, since the rank-2 cross-shaped filters effectively learned as a combination of $3$$\times$$1$ and $1$$\times$$3$ filters are capable of representing more complex local pixel relations than rank-1 filters.
Recent advances in state-of-the-art accuracy with \glspl{cnn}\index{CNN} for image classification have come at the cost of increasingly large and computationally complex models. We believe our results to show that learning computationally efficient models with fewer, more relevant parameters, can prevent overfitting, increase generalization and thus also increase accuracy.
\end{document}