spatialbasis.tex

% !TEX root = thesis.tex
\documentclass[thesis]{subfiles}

\begin{document}
	\chapter{Spatial Connectivity}\label{lowrankfilters}
	%\chapter{Learning a Basis for the Spatial Extents of Filters}
	\begin{chapquote}{Yann LeCun, \textit{Backprop.\ Applied to Handwritten Zip Code Recognition}, 1989}
        ``Classical work in visual pattern recognition has demonstrated the advantage of extracting local features and combining them to form higher-order features. Such knowledge can be easily built into the network by forcing the hidden units to combine only local sources of information. Distinctive features of an object can appear at various location on the input image. Therefore it seems judicious to have a set of feature detectors that can detect a particular instance of a feature anywhere on the input plane.''
        %Since the precise location of a feature is not relevant to the classification, we can afford to lose some position information in the process. Nevertheless, approximate position information must be preserved, to allow the next levels to detect higher order, more complex features.''
    \end{chapquote}
    
    \glspl{cnn}\index{CNN} (see \cref{cnns}) are a highly specialized form of neural network\index{neural network} for learning image representations. Their use of \emph{convolutional filters} allows \glspl{cnn}\index{CNN} to learn much more efficient representations, from a memory and computational efficiency standpoint, than a full connected network. Such filters usually have limited spatial extents (\ie width and height, as opposed to channels) and their learned weights are shared across the image's spatial domain to provide translation invariance~\citep{Fuk80,Lecun1998}.
    Thus, as illustrated in \cref{fig:sparseconn}, in comparison with fully-connected network layers (\cref{fig:sparseconn}(a)), convolutional layers have a much sparser connection structure and use fewer parameters (\cref{fig:sparseconn}(b)).
    This leads to faster training and inference, better generalization, and higher accuracy.
    
    This chapter focuses on reducing the computational complexity of the convolutional layers of \glspl{cnn}\index{CNN} by further sparsifying their spatial connection structures.  Specifically, we show that by representing convolutional filters using a basis space comprising groups of filters of different spatial dimensions (examples shown in \cref{fig:sparseconn}(c, d)), we can significantly reduce the computational complexity of existing state-of-the-art \glspl{cnn}\index{CNN} without compromising classification accuracy.
    
    %%% Figure
    \afterpage{
    \begin{landscape}
    \begin{figure}[p] 
        \centering
        \includegraphics[height=0.8\textheight]{sparseconn3}
        \caption[Image access map visualizing sparsity of convolutional filters]{
            {\bf Network connection structure for convolutional layers.} For a single-layer neural network, the sparsity of a convolutional filter as compared to a fully connected network is illustrated. Connection weight maps (centre) show the pairwise dependencies between input and output pixels. In a fully-connected network (a), each output node is connected to all input pixels. For a \gls{cnn} (b,c,d), the output pixels depend only on a sparse subset of input pixels, where shared weights are represented by repeated unique colours, and white pixels represent pixels with no connection. Note that sparsity increases from (a) to (d), opening up potentially more efficient implementation.
        }
        \label{fig:sparseconn}
    \end{figure}
    \end{landscape}
    }
    %%%
    
    Our contributions include a novel method of learning a set of small basis filters that are combined to represent larger filters efficiently. Rather than approximating previously trained networks, we train networks \emph{from scratch} and show that our convolutional layer representation can improve both efficiency and classification accuracy. Unlike methods that approximate previously-trained models (as listed in \cref{approxmethods,factorized}) this allows us to reduce training time, and even increase accuracy over the original model. We further describe how to initialize connection weights effectively for training networks with composite convolutional layers containing groups of differently-shaped filters, which we found to be of critical importance to our training method\footnote{note that much of this work was done before the widespread use of batch normalization, however inititalization still plays an important role.}.
    
    \section{Related Work}
    \label{relatedwork}
    There has been much previous work on increasing the test-time efficiency of \glspl{cnn}\index{CNN}. Some promising approaches work by making use of more hardware-efficient representations. For example \citet{1502.02551v1} and \citet{vanhoucke2011improving} achieve training- and test-time compute savings by further quantization of network weights that were originally represented as 32-bit floating point numbers. However, more relevant to our work are approaches that depend on new network connection structures, efficient approximations of previously trained networks, and learning low rank filters. 
    
    \paragraph{Efficient Network Connection Structures}
    There has been shown to be significant redundancy in the trained weights of \glspl{cnn}\index{CNN}~\citep{Denil2013predicting}. \citet{lecun1989optimal} suggest a method of pruning unimportant connections within networks. However this requires repeated network re-training and may be infeasible for modern, state-of-the-art \glspl{cnn}\index{CNN} requiring weeks of training time. \citet{Lin2013NiN} show that the geometric increase in the number and dimensions of filters with deeper networks can be managed using low-dimensional embeddings. The same authors show that global average-pooling may be used to decrease model size in networks with fully-connected layers. \citet{Simonyan2014verydeep} show that stacked filters with small spatial dimensions (\eg $3$$\times$$3$), can operate on the effective receptive field of larger filters (\eg $5$$\times$$5$) with less computational complexity.
    
    \paragraph{Low-Rank Filter Approximations}
    \label{approxmethods}
    \citet{conf/cvpr/RigamontiSLF13} approximate {\em previously trained} \glspl{cnn}\index{CNN} with low-rank filters for the semantic segmentation of curvilinear structures within volumetric medical imagery. They discuss two approaches: enforcing an $\ell_1$-based regularization to learn approximately low rank filters, which are later truncated to enforce a strict rank, and approximating a set of pre-learned filters with a tensor decomposition into many rank-1 filters. Neither approach learns low rank filters directly, and indeed the second approach proved the more successful.
    
    The work of \citet{journals/corr/JaderbergVZ14} also approximates the existing filters of previously trained networks. They find separable 1D filters through an optimization minimizing the reconstruction error of the already learned full rank filters. They achieve a 4.5$\times$ speed-up with a loss of accuracy of 1\% in a text recognition problem. However since the method is demonstrated only on text recognition, it is not clear how well it would scale to larger data sets or more challenging problems. A key insight of the paper is that filters can be represented by low rank approximations not only in the spatial domain but also in the channel domain.
    
    Both of these methods show that, at least for their respective applications, low rank approximations of full-rank filters learned in convolutional networks can increase test-time efficiency significantly. However, being approximations of pre-trained networks, they are unlikely to improve test accuracy, and can only increase the computational requirements during training.
    
    \paragraph{Learning Separable (Factorized) Filters}
    \label{factorized}
    \citet{mamalet2012simplifying} propose training networks with separable filters on the task of digit recognition with the \gls{mnist} dataset. They train networks with \emph{sequential} convolutional layers of horizontal and vertical 1D filters, achieving a speed-up factor of 1.6$\times$, but with a relative increase in test error of 13\% (1.45\% \vs 1.28\%). Our approach is different than  this, allowing both horizontal and vertical 1D filters (and other shapes too) on the same layer and avoiding issues with ordering.  We also demonstrate a decrease in error, and validate on more challenging datasets.
    
    \section{Using Low-Rank Filters in CNNs}
    %%% Figure
    \begin{figure}[tbp] 
        \begin{subfigure}[b]{0.98\textwidth}
            \centering
            \includegraphics[height=0.165\textheight, page=1]{sparsification}
            \caption{A standard convolutional layer.}\label{fig:fullrank}
        \end{subfigure}\\
        \begin{subfigure}[b]{0.98\textwidth}
            \centering
            \includegraphics[height=0.16\textheight, page=2]{sparsification}
            \caption{Sequential separable filters~\citep{journals/corr/JaderbergVZ14}.}\label{fig:separableseq}
        \end{subfigure}\\
        \begin{subfigure}[b]{0.98\textwidth}
            \centering
            \includegraphics[height=0.19\textheight, page=3]{sparsification}
            \caption{Our method, a learned basis space of filters that are rectangular in the spatial domain and oriented horizontally and vertically.}\label{fig:ourmethod}
        \end{subfigure}\\
        \begin{subfigure}[b]{0.98\textwidth}
            \centering
            \includegraphics[height=0.23\textheight, page=4]{sparsification}
            \caption{Our method, a learned basis space of vertical/horizontal rectangular filters and square filters. Filters of other shapes are also possible.}\label{fig:ourmethodfullrank}
        \end{subfigure}
        \caption[Overview of methods of using low-rank filters]{\textbf{Methods of using low-rank filters in \glsfmtplural{cnn}\index{CNN}}. Methods from literature and our proposed methods for learning low rank filters. The activation function is not shown, coming after the last layer in each configuration.}\label{fig:separablemethods}
    \end{figure}
    %%%
    
    \subsection{Convolutional Filters}
    The convolutional layers of a \gls{cnn} produce output `images' (usually called \emph{feature maps}\index{feature map}) by convolving input images with one or more learned filters. %The output images of convolutional layers are  to distinguish them from raw input images.
    In a typical convolutional layer, as illustrated in \cref{fig:fullrank}, a $c$-channel input image of size $H$$\times$$W$ pixels is convolved with $d$ filters of size $h$$\times$$w$$\times$$c$ to create a $d$-channel output image. Each filter is represented by $h w c$ independent weights. Therefore the computational complexity for the convolution of the filter with a $c$-channel input image is $\gls{bigoh}(d w h c)$ (per pixel in the output \gls{featuremap}\index{feature map}).
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Here we will use our existing mathematical description of convolution from \cref{sec:convolutioninpractice}, where the incoming \gls{featuremap} is denoted by $\gls{fmX}$, outgoing \gls{featuremap} $\gls{fmY}$, and convolutional filter, or kernel $\gls{fmK}$. The scalar elements of each \gls{featuremap} are $\gls{fmX}_{i,j,k}$, $\gls{fmY}_{i,j,k}$, where $i=\{0,\ldots,\gls{c}\}$ is the \glspl{featuremap} channel (\ie colour for an input image), and $j=\{0,\ldots,\gls{filterh}\}$, $k=\{0,\ldots,\gls{filterw}\}$ are the spatial coordinates, rows and columns respectively, of the channel $i$ image. The filter's scalar elements are $\gls{fmK}_{i,j,k,l}$, where $i$ is the filter's index in the convolutional layer's filter bank and the output channel in $\gls{fmY}$ to which the filter's result is written, $j$ is the input channel in $\gls{fmX}$ over which the filter's spatial elements are convolved, and $(k, l)$ are the row and column offset between the output and input images.

    A convolutional layer, as illustrated in \cref{fig:fullrank}, then convolves across the layer such that,
    \begin{equation}
        \begin{aligned}
            \gls{fmY}_{i,j,k} &= \sum_{l,m,n} \gls{fmX}_{l,j+m,k+n}\; \gls{fmK}_{i,l,m,n},
        \end{aligned}
    \end{equation}
    for all valid indices $l,m,n$, depending on the \gls{padding}\index{padding} of the input image. We will express this in shorter terms using the convolution operator $\gls{convolution}$ and allowing considering only a single pixel (\ie fixed $j,\,k$) output spatially for simplicity, 
    \begin{equation}
        \begin{aligned}
            \gls{fmY}_i &= \sum_{l} \gls{fmX}_l \gls{convolution} \gls{fmK}_{il}.
        \end{aligned}
    \end{equation}


    %%%%%%%%%%%%%%%%%%%%%%%%%55
    
    In what follows, we describe schemes for modifying the architecture of the convolutional layers so as to reduce computational complexity. The idea is to replace expensive, full-rank spatial convolutional filters, with modified versions that represent the same number of effective filters by a linear combinations of smaller basis vectors. 
    
    \subsection{Sequential Separable Filters}\label{seqsep}
    An existing scheme for reducing the computational complexity of convolutional layers~\citep{journals/corr/JaderbergVZ14} is to replace each one with a sequence of two regular convolutional layers but with filters that are rectangular in the spatial domain, as shown in \cref{fig:separableseq}. 
    
    The first convolutional layer has $m$ horizontal filters $\gls{fmK}_{i,\,l=0,\ldots,m}$ of size $w$$\times$$1$$\times$$c$, producing an output \gls{featuremap}\index{feature map} with $m$ channels. The second convolutional layer has $d$ vertical filters $\gls{fmK}_{i,\,l=0,\ldots,d}$ of size $1$$\times$$h$$\times$$m$, producing an output \gls{featuremap}\index{feature map} with $d$ channels. 
    
    Mathematically, the seperable convolution illustrated in \cref{fig:separableseq} can be expressed,
    \begin{equation}
        \begin{aligned}
            \gls{fmY}_i &= \sum_{l} \gls{fmY}^\textrm{h}_l \gls{convolution} \gls{fmK}^\textrm{v}_{il}\\
            =& \sum_{l} \left(\gls{fmX}^\textrm{h}_l \gls{convolution} \gls{fmK}^\textrm{h}_{il}\right) \gls{convolution} \gls{fmK}^\textrm{v}_{il},
        \end{aligned}
    \end{equation}
    where $\gls{fmX}^\textrm{h}$ and $\gls{fmX}^\textrm{v}$ are the input feature maps convolved with the horizontal and vertical filters $\gls{fmK}^\textrm{h}$ and $\gls{fmK}^\textrm{v}$ respectively.

    By these means the full rank original convolutional filter bank is represented by a low rank approximation formed from a linear combination of a set of separable $w$$\times$$h$ basis filters. The computational complexity of this scheme is $\gls{bigoh}(m c w)$ for the first layer of horizontal filters and $\gls{bigoh}(d m h)$ for the second layer of vertical filters, with a total of $\gls{bigoh}(m(c w + d h))$.
    
    Note that \citet{journals/corr/JaderbergVZ14} use this scheme to approximate existing full rank filters belonging to previously trained networks using a retrospective fitting step. In this work, by contrast,  we {\em train} networks containing convolutional layers with this architecture from scratch. In effect, we learn the separable basis filters and their combination weights simultaneously during network training.
    
    
    \subsection{Filters as Linear Combinations of a Basis}
    We introduce a novel method for reducing convolutional layer complexity by training with low-rank filters. This works by representing convolutional filters as linear combinations of basis filters as illustrated in \cref{fig:ourmethod}. This scheme uses \emph{\glspl{compositelayer}\index{composite layer|textbf}} comprising several sets of filters where the filters in each set have different spatial dimensions (see \cref{fig:compositelayers}). The outputs of these basis filters may be combined in a subsequent layer containing filters with spatial dimensions $1$$\times$$1$.
    
    This configuration is illustrated in \cref{fig:ourmethod}, and can be expressed as,
    \begin{equation}
        \begin{aligned}
            \gls{fmY}_i &= \sum_{l} \gls{fmY}^\textrm{basis}_l \gls{convolution} \gls{fmK}^\textrm{weights}_{il}\\
            &= \sum_{l} f^\textrm{weights}_{il} \gls{fmY}^\textrm{basis}_l \quad\textrm{(since $\gls{fmK}^\textrm{weights}_{il}$ is a scalar)}\\
            &= \sum_{l=0}^{m/2} f^\textrm{weights}_{il} \gls{fmX}_l \gls{convolution} \gls{fmK}^\textrm{h}_{il} + \sum_{l=m/2}^m f^\textrm{weights}_{il} \gls{fmX}_l \gls{convolution} \gls{fmK}^\textrm{v}_{il},
        \end{aligned}
    \end{equation}
    where $\gls{fmK}^\textrm{h}$ and $\gls{fmK}^\textrm{v}$ are the horizontal and vertical filters respectively.
    
    Here, our \gls{compositelayer}\index{composite layer} contains horizontal $w$$\times$$1$ and vertical $1$$\times$$h$ filters, the outputs of which are concatenated in the channel dimension, resulting in an intermediate $m$-channel \gls{featuremap}\index{feature map}. These filter responses are then linearly combined by the next layer of $d$ $1$$\times$$1$ filters to give a $d$-channel output \gls{featuremap}\index{feature map}. In this case, the filters are applied on the input \gls{featuremap}\index{feature map} with $c$ channels and followed by a set of $m$ 1$\times$1 filters over the $m$ output channels of the basis filters. If the number of horizontal and vertical filters is the same, the computational complexity is $\gls{bigoh}( m(wc/2 +hc/2 + d))$.
    
    The effective filters learned in our models are low-rank in that, although we only learn mostly basis filters much smaller than the original networks (\eg $1$$\times$$h$, $w$$\times$$1$), the effective filter size when also using only a few full $w$$\times$$h$ basis filters is still a full 3$\times$3. Necessarily, some of the parameters in our effective filters are linear combinations of others, since the effective filter is a learned linear combination of the low-rank basis filters.
    
    Interestingly, the configuration of \cref{fig:ourmethod}, where we only use horizontal and vertical basis filters, gives rise to linear combinations of horizontal and vertical filters that are cross-shaped in the spatial domain. This is illustrated in \cref{fig:conv1filters} for filters learned in the first convolutional layer of the `vgg-gmp-lr-join' model that is described in \cref{spatialbasisresults}, where it is trained using the \gls{ilsvrc} dataset.
    
    \begin{figure}[tbp] 
        \centering
        \begin{tabular}[c]{rl}
            &
            \subcaptionbox{$3\times1$ filters.\label{fig:horizontalfilters}}
            {
                \includegraphics[width=0.25\textheight]{conv1_x}
            }\\
            \subcaptionbox{$1\times3$ filters.\label{fig:verticalfilters}}[0.1\textheight]
            {
                \includegraphics[height=0.25\textheight]{conv1_y}
            }&
            \subcaptionbox{Learned linear combinations.\label{fig:linearcomb}}
            {
                \includegraphics[width=0.25\textheight]{linearcombinations}
            }\\
        \end{tabular}
        \caption[Learned cross-shaped filters]{\textbf{Learned Cross-Shaped Filters}. The cross-shaped filters (c) learned as weighted linear combination of (b) $1$$\times$$3$ and (c) $3$$\times$$1$ basis filters in the first convolutional layer of the the `vgg-gmp-lr-join' model trained using the \gls{ilsvrc} dataset.}\label{fig:conv1filters}
    \end{figure}
    
    Note that, in general, more than two different sizes of basis filter might be used in the \gls{compositelayer}\index{composite layer}. In the more general case, for a set of heterogeneous filter groups $\gls{fmK}^{g=0,\ldots,G}$, we can express this as
    \begin{equation}
        \begin{aligned}
            \gls{fmY}_i &= \sum_{l} f^\textrm{weights}_{il} \sum_g \gls{fmX}_l \gls{convolution} \gls{fmK}^g_{il}.
        \end{aligned}
    \end{equation}
    
    For example,  \cref{fig:ourmethodfullrank} shows a combination of three sets of filters with spatial dimensions $w$$\times$$1$, $1$$\times$$h$, and $w$$\times$$h$. Also note that an interesting option is to omit the $1$$\times$$1$ linear combination layer and instead allow the connection weights in a subsequent network layer to learn to combine the basis filters of the preceding layer (despite any intermediate non-linearity, \eg \glspl{relu}). This possibility is explored empirically in the \cref{spatialbasisresults}.
    
    In that our method uses a combination of filters in a \gls{compositelayer}\index{composite layer}, it is similar to the `GoogLeNet' of \citet{Szegedy2014going} which uses \gls{inception}\index{inception} modules comprising several (square) filters of different sizes ranging from 1$\times$1 to 5$\times$5. In our case, however, we are implicitly learning linear combinations of less computationally expensive filters with different orientations (\eg 3$\times$1 and 1$\times$3 filters), rather than combinations of filters of different sizes. Amongst networks with similar computational requirements, GoogLeNet is one of the most accurate for large scale image classification tasks (see \cref{fig:vggplots}), partly due to the use of heterogeneous filters in the \gls{inception}\index{inception} modules, but also the use of low-dimensional embeddings and global pooling.

    \section[Training CNNs with Mixed-Shape Low-Rank Filters]{Training CNNs with Mixed-Shape\texorpdfstring{\\}{ }Low-Rank Filters}\label{initialization}
    To determine the standard deviations to be used for weight initialization, we use an approach similar to that described by \citet{glorot2010understanding} (with the adaptation described by \citet{He2015b} for layers followed by a \gls{relu}). In \cref{initializationderivation}, we show the details of our derivation, generalizing the approach of \citet{He2015b} to the initialization of \glspl{compositelayer}\index{composite layer} comprising several groups of filters of different spatial dimensions (see \cref{initializationderivation}, \cref{fig:compositelayers}).
    
    At the start of training, network weights are initialized at random using samples drawn from a Gaussian distribution with a standard deviation parameter specified separately for each layer. We found that the setting of these parameters was critical to the success of network training and difficult to get right, particularly because published parameter settings used elsewhere were not suitable for our new network architectures. With unsuitable weight initialization, training may fail due to {\em exploding gradients}, where  back-propagated gradients grow so large as to cause numeric overflow, or {\em vanishing gradients} where back-propagated gradients diminish such that their effect is dwarfed by that of weight decay\index{weight decay} such that loss does not decrease during training~\citep{Hochreiter01gradientflow}.
    
    The approach of \citet{glorot2010understanding} works by ensuring that the magnitudes of back-propagated gradients remain approximately the same throughout the network. Otherwise, if the gradients were inappropriately scaled by some factor (\eg $\gls{beta}$) then the final back-propagated signal would be scaled by a potentially much larger factor ($\gls{beta}^L$ after $L$ layers) (see \cref{ssec:init}).
    
    \subsection{Derivation of the Initialization for \Glsfmtplural{compositelayer}}\label{initializationderivation}
    In what follows, we adopt notation similar to that of \citet{He2015b}, and follow their derivation of the appropriate standard deviation for weight initialization. However, we also generalize their approach to the initialization of \glspl{compositelayer}\index{composite layer} comprising several groups of filters of different spatial dimensions (see \cref{fig:compositelayers}). 
    
    \paragraph{Forward Propagation}
    The response of the $l^\text{th}$ convolutional layer can be represented as,
    \begin{equation}
    \gls{vectory}_l =\gls{wmatrix}_l \gls{vectorx}_l + \gls{vectorb}_l,
    \end{equation}
    where $\gls{vectory}_l$ is a $\gls{d}$$\times$$1$ vector representing a pixel in the output \gls{featuremap}\index{feature map}, and $\gls{vectorx}_l$ is a $\gls{filterw} \gls{filterh} \gls{c} \times 1$ vector that represents a $\gls{filterw}$$\times$$\gls{filterh}$ sub-region of the $\gls{c}$-channel input \gls{featuremap}\index{feature map}. $\gls{wmatrix}_l$ is the $\gls{d}$$\times$$n$ weight matrix, where $\gls{d}$ is the number of filters and $n$ is the size of a filter, \ie $n = \gls{filterw} \gls{filterh} \gls{c}$ for a filter with spatial dimensions $\gls{filterw}$$\times$$\gls{filterh}$ operating on an input \gls{featuremap}\index{feature map} of $\gls{c}$ channels, and $\gls{vectorb}_l$ is the bias. Finally $\gls{vectorx}_l = f(\gls{vectory}_{l-1})$ is the output of the previous layer passed through an activation function $\gls{f}$ (\eg the application of a \gls{relu} to each element of $\gls{vectory}_{l-1}$).
    
    \paragraph{Backward Propagation}
    During backpropagation\index{backpropagation}, the gradient of a convolutional layer is computed as,
    \begin{equation}
    \Delta \gls{vectorx}_l = \hat{\gls{wmatrix}}_l \Delta \gls{vectory}_l,
    \label{eq:back_prop_gradient}
    \end{equation}
    where $\Delta \gls{vectorx}_l$ and $\Delta \gls{vectory}_l$ denote the derivatives of loss $\gls{L}$ with respect to input and output pixels. $\Delta \gls{vectorx}_l$ is a $\gls{c}$$\times$$1$ vector of gradients with respect to the channels of a single pixel in the input \gls{featuremap}\index{feature map} and $\Delta \gls{vectory}$ represents $\gls{filterh}$$\times$$\gls{w}$ pixels in $d$ channels of the output \gls{featuremap}\index{feature map}. $\hat{\gls{wmatrix}}_l$ is a $\gls{c}$$\times$$\hat{n}$ matrix, and $\hat{n} = \gls{filterw}\gls{filterh}\gls{d}$. Note that $\hat{\gls{wmatrix}}_l$ can be simply reshaped from $\gls{wmatrix}_l^\top$. Also note that the elements of $\Delta \gls{vectory}_l$ correspond to pixels in the output image that had a forward dependency on the input image pixel corresponding to $\Delta \gls{vectorx}$. In backpropagation\index{backpropagation}, each element $\Delta \gls{y}_l$ of $\Delta \gls{vectory}_l$ is related to an element $\Delta \gls{x}_{l+1}$ of some $\Delta \gls{vectorx}_{l+1}$ (\ie a back-propagated gradient in the next layer) by the derivative of the activation function $\gls{f}$:
    \begin{equation}
    \Delta \gls{y}_l = \gls{f}^\prime (\gls{y}_l) \Delta \gls{x}_{l+1},
    \end{equation}
    where $\gls{f}^\prime$ is the derivative of the activation function. 
    
    
    \newcommand{\Expect}{\gls{expected}}
    \newcommand{\Var}{\gls{var}}
    
    \paragraph{Weight Initialization}
    Now let $\Delta \gls{y}_l$, $\Delta \gls{x}_l$ and $\gls{w}_l$ be scalar random variables that describe the distribution of elements in $\Delta \gls{vectory}_l$, $\Delta \gls{vectorx}_{l}$ and $\hat{\gls{wmatrix}}_l$ respectively. Then, assuming $f^\prime (\gls{y}_l)$ and $\Delta \gls{x}_{l+1}$ are independent,
    \begin{equation}
    \Expect{}[\Delta \gls{y}_l] = \Expect{}[\gls{f}^\prime (\gls{y}_l)] \, \Expect{}[ \Delta \gls{x}_{l+1}].
    \end{equation}
    
    For the \gls{relu} case, $\gls{f}'(\gls{y}_l)$ is zero or one with equal probability. 
    Like \citet{glorot2010understanding}, we assume that $\gls{w}_l$ and $\Delta \gls{y}_l$ are independent. Thus, \cref{eq:back_prop_gradient} implies that $\Delta \gls{x}_l$ has zero mean for all layers $l$, when $\gls{w}_l$ is initialized by a distribution that is symmetric around zero. Thus we have $\Expect{}[\Delta \gls{y}_l] = \frac{1}{2}\Expect{}[\Delta \gls{x}_{l+1}]= 0$ and also $\Expect{}[(\Delta \gls{y}_l)^2] = \Var{}[\Delta \gls{y}_l] = \frac{1}{2} \Var{}[\Delta \gls{x}_{l+1}]$. Now, since each element of $\Delta \gls{vectorx}_l$ is a summation of $\hat n$ products of elements of $\hat{\gls{wmatrix}}_l$ and elements of $\Delta \gls{vectory}_l$, we can compute the variance of the gradients in \cref{eq:back_prop_gradient}:
    
    \begin{equation}
    \begin{aligned}
    \Var{}[\Delta \gls{x}_l] &=  \hat{n} \Var{}[\gls{w}_l] \Var{}[\Delta \gls{y}_l]\\
    &= \frac{1}{2}   \hat{n} \Var{}[\gls{w}_l] \Var{} [\Delta \gls{x}_{l+1}].
    \end{aligned}
    \end{equation}
    
    
    To avoid scaling the gradients in the convolutional layers (and so avoid exploding or vanishing gradients), we set the ratio between these variances to 1:
    \begin{equation}
    \frac{1}{2} \hat{n} \Var{}[\gls{w}_l] = 1.
    \end{equation}
    
    This leads to the result of \citet{He2015b}, in that a layer with $\hat{n}_l$ connections followed by a \gls{relu} activation function should be initialized with a zero-mean Gaussian distribution with standard deviation $\sqrt{2/ \hat{n}_l}$.
    
    \paragraph{Weight Initialization in \Glsfmtplural{compositelayer}}
    %%% Figure
    \begin{figure}[tbp]
        \centering
        \includegraphics[width=0.95\textwidth]{composite}
        \caption[A composite convolutional layer]{{\bf A composite convolutional layer}. \Glsfmtplural{compositelayer}\index{composite layer} convolve an input \gls{featuremap}\index{feature map} with $N$ groups of convolutional filters of several different spatial dimensions. Here the $i^\text{th}$ group has $\gls{d}^{[\gls{g}]}$ filters with spatial dimension $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$. The outputs are concatenated to create a $\gls{d}$ channel output \gls{featuremap}\index{feature map}. \Glsfmtplural{compositelayer}\index{composite layer} require careful weight initialization to avoid vanishing/exploding gradients during training.}
        \label{fig:compositelayers}
    \end{figure}
    %%%
    
    The initialization scheme described above assumes that the layer comprises filters of spatial dimension $\gls{w}$$\times$$\gls{filterh}$. Now we extend this scheme to composite convolutional layers\index{composite layer} containing $N$ groups of filters of different spatial dimensions $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ (where superscript $[\gls{g}]$ denotes the group index and with $g\in \{1,\dots,N\}$). Now the layer response is the concatenation of the responses of each group of filters:
    \begin{equation}
    \gls{vectory}_l =\begin{bmatrix}\gls{wmatrix}_l^{[1]} \gls{vectorx}_l^{[1]} \\ \gls{wmatrix}_l^{[2]} \gls{vectorx}_l^{[2]} \\ \dots \\ \gls{wmatrix}_l^{[N]} \gls{vectorx}_l^{[N]} \end{bmatrix} + \gls{vectorb}_l.
    \end{equation}
    As before $\gls{vectory}_l$ is a $\gls{d}$$\times$$1$ vector representing the response at one pixel of the output \gls{featuremap}\index{feature map}. Now each ${\gls{vectorx}}^{[\gls{g}]}$ is a $\gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{c} \times 1$ vector that represents a different shaped $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ sub-region of the input \gls{featuremap}\index{feature map}. Each $\gls{wmatrix}_l^{[\gls{g}]}$ is the $\gls{c}_l^{[\gls{g}]}\times \hat{n}^{[\gls{g}]}$ weight matrix, where $\gls{d}$ is the number of filters and $\hat{n}^{[\gls{g}]}$ is the size of a filter, \ie $\hat{n}^{[\gls{g}]} = \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{c}^{[\gls{g}]}$ for a filter of spatial dimension $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ operating on an input \gls{featuremap}\index{feature map} of $\gls{c}_l = \gls{d}_{l-1}$ channels.
    
    During backpropagation, the gradient of the composite convolutional layer\index{composite layer} is computed as a summation of the contributions from each group of filters:
    \begin{equation}
    \Delta \gls{vectorx}_l = \hat{\gls{wmatrix}}_l^{[1]} \Delta \gls{vectory}_l^{[1]} +  \hat{\gls{wmatrix}}_l^{[2]} \Delta \gls{vectory}_l^{[2]} + \cdots+  \hat{\gls{wmatrix}}_l^{[N]} \Delta \gls{vectory}_l^{[N]},
    \label{eq:back_prop_gradient_composite}
    \end{equation}
    where now $\Delta \gls{vectory}^{[\gls{g}]}$ represents $\gls{w}^{[\gls{g}]} \times \gls{filterh}^{[\gls{g}]}$ pixels in $\gls{d}^{[\gls{g}]}$ channels of the output \gls{featuremap}\index{feature map}. Each $\hat{\gls{wmatrix}}_l^{[\gls{g}]}$ is a $\gls{c}_l \times \hat{n}^{[\gls{g}]}$ matrix of weights arranged appropriately for backpropagation. Again, note that each $\hat{\gls{wmatrix}}_l^{[\gls{g}]}$ can be simply reshaped from $\gls{wmatrix}_l^{[\gls{g}]}$.
    
    As before, each element of $\Delta \gls{vectory}_l$ is a sum over $\hat n$ products between elements of $\hat{\gls{wmatrix}}^{[\gls{g}]}_l$ and elements of $\Delta \gls{vectory}^{[\gls{g}]}_l$ and here $\hat{n}$ is given by:
    \begin{equation}
    \hat{n} = \sum{ \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}}.
    \end{equation}
    %
    In the case of a \gls{relu} non-linearity, this leads to a zero-mean Gaussian distribution with standard deviation:
    %
    \begin{equation}
    \gls{stddev} = \sqrt{\frac{2}{\sum{ \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}}}}.\label{eqn:reluinit}
    \end{equation}
    
    In conclusion, a \gls{compositelayer}\index{composite layer} of heterogeneously-shaped filter groups\index{filter groups}, where each filter group\index{filter groups} $i$ has $\gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}$ outgoing connections should be initialized as if it is a single-layer with  $\hat{n} = \sum{ \gls{w}^{[\gls{g}]} \gls{filterh}^{[\gls{g}]} \gls{d}^{[\gls{g}]}}$. Thus in the case of a \gls{relu} non-linearity, we find that such a \gls{compositelayer}\index{composite layer} should be initialized with a zero-mean Gaussian distribution with standard deviation given in \cref{eqn:reluinit}.
    
    \section{Results}\label{spatialbasisresults}
    To validate our approach, we show that we can replace the filters used in existing state-of-the-art network architectures with low-rank representations as described above to reduce computational complexity without reducing accuracy. Here we characterize the computational complexity of a \gls{cnn} using the number of multiply-accumulate operations required for a forward pass (which depends on the size of the filters in each convolutional layer as well as the input image size and stride).

    \subsection[Multiply-Accumulate Operations and Caffe CPU/GPU Timings]{Multiply-Accumulate Operations and\texorpdfstring{\\}{ }Caffe CPU/GPU Timings}\label{mavstimings}
    We have characterized the computational complexity of a \gls{cnn} using the number of multiply-accumulate operations required for a forward pass (which depends on the size of the filters in each convolutional layer as well as the input image size and stride), to give as close as possible to a hardware and implementation independent evaluation the computational complexity of our method.
    \input{lrdata/mavstimings}
    However, we have observed strong correlation between multiply-accumulate counts and run-time for both \gls{cpu} and \gls{gpu} implementations of the networks described here (as shown in \cref{fig:mavstimings}). Note that the Caffe timings differ more for the initial convolutional layers where the input sizes are much smaller (3-channels), and \text{\gls{blas}} is less efficient for the relatively small matrices being multiplied.
    
    \subsection{Methodology}
    We augment our training set with randomly cropped and mirrored images, but do not use any scale or photometric augmentation, or over-sampling. This allows us to compare the efficiency of different network architectures without having to factor in the computational cost of the various augmentation methods used elsewhere. During training, for every model except GoogLeNet, we adjust the learning rate according to the schedule,
    \begin{equation}
        \gls{lr}_t = \gls{lr}_0(1+\gls{lr}_0\gls{weightdecay} \gls{t})^{-1},
    \end{equation}
    where $\gls{lr}_0,\gls{lr}_t$ and $\gls{weightdecay}$ are the initial learning rate, learning rate at iteration $\gls{t}$, and weight decay\index{weight decay} respectively~\citep{Bottou2012sgdtricks}. When the validation accuracy levels off we manually reduce the learning rate by further factors of 10 until the validation accuracy no longer increases. Unless otherwise indicated, aside from changing the standard deviation of the normally distributed weight initialization, as explained in \cref{initialization}, we used the standard hyper-parameters for each given model. Our results use no test-time augmentation.  
    
    \subsection[VGG-11 Architectures for \Glsfmttext{ilsvrc} Object Classification and MIT Places Scene Classification]{VGG-11 Architectures for \glsfmttext{ilsvrc} Object\\Classification and MIT Places Scene Classification}\label{vggresults}
    We evaluated classification accuracy of the VGG-11 based architectures using two datasets, \gls{ilsvrc}~\citep{Jia2014} and MIT Places~\citep{zhou2014learning}. The \gls{ilsvrc} dataset comprises 1.2M training images of 1000 object classes, commonly evaluated by top-1 and top-5 accuracy on the 50K image validation set. The MIT Places dataset comprises 2.4M training images from 205 scene classes, evaluated with top-1 and top-5 accuracy on the 20K image validation set.
    
    VGG-11 (`VGG-A') is an 11-layer convolutional network introduced by \citet{Simonyan2014verydeep}. It is in the same family of network architectures used by \citet{Simonyan2014verydeep,He2015b} to obtain the state-of-the-art accuracy for \gls{ilsvrc}, but uses fewer convolutional layers and therefore fits on a single \gls{gpu} during training. During training of our VGG-11 based models, we used the standard hyperparameters detailed by \citet{Simonyan2014verydeep} and the initialization of \citet{He2015b}.
    
    \subsubsection{\Glsfmttext{vgg}-derived Model Table}
\begin{table}[tbp]
    \centering
    \caption[Low-rank \Glsfmttext{vgg} \Glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{vgg} \glsfmttext{ilsvrc} Results.} Accuracy, multiply-accumulate count, and number of parameters for the baseline VGG-11 network (both with and without \gls{gap}) and more efficient versions created by the methods described in this chapter.}\label{table:vggimagenetresults}
    \pgfplotstableread[col sep=comma]{lrdata/vggma.csv}\data
    
    %        \scalebox{0.9}{
    \tabcolsep=5pt
    \pgfplotstabletypeset[
    every head row/.style={
        before row=\toprule,after row=\midrule},
    every last row/.style={
        after row=\bottomrule},
    every first row/.style={
        after row=\bottomrule},
    %dec sep align,      % Align at decimal point
    fixed zerofill,     % Fill numbers with zeros
    columns={Network, Stride, Multiply-Acc., Param., Top-1 Acc., Top-5 Acc.},
    column type/.add={lrp{5em}p{4em}rrr}{},
    columns/Multiply-Acc./.style={
        column name=FLOPS {\small $\times 10^{9}$},
        preproc/expr={{##1/1e9}}
    },
    columns/Param./.style={
        column name=Param. {\small $\times 10^{7}$},
        preproc/expr={{##1/1e7}}
    },
    columns/Network/.style={string type},
    columns/Stride/.style={precision=0},
    columns/Top-1 Acc./.style={precision=3},
    columns/Top-5 Acc./.style={precision=3},
    highlight col max ={\data}{Top-1 Acc.},
    highlight col max ={\data}{Top-5 Acc.},
    highlight col min ={\data}{Param.},
    highlight col min ={\data}{Multiply-Acc.},
    col sep=comma]{\data}
    %        }
\end{table}
\begin{figure}[tbp] 
    \centering
    \pgfplotstableread[col sep=comma]{lrdata/vggma.csv}\datatable
    \pgfplotsset{major grid style={dotted,red}}
    
    \begin{tikzpicture}
    \begin{axis}[
        width=0.95\textwidth,
        height=0.33\textheight,
        axis x line=bottom,
        ylabel=Top-5 Error,
        xlabel=Multiply-Accumulate Operations,
        axis lines=left,
        enlarge x limits=0.10,
        grid=major,
        %xmin=0,
        ytick={0.01,0.02,...,0.21},
        ymin=0.1,ymax=0.15,
        yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
            /pgf/number format/fixed,
            /pgf/number format/precision=1
        },
        legend style={at={(0.01,0.98)}, anchor=north west, column sep=0.5em},
        legend columns=2,
        \setplotcyclecat{2},
        every axis plot/.append style={fill},
    ]
    \addplot+[mark=square*,nodes near coords,only marks,
        point meta=explicit symbolic,
        x filter/.code={
            \ifnum\coordindex>2\def\pgfmathresult{}\fi
        },
    ] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
    \addplot+[mark=*,nodes near coords,only marks,
        point meta=explicit symbolic,
        x filter/.code={
            \ifnum\coordindex<3\def\pgfmathresult{}\fi
        },
    ] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
    \legend{Baseline Networks, Our Results}
    \end{axis}
    \end{tikzpicture}
    \caption[Low-rank \Glsfmttext{vgg} \Glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{vgg} \glsfmttext{ilsvrc} Results.} Multiply-accumulate operations \vs top-5 error for VGG-derived models on \gls{ilsvrc} object classification dataset, the most efficient networks are closer to the origin. Our models are significantly faster than the baseline network, in the case of `gmp-lr-2x' by a factor of almost 60\%, while slightly lowering error. Note that the `gmp-lr' and `gmp-lr-join' networks have the same accuracy, showing that an explicit linear combination layer may be unnecessary.}\label{fig:vggplots}
\end{figure}
\begin{table}[tbp]
    \centering
    \caption[Low-rank MIT Places results]{{\bf MIT Places Results.} Accuracy, multiply-accumulate operations, and number of parameters for the baseline `vgg-11-gmp' network, separable filter network as described by \citet{journals/corr/JaderbergVZ14}, and more efficient models created by the methods described in this chapter. All networks were trained at stride 2 for the MIT Places dataset.
    }
    %\resizebox{\textwidth}{!}{
    \pgfplotstableread[col sep=comma]{lrdata/mitma.csv}\data
    \pgfplotstabletypeset[
        every head row/.style={
            before row=\toprule,after row=\midrule},
        every last row/.style={
            after row=\bottomrule},
        fixed zerofill,     % Fill numbers with zeros
        columns={Network, Stride, Multiply-Acc., Param., Top-1 Acc., Top-5 Acc.},
        column type/.add={lp{5em}p{5em}rrrr}{},
        columns/Multiply-Acc./.style={
            column name=FLOPS {\small $\times 10^{8}$},
            preproc/expr={{##1/1e8}}
        },
        columns/Param./.style={
            column name=Param. {\small $\times 10^{7}$},
            preproc/expr={{##1/1e7}}
        },
        columns/Network/.style={string type},
        columns/Stride/.style={precision=0},
        columns/Top-1 Acc./.style={precision=3},
        columns/Top-5 Acc./.style={precision=3},
        highlight col max ={\data}{Top-1 Acc.},
        highlight col max ={\data}{Top-5 Acc.}, 
        highlight col min ={\data}{Param.}, 
        highlight col min ={\data}{Multiply-Acc.}, 
        col sep=comma]{\data}
    %}
    \label{table:placesresults}
\end{table}
    \input{lrdata/vggplacesplots}
    \Cref{table:vggarch} shows the architectural details of the VGG-11-derived models used in \cref{vggresults}.
    In what follows, we compare the accuracy of a number of different network architectures detailed in  \cref{table:vggarch}. Results for \gls{ilsvrc} are given in \cref{table:vggimagenetresults}, and plotted in \cref{fig:vggplots}. Results for MIT Places are given in \cref{table:placesresults}, and plotted in \cref{fig:placesresults}. 
    
    \paragraph{Baseline (Global Max Pooling)}  Compared to the version of the network described by \citet{Simonyan2014verydeep}, we use a variant that replaces the final $2$$\times$$2$ max pooling layer before the first fully-connected layer with a global max pooling operation, similar to the global average pooling used by \citet{Lin2013NiN,Szegedy2014going}. We evaluated the accuracy of the baseline VGG-11 network with global max-pooling (\textbf{vgg-gmp}) and without (\textbf{vgg-11}) on the two datasets. We trained these networks at stride 1 on the \gls{ilsvrc} dataset and at stride 2 on the larger MIT Places dataset. This globally max-pooled variant of VGG-11 uses over 75\% fewer parameters than the original network and gives consistently better accuracy -- almost 3 percentage points lower top-5 error on \gls{ilsvrc} than the baseline VGG-11 network on \gls{ilsvrc} (see \cref{table:vggimagenetresults}). We used this network as the baseline for the rest of our experiments.
    
    
    \paragraph{Separable Filters} To evaluate the separable filter approach described in \cref{seqsep} (illustrated in \cref{fig:separableseq}, we replaced each convolutional layer in VGG-11 with a sequence of two layers, the first containing horizontally-oriented $1$$\times$$3$ filters and the second containing vertically-oriented $3$$\times$$1$ filters (\textbf{vgg-gmp-sf}). These filters applied in sequence represent $3$$\times$$3$ kernels using a low-dimensional basis space. Unlike \citet{journals/corr/JaderbergVZ14}, we trained this network from scratch instead of approximating the full-rank filters in a previously trained network. Compared to the original VGG-11 network, the separable filter version requires approximately 14\% less computation. Results are shown in \cref{table:vggimagenetresults} for \gls{ilsvrc} and \cref{table:placesresults} for MIT Places. Accuracy for this network is approx.~0.8\% lower than that of the baseline vgg-11-gmp network for \gls{ilsvrc} and broadly comparable for MIT Places. This approach does not give such a significant reduction in computational complexity as what follows, but it is nonetheless interesting that separable  filters are capable of achieving quite high classification accuracy on such challenging tasks.
    
    \paragraph{Simple Horizontal/Vertical Basis} To demonstrate the efficacy of the simple low rank filter representation illustrated in \cref{fig:separablemethods}c, we created a new network architecture (\textbf{vgg-gmp-lr-join}) by replacing each of the convolutional layers in VGG-11 (original filter dimensions were $3$$\times$$3$) with a sequence of two layers. The first layer comprises half $1$$\times$$3$ filters and half $3$$\times$$1$ filters whilst the second layer comprises the same number of $1$$\times$$1$ filters. The resulting network is approximately 49\% faster than the original and yet it gives broadly comparable accuracy (within 1 percentage point) for both the \gls{ilsvrc} and MIT Places datasets.
    
    \paragraph{Full-Rank Mixture} An interesting question concerns the impact on accuracy of combining a small proportion of 3$\times$3 filters with the 1$\times$3 and 3$\times$1 filters used in ‘vgg-gmp-lr-join’. To answer this question, we trained a network, \textbf{vgg-gmp-lr-join-wfull}, with a mixture of 25\% $3$$\times$$3$ and 75\% $1$$\times$$3$ and $3$$\times$$1$ filters, while preserving the total number of filters of the baseline network (as illustrated in \cref{fig:ourmethodfullrank}). This network was significantly more accurate than both `vgg-gmp-lr-join' and the baseline, with a top-5 center crop accuracy of 89.7\% on \gls{ilsvrc}, with a computational saving of approximately 16\% over our baseline. We note that the accuracy is approx.~1 percentage point higher than GoogLeNet.
    
    \paragraph{Implicitly Learned Combinations} In addition, we try a network similar to vgg-gmp-lr-join but without the $1$$\times$$1$ convolutional layer (as shown in \cref{fig:ourmethod}) used to sum the contributions of $3$$\times$$1$ and $1$$\times$$3$ filters (\textbf{vgg-gmp-lr}). Interestingly, because of the elimination of the extra $1$$\times$$1$ layers, this gives an additional computational saving such that this model is only $1/3$ of the computation of our baseline, with no reduction in accuracy. This seems to be a consequence of the fact that the subsequent convolutional layer is itself capable of learning effective combinations of filter responses even after the intermediate \gls{relu} non-linearity.
    
    We also trained such a network with double the number of convolutional filters (\textbf{vgg-gmp-lr-2x}), \ie with an equal number of $1$$\times$$3$ and $3$$\times$$1$ filters, or $2c$ filters as shown in \cref{fig:ourmethod}. We found this to increase accuracy further (88.9\% Top-5 on \gls{ilsvrc}) while still being approximately 58\% faster than our baseline network.
    
    \paragraph{Low-Dimensional Embeddings}
%    \input{lrdata/vggmodeltable}
\afterpage{\begin{landscape}
    \renewcommand{\arraystretch}{1.1}
    \setlength{\tabcolsep}{0.3em}
    \begin{table*}[p]
    \scriptsize
    \centering
    \caption[VGG model low-rank architectures]{{\bf VGG Model Architectures}. Here ``3$\times$3, 32'' denotes 32 3$\times$3 filters, ``/2'' denotes stride 2, ``fc'' denotes fully-connected, and ``$\|$'' denotes a concatenation within a composite layer.}\label{vggmodeltable}
    %\resizebox{0.97\textwidth}{!}{
    \begin{tabular}{@{}|c||c|c|c|c|c|c|c|c|@{}}
        \hline
        Layer & VGG-11 & \textbf{GMP} & \textbf{GMP-SF} & \textbf{GMP-LR} & \textbf{GMP-LR-2X} & \textbf{GMP-LR-JOIN} & \textbf{GMP-LR-LDE} & \textbf{GMP-LR-JOIN-WFULL}\\
        \hline
        \hline
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV1
        \textbf{conv1}& \multicolumn{2}{c|}{3$\times$3, 64} & 1$\times$3, 64 & 3$\times$1, 32 $\|$ 1$\times$3, 32 & 3$\times$1, 64 $\|$ 1$\times$3, 64 & \multicolumn{2}{c|}{3$\times$1, 32 $\|$ 1$\times$3, 32}& 3$\times$1, 24 $\|$ 1$\times$3, 24 $\|$ 3$\times$3, 16\\
        \cline{4-4} \cline{7-9}	
        & \multicolumn{2}{c|}{} & 3$\times$1, 64 & & & 1$\times$1, 64 & 1$\times$1, 32 & 1$\times$1, 64 \\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
        \hline
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV2
        \textbf{conv2} & \multicolumn{2}{c|}{3$\times$3, 128} & 1$\times$3, 128 & 3$\times$1, 64 $\|$ 1$\times$3, 64 & 3$\times$1, 128 $\|$ 1$\times$3, 128 & \multicolumn{2}{c|}{3$\times$1, 64 $\|$ 1$\times$3, 64} &  3$\times$1, 48 $\|$ 1$\times$3, 48 $\|$ 3$\times$3, 32\\
        \cline{4-4} \cline{7-9}
        & \multicolumn{2}{c|}{} & 3$\times$1, 128 & & & 1$\times$1, 128 & 1$\times$1, 64 & 1$\times$1, 128 \\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
        \hline
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV3
        \textbf{conv3} & \multicolumn{2}{c|}{3$\times$3, 256} & 1$\times$3, 256 & 3$\times$1, 128 $\|$ 1$\times$3, 128 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & \multicolumn{2}{c|}{3$\times$1, 128 $\|$ 1$\times$3, 128} & 3$\times$1, 96 $\|$ 1$\times$3, 96 $\|$ 3$\times$3, 64 \\
        \cline{4-4} \cline{7-9}
        & \multicolumn{2}{c|}{} & 3$\times$1, 256 & & & 1$\times$1, 256 & 1$\times$1, 128 & 1$\times$1, 256\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{2}{c|}{3$\times$3, 256} & 1$\times$3, 256 & 3$\times$1, 128 $\|$ 1$\times$3, 128 & 3$\times$1, 256 $\|$ 1$\times$3, 256 &\multicolumn{2}{c|}{3$\times$1, 128 $\|$ 1$\times$3, 128} & 3$\times$1, 96 $\|$ 1$\times$3, 96 $\|$ 3$\times$3, 64 \\
        \cline{4-4} \cline{7-9}	
        & \multicolumn{2}{c|}{} & 3$\times$1, 256 & & & 1$\times$1, 256 & 1$\times$1, 128 & 1$\times$1, 256\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
        \hline
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV4
        \textbf{conv4} & \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 &\multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
        \cline{4-4} \cline{7-9}
        & \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 & \multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
        \cline{4-4} \cline{7-9}
        & \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{2$\times$2 maxpool, /2}\\
        \hline
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONV5
        \textbf{conv5} & \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 & \multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
        \cline{4-4} \cline{7-9}
        & \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & \multicolumn{2}{c|}{3$\times$3, 512} & 1$\times$3, 512 & 3$\times$1, 256 $\|$ 1$\times$3, 256 & 3$\times$1, 512 $\|$ 1$\times$3, 512 & \multicolumn{2}{c|}{3$\times$1, 256 $\|$ 1$\times$3, 256} & 3$\times$1, 192 $\|$ 1$\times$3, 192 $\|$ 3$\times$3, 128\\
        \cline{4-4} \cline{7-9}
        & \multicolumn{2}{c|}{} & 3$\times$1, 512 & & & 1$\times$1, 512 & 1$\times$1, 256 & 1$\times$1, 512\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \cline{2-9}
        & 2$\times$2 maxpool, /2 & \multicolumn{7}{c|}{global maxpool}\\
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% FC
        \hline
        \textbf{fc6} & $7^2$ $\times$ 512 $\times$ 4096 & \multicolumn{7}{c|}{512 $\times$ 4096}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \hline
        \textbf{fc7} & \multicolumn{8}{c|}{4096 $\times$ 4096}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{ReLU}\\
        \hline
        \textbf{fc8} & \multicolumn{8}{c|}{4096 $\times$ 1000}\\
        \cline{2-9}
        & \multicolumn{8}{c|}{softmax}\\
        \hline
    \end{tabular}
    %}
    \label{table:vggarch}
    \end{table*}
    \end{landscape}}
    We attempted to reduce the computational complexity of our `gmp-lr' network further in the \textbf{vgg-gmp-lr-lde} network by using a stride of 2 in the first convolutional layer, and adding low-dimensional embeddings, as in \citet{Lin2013NiN,Szegedy2014going}. We reduced the number of output channels by half after each convolutional layer using $1$$\times$$1$ convolutional layers, as detailed in \cref{vggmodeltable,table:vggarch}. While this reduces computation significantly, by approx.~86\% compared to our baseline, we saw a decrease in top-5 accuracy on \gls{ilsvrc} of 1.2 percentage points. We do note however, that this network remains 2.5 percentage points more accurate than the original VGG-11 network, but is 87\% faster.
    
    \subsection{GoogLeNet for \Glsfmttext{ilsvrc} Object Classification}
    GoogLeNet, introduced by \citet{Szegedy2014going}, is the most efficient network for \gls{ilsvrc}, getting close to state-of-the-art results with a fraction of the computation and model size of even VGG-11. The GoogLeNet \gls{inception}\index{inception} module is a composite layer of 5 homogeneously-shaped filters, $1$$\times$$1$, $3$$\times$$3$, $5$$\times$$5$, and the output of a 3$\times$3 average pooling operations. All of these are concatenated and used as input for successive layers (see \cref{backgroundinception}). 
    
    For the \textbf{googlenet-lr} network, within only the \gls{inception}\index{inception} modules we replaced each the $3$$\times$$3$ filters with low-rank $3$$\times$$1$ and $1$$\times$$3$ filters, and replaced the layer of $5$$\times$$5$ filters with a set of low-rank $5$$\times$$1$ and $1$$\times$$5$ filters. For the \textbf{googlenet-lr-conv1} network, we similarly replaced the first and second layer convolutional layers with $7$$\times$$1$~/~$1$$\times$$7$ and $3$$\times$$1$~/~$1$$\times$$3$ layers respectively.
    
    Results are shown in \cref{table:googlenetimagenetresultsch4}, and \cref{fig:googlenetimagenetresultsch4}. GoogLeNet uses intermediate losses and fully connected layers only at training time, at test time these are removed. Test-time model size is thus significantly smaller than training time model size. \Cref{table:googlenetimagenetresultsch4} also reports test-time model size. The low-rank network delivers comparable classification accuracy using 26\% less compute.  No other networks produce comparable accuracy within an order of magnitude of compute. We note that although the Caffe pre-trained GoogLeNet model~\citep{Jia2014} has a top-5 accuracy of 0.889, our training of the same network using the given model definition, including the hyper-parameters and training schedule, but a different random initialization had a top-5 accuracy of 0.883.
    
    \begin{table}[tbp]
        \centering
        \caption[Low-rank \Glsfmttext{googlenet} \Glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{googlenet} \glsfmttext{ilsvrc} Results.} Accuracy, multiply-accumulate count, and number of parameters for the baseline GoogLeNet network and more efficient versions created by the methods described in this chapter.
        }
%        \resizebox{\textwidth}{!}{
        \pgfplotstableread[col sep=comma]{lrdata/googlenetma.csv}\data
        \pgfplotstabletypeset[
        every head row/.style={
            before row=\toprule,after row=\midrule},
        every last row/.style={
            after row=\bottomrule},
        every first row/.style={
            after row=\bottomrule}, 
        fixed zerofill,     % Fill numbers with zeros
        columns={Network, Multiply-Acc., Test Param., Top-1 Acc., Top-5 Acc.},
        columns/Multiply-Acc./.style={
            column name=FLOPS {\small $\times 10^{9}$},
            preproc/expr={{##1/1e9}}
        },
        columns/Test Param./.style={
            column name=Test Param. {\small $\times 10^{6}$},
            preproc/expr={{##1/1e6}}
        },
        column type/.add={lrrrrrr}{},
        columns/Network/.style={string type},
        columns/Top-1 Acc./.style={precision=3},
        columns/Top-5 Acc./.style={precision=3},
        highlight col max ={\data}{Top-1 Acc.},
        highlight col max ={\data}{Top-5 Acc.}, 
        highlight col min ={\data}{Test Param.}, 
        highlight col min ={\data}{Multiply-Acc.}, 
        col sep=comma]{\data}
%        }
        \label{table:googlenetimagenetresultsch4}
    \end{table}
    
    \begin{figure}[tbp] 
        \centering
        \pgfplotstableread[col sep=comma]{lrdata/googlenetma.csv}\datatable
        \pgfplotsset{major grid style={dotted,red}}
        
        \begin{tikzpicture}
        \begin{axis}[
            width=0.95\textwidth,
            height=0.33\textwidth,
            axis x line=bottom,
            ylabel=Top-5 Error,
            xlabel=Multiply-Accumulate Operations,
            axis lines=left,
            enlarge x limits=0.10,
            grid=major,
            ytick={0.01,0.02,...,0.21},
            ymin=0.10,ymax=0.15,
            yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
                /pgf/number format/fixed,
                /pgf/number format/precision=1
            },
            legend style={at={(0.98,0.98)}, anchor=north east, column sep=0.5em},
            legend columns=2,
            \setplotcyclecat{2},
            every axis plot/.append style={fill},
        ]
        \addplot+[mark=square*,nodes near coords,only marks,
            point meta=explicit symbolic,
            x filter/.code={
                \ifnum\coordindex>0\def\pgfmathresult{}\fi
            }
        ] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
        \addplot+[mark=*,nodes near coords,only marks,
            point meta=explicit symbolic,
            x filter/.code={
                \ifnum\coordindex<1\def\pgfmathresult{}\fi
            }
        ] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} },]{\datatable};
        \legend{Baseline, Our Results}
        \end{axis}
        \end{tikzpicture}
        
        \caption[Low-Rank \Glsfmttext{googlenet} \glsfmttext{ilsvrc} results]{\textbf{\Glsfmttext{googlenet} \glsfmttext{ilsvrc} Results.} Multiply-accumulate operations \vs top-5 error for \glsfmttext{googlenet}-derived models on \glsfmttext{ilsvrc} object classification dataset.}
        \label{fig:googlenetimagenetresultsch4}
    \end{figure}
    
    \subsection{\Glsfmttext{nin} for \Glsfmttext{cifar10} Object Classification}
    The \gls{cifar10} dataset consists of 60,000 $32\times 32$ images in 10 classes, with 6000 images per class. This is split into standard sets of 50,000 training images, and 10,000 test images~\citep{CIFAR10}. As a baseline for the \gls{cifar10} dataset, we used the \gls{nin} architecture~\citep{Lin2013NiN}, which has a published test-set error of 8.81\%. We also used random crops during training, with which the network has an error of 8.1\%. Like most state-of-the-art CIFAR results, this was with ZCA pre-processed training and test data~\citep{goodfellow2013maxout}, training time mirror augmentation and random sub-crops. The results of our CIFAR experiments are listed in \cref{table:cifarresultsch4} and plotted in \cref{fig:cifarresultsch4}.
    
    \begin{table}[tbp]
        \centering
        \caption[Low-rank \Glsfmttext{cifar10} results]{\textbf{\Glsfmttext{nin} \glsfmttext{cifar10} Results.} Accuracy, multiply-accumulate operations, and number of parameters for the baseline \gls{nin} model and more efficient versions created by the methods described in this chapter.}
        %\resizebox{\textwidth}{!}{
        \pgfplotstableread[col sep=comma]{lrdata/cifarma.csv}\data
        \pgfplotstabletypeset[
        every head row/.style={
            before row=\toprule,after row=\midrule},
        every last row/.style={
            after row=\bottomrule},    
        every first row/.style={
            after row=\bottomrule},
        fixed zerofill,     % Fill numbers with zeros
        columns={Network, Multiply-Acc., Param., Accuracy},
        columns/Multiply-Acc./.style={
            column name=FLOPS {\small $\times 10^{8}$},
            preproc/expr={{##1/1e8}}
        },
        columns/Param./.style={
            column name=Param. {\small $\times 10^{5}$},
            preproc/expr={{##1/1e5}}
        },
        column type/.add={lrrr}{},
        columns/Network/.style={string type},
        columns/Accuracy/.style={precision=4}, 
        highlight col max ={\data}{Accuracy}, 
        highlight col min ={\data}{Param.}, 
        highlight col min ={\data}{Multiply-Acc.}, 
        col sep=comma]{\data}
        %}
        \label{table:cifarresultsch4}
    \end{table}
    
    \begin{figure}[tbp] 
        \centering
        \pgfplotstableread[col sep=comma]{lrdata/cifarma.csv}\datatable
        \pgfplotsset{major grid style={dotted,red}}
        
        \begin{tikzpicture}
        \begin{axis}[
            width=0.95\textwidth,
            height=0.33\textwidth,
            axis x line=bottom,
            ylabel=Error,
            xlabel=Multiply-Accumulate Operations,
            axis lines=left,
            enlarge x limits=0.10,
            grid=major,
            xticklabel style={
                /pgf/number format/fixed,
                /pgf/number format/fixed zerofill,
                /pgf/number format/precision=1
            },
            ytick={0.01,0.015,0.02,...,0.21},
            yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
                /pgf/number format/fixed,
                /pgf/number format/fixed zerofill,
                /pgf/number format/precision=1
            },
            ymin=0.07,ymax=0.1,
            legend style={at={(0.98,0.98)}, anchor=north east, column sep=0.5em},
            legend columns=2,
            \setplotcyclecat{2},
            every axis plot/.append style={fill},
        ]
        \addplot+[mark=square*,
            nodes near coords,only marks,
            point meta=explicit symbolic,
            x filter/.code={
                \ifnum\coordindex>1\def\pgfmathresult{}\fi
            }
        ] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Accuracy} }]{\datatable};
        \addplot+[mark=*,nodes near coords,only marks,
            point meta=explicit symbolic,
            x filter/.code={
                \ifnum\coordindex<2\def\pgfmathresult{}\fi
            }
        ] table[meta=Network,x=Multiply-Acc.,y expr={1 - \thisrow{Accuracy} }]{\datatable};
        \legend{Baseline Networks, Our Results}
        \end{axis}
        \end{tikzpicture}
        \caption[Low-rank \Glsfmttext{cifar10} results]{\textbf{\Glsfmttext{nin} \glsfmttext{cifar10} Results.} Multiply-accumulate operations \vs error for \gls{nin} derived models on \gls{cifar10} object classification dataset.}
        \label{fig:cifarresultsch4}
    \end{figure}
    
    This architecture uses $5$$\times$$5$ filters in some layers. We found that we could replace all of these with $3$$\times$$3$ filters, with comparable accuracy. As suggested by \citet{Simonyan2014verydeep}, stacked  $3$$\times$$3$ filters have the effective receptive field of larger filters with less computational complexity. In this \textbf{nin-c3} network, we replaced the first convolutional layer with one $3$$\times$$3$ layer, and the second convolutional layer with two $3$$\times$$3$ layers. This network is 26\% faster than the standard NiN model, with only 54\% of the model parameters. Using our low-rank filters in this network, we trained the \textbf{nin-c3-lr} network, which is of similar accuracy (91.8\% \vs 91.9\%) but is approximately 54\% of the original network's computational complexity, with only 45\% of the model parameters.
    
    \subsection{Comparing with \Glsfmttext{ilsvrc} State-of-the-Art Networks}
    \Cref{fig:bigpicturema,fig:bigpictureparam} compare published top-5 \gls{ilsvrc} validation error \vs multiply-accumulate operations and number of model parameters (respectively) for several state-of-the-art networks~\citep{Simonyan2014verydeep,Szegedy2014going,He2015b}. The error rates for these networks are only reported as obtained with different combinations of computationally expensive training and test-time augmentation methods, including scale, photometric, ensembles (multi-model), and multi-view/dense oversampling. This can make it difficult to compare model architectures, especially with respect to computational requirements.
    
    State-of-the-art networks, such as MSRA-C\footnote{at the time of these experiments.}, VGG-19 and oversampled GoogLeNet are orders of magnitude larger in computational complexity than our networks. From \cref{fig:bigpicturema}, where the multiply-accumulate operations are plotted on a log scale, increasing the model size and/or computational complexity of test-time augmentation of \glspl{cnn}\index{CNN} appears to have diminishing returns for decreasing validation error. Our models \emph{without} training or test-time augmentation show comparable accuracy to networks such as VGG-13 \emph{with} training and test-time augmentation, while having far less computational complexity and a smaller model size. In particular, the `googlenet-lr' model has a much smaller test-time model size than any network of comparable accuracy. 
    
    \afterpage{
    \begin{landscape}
    \begin{figure}[p]
    	\centering
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture.csv}\datatable
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture_ours.csv}\datatableours
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture_aug.csv}\datatableaug
    	\pgfplotsset{major grid style={dotted,red}}
    	\pgfplotsset{minor grid style={dotted,red}}
    	
    	\begin{tikzpicture}
    	\begin{axis}[
            width=1.37\textwidth,
            height=0.95\textheight,
            axis x line=bottom,
            ylabel=Top-5 Error,
            xlabel=$\log_{10}$(Multiply-Accumulate Operations),
            axis lines=left,
            enlarge x limits=0.10,
            enlarge y limits=0.05,
            grid=both,
            ytick={0.01,0.02,...,0.2},
            xmode=log,
            yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
                /pgf/number format/fixed,
                /pgf/number format/precision=1
            },
            \setplotcyclecat{3},
            every axis plot/.append style={fill},
            legend style={at={(0.01,0.01)},anchor=south west},
            ]
            \addplot+[mark=*,
                nodes near coords,only marks,
                point meta=explicit symbolic,
                every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
            ] table[meta=Real Name,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableours};
            \addplot+[mark=*,
                nodes near coords,only marks,
                point meta=explicit symbolic,
                every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
            ] table[meta=Real Name,x=Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatable};
            \addplot+[mark=square*,
                nodes near coords,only marks,
                point meta=explicit symbolic,
                every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
            ] table[meta=Real Name,x=Test Multiply-Acc.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableaug};
            \legend{Our Results, Crop \& Mirror Aug., Extra Augmentation}
    	\end{axis}
        \end{tikzpicture}
    	\caption[Computational complexity of state-of-the-art \glsfmttext{ilsvrc} models]{\textbf{Computational complexity of state-of-the-art \Glsfmttext{ilsvrc} models.} Test-time multiply-accumulate operations \vs top-5 error on state-of-the-art networks using a \emph{single} model. Note the difference in accuracy and computational complexity for VGG-11 model with/without extra augmentation. Our `vgg-gmp-lr-join-wfull' model \emph{without} extra augmentation is more accurate than VGG-11 \emph{with} extra augmentation, and is much less computationally complex.}
    	\label{fig:bigpicturema}
    \end{figure}
    \end{landscape}
    }
    
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    \afterpage{
    \begin{landscape}
    \begin{figure}[p]
    	\centering
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture.csv}\datatable
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture_ours.csv}\datatableours
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture_aug.csv}\datatableaug
    	\pgfplotsset{major grid style={dotted,red}}
    	\pgfplotsset{minor grid style={dotted,red}}
    	
    	\begin{tikzpicture}
    	\begin{axis}[
            width=1.37\textwidth,
            height=0.95\textheight,
            axis x line=bottom,
            ylabel=Top-5 Error,
            xlabel=$\log_{10}$(Number of Parameters),
            axis lines=left,
            enlarge y limits=0.05,
            grid=both,
            ytick={0.01,0.02,...,0.2},
            xmode=log,
            xmin=10e5,xmax=10e8,
            yticklabel={\pgfmathparse{\tick*100}\pgfmathprintnumber{\pgfmathresult}\%},style={
                /pgf/number format/fixed,
                /pgf/number format/precision=1
            },
            \setplotcyclecat{3},
            every axis plot/.append style={fill},
            legend style={at={(0.01,0.01)},anchor=south west},
    	]
    	\addplot+[mark=*,
            nodes near coords,
            only marks,
            point meta=explicit symbolic,
            every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
    	] table[meta=Real Name,x=Param.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableours};
    	\addplot+[mark=*,
            nodes near coords,
            only marks,
            point meta=explicit symbolic,
            every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
    	] table[meta=Real Name,x=Param.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatable};
    	\addplot+[mark=square*,
            nodes near coords,
            only marks,
            point meta=explicit symbolic,
            every node near coord/.append style={xshift=0.01em, anchor=west, font=\scriptsize\sffamily\sansmath},
    	] table[meta=Real Name,x=Param.,y expr={1 - \thisrow{Top-5 Acc.} }]{\datatableaug};
    	\legend{Our Results, Crop \& Mirror Aug., Extra Augmentation}
    	\end{axis}
    	\end{tikzpicture}
    	\caption[The number of parameters of state-of-the-art \glsfmttext{ilsvrc} models]{\textbf{Number of Parameters of State-of-the-Art \Glsfmttext{ilsvrc} Models.} Test-time parameters \vs top-5 error for state-of-the-art models. The main factor in reduced model size is the use of global pooling or lack of fully-connected layers. Note that our `googlenet-lr' model is almost an order of magnitude smaller than any other network of comparable accuracy.}
    	\label{fig:bigpictureparam}
    \end{figure}
    \end{landscape}
    }
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    
    \begin{table}[tbp]
    	\centering
    	\caption[State-of-the-art single models with extra augmentation]{\textbf{State-of-the-Art Single Models with Extra Augmentation.} Top-5 \glsfmttext{ilsvrc} validation accuracy, single view and augmented test-time \glsfmttext{flops} (multiply-accumulate) count, and number of parameters for various state-of-the-art models \emph{with} various training and test-time augmentation methods. A multi-model ensemble of MSRA-C is the current state-of-the-art network.}
    	%\resizebox{\columnwidth}{!}{
    	\pgfplotstableread[col sep=comma]{lrdata/bigpicture_aug.csv}\data
    	\pgfplotstabletypeset[
    	every head row/.style={
    		before row=\toprule,after row=\midrule},
    	every last row/.style={
            after row=\bottomrule},
        empty cells with={--},
    	%dec sep align,      % Align at decimal point
    	fixed zerofill,     % Fill numbers with zeros
    	columns={Real Name, Multiply-Acc., Test Multiply-Acc., Param., Top-5 Acc.},
    	column type/.add={lrrrrrr}{},
    	columns/Multiply-Acc./.style={
    		column name=FLOPS {\small $\times 10^{9}$},
    		preproc/expr={{##1/1e9}}
    	},
    	columns/Test Multiply-Acc./.style={
    		column name=FLOPS w/ Aug. {\small $\times 10^{9}$},
    		preproc/expr={{##1/1e9}}
    	},
    	columns/Param./.style={
    		column name=Param. {\small $\times 10^{7}$},
    		preproc/expr={{##1/1e7}}
    	},
    	columns/Real Name/.style={string type},
    	columns/Stride/.style={precision=0},
    	columns/Top-1 Acc./.style={precision=3},
    	columns/Top-5 Acc./.style={precision=3},
    	highlight col max ={\data}{Top-5 Acc.}, 
    	highlight col min ={\data}{Param.}, 
    	highlight col min ={\data}{Multiply-Acc.}, 
    	highlight col min ={\data}{Test Multiply-Acc.}, 
    	col sep=comma]{\data}
    	
    	\label{table:bigpicturetable}
    \end{table}
    
    \section{Discussion}
    %We found our network architecture, which learns a small set of $3$$\times$$3$ basis filters along with many $1$$\times$$3$~/~$3$$\times$$1$ basis filters, gave the most impressive results. Such a model, `vgg-lr-wfull', increased the top-5 center crop validation accuracy on \gls{ilsvrc} by 1 percentage points in accuracy (89.7\% \vs 88.7\%) while reducing computation by 16\%, over our baseline network with global max-pooling. Although we did not try such a configuration of GoogLeNet, our `googlenet-lr' network using only $1$$\times$$3$~/~$3$$\times$$1$ and $1$$\times$$5$~/~$5$$\times$$1$ basis filters within the \gls{inception}\index{inception} modules obtained the smallest model size while maintaining comparable accuracy, using 26\% less compute than GoogLeNet and 41\% less model parameters.`
    This chapter has presented a method to train \gls{cnn} from scratch using low-rank filters. This is made possible by a new way of initializing the network’s weights which takes into consideration the presence of differently shaped filters in \glspl{compositelayer}\index{composite layer}. 
    Validation on image classification in three popular datasets confirms similar or higher accuracy than the state-of-the-art models, with much greater computational efficiency. 

    It is somewhat surprising that networks based on learning filters with less representational ability are able to do as well, or better, than \glspl{cnn}\index{CNN} with full $k$$\times$$k$ filters on the task of image classification. However, a lot of interesting small-scale image structure is well-characterized by low-rank filters, \eg edges and gradients. Our experiments training a separable (rank-1) model (`vgg-gmp-sf') on \gls{ilsvrc} and MIT Places show surprisingly high accuracy on what are considered challenging problems --- approx.\ 88\% top-5 accuracy on \gls{ilsvrc} --- but not enough to obtain comparable accuracies to the models on which they are based.
    
    Given that most discriminative filters learned for image classification appear to be low-rank, we instead structure our architectures with a set of basis filters in the way illustrated in \cref{fig:ourmethodfullrank}. This allows our networks to learn the most effective combinations of complex (\eg $k$$\times$$k$) and simple (\eg $1$$\times$$k$, $k$$\times$$1$) filters. Furthermore, in restricting how many complex spatial filters may be learned, this architecture prevents overfitting, and helps improve generalization. Even in our models where we do not use square $k$$\times$$k$ filters, we obtain comparable accuracies to the baseline model, since the rank-2 cross-shaped filters effectively learned as a combination of $3$$\times$$1$ and $1$$\times$$3$ filters are capable of representing more complex local pixel relations than rank-1 filters.

    Recent advances in state-of-the-art accuracy with \glspl{cnn}\index{CNN} for image classification have come at the cost of increasingly large and computationally complex models. We believe our results to show that learning computationally efficient models with fewer, more relevant parameters, can prevent overfitting, increase generalization and thus also increase accuracy.
\end{document}