hypergraph

Tensor Decomposition for Neural Network Compression

In this notebook, we leverage the natural data compression provided by tensor decomposition in order to reduce the number of parameters in a trained convolutional neural network. Specifically, by decomposing the weight tensors associated with each layer of ResNet-50, we find a network that is less than 7.2% the size of the original yet achieves 99.1% of the original accuracy. Possible benefits of this method include reduced power and memory requirements and faster inference. This technique would allow for state-of-the-art networks to be successfully deployed on the edge. Our application of this technique is demonstrated using Keras, a high-level API used to interface with TensorFlow.

Table of Contents

Layer Decomposition

In order to use ENSIGN to decompose convolutional layers, the ndarray representions of the weights need to be converted to the ENSIGN tensor format for decomposition. We define decompose_kernel, a wrapper for ensign.cp_decomp.cp_als(), that takes weights as an ndarray, constructs an ENSIGN tensor, and decomposes it using CP-ALS. We choose this decomposition algorithm, which assumes tensor entries are distributed according to Gaussian distributions, because it is best suited for continuous-valued data.

In order to apply the decomposed convolutions to incoming feature maps, the decomposed weight tensor should be used to set the weights of the decomposed layer architecture. The specific decomposed layer used here, implemented in factorized_conv, turns an n-dimensional convolution into a sequence of n+2 1D convolutions. Our implementation takes a decomposition object and the original layer parameters as inputs and infers the appropriate sequence of 1D convolutions.

This technique is derived and discussed in more detail in Speeding-up Convolutional Neural Networks Using Fine-tuned CP-decomposition. The figures below depict the factorization of a 2D convolution into a sequence of four 1D convolutions. In the simplified standard 2D convolution mapping S input channels to T output channels, there are S filters for each output channels. Then for a kernal size of d, the tensor storing the layer weights has size d x d x S x T.

hypergraph

After decomposing the tensor and substituting the form of the decomposition for the kernel in the layer application rule, the following sequence of 1D convolutions is derived. The factor matrices contain the weights for the 1D convolutions.

hypergraph

The compression results from the fact that the factor matrices describing the 1D convolutions have fewer parameters than the original tensor for appropriate choices of the rank. The original layer has d x d x S x T parameters while the decomposed layer has R x (d + d + S + T) parameters where R is the rank. For example, consider a convolutional layer with a 3 x 3 kernels mapping 32 input channels to 64 output channels. The original layer has 3 x 3 x 32 x 64 = 18,432 parameters. Decomposing the layer with rank 16 results in 16 x (3 + 3 + 32 + 64) = 1,632 parameters.

Network Surgery

The original network must be modified to replace layers with their decomposed counterparts. In order to handle arbitrary network archictures, including non-sequential models like ResNet-50 that contain skip connections, we implement replace_layers. This "network surgery" tool inspects the input and output nodes of each layer to reconstruct the model's computational graph with the decomposed layers substituted for the originals.

Network Compression Wrapper

Finally, we implement factorized_cnn to handle the coordination between decomposing weight tensors, constructing decomposed layers, and performing network surgery. This function takes a convolutional network and either a rank for decomposing the layers or a desired compression rate and returns a convolutional network with each layer factorized.

Application to ResNet-50

Our specific application is to decompose each layer of ResNet-50 trained on the CIFAR-10 dataset. The resulting network is 7.2% the size of the original. Yet after fine-tuning, it achieves 99.1% of the original accuracy.

Model and Data

We decompose ResNet-50 trained on CIFAR-10 image data. The network technically consists of two parts: the feature extractor and the classifier. We decompose the feature extractor, which consists of convolutional layers.

The model achieves accuracy of 79.4% on the test data and the feature extractor has 23,587,712 parameters. These numbers are the baseline and serve as the point of comparison for the factorized model.

Decomposition

We decompose each layer in ResNet-50 with rank 32, then the classifier used in the original model is reused by the decomposed feature extractor. The lower the rank, the greater the compression. We found that decomposing each layer with 32 components resulted in a good tradeoff between compression and accuracy.

Fine-tuning

Finally, we fine-tune the decomposed model to achieve higher test accuracy by performing end-to-end training.

The decomposed model achieves accuracy of 78.7% on the test data and the feature extractor has 1,709,536 parameters. So the decomposed model is 7.2% the size of the original while achieving nearly 99.1% of its accuracy.

Key Takeaways

Following the work of Speeding-up Convolutional Neural Networks Using Fine-tuned CP-decomposition, we applied the compression technique to a much larger network and replicated the quality of results. We anticipate that applying this technique will enable the deployment of state-of-the-art architectures on the edge.

The natural representation of neural network layers as tensors allows for tensor decomposition to compress the number of parameters in each layer. The resulting decomposed layers require both less memory and less computation than their undecomposed counterparts. While we demonstrate the technique on convolutional layers, analogous implementations are possible for dense and recurrent layers as well as transformers. Future effort will be put toward factorizing models used for natural language processing, such as BERT.

It is important to note that this technique, which operates on the architecture of the neural network, is orthogonal to any compiler optimizations used to accelerate inference. In fact, these types of techniques may complement one another.

Additional future work will involve using pruning to sparsify the compressed network, further reducing the number of parameters.