Future Generation Computer Systems 84 (2018) 1–10
Contents lists available at ScienceDirect
Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs
T1000: Mitigating the memory footprint of convolution neural networks with decomposition and re-fusion Changxi Liu a , Hailong Yang a , Rui Wang a, *, Zhongzhi Luan a , Depei Qian a,b a b
Sino-German Joint Software Institute, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510275, China
highlights • • • •
We identify the memory problem when applying CP-decomposition to CNNs. We propose a decomposition and re-fusion approach to mitigate the memory problem. We demonstrate the effectiveness of our approach on two state-of-the-art CNNs. Our experiments with AlexNet and VGG-19 show great memory reduction and speedup.
article
info
Article history: Received 28 November 2017 Received in revised form 6 February 2018 Accepted 13 February 2018 Available online 16 February 2018 Keywords: CNN Tensor decomposition Convolution re-fusion
a b s t r a c t In recent years, convolution neural networks have significantly advanced the frontier of computer vision and other intelligent applications due to its promising accuracy. However, the improved accuracy comes with the formidable computation complexity with deeper convolution layers, which prevents its adoption on resource constrained system such as embedded and mobile. Although research efforts have been devoted to reduce the computation complexity of convolution neural networks through tensor decomposition, the volume of intermediate data generated by the tensor decomposition grows dramatically, which consumes more memory resource and has not been addressed by existing work. In this work, we propose T1000 to re-fuse the convolutions across tensors after applying the canonical polyadic decomposition to conventional convolution layers so that we can receive the benefit of reduced computation complexity, in the meanwhile mitigate the memory occupancy of the intermediate data. We demonstrate the effectiveness of our approach by applying canonical polyadic decomposition and re-fusion to the convolution layers of two well-known convolution neural networks, AlexNet and VGG19 implemented with Caffe. Compared to the default canonical polyadic decomposition, our approach reduces the memory occupancy of the intermediate data by 84.6% and 77.4% for AlexNet and VGG-19 respectively. In addition, our approach improves the performance of AlexNet and VGG-19 by 1.77× and 1.4× respectively. © 2018 Elsevier B.V. All rights reserved.
1. Introduction In the recent years, the advancement in image classification [1,2] and object detection [3,4] achieved by convolution neural networks (CNNs) have demonstrated that deep learning is an effective approach to develop intelligent computer vision applications, such as self-driving car, personal assistant and artificial intelligent robot. However, as the depth of neural network grows, the
* Corresponding author.
E-mail addresses:
[email protected] (C. Liu),
[email protected] (H. Yang),
[email protected] (R. Wang),
[email protected] (Z. Luan),
[email protected] (D. Qian). https://doi.org/10.1016/j.future.2018.02.024 0167-739X/© 2018 Elsevier B.V. All rights reserved.
computation demand of CNNs is becoming a major obstacle preventing its pervasive adoption. Even though the training tasks can be done on high-end server with powerful accelerators (e.g., GPU and FPGA), there are rising interests from both industry and academia to deploy inference tasks in resource constrained fields such as embedded system and mobile device [5–7]. Due to the limited computation and memory capacity, it is critical to mitigate the resource consumption of CNNs for its successful adoption in resource constrained fields. To address the above challenges from different perspectives, there has been growing amount of research works such as developing smaller networks with negligible precision loss [8–11], advancing the mathematical computation method [12,13] and modifying existing networks to adapt to the hardware architecture [5,14–16]
2
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
(e.g., transforming the convolution layers to reduce computation complexity). Among these studies, the idea of convolution dimensionality reduction (e.g., tensor decomposition [17]) is considered to be an effective way to mitigate the computation complexity. However, existing work [18,5,16] fails to consider the intermediate data generated after the transformation, which consumes significant amount of memory resource. Among the convolution dimensionality reduction approach [18, 17,5,16], canonical polyadic decomposition (CP-decomposition) [5] has been widely used by researchers to optimize the convolution operations. The CP-decomposition actually involves two steps such as kernel decomposition and convolution composition. The decomposition means breaking the high-dimensional convolution kernel tensors into a series of low rank ones. Whereas the composition means replacing the original convolution layer with a composition of four convolution layers with decomposed kernel tensors. The fundamental idea is similar to service composition in the field of cloud computing [19,20]. By approximating a multidimensional convolution to the sum of several low-rank tensors, CPdecomposition can effectively cut down the number of convolution operations with negligible precision loss (detailed discussion in Section 2.2). However, when applying CP-decomposition to convolution layers, it generates large amount of intermedia data by lowrank tensors, exacerbating the problem of memory consumption. To further illustrate, we apply CP-decomposition to the most time consuming convolution layers (e.g., conv2) of AlexNet [21] and VGG-19 [22], and measure memory footprint of each convolution layer after CP-decomposition. Note that the precision loss of both networks is less that 1% after applying the CP-decomposition. The experimental details are shown in Section 4.1. As shown in Fig. 1, the left two bars represent the results of AlexNet, while the right two bars represent the results of VGG-19. The memory footprint of AlexNet and VGG-19 is shown on the left y-axis and right y-axis respectively. The bar labeled Traditional-Conv shows the results of the original convolution layers, while the bar with CP-Conv shows the results after applying CPdecomposition. Comparing to the original convolution layer, CPdecomposition generates large amount of intermediate data while reducing the computational complexity. For instance, the size of intermedia data for AlexNet and VGG-19 increases by more than 26× and 7× respectively. The reason for the dramatic increase of intermedia data size is that CP-decomposition utilizes three cascaded tensors (much smaller than the original convolution kernels) to reduce the number of convolution operations. The intermedia data generated by previous tensor is passed to the next tensor, which requires additional memory space to store. Moreover, we observe that the volume of the intermediate data is closely related to the batchsize chosen by the network. Therefore, it is critical to mitigate the memory footprint of intermedia data from CPdecomposition for its successful adoption in resource constrained fields. The idea of CNN layer re-fusion is explored by [14]. However, there are several challenges to be addressed in order to re-fuse the convolutions across tensors from CP-decomposition. (1) Since the original convolution is decomposed across several tensors, it remains unclear which convolutions should be re-fused and how the decision affects the memory occupancy as well as accuracy. (2) Convolution re-fusion itself consumes memory to store temporary data, therefore it is important to optimize the memory usage during re-fusion. (3) The intermedia data generated from one tensor is the input for the subsequent tensor. The efficiency of the convolution operations across tensors determine the performance of the layer after applying re-fusion. To address the above challenges, we propose a decomposition and re-fusion approach T1000. It leverages the advantage of CPdecomposition to reduce the computation complexity of convolution operations, and then use re-fusion to mitigate the volume
Fig. 1. The memory footprint of convolution layers of AlexNet and VGG-19 after applying the CP-decomposition.
of intermediate data introduced by CP-decomposition. We evaluate the effectiveness of T1000 from several aspects by applying CP-decomposition and re-fusion to two state-of-the-art CNNs (AlexNet and VGG-19). The experiments results demonstrate that T1000 can effectively reduce the size of intermediate data while improving the performance of the original CNN models. Specifically, this paper makes the following contributions:
• We identify the memory problem due to the large amount of intermedia data introduced by CP-decomposition when it is applied to CNNs with comprehensive analysis. • To overcome the memory problem, we propose a decomposition and re-fusion approach (T1000) to mitigate the volume of intermediate data through combining separate convolution operations across tensors into an integrated computation process. • We demonstrate the effectiveness of T1000 on two stateof-the-art CNNs (AlexNet and VGG-19) that significantly reduces the size of intermedia data as well as improves the performance of convolution layers. The rest of the paper is organized as follows. Section 2 describes the background of CNNs and CP-decomposition that motivates our study. Section 3 proposes our decomposition and re-fusion approach. Section 4 details the evaluation. Section 5 discusses the related work. Section 6 concludes our work. 2. Background 2.1. Convolution neural networks Neural networks have been demonstrated success in many fields such as speech signal prediction [23] and medical data classification [24]. Especially the Convolution Neural Network (CNN) [21] have received dramatic research attention recently with its extraordinary performance on ImageNet [25]. CNN is a multiple-layer neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. In general, CNNs is composed by series of convolution and pooling layers followed by classification layer. The main function of convolution operation is to extract features from input, and then produce high level feature maps as the layer goes up. Taking advantage of the local perception and parameter sharing, CNNs is capable of reducing the massive weight parameters compared to the fully-connected neural network. The feature maps are computed through convolution between feature maps and convolution kernels. When a three dimensional feature map with depth M performs convolution with N convolution kernels of the
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
3
Fig. 2. Canonical Polyadic Decomposition (CP-decomposition).
same depth, it produces another three dimensional feature map with depth N. For the ease of implementation, the convolution operation is usually transformed into matrix multiplication (e.g., Caffe [26], Torch [27] and MxNet [28]). However, this transformation requires changing the shape of input feature map, which not only introduces additional computation, but also increases the memory usage. Fortunately, there is a special 1 × 1 convolution kernel that can be directly used in matrix multiplication without reshaping the feature map. For instance, a feature map with dimension x × y × z can be treated as a matrix A with length x × y and width z, whereas at the layer with N times 1 × 1 convolution kernels, the kernels can be treated as a matrix K with length z and width N. Therefore, the convolution can be done as a A × K matrix multiplication. 2.2. CP-decomposition CP-decomposition [17] is a special form of tensor decomposition to decompose a high-dimensional complex tensors into a series of low rank ones. It approximates the convolution as a composition of four convolutions with small kernels (1 × d or d × 1), which significantly reduces the computational complexity with negligible accuracy loss. The details of CP-decomposition is shown in Fig. 2. Supposing in the original convolution layer there are N convolution kernels with the same dimension of d × d × S. CP-decomposition approximates the original four dimensional convolution kernel (generalizing the number of kernels as another dimension) with the sum of four two dimensional tensors. Accordingly, the number of the convolution operations is reduced from N × d × d × S to R × (N + 2d + S) (the sum of R × S, R × d, R × d and R × N of the four transformed tensors). The computation complexity of the original convolution kernel is reduced significantly if the decomposition granularity R satisfies Eq. (1). For CNNs, there are several filters (e.g., convolution kernels) performing the convolution at each layer, which can be treated as the high-dimensional tensors. Therefore, employing CP-decomposition to these convolution kernels can effectively optimize the performance of CNNs. R<
N ×d×d×S N +2×d+S
(1)
As seen from Eq. (1), the smaller R is, the less number of weight parameters we need to use in the tensors with CP-decomposition, which in turn means the less computation is needed during the convolution. However, as R becomes smaller, it means we are using less weight parameters in the two dimensional tensors to approximate the four dimensional kernel in the original convolution layer. The decreasing number of weight parameters with smaller R after CP-decomposition deteriorates the representativeness of the original convolution layer, and thus degrades the accuracy of the CNN. For CP-decomposition to be effective, R is usually chosen with acceptable accuracy loss.
2.3. Deep learning frameworks With the success of CNNs applying in various fields, deep learning frameworks are built to ease the development of CNN applications. Among the existing frameworks, we describe three of them (Caffe [26], TensorFlow [29] and Torch [27]) proposed by both academia and industry for the brevity of our discussion. Caffe as one of the most popular deep learning frameworks is initially proposed from UC Berkeley. It is well recognized and widely used in the field of computer vision. It supports distributed training as well as parallel computing with CUDA [30], which improves the performance and scalability of the Caffe framework. However, its support for the recursive neural networks [31] (RNNs) is quite limited. TensorFlow is open sourced from Google in 2015, which draws tremendous attention from the public since then. It is well known for expressing the computations as stateful dataflow graphs that improves the computing efficiency of the neural networks. It provides python APIs that are easy to use for the development of new applications. Torch is contributed by the internet company Facebook that provides a wide range of machine learning libraries. To use Torch, the users have to write special scripts based on the Lua programming language [32]. However, Torch does not support distributed training for now. Considering the modifications to the internal implementation of the CNNs, we choose Caffe as our target platform due to its rich documentations and low learning curve. Note that although we use Caffe in this study, our approach is also applicable to other deep learning frameworks. 2.4. The limitation of CP-decomposition Although CP-decomposition is quite effective to reduce the computation complexity through approximating the original high dimensional convolution kernels with low dimensional tensors, it requires large memory space to store the intermediate data during the convolution across the tensors. As shown in Fig. 1, after applying CP-decomposition, the size of the intermediate data for AlexNet and VGG-19 increases by more than 26× and 7× respectively. The surging volume of intermediate data from CP-decomposition offsets its computation advantage and limits its adoption on resource constrained platforms such as embedded and mobile devices. The large memory footprint of CP-decomposition lies in the data flow of the convolution operation across the four tensors as shown in Fig. 2. These convolution operations are cascaded along with the intermediate data across tensors, which requires large memory space to store. In addition, in most of the deep learning frameworks, convolution operations are commonly implemented with matrix multiplications, which requires transformation of the input feature maps(e.g., im2col in Caffe). These transformations consumes extra memory space to store the temporary data as well as introduces computation overhead. For CP-decomposition that performs more convolution operations across the low dimensional tensors, these transformations become another source of inefficiency for its consumption of computation and memory resources. Especially when
4
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
Fig. 3. The convolution operation within (a) stock CNN, (b) CNN with CP-decomposition and (c) CNN with CP-decomposition and Re-fusion.
the size of the input feature maps becomes larger (e.g., using large batchsize in the network), the memory and computation overhead with the transformation is more prominent. The drawbacks of large memory footprint and computation overhead with the intermediate data introduced by CP-decomposition motivate our work that re-fuses the convolution operations across tensors to mitigate the memory occupancy and computation overhead of the intermediate data. 3. Decomposition and re-fusion In this section, we first describe the design overview of our approach to re-fuse the convolutions across tensors after CPdecomposition. Then we illustrate the implementation details of convolution re-fusion. In addition, we give a quantitative analysis on the reduction of the memory occupancy as well as the computation complexity from our approach.
memory occupancy of the intermediate data as well as eliminates the computation overhead of the im2col transformation. The process of our approach to re-fuse the convolutions among the tensors is shown in Algorithm 1. Different from traditional convolution, our approach first allocates the memory space to store the intermediate data for the convolution operations (Calculate) of Ki and Kj (line 2). The size of the memory space allocated equals the width of the input feature maps multiplied by the length of the output feature maps. After that, it performs the 1 × 1 convolution on Ks and then it iterates through the batchsize and performs the convolution operations on Ki and Kj along the length and width dimension respectively (line 3–5). Note that in line 5, we re-fuse the three separate convolutions (Ks to Ki , Ki to Kj and Kj to Kn ) into an integrated operation within Calculate. Similarly to Ks , it performs the 1 × 1 convolution on Kn in line 6. Algorithm 1 The overview process of our approach (T1000) 1:
3.1. Overview
2: 3:
Fig. 3 illustrates the convolution operations within the stock CNN, CP-decomposition (CP-CNN), and our CP-decomposition and re-fusion approach (CP-Refusion-CNN). As shown in Fig. 3(a), stock CNN performs the convolution operations over the input feature maps against high-dimensional kernels, which causes high computation complexity. CP-decomposition in Fig. 3(b) decomposes the convolution kernels into low dimensional tensors, which reduces the computation complexity but introduces large volume of intermediate data across tensors. Fig. 3(c) shows the our approach of convolution re-fusion. To mitigate the memory usage of intermediate data after CP-decomposition, we re-fuse the convolution operations across the tensors. We notice that the first (Ks ) and last tensor (Kn ) after CPdecomposition are one dimensional (1 × 1) tensors, upon which the convolution can be done directly through matrix multiplication without additional im2col transformation. Whereas for the second (Ki ) and third tensor (Kj ), there are some special properties when performing the convolution operations on them. For instance, the convolution is one dimensional and independent of the tensor depth. In the default CP-decomposition, these properties are not explored, and thus the convolutions among the tensors are treated separately as ordinary. However in our approach, we take advantage of this observation and re-fuse the separate convolutions among the tensors into an integrated operation, which reduces the
4: 5: 6: 7: 8: 9: 10: 11:
function Forward(Bottom, Top, Ki , Kj , Ks , Kn ) Tmp = Malloc(inputw × outputh ) batch_id = 0 while batch_id < batch_size do MatrixMuitiply(Temp1, Ks , Bottom(batch_id)) Calculate(Temp2, Ki , Kj , Temp1, Tmp) MatrixMuitiply(Temp1, Ks , Bottom(batch_id)) batch_id++ end while Free(Tmp) end function
3.2. Convolution re-fusion across tensors Compared to the traditional convolution that requires not only calculating along the length and width dimensions, but also accumulating along the depth dimension, the convolutions with Ki and Kj are one dimensional and independent of the tensor depths. Therefore, as shown in Algorithm 2 after tensor re-fusion, we can perform the convolutions on Ki and Kj along the depth dimension individually (line 3–7). In addition, we store the convolution results from Kj (line 4) and use them as the input to the convolution on Ki when iterating through the depth dimension (line 5). This reduces the memory occupancy from R (chosen by the default CP-decomposition) to R1 . The reason for the significant memory reduction is because in the default CP-decomposition the convolutions between Ki and Kj are treated separately, and thus the
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
entire convolution results from Kj need to be stored first before proceeding to Ki . Whereas in our approach, since the convolutions between Ki and Kj are fused into one integrated operation Calculate, it is able to only buffer the partial results during each iteration within Calculate, and thus saving large amount of memory space.
Algorithm 4 The convolution process on Ki 1: 2: 3: 4: 5:
Algorithm 2 The convolutions on Ki and Kj after re-fusion 1: 2: 3: 4: 5: 6: 7:
r =0 /∗R is the depth of the tensor Ki and Kj ∗/ while r < R do Calculate_h(Tmp,Kj ,in(r)) Calculate_w(out(r),Ki , Kj ,Tmp) r++ end while
In addition to the convolution re-fusion, we also optimize the performance of the convolution on Kj . In the default CPdecomposition, the cache miss rate is quite high during the convolution on Kj due to its computation for only one column of the input feature maps at a time. This causes the performance penalty since the convolution cannot leverage the data already resides in the cache. Instead, it fetches the data from memory every time during the iteration of the columns. We observe that since the convolution on Kj is one dimensional, its convolution stepsize as well as the width of the convolution kernel equals one. This special property of the convolution on Kj means we can improve the cache hit rate by calculating the number of columns that fits into the cache once at a time. As shown in Algorithm 3 , we pay special attention to the input feature maps that are not multiples of the cache line by using aligning and zero padding (line 8–9), which further improves the utilization of cache. Algorithm 3 The convolution process on Kj 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
Width_Cache = Input_Width / CacheLine × CacheLine width_id = 0 while w idth_id < Width_Cache do height_id = 0 while height_id < output_height do Conv_Begin = GetConvBegin(height_id,...) Conv_End = GetConvEnd(height_id,...) Setzero(Tmp(width_id:Width_Cache+CacheLine, height_id)) Conv_id = Conv_Begin while Conv _id < Conv _End do weight_id = 0 cache_id = 0 while cache_id < CacheLine do Tmp(width_id + cache_id, height_id) += Kj (weight_id) × in(width_id + cache_id, Conv_id) weight_id++ cache_id++ end while Conv_id++ end while height_id++ end while width_id+=CacheLine end while
The convolution process on Ki is shown in Algorithm 4 . Since the convolution on Ki is along the width dimension (line 4–9), it does not suffer the performance penalty of high cache miss rate as Kj .
5
6: 7: 8: 9: 10:
height_id=0 while height_id < Output_height do Width_id=0 while Width_id < Output_w idth do Conv_Begin = GetConvBegin(width_id, ...) Conv_End = GetConvEnd(width_id, ...) Out(width_id, height_id) = Ki × Tmp(Conv_Begin:Conv_End, height_id) end while end while
3.3. Memory occupancy analysis The memory occupancy of convolution layer in CNN comes from four parts: input feature map, weight parameters (kernels or tensors), intermediate data and output feature map. The major difference in memory occupancy among the three approaches (CNN, CP-CNN and CP-Refusion-CNN) depends on the third part. Therefore, we only consider the size of the intermediate data in this analysis. In addition to the intermediate data, CNNs implemented on modern deep learning frameworks (e.g., Caffe) commonly transform convolution operation into matrix multiplication (e.g., im2col in Caffe), which consumes extra memory space to store the transformed data. For instance, with an input feature map W × H × S, an output feature map W ×H ×N and a convolution kernel d×d×S ×N, in order to store the transformed data, it consumes d×d×S ×W ×H additional memory space. In CP-decomposition, the original d × d × S × N convolution is decomposed into four convolutions Ks , Ki , Kj and Kn with low dimensional tensors as shown in Fig. 3. Since Ks and Kn are 1 × 1 convolution, they do not require the data transformation of im2col. However, for Ki and Kj , they need R × W × H × d and R × W × H × d memory space to store the transformed data from im2col. In addition, Ks , Ki and Kj generate intermediate data for their subsequent convolutions (Ki , Kj and Kn ) as input, which consumes W × H × R, W × H × R and W × H × R memory space respectively. More worse, the above analysis assumes the batchsize equals 1. In a more realistic CNNs, the batchsize is much larger than 1 (e.g., 50). The memory occupancy of CP-decomposition (memorycp ) is shown in Eq. (2). It is clear that when batchsize becomes larger, the memory occupancy of CP-decomposition increases significantly. memoryinter = 3 × batchsize × R × W × H memoryim2col = 2 × R × W × H × d
(2)
memoryCP = memoryinter + memoryim2col In our approach, the four convolutions Ks , Ki , Kj and Kn within CP-decomposition are re-fused into an integrated operation instead of four separate convolutions. Therefore, it reduces the memory space to store the intermediate data between separate convolutions. After re-fusion, only a small amount of data is buffered for reuse during each convolution. Moreover, it eliminates the memory usage from the im2col transformation since the interface of our approach to input and output feature map is still 1 × 1 convolution. The total memory usage of our approach is shown in Eq. (3). memoryinter = (2 × R + 1) × W × H memoryT 1000 = memoryinter
(3)
Based on Eqs. (2) and (3), the comparison of memory usage between our approach and CP-decomposition is shown in Eq. (4). Empirically, the value of R is much larger than batchsize and d
6
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
(e.g., R = 400, batchsize = 8 and d = 3) to retain acceptable accuracy of the CNNs after CP-decomposition. Therefore, the value of Eq. (4) approximates to 1.5 × batchsize + d. Since both batchsize and d are larger than 1, the memory occupancy of CP-decomposition is more than 2.5× compared to our approach that re-fuses the convolutions across the tensors. memoryCP memoryT 1000
= = =
memoryinter + memoryim2col memoryinter 3 × batchsize × R × W × H + 2 × R × W × H × d
(4)
(2 × R + 1) × W × H Fig. 4. The memory occupancy of AlexNet (conv2 layer) with original convolution layer (Traditional-Conv), CP-decomposition (CP-Conv) and our approach (T1000Conv) when batchsize = 1.
3 × batchsize × R + 2 × R × d 2×R+1
≈ 1.5 × batchsize + d compared to the original model. Note that we do not fine tune the CNN models after CP-decomposition and re-fusion.
3.4. Computation complexity analysis Since the convolution is transformed to matrix multiplication, the computation complexity of convolution layers in CNNs mainly comes from the matrix multiplication. For the original convolution layer, the computation complexity of matrix multiplication can be treated as Eq. (5). For CP-decomposition, its computation complexity is added up by convolution operations. For convolution Ks and Kn that do not need the transformation, its computation complexity is S ×W ×H ×R and N ×W ×H ×R respectively. Whereas for convolutions on Ki and Kj , the computation complexity for matrix multiplication is d × W × H × R and d × W × H × R respectively. The overall computation complexity of CP-decomposition is shown in Eq. (6). It is clear that the computation complexity reduces significantly with CP-decomposition compared to original convolution layer if R is chosen appropriately. According to our empirical study, it is always true that Eq. (6) is much smaller than Eq. (5) with the R used in our experiments. computationconv = d × d × S × W × H × N
(5)
computationcp
= (2 × d × W × H + S × W × H + N × W × H) × R
(6)
For our approach, the re-fusion of the four separate convolutions from CP-decomposition does not introduce additional computation overhead. Moreover, the re-fusion eliminates the transformation of Ki and Kj . Therefore the computation complexity of our approach is lower than CP-decomposition. 4. Evaluation 4.1. Experimental setup We evaluate the effectiveness of our approach with AlexNet and VGG-19 implemented on Caffe [26]. In our experiments, we use the original Caffe implementation as the baseline and compare with the CP-decomposition and our approach. We choose the metric of top-1 error to measure the accuracy of the CNN models. Our experiments are conducted on a SMP server with six Intel Xeon E5-2620 cores, each core can run at the maximum frequency of 2.4 GHz. The memory capacity is 48 GB. The operating system is Ubuntu 15.04 with 4.4.0-57 Linux kernel. We use the Caffe version of v1.0. To maintain the accuracy of the CNNs, the R in CPdecomposition is chosen so that the accuracy loss is less than 2%
4.2. Memory occupancy In this section, we illustrate the experimental results of memory occupancy with AlexNet and VGG-19 using our approach. To show the effectiveness of our approach, we compare our approach with the original convolution and the one applying CP-decomposition in all experiments. 4.2.1. AlexNet In our experiments, we measure the memory usage of the second convolution layer (conv2) in AlexNet that dominates the execution time of the entire model. Fig. 4 presents the memory usage of the original convolution layer, CP-decomposition and our approach when batchsize = 1. It is clear that the intermediate data dominates the overall memory usage for all the three approaches. For the original convolution layer, the intermediate data takes more than 75.6% of the total memory usage, which can be attributed to the transformation from convolution to matrix multiplication (im2col). For CP-decomposition, the amount of intermediate data increases by 1.63× compared to the original convolution layer. This is due to the additional memory space CP-decomposition required to store the intermediate data passing through the four separate convolutions. It is also interesting to notice that, although the size of intermediate data increases by 1.63×, the overall memory usage only increases by 1.38× for CP-decomposition. The reason is that CP-decomposition uses small tensors to approximate the original high dimensional convolution kernels. Therefore the size of the weight parameters reduces significantly, by more than 69.3% in our experiment. The decrease of weight size offsets the increase of intermediate data to some extent. For our approach, it is quite effective to mitigate the large amount of intermediate data introduced by CP-decomposition, and thus reduces the overall memory usage of entire convolution layer. As shown in Fig. 4, the size of intermediate data reduces by 84.6% and 75% compared to CP-decomposition and original convolution respectively. At the same time, the overall memory usage of the entire convolution layer by our approach also reduces apparently by 75.3% and 65.9%. We also evaluate our approach with AlexNet when large batchsize is used. Fig. 5 shows the comparison of memory occupancy of the original convolution layer, CP-decomposition and our approach when batchsize = 50. As the batchsize becomes larger, the output feature maps dominate the memory occupancy of the original
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
Fig. 5. The memory occupancy of AlexNet (conv2 layer) with original convolution layer (Traditional-Conv), CP-decomposition (CP-Conv) and our approach (T1000Conv) when batchsize = 50.
7
Fig. 7. The memory occupancy of VGG-19 (conv2 layer) with original convolution layer (Traditional-Conv), CP-decomposition (CP-Conv) and our approach (T1000Conv) when batchsize = 50.
this case, the input/output feature maps become much larger with larger batchsize, which offsets the memory benefit from low dimensional tensors and eventually leads to dramatic increase of intermediate data size. Nevertheless, with both small and large batchsize in Figs. 6 and 7, our approach is able to reduce the memory occupancy of the intermediate data by 77.4% and 98.7% respectively compared to CP-decomposition (89.4% and 89.5% compared to the original convolution). The experimental results demonstrate the effectiveness of our approach in mitigating the large memory occupancy of the intermediate data while embracing the low computation complexity with CP-decomposition. 4.3. Execution time Fig. 6. The memory occupancy of VGG-19 (conv2 layer) with original convolution layer (Traditional-Conv), CP-decomposition (CP-Conv) and our approach (T1000Conv) when batchsize = 1.
convolution layer. However, after applying the CP-decomposition, the size of the intermediate data increases dramatically, by more than 20× compared to the original layer, which again becomes the dominating factor of the high memory occupancy. With our approach to re-fuse the convolutions across the tensors, the size of the intermediate data reduces significantly by more than 98.7%. 4.2.2. VGG Similar to the experiments with AlexNet, we evaluate our approach with the second convolution layer in VGG-19 and compare the memory occupancy against CP-decomposition. The dominating factor of the memory occupancy in Figs. 6 and 7 is similar to Figs. 4 and 5. When the batchsize is small (batchsize = 1), the intermediate data takes a large portion of the memory occupancy. As the batchsize increases (batchsize = 50), the output feature maps consume more memory space than the intermediate data. However, the difference in Fig. 6 is that after applying CP-decomposition, the size of the intermediate data decreases apparently compared to the original convolution layer. The reason is that the im2col transformation in Caffe requires additional memory space to store the intermediate data, whose size depends on the size of input/output feature maps as well as the dimension of the convolution. After CP-decomposition the original high dimensional convolution is decomposed into several low dimensional tensors, which requires less memory space to store the intermediate data from im2col. Whereas in Fig. 7 with batchsize equals 50, the memory occupancy of the intermediate data increases significantly compared to the original convolution after applying CP-decomposition. In
In this section, we illustrate the experimental results of execution time with AlexNet and VGG-19 using our approach. To show the speedup of our approach, we compare our approach with the original convolution and the one applying CP-decomposition in all experiments. 4.3.1. AlexNet According to our computation complexity analysis in Section 3.4, after applying CP-decomposition, the performance of the conv2 layer of AlexNet can be improved by more than 3×. However, as shown in Fig. 8, in the Caffe implementation, the execution time of conv2 layer after applying CP-decomposition increases slightly, which is contradictory to our expectation. After further investigation, we find out the distribution of the execution time across the four low dimensional tensors with CP-decomposition is 1:6.2:5.3:2.0. This means most of the execution time with CPdecomposition is spent on the second and third tensor convolution. In the meanwhile, the size of input/output feature maps of AlexNet is quite small, which leaves little room for performance improvement. Moreover, in the Caffe implementation, additional computation is required to perform the im2col transformation before the second and third tensor convolution, which offsets the performance improvement from computation reduction of the low dimensional convolution. With our approach, the im2col transformation is eliminated due to convolution re-fusion across tensors, which improves the execution time by 1.67× compared to the original convolution (1.77× compared to default CP-decomposition). 4.3.2. VGG Similarly, we measure the execution time of conv2 layer in VGG19 to compare the performance among the original model, CPdecomposition and our approach. Differently from AlexNet, the
8
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
Fig. 8. The execution time of AlexNet (conv2 layer) with original convolution layer (Traditional-Conv), CP-decomposition (CP-Conv) and our approach (T1000-Conv) when batchsize = 50.
Fig. 9. The execution time of VGG-19 (conv2 layer) with original convolution layer (Traditional-Conv), CP-decomposition (CP-Conv) and our approach (T1000-Conv) when batchsize = 50.
input/output feature maps in VGG-19 are large enough for CPdecomposition to compensate the computation overhead introduced by im2col transformation. As shown in Fig. 9, the execution time improves by 2.3× after applying CP-decomposition. With our approach, the execution time is further reduced by 1.4× compared to CP-decomposition, since it does not require the additional computation of im2col transformation. 5. Related work Recently, improving the performance of CNNs has attracted tremendous research attention. Most of the research work on improving the performance of CNNs focus on compressing the network and advancing the computation method. In compressing the network, Jimmy Ba et al. [9] debate whether deeper neural networks deliver better performance. Their experiments have shown that compressed neural networks can maintain the same expressiveness as deeper ones. Adriana Romero et al. [10] propose an approach to achieve the same recognition precision as deeper networks on smaller networks through high accuracy learning. SqueezeNet [11] uses a ten-layer CNNs composed by eight well designed Fire modules to achieve significant parameter reduction compared to state-of-the-art CNN models with similar accuracy. In the case of advanced computation method, Yangqing Jia et al. [26] converts convolution operation to matrix multiplication when implementing the Caffe framework. Michael Mathie et al. [12] uses FFT to reduce the computation complexity and improves the efficiency of CNNs. Based on [12], Nicolas Vasilache
et al. [33] accelerates the computation of CNNs on GPUs using FFT. Moreover, Yu Cheng et al. [13] explore the redundancy of parameters in the fully connected layer and propose to use rotation matrix multiplication to reduce memory footprint. Manoj Alwani et al. [14] propose the integration of convolutional layers in CNNs to reduce the memory occupancy. Recently, there is a class of research work decomposing the convolution kernels to improve performance of CNNs. Emily Denton et al. [16] leverage SVD decomposition to approximate the high dimensional convolution kernels with low dimensional tensors to reduce the computation complexity. Jian Xue et al. [15] apply SVD decomposition to the weighted matrix of DNN and then reconstruct the model through fine tuning to realize desirable accuracy. To further speed up the computation after applying SVD, Vadim Lebedev et al. [5] explore CP-decomposition [34] and model fine tuning to improve the convolution performance and control accuracy loss. After the initial exploration of applying kernel decomposition in CNNs [15,5], more research work are conducted to advance the application of kernel decomposition in CNNs. Yong-Deok Kim et al. [35] leverage Bayesian matrix factorization to select the desirable rank value to implement tucker decomposition on neural network. Xiangyu Zhang et al. [36] develop GSVD using a nonlinear method to replace the SGD-based solvers in previous work to optimize the decomposition. Peisong Wang et al. [37] propose a block term decomposition method to approximate the original convolution kernels by the sum of a small number of low multilinear tensors. In addition, the sparse tensors are grouped to speed up the convolution. However, none of the existing work notices the large amount of intermediate data introduced by CPdecomposition, which motivates our work. As mobile devices take a large portion of the emerging AI market, there are research efforts to reduce the memory and storage overhead of CNN models in order to fit into the mobile devices. Wenlin Chen et al. [38] propose to convert filter weights to the frequency domain and use a low-cost hash function to randomly group frequency parameters into hash buckets, which exploits the redundancy in both convolutional layers and fully-connected layers of a deep learning model to reduce the memory and storage consumption. Jiaxiang Wu et al. [39] propose a unified framework for CNNs to quantize network parameters of both filter kernels in convolutional layers and weighting matrices in fully-connected layers via approximate inner product computation, which significantly compresses the CNN models with small loss of accuracy. Xiangyu Zhang et al. [40] introduce a new CNN architecture named ShuffleNet that utilizes pointwise group convolution and channel shuffle to reduce computation cost while maintaining accuracy. Our work complements the research efforts in this direction, which mitigates the memory consumption of the intermediate data after applying the CP-decomposition to CNNs. 6. Conclusion and future work In this paper, we propose a decomposition and re-fusion approach to mitigate the memory occupancy of the intermediate data generated by default CP-decomposition. By re-fusing the convolutions across tensors, our approach reduces the size of the intermediate data by 84.6% and 77.4% for AlexNet and VGG-19 respectively. In the meanwhile, we optimize the cache locality of the convolution operations after re-fusion, which in turn improves the performance of AlexNet and VGG-19 by 1.77× and 1.4× respectively. Moreover, we provide formal analysis on the memory occupancy and computation complexity of our approach in order
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10
to demonstrate the advantage of our approach compared to the default CP-decomposition. With our approach, we receive the benefit of reduced computation complexity from CP-decomposition, in the meanwhile mitigate the memory occupancy of the intermediate data from convolution re-fusion. For the future work, we would like to evaluate more CNN models such as GoogleNet [41] and ResNet [1] with our approach. In addition, we are also interested to apply our approach on accelerators such as GPU. Since the PCIe bandwidth between CPU and GPU is quite limited for CNN applications, it is expected that our approach is effective to alleviate this performance bottleneck. Furthermore, as previous work [5] concludes that the accuracy of the CNN models after CP-decomposition can be improved significantly with model fine tune. We would like to explore this opportunity to experiment with more radical decomposition and re-fusion methods, which achieves better performance in both computation complexity and memory occupancy with acceptable accuracy. Acknowledgments We would like to thank anonymous reviewers for their valuable feedbacks. This work is supported by National Key Research and Development Program of China (Grant No. 2016YFB1000304) and National Natural Science Foundation of China (Grant No. 61502019). References [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [2] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, 2016, arXiv preprint arXiv: 1602.07261. [3] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: integrated recognition, localization and detection using convolutional networks, 2013, arXiv preprint arXiv:1312.6229. [4] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, S. Yan, Attentive contexts for object detection, IEEE Trans. Multimedia (2016). [5] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, V. Lempitsky, Speeding-up convolutional neural networks using fine-tuned cp-decomposition, 2014, arXiv preprint arXiv:1412.6553. [6] J. Wu, C. Leng, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural networks for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4820–4828. [7] Z. Zhengj, J. Weng, Mobile device based outdoor navigation with on-line learning neural network: A comparison with convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 11–18. [8] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural networks, 2016, arXiv preprint arXiv:1603.05279. [9] J. Ba, R. Caruana, Do deep nets really need to be deep? in: Advances in Neural Information Processing Systems, 2014, pp. 2654–2662. [10] A. Romero, N. Ballas, S.E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: hints for thin deep nets, 2014, arXiv preprint arXiv:1412.6550. [11] F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1 MB model size, 2016, arxiv preprint arXiv:1602.07360. [12] M. Mathieu, M. Henaff, Y. LeCun, Fast training of convolutional networks through ffts, 2013, arXiv preprint arXiv:1312.5851. [13] Y. Cheng, F.X. Yu, R.S. Feris, S. Kumar, A. Choudhary, S.-F. Chang, An exploration of parameter redundancy in deep networks with circulant projections, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2857–2865. [14] M. Alwani, H. Chen, M. Ferdman, P. Milder, Fused-layer CNN accelerators, in: Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, IEEE, 2016, pp. 1–12. [15] J. Xue, J. Li, Y. Gong, Restructuring of deep neural network acoustic models with singular value decomposition, in: INTERSPEECH, 2013, pp. 2365–2369. [16] E.L. Denton, W. Zaremba, J. Bruna, Y. LeCun, R. Fergus, Exploiting linear structure within convolutional networks for efficient evaluation, in: Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.
9
[17] R. Rigamonti, A. Sironi, V. Lepetit, P. Fua, Learning separable filters, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2754–2761. [18] M. Jaderberg, A. Vedaldi, A. Zisserman, Speeding up convolutional neural networks with low rank expansions, 2014, arXiv preprint arXiv:1405.3866. [19] T. Baker, M. Asim, H. Tawfik, B. Aldawsari, R. Buyya, An energy-aware service composition algorithm for multiple cloud-based iot applications, J. Netw. Comput. Appl. 89 (2017) 96–108. [20] T. Baker, A. Taleb-Bendiab, M. Randles, A. Hussien, Understanding elasticity of cloud services compositions, in: Utility and Cloud Computing, UCC, 2012 IEEE Fifth International Conference on, IEEE, 2012, pp. 231–232. [21] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [22] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556. [23] D. Al-Jumeily, A.J. Hussain, P. Fergus, N. Radi, Self-organized neural network inspired by the immune algorithm for the prediction of speech signals, in: International Conference on Intelligent Computing, Springer, 2015, pp. 654–664. [24] M. Khalaf, A.J. Hussain, D. Al-Jumeily, R. Keight, R. Keenan, P. Fergus, H. AlAskar, A. Shaw, I.O. Idowu, Training neural networks as experimental models: Classifying biomedical datasets for sickle cell disease, in: International Conference on Intelligent Computing, Springer, 2016, pp. 784–795. [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in: CVPR09, 2009. [26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675–678. [27] Torch, http://torch.ch/. [28] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems, 2015, arxiv preprint arXiv:1512.01274. [29] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., TensorFlow: A system for large-scale machine learning, in: OSDI, vol. 16, 2016, pp. 265–283. [30] C. Nvidia, Compute unified device architecture programming guide, 2007. [31] L. Medsker, L. Jain, Recurrent neural networks, Des. Appl. 5 (2001). [32] R. Ierusalimschy, Programming in lua, Roberto Ierusalimschy, 2006. [33] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, Y. LeCun, Fast convolutional nets with fbfft: a gpu performance evaluation, 2014, arxiv preprint arXiv:1412.7580. [34] T.G. Kolda, B.W. Bader, Tensor decompositions and applications, SIAM Rev. 51 (3) (2009) 455–500. [35] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, D. Shin, Compression of deep convolutional neural networks for fast and low power mobile applications, 2015, arxiv preprint arXiv:1511.06530. [36] X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2016) 1943–1955. [37] P. Wang, J. Cheng, Accelerating convolutional neural networks for mobile applications, in: Proceedings of the 2016 ACM on Multimedia Conference, ACM, 2016, pp. 541–545. [38] W. Chen, J.T. Wilson, S. Tyree, K.Q. Weinberger, Y. Chen, Compressing convolutional neural networks, 2015, arXiv preprint arXiv:1506.04449. [39] J. Wu, C. Leng, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural networks for mobile devices, in: The IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016. [40] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices, 2017, arXiv preprint arXiv:1707. 01083. [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Computer Vision and Pattern Recognition, CVPR, 2015. http://arxiv.org/abs/1409.4842.
Changxi Liu is a Master student in School of Computer Science and Engineering, Beihang University. He is currently working on identifying performance opportunities for both scientific and AI applications. His research interests include HPC, performance optimization and deep learning.
10
C. Liu et al. / Future Generation Computer Systems 84 (2018) 1–10 Hailong Yang is an assistant professor in School of Computer Science and Engineering, Beihang University. He received the Ph.D. degree in the School of Computer Science and Engineering, Beihang University in 2014. He has been involved in several scientific projects such as performance analysis for big data systems and performance optimization for large scale applications. His research interests include parallel and distributed computing, HPC, performance optimization and energy efficiency. He is also a member of IEEE and China Computer Federation (CCF).
Rui Wang is an assistant professor of School of Computer Science and Engineering, Beihang University. He received his BS and MS degree in computer science from Xi’an Jiaotong University in 2000 and 2003, respectively; and his Ph.D. in computer science from Beihang University in 2009. His research interests include computer architecture and computer networks. He is a member of IEEE and China Computer Federation(CCF).
Zhongzhi Luan is an Associate Professor of School of Computer Science and Engineering, and Assistant Director of the Sino-German Joint Software Institute (JSI) at Beihang University, China. In 2003, he completed Ph.D. in Department of Computer Science of Xi’an Jiaotong University. He has been involved into more than 15 scientific projects mostly as project leader or the backbone of the researchers. He is now in charge of the international data placement testbed project which is funded by international cooperation program of National Science Foundation of China. His research interests include distributed computing, parallel computing, grid computing, HPC and new generation of network technology.
Depei Qian is a professor at the Department of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas in 1984. He is currently serving as the chief scientist of China National High Technology Program (863 Program) on high productivity computer and service environment. He is also a fellow of China Computer Federation (CCF). His research interests include innovative technologies in distributed computing, high performance computing and computer architecture.