A Fourier domain acceleration framework for convolutional neural networks

ARTICLE IN PRESS JID: NEUCOM [m5G;July 29, 2019;7:58] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing jour...

Download PDF

NAN Sizes 1 Downloads 67 Views

Report

Full Text

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 29, 2019;7:58]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A Fourier domain acceleration framework for convolutional neural networks Jinhua Lin a,b,∗, Lin Ma c, Yu Yao a a

School of Computer Application Technology, Changchun University of Technology, Yan’an street No.2055, Changchun, China Machinery & Electronics Engineering, University of Chinese Academy of Sciences, Yu Quan Road No. 19, Beijing, China c FAW Foundry Co., Ltd., DongFeng street No. 83, Changchun, China b

a r t i c l e

i n f o

Article history: Received 28 February 2019 Revised 15 June 2019 Accepted 24 June 2019 Available online xxx Communicated by Jun Yu Keywords: Convolutional neural networks Deep learning Forward/backward propagation passes Activation function Downsampling operations

a b s t r a c t Acceleration of training and inference of convolutional neural networks (CNNs) plays a signiﬁcant role in deep learning efforts for large-scale datasets. However, it is diﬃcult to accelerate the training and inference of CNNs based on traditional Fourier domain acceleration frameworks because Fourier domain training and inference are related to many complicated factors, such as the architecture of Fourier domain propagation passes, the representation of the activation function and the design of downsampling operations. A conceptually intuitive, useful and general Fourier domain acceleration framework for CNNs is proposed in this paper. Taking the proposed Fourier domain rectiﬁed linear unit (FReLU) as an activation function and the proposed Fourier domain pooling function (FPool) as a downsampling function, a Fourier domain acceleration framework is established for CNNs, and the inverse activation function (FReLU−1 ) and inverse downsampling function (FPool−1 ) are further obtained for the backward propagation pass. Furthermore, a block decomposition pipeline is integrated into the Fourier domain forward/backward propagation passes of CNNs to accelerate the training and inference of CNNs. The results show that the proposed acceleration framework can accelerate the training and inference of CNNs by a signiﬁcant factor without reducing the recognition precision. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Convolutional neural networks (CNNs) play an important role in the machine learning ﬁeld. CNNs have become one of the most enabling techniques for solving large-scale learning problems, such as object detection, document processing and natural language processing [1–9]. Accelerating the training and inference of CNNs plays an important role in deep learning efforts for large-scale datasets. However, the training and inference of CNNs are affected by many complicated factors, such as the scale of the dataset, the depth of the convnet layer, and the architecture of the CNNs. It is challenging to accelerate the training and inference of CNNs. Many efforts have been made in the past decade on the modeling of baseline frameworks of CNNs with the goal of accelerating the training and inference of CNNs [10]. The baseline frameworks of CNNs can be divided into two major groups: spatial domain acceleration frameworks and Fourier domain acceleration frameworks. Spatial domain acceleration frameworks refer to re-

∗ Corresponding author at: School of Computer Application Technology, Changchun University of Technology, Yan’an street No. 2055, Changchun, China. E-mail address: [email protected] (J. Lin).

ducing the computational expense of convnet layers by directly implementing convolution operations in the spatial domain. Sharan Chetlur [11] presented an eﬃcient library (cuDNN) for deep learning primitives, which made it easy to optimize the parallel kernels of deep learning frameworks for a given hardware. The optimization routines of [12] are similar to [11]. cuDNN can be easily integrated into the existing CNN frameworks [13–17], when these frameworks are coded on a NVIDIA graphics processing unit (GPU) by using CUDA [18,19]. cuDNN accelerates training and inference of CNNs by implementing spatial domain convolution operations in parallel on NVIDIA GPUs. Andrew Lavin [20] presented a spatial domain fast algorithm (FA) for CNNs based on Winograd’s minimal ﬁltering algorithms [21,22]. FA is more applicable for CNNs with small ﬁlters and small batch sizes, reducing the computational expense of a convolutional layer by a factor of 4 compared to cuDNN’s direct convolution. Furthermore, Jason Cong and Bingjun Xiao [23] used the fast matrix multiplication algorithm proposed by Volker Strassen [24] to reduce the arithmetic complexity of CNNs in the spatial domain. Fourier domain acceleration frameworks refer to calculating spatial domain convolutions as dot products in the Fourier domain in parallel with the same transformed feature map being reused in all of the propagation passes. Souheil Ben-Yacoub [25] presented a fast Fourier transform

https://doi.org/10.1016/j.neucom.2019.06.080 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

(FFT) [26,27] based multilayer perceptron (MLP) to reduce the inference time of a three-layer neural network. [25] explored the possibility of using FFT to accelerate the neural network, yet it calculates the Fourier transforms off-line and cannot be used during the training time. Michael Mathieu [28] presented a Fourier domain acceleration algorithm to accelerate training and inference of CNNs during three propagation passes. [28] initiated a precedent of using FFT-based convolutions in CNNs, yet it is more applicable to a certain condition when the ﬁlter size is close to that of the input feature map. The NVIDIA corporation presented an NVIDIA CUDA FFT (cuFFT) library [29] for computing FFTs on a NVIDIA GPU, which can be integrated into the advanced CNN frameworks. cuFFT uses the most common Cooley-Tukey algorithm [30] to reduce the computational complexity of the discrete Fourier transform (DFT) to O(NlogN). cuFFT integrated CNNs [31] to perform spatial domain convolutions as pointwise products of size N in the Fourier domain at each convnet layer. [31] presented a useful algorithm to reduce the transformation time of FFTs from the spatial domain to the Fourier domain, accelerating the training and inference of CNNs by a factor of 8 in terms of 2D image classiﬁcation [32–34]. The performance of the cuFFT library is also related to register usage and memory latency, and thus, the key improvement to cuFFT is to design an in-register well-tuned approach that performs better than the vendor-tuned libraries. Jaewook Shin [35] presented a compiler-based autotuning strategy to speed up the computations of a speciﬁc architecture, showing high performance gains over several vendor-tuned libraries. An open-source library called Facebook FFT (fbFFT) [36] was built on the if-conversion properties of the GPU CUDA compiler and the automatic code generation tools [37], making full use of the tuning potentials of GPU registers. For training and inference of CNNs, fbFFT outperforms cuFFT by a factor of 1.5 at batch sizes of 8–64. Although these methods can accelerate the training and inference of CNNs to a certain degree, they cannot avoid the transformations to and from the Fourier domain, i.e., the FFT and inverse FFT (IFFT) operations are required between every convnet layer. Moreover, frequent transformation operations occupy a large amount of memory bandwidth and signiﬁcantly increase the computational complexity of the convnet layer. Therefore, many researchers have realized that an eﬃcient solution is to train and test CNNs entirely in the Fourier domain, i.e., the activation function, pooling operations and convolution operations are all implemented in the Fourier domain. Rippel et al. [38] explored the possibility of training CNNs in a Fourier domain, presenting a spectral pooling strategy for dimensionality reduction in input feature maps. Different from traditional pooling methods, [38] entirely truncated the higher frequency band of spectrum representation of the input feature map. This spectral truncation strategy can preserve more feature map information during the forward propagation pass, yet the precision of the loss gradients with respect to the weights are signiﬁcantly decreased during backward propagation pass; i.e., it is diﬃcult to propagate the truncated weight spectrum to the prior layer. Ko et al. [39] presented a frequency-domain accelerator for training CNNs using a discrete sync interpolation operation to replace the zero-padded operation for ﬁlters in the Fourier domain. Using the spectral pooling function proposed in [38] as a downsampling function and the approximated tanh/sigmoid functions [40–42] as activation functions, [39] achieved training and inference of CNNs in the Fourier domain. However, the sync interpolation operation requires additional K2 × N2 complex multiplications in each convnet layer, which is a computationally intensive framework, at the expense of memory latency. Furthermore, the tanh/sigmoid functions used by Ko et al. [39] are two saturated activation functions that decrease the backpropagation accuracy of the gradients with respect to the weights and make the CNNs diﬃcult to converge in the training process.

Training and inference of CNNs in the Fourier domain is an important issue in the ﬁeld of machine learning. Not only should the computational expense of training and inference of CNNs be emphasized, but large-scale learning accuracy should also be maintained in the machine learning process. Therefore, the importance of the proposed Fourier domain CNN acceleration framework in this paper includes two aspects. First, in terms of training and inference speed, it can be regarded as the baseline framework for accelerating training and inference of CNNs in the Fourier domain. Second, in terms of detection accuracy, it considers abundant Fourier domain representation of the activation function, pooling operations and convolution operations in the ﬁeld of deep learning. In this paper, an acceleration framework based on the block decomposition pipeline, Fourier domain rectiﬁed linear unit (FReLU), and Fourier domain downsampling function (FPool) is proposed for accelerating the training and inference of CNNs in the Fourier domain(see Fig. 1). The framework speeds up CNNs based on studying the neuron decomposition mechanism and Fourier domain representation of key operations instead of depending on hardware acceleration or simple algorithmic improvement. It is important to decrease the computational costs of CNNs and the number of functions and representations of the convnet architecture should be comprehensively explored or reconstructed instead of relying on simple algorithmic improvement. First, a block decomposition pipeline is introduced into the forward and backward propagation passes of the Fourier domain acceleration framework to reduce the computational complexity and memory latency. Second, the FReLU and the Fourier domain downsampling function (FPool) are established to catch the non-linear features of the input data in the Fourier domain and decrease the output feature map dimensions in the Fourier domain, respectively. Finally, the forward and backward propagation passes are investigated together to yield the corresponding representation of the inverse FReLU−1) and inverse Fourier domain downsampling function (FPool−1 ). This paper is organized as follows. The proposed Fourier domain CNN acceleration framework (abbreviated as FCNN) is outlined in Section 2. The Fourier domain forward propagation pass (Ffprop) of FCNN is proposed in Section 3, in parallel with the Fourier domain representations of FReLU and FPool. The Fourier domain backward propagation pass of FCNN is proposed in Section 4, in parallel with the inverse Fourier domain representations of FReLU−1 and FPool−1 . The arithmetic complexity analysis for the proposed acceleration frameworks is presented in Section 5. The evaluation results of FCNN are presented in Section 6. Finally, the conclusions are drawn in Section 7. 2. Overall framework The FCNN framework makes it possible to learn features in the Fourier domain rather than in the spatial domain since it removes the need for inverse Fourier transforms in each convnet layer (i.e., the activation function and downsampling method of the Fourier domain are proposed) and accelerates training and inference further by a block decomposition pipeline. Fig. 2 shows the overall architecture of FCNN. FCNN performs deep learning in the Fourier domain through two passes. The two passes are an Ffprop and a Fourier domain backward propagation pass (Fbprop), as shown in Fig. 2. Speciﬁcally, the Fourier domain backward propagation pass contains two sub-passes, sub-pass 1 is used to compute the gradients of the inputs, and sub-pass 2 is used to compute the loss gradients of the weights. In each pass of the FCNN, the large-scale input feature maps are decomposed into small feature blocks, and the size of the feature blocks corresponds to the size of the ﬁlters in each convnet layer. The convolutional operations are decomposed into block-sized product operations in the Fourier domain. This block

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

3

Fig. 1. CNN frameworks in forward propagation pass with (a) traditional spatial-domain pipeline, (b) other Fourier-domain acceleration pipeline, and (c) the proposed Fourier-domain acceleration pipeline.

Fig. 2. The FCNN pipeline.

decomposition pipeline avoids the time-consuming padding steps in traditional FFT-based CNNs. Furthermore, the non-linearity is introduced into FCNN by the non-linear activation function of the Fourier domain. The rectiﬁed linear unit in the frequency domain, called FReLU (labelled as “FR” in Fig. 2), is proposed to catch the non-linear features of the output feature maps in the Fourier domain in the forward propagation pass. Correspondingly, the inverse FReLU (labelled as “FR−1 ” in Fig. 2) is proposed to catch the non-linear features of the loss gradients with respect to the inputs in the Fourier domain in the backward propagation pass. The

FReLU and inverse FReLU do not decrease neuron backpropagation and makes FCNN converge easier in the training process. Next, the downsampling method in the frequency domain, called FPool (labelled as “FP” in Fig. 2), is proposed to reduce the resolution of the feature map in the Fourier domain. Correspondingly, the inverse FPool (labelled as “FP−1 ” in Fig. 2) is proposed to reduce the number of parameters in the backward propagation pass. Both the maximum and average operations are introduced into the FPool and the inverse FPool. Fourier domain maximum pooling and average pooling methods were implemented in FCNN. The proposed

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

activation function and downsampling method remove the need for inverse Fourier transforms in each convnet layer in traditional FFT-based CNNs. The detailed implementation process of FCNN is described in the following sections. 3. Fourier domain forward propagation pass (Ffprop) Fourier domain forward propagation refers to mapping input x to an output y in the frequency domain, ensuring that information ﬂows forward through the Fourier domain network (FCNN). First, input x ﬂows through our block decomposition pipeline in each convnet layer, and the large-scale convolution operations between input x and ﬁlter w are decomposed into block-sized product operations in the Fourier domain. Second, the Fourier domain activation function (FReLU) is implemented for the prior product operation result F(y), yielding the output feature map in each convnet layer. Finally, Fourier domain downsampling methods (FPool) are implemented to reduce the resolution of the output feature map in each convnet layer. The detailed implementation process of Ffprop is described in this section. 3.1. Block decomposition pipeline in Ffprop The state-of-the-art CNN architectures use deep networks with small ﬁlters (3 × 3 in general). However, the traditional FFT-based CNN architecture is more applicable for networks with large ﬁlters. Therefore, there is a strong need for a Fourier domain CNN framework (FCNN) for small ﬁlters. In this paper, a block decomposition pipeline is proposed for FCNN to achieve fast convolutions between large inputs and small ﬁlters. For example, when the size of the input feature map x is much larger than that of the ﬁlter, the ﬁlter needs to be zero-padded to the same size as x, which is suboptimal with respect to speed. The block decomposition pipeline can decompose the input feature map x into a set of feature blocks of the same size as ﬁlters. Then, the product operations are implemented for each block in parallel, and ﬁnally, the results are combined to obtain the output feature map F(y). Before introducing the presented method, we ﬁrst clarify the meanings of several notations. For a given convnet layer, the size of the minibatch is denoted as S. The number of input feature maps is denoted as f. The input feature map xk is indexed by k, and k ranges from 1 to f. The input feature map is of size n1 × n2 . The number of output feature maps is denoted as f ’. The output feature map yk’ is indexed by k’, and k’ ranges from 1 to f’. The output feature map is of size m1 × m2 . The number of ﬁlter kernels is denoted as f’f. The ﬁlter kernel wk’k is indexed by k’k, and k’k ranges from 1 to f’f. The ﬁlter kernel is of size k1 × k2 . The tilde notation ∼ refers to padding zero values to the feature map in the base-2 DIT FFT. In our block decomposition pipeline, each input feature map xk is divided into p blocks, and each block is of size l. The block and ﬁlter kernel have the same order of magnitude, and thus, the value of l is close to that of k1 × k2 (the notation m is used to label the value of k1 × k2 in our algorithm). The ith block of input feature map xk is represented as:

xki (n ) =

xk ( n ),

il ≤ n ≤ (i + 1 )l − 1

0,

other n

i = 0, 1, · · · p−1

(1)

Then, the input feature map xk can be denoted as:

xk ( n ) =

p−1

xki (n )

(2)

i=0

Therefore, the convolution operations between xk and wk’k are computed by convolving each block xki with the corresponding ﬁl-

ter kernel wk’k , i.e., the output feature map yk’ is denoted as:

yk ( n ) = xk ( n ) ∗ wk k ( n ) =

p−1

xki (n ) ∗ wk k (n )

(3)

i=0

where each xki (n ) ∗ wk k (n ) can be transformed into product operations of the Fourier domain. xki (n ) ∗ wk k (n ) is of the size l + m − 1, the xki (n) and wk k (n ) are zero-padded to matrices of size N, respectively. To use the base-2 FFT in the block decomposition pipeline, the N is set to 2α , which is larger than l + m − 1, i.e., N = 2α ≥ l + m − 1 (α ∈ I ). Therefore, the output feature block yk i (n ) is computed by:

yk i (n ) = xki (n )wk k (n )

(4)

where denotes N-point circular convolution. As xki (n) consists of l points, yk i (n ) consists of N points (N is initially set to l + m − 1), there must be an overlapping area of m-1 points between two adjacent output feature blocks, as shown in Fig. 3. According to formula (3), the overlapping and non-overlapping parts should be combined to form the output feature map yk’ . The speciﬁc steps of the proposed block decomposition pipeline in Fprop are as follows. Step 1. N-point FFT is implemented for the ﬁlter kernel wk k (n ), i.e., Wk k (U ) = F (wk k (n ) ) = F F T (wk k (n ) ), and U denotes the argument of the Fourier domain; Step 2. N-point FFT is implemented for the input feature block xki (n), i.e., Xki (U ) = F (xki (n ) ) = F F T (xki (n ) ); Step 3. N-point product operations are implemented for Xki (U) and Wk k (U ) to yieldYk i (U ), i.e., Yk i (U ) = F (yk i (n ) ) = Xki (U ) · Wk k (U ); Step 4. Each output feature block Yk i (U ) (including overlapping parts) is added to form the ﬁnal feature map Yk of the p−1 Fourier domain, i.e., Yk = Yk i (U ). i=0

Step 5. In addition to the last FC layer, the above four steps are repeated in each convnet layer. Before the end of the forward propagation pass, i.e., before training of the FC layer, N-point inverse FFT is implemented for Yk i (U ) to yield yk i (n ), i.e., p−1 p−1 yk = yk i ( n ) = IF F T (Yk i (U )). i=0

i=0

3.2. FReLU and FPool The activation function is a non-linear mathematical operation over each convolution operation. It is introduced into the convnet layer to extract non-linear characteristics of the input data. Several kinds of activation functions have been proposed for training CNNs in spatial domains, such as sigmoid, tanh and rectiﬁed linear unit (ReLU). Despite spatial domain activation functions, there is a strong need for Fourier domain activation functions that work better in identifying non-linear characteristics of the Fourier domain input. This paper nests the traditional spatial domain ReLU to construct the FReLU for FCNN, as shown in Fig. 4. For each convnet layer, the FReLU function is presented as:

Ak = F R(Yk )

(5)

where Yk’ is the output feature map of the convolution operation. Ak’ is the output of the activation function. FR( · ) is the Fourier domain ReLU function. k’ indexes the output feature maps, taking from 1 to f’. Because Yk’ is the DFT with respect to the spatial domain output yk’ , we can transform Yk’ into a form of discrete Fourier series as follows:

Yk (U ) = yk (0 ) + yk (1 )e− j + · · · + yk ( N − 1 )e

2π N

U

+ yk ( 2 )e− j

− j 2Nπ (N−1 )U

2π N

2U

(6)

where N is the size of the output feature map. U is the argument of the Fourier series, indexing the elements of the output feature

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

5

Fig. 3. Serialized maps for Fourier domain multiplications in the block decomposition pipeline.

Fig. 4. Fourier domain activation function (FReLU).

map. For example, in Fig. 4, N is equal to 3 × 3, U denotes the integers between 0 and N-1, i.e., U ∈ [0, 8]. The ReLU function curve in the spatial domain is shown in the left half of Fig. 4, and the corresponding FReLU function curve in the frequency domain is shown in the right half of Fig. 4. The spatial domain curve shows that the ReLU function eliminates the elements that are less than zero in output yk’ and resets these elements to zero. However, in the frequency domain, the FReLU function eliminates only the coeﬃcients that are less than zero in output Yk’ instead of the whole elements, which indicates that only the non-positive items in Eq. (6) are reset to zero. To pick out the non-positive items from the output Yk’ , the absolute value of Eq. (6) is taken term by term as:

2π 2π |Yk (U )| = |yk (0 )| + yk (1 )e− j N U + yk (2 )e− j N 2U 2π + · · · + yk (N − 1 )e− j N (N−1)U

(7)

where | · | denotes taking the absolute value. Eqs. (6) and (7) are combined to yield the FReLU function as follows:

Ak (U ) = F R(Yk (U ) ) =

1 (Y (U ) + |Yk (U )|) 2 k

(8)

The elements of Ak’ are all non-zero; however, the Fourier series items with non-positive coeﬃcients are reset to zero in output Yk’ . For example, in Fig. 4, the second element of Ak’ is equal to the value of FR(Yk’ (1)), i.e., a01 = 3 + e− j − j 29π

2π 9

2

+ 0e− j

2π 9

5

+ 3e− j

2π 9

6

+

7 2e . The coeﬃcients indexed by 1, 3, 4 and 8 are eliminated, and these coeﬃcients correspond to the non-positive feature elements in the spatial domain output yk’ . In summary, the proposed FReLU function refers to eliminating the non-positive items of the Fourier domain output Yk’ instead of the whole feature elements. This eliminating strategy ensures that the spatial domain ReLU function is mapped seamlessly to the frequency domain. Activation operations are usually followed by downsampling operations to reduce the dimensions of the output feature map. The number of weight parameters and the training complexity are also reduced. For each convnet layer, the Fourier domain downsampling function (FPool) is presented as:

Yk = F P (Ak )

(9)

where Yk’ represents the pooled output feature map in Fourier domain. Ak’ represents the output feature map from the activation operation. FP( · ) is the Fourier domain downsampling function. In general, maximum pooling and average pooling are two important

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

Fig. 5. Fourier domain downsampling method (FPool).

downsampling methods to eliminate overﬁtting problems in the CNN. In the spatial domain, maximum pooling extracts the maximum eigenvalue from the speciﬁc regions of the output feature map and replaces the other eigenvalues of the original speciﬁc region with this maximum eigenvalue. Average pooling computes the average eigenvalue among the speciﬁc regions of the output feature map and replaces the other eigenvalues of the original speciﬁc region with this average eigenvalue. According to the spatial domain pooling idea, we nest the spatial domain downsampling methods to construct the Fourier domain maximum pooling and average pooling methods for FCNN. Fourier domain maximum pooling refers to extracting the maximum coeﬃcient of the Fourier item of each element from the speciﬁc regions and takes the maximum coeﬃcients as new coeﬃcients of the corresponding elements, replacing the other elements of the original speciﬁc region with this new element (see Fig. 5). These maximum coeﬃcients correspond to the maximum eigenvalues extracted by the spatial domain maximum pooling. The speciﬁc region refers to the minimum neighbourhood unit for pooling operations in the output feature map. In general, the size of the speciﬁc region is ﬁxed to the same size as the ﬁlter. Therefore, considering the Fourier domain maximum pooling idea, formula (9) is rewritten as:

Yk (U ) ≡ F P m(Ak (∩ )) =

βj

Fourier domain average pooling function is written as:

Yk (U ) ≡ F Pa(Ak (∩ )) =

βj

avg(yk (n ))e− j

2π N

n∩

(11)

n = β0

where FPa( · ) represents the Fourier domain average pooling function. The location of the average eigenvalues extracted by the spatial domain average pooling does not exist, and the location of the non-negative coeﬃcients is ﬁxed for each element of the feature map; therefore, whatever the value of ∩ is, the pooled Fourier terms are invariant. In our experiment, ∩ is set to 1. In Fig. 5, the dimension of the output feature map Yk’ is reduced to half the size of its original type, and the dimension reduction proportion in the frequency domain is the same as that in the spatial domain, which indicates that the spatial domain pooling operations are accurately mapped to the Fourier domain. Our Fourier domain pooling methods preserve the important training features, reducing the number of weight parameters, computation complexity, and the dimension of the output feature map, eliminating the overﬁtting problems in the FCNN. 4. Fourier domain backward propagation pass (Fbprop) Fourier domain backward propagation refers to mapping the loss gradient of the output ∂∂ yL to the gradient of the input ∂∂ xL and

max(yk (n ))e− j

2π N

n∩

(10)

n = β0

where FPm( · ) represents the Fourier domain maximum pooling function. β indexes the Fourier items with non-negative coeﬃcients in Ak (∩ ). U indexes the elements of the pooled output feature map. ∩ indexes the location of the maximum eigenvalues extracted by the spatial domain maximum pooling. The pooled output feature map in the frequency domain is the size of ((Nm)/str+1), i.e.,U ∈ [1, ((N − m )/str + 1 )]. N is the size of the original output feature map, m is the size of the ﬁlter, and str is the step size. Fourier domain average pooling refers to extracting the average coeﬃcients of the Fourier item of each element from the speciﬁc regions and taking the average coeﬃcients as new coeﬃcients of the corresponding elements, replacing the other elements of the original speciﬁc region with this new element (see Fig. 5). These average coeﬃcients correspond to the average eigenvalues extracted by the spatial domain average pooling. Therefore, the

the gradient of the weight ∂∂wL in the frequency domain, ensuring that loss deviation information ﬂows backward through the Fourier domain network. First, the loss gradient with respect to the output ﬂows through our block decomposition pipeline in each convnet layer. Second, the Fourier domain inverse activation function (FR−1 ) is implemented for F( ∂∂ xL ), yielding the loss gradients with respect to the inputs and the weights in each convnet layer. Finally, the Fourier domain inverse downsampling methods (FP−1 ) are implemented to propagate the loss deviation to the speciﬁc regions; these speciﬁc regions are located by the pooling methods in Fprop. The detailed implementation process of Fbprop is described in this section. 4.1. Block decomposition pipeline in Fbprop

In the Fourier domain backward propagation pass, the block decomposition pipeline can decompose the loss gradient map with respect to the output into a set of feature blocks of the same size as the transposed weight kernels. Then, the product operations are implemented for each feature block in parallel. Finally, the results are combined to obtain the loss gradient map with respect to the

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

input, i.e., F( ∂∂ xL ). Fbprop uses the same notations as the Fourier domain forward propagation process. In the block decomposition pipeline of Fbrop, each loss gradient map with respect to the output (i.e., ∂∂yL ) is divided into p blocks, and each block is size of l . k

Therefore, the ith block of the loss gradient map with respect to the output is represented as:

∂ L ( n ), ∂L ( n ) = ∂ yk ∂ yk i 0,

il ≤ n ≤ (i + 1 )l − 1 other n

i = 0, 1, · · · p −1 (12)

Then, the loss gradient map with respect to the output can be denoted as:

p −1 ∂L ∂L (n ) = (n ) ∂ yk ∂ yk i i=0

(13)

weight kernel (i.e., w−1 ) are computed by convolving each block k k −1 ∂L ∂ yk i with the corresponding wk k , i.e., the loss gradient map with respect to the input ∂∂xL is denoted as: k p −1

∂L ∂L ∂L (n ) = (n ) ∗ w−1 (n ) ∗ w−1 (n ) k (n ) = k k k ∂ xk ∂ yk ∂ yk i i=0

k i

padded to matrices of size N, respectively. N is set to 2α , which is larger than l + m − 1, i.e., N = 2α ≥ l + m − 1 (α ∈ I ). Therefore, the loss gradient block with respect to the input (i.e., ∂∂xL (n )) is ki computed by:

∂L ∂L (n ) = (n )w−1 (n ) k k ∂ xki ∂ yk i

(15)

where denotes the N-point circular convolution. The overlapping areas between two adjacent loss gradient blocks are processed by the same steps as Ffrop (see Section 3.1). 4.2. The inverse FReLU and inverse FPool In the Fourier domain back-forward propagation pass, the inverse activation function alleviates the gradient explosion problem, quickly updating the parameters in the convnet layers. In this paper, Fourier domain inverse ReLU activation is implemented to propagate the loss values to the prior layer and update the speciﬁc features accordingly. The inverse FReLU needs to determine whether the coeﬃcients of the Fourier items of the original inputs are non-negative (i.e., whether the input eigenvalue is greater than 0 in the spatial domain). When these coeﬃcients are non-negative, the loss value is directly transmitted to the prior layer; otherwise, the loss value is set to zero before transmitting to the prior layer. This means that only parts of the eigenvalues are modiﬁed in the FCNN. This paper nests the spatial domain inverse ReLU to construct the Fourier domain inverse rectiﬁed linear unit (FR−1 ) for the FCNN, as shown in Fig. 6. For each convnet layer, the FR−1 function is presented as:

∂L ∂ Xk

k

∂L ∂L ∂L ∂L 2π 2π (U ) = (0 ) + ( 1 )e− j N U + (2 )e− j N 2U ∂ Xk ∂ xk ∂ xk ∂ xk ∂L 2π + ··· + (N − 1 )e− j N (N−1)U ∂ xk

(17)

where N is the size of the gradient map with respect to the input. U indexes the elements of the gradient map. The locations of the non-negative eigenvalues of the original output feature map are preserved during the forward propagation pass. Using these preserved locations to determine the locations of the elements on the loss gradient map, the spatial domain inverse ReLU function updates the feature map of the prior layer with these located elements. As the coeﬃcients of ∂∂XL correspond to the elements of

(16)

where ∂∂XL is the loss gradient with respect to the input feature k

map of the Fourier domain. A−1 is the output of the inverse activak tion function. F R−1 (· ) is the Fourier domain inverse ReLU function. L represents the loss function. ∂∂XL is the DFT with respect to the k

the spatial domain loss gradient map of the input, only the nonnegative coeﬃcients need to be updated by the located ∂∂xL , which k

indicates that only the non-positive items in Eq. (17) are reset to zero. Therefore, formula (16) and formula (17) are combined and rewritten as:

A−1 (U ) = F R−1 k

(14)

where each ∂∂y L (n ) ∗ w−1 (n ) can be transformed into product opk k k i erations of the Fourier domain. The ∂∂y L (n ) and w−1 (n ) are zerok k

spatial domain loss gradient map of input (i.e., ∂∂xL ), and thus, we k can transform ∂∂XL into a form of discrete Fourier series as:

k

Therefore, the convolutions between ∂∂yL and the transposed k

A−1 = F R−1 k

7

χj ∂L ∂L 2π (U ) = (n )e− j N nU ∂ Xk ∂ x k n = χ0

(18)

where χ represents the locations of the non-negative eigenvalues of the original output feature map. The Fourier items with non-positive coeﬃcients are reset to zero in ∂∂XL . For example, k

in Fig. 6, the second element of A−1 is equal to the value of k F R−1 ( ∂∂XL (1 ) ), i.e., a−1 = 2.5 + 1.5e− j 01 − j 29π

k

2π 9

2

− 3e− j

2π 9

5

+ 1.7e− j

2π 9

6

−

7

. The coeﬃcients indexed by 1, 3, 4 and 8 are eliminated, 2e and these coeﬃcients correspond to the non-positive eigenvalues in the spatial domain loss gradient map of the input. In summary, the proposed FR−1 function refers to updating the speciﬁc Fourier items of the original Fourier domain output Yk’ instead of the whole feature elements. This updating strategy ensures that the Fourier domain loss values accurately propagate to the previous layer of the FCNN. Inverse activation operations are usually followed by inverse pooling operations to propagate the loss values to the speciﬁc regions of the original input feature map. For each convnet layer, the Fourier domain inverse pooling function (FP−1 ) is presented as:

∂L = F P −1 (A−1 ) k ∂ Xk

(19)

where ∂∂XL represents the inverse-pooled loss gradient map with k

respect to the input in the Fourier domain. FP−1 (· ) is the Fourier domain inverse pooling function. The maximum pooling and average pooling are two important pooling methods to eliminate overﬁtting problems in the CNN. According to the Fourier domain maximum pooling and average pooling operations (see Section 3.2), we construct the Fourier domain maximum inverse pooling and average inverse pooling methods based on the spatial domain inverse pooling methods. Fourier domain maximum inverse pooling refers to updating the coeﬃcient of the Fourier item of the speciﬁc region elements with the loss gradient value. The locations of these coeﬃcients within the element have been obtained by Fourier domain maximum pooling operation during forward propagation pass (see Fig. 7). The locations of the updated elements correspond to the locations of the maximum eigenvalues extracted by the spatial domain maximum pooling. The speciﬁc region refers to the minimum neighbourhood unit for pooling operations in the original output feature map. Based on the Fourier domain maximum inverse

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

Fig. 6. Fourier domain inverse activation function (FR−1 ).

Fig. 7. Fourier domain inverse pooling method (FP−1 ).

pooling idea, formula (19) is rewritten as: δj

∂L ∂L 2π (U ) ≡ F Pm−1 (A−1 (∩ )) = ( n )e− j N n∩ k ∂ Xk ∂ xk n=δ

(20)

0

where FPm−1 (· ) represents the Fourier domain maximum inverse pooling function. δ indexes the updated coeﬃcients of the Fourier items in A−1 (∩ ). U indexes the elements of the inverse-pooled loss k gradient map with respect to the input. ∩ indexes the location of the updated elements extracted by the spatial domain maximum inverse pooling. Fourier domain average inverse pooling refers to updating the coeﬃcient of the Fourier item of the element of the speciﬁc regions with the average loss gradient value. All coeﬃcients within the element are updated by the average loss gradient values. These average loss gradient values are obtained by allocating the loss gradient value of the input to the corresponding speciﬁc regions equally (see Fig. 7). The locations of the speciﬁc regions within the feature map are obtained by Fourier domain average pooling operation during forward propagation pass. Based on the Fourier domain average inverse pooling idea, formula (19) is rewritten as: m −1 ∂L (U ) ≡ F Pa−1 (A−1 ( ∩ )) = k ∂ Xk n=0

∂L 2π (n ) m e− j N n∩ (21) ∂ xk

where FPa−1 (· ) represents the Fourier domain average inverse pooling function. m represents the size of the speciﬁc region. In Fig. 7, the gradients of the input are updated to the speciﬁc regions accordingly. The weight updating strategy in the Fourier

domain is the same as that in the spatial domain, which indicates that the spatial domain inverse pooling operations are mapped accurately to the Fourier domain. Our Fourier domain inverse pooling methods preserve the locations of the speciﬁc regions of the Fourier domain well, accurately propagating the loss values to the previous layer. 5. Arithmetic complexity analysis Following the common evaluation metric, the speedup of the GPU parallel implementations of the proposed method is measured on a NVIDIA GEFORCE RTX 2080 GPU with a peak throughput of 8.92 trillions of ﬂoating-point operations per second (TFLOPS). Using direct convolutions in the forward propagation pass requires S · f2 · f (n1 · n2 )(k1 · k2 ) operations. The FFT-based convolution in forward propagation requires S · f (n1 n2 + k1 k2 − 1 )(1 + 3/2 · log(n1 n2 + k1 k2 − 1 )) operations. The arithmetic complexity of the proposed FCNN in forward propagation requires S · f · (2k1 k2 − 1 )(1 + log(2k1 k2 − 1 )) operations. An operation represents a complex multiplication, containing four real multiplications and two real additions. As the direct convolution network implements convolution operations for each convnet layer in the spatial domain, each of the input feature maps needs to be convolved with each of the ﬁlters; i.e., each element in the input feature map must be multiplied by all elements of a ﬁlter. Therefore, a single convolution operation requires (n1 · n2 )(k1 · k2 ) operations in the direct convolution network. The FFT-based network transforms the convolution operations to pointwise product operations in the Fourier

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

9

Table 1 Arithmetic complexity analysis for three important frameworks. Direct conv(cuDNN)

FFT conv(fbFFT)

fprop/Ffprop

S · f2 · f (n1 · n2 )(k1 · k2 )

bprop/Fbprop

S · f · f 2 (n1 −k1 + 1 )(n2 −k2 + 1 ) ( k1 · k2 ) S · f · f (n1 −k1 + 1 )(n2 −k2 + 1 ) S(n1 · n2 )

S · f (n1 n2 + k1 k2 − 1 )(1 + 3/2 · log(n1 n2 + k1 k2 − 1 )) S · f (m1 m2 + k1 k2 − 1 )(1 + 3/2 · log(m1 m2 + k1 k2 − 1 )) f · f ( m1 m2 + n1 n2 − 1 ) f (1 + 3/2 · log(m1 m2 + n1 n2 − 1 ))

Direct conv(cuDNN) f · f · k 1 · k 2 S · f · n 1 · n 2 / S · f · m 1 · m 2 –

FFT conv(fbFFT) f · f · k 1 · k 2 S · f · n 1 · n 2 / S · f · m 1 · m 2 4n˜ (n˜ + 1 )(S · f + S · f + f · f )

Weight calc. Frames memory weight feature cache for FFT

Ours S · f · (2k1 k2 − 1 )(1 + log(2k1 k2 − 1 )) S · f · (2k1 k2 − 1 )(1 + log(2k1 k2 − 1 )) f · f ( m1 m2 + n1 n2 − 1 ) (1 + log(m1 m2 + n1 n2 − 1 )) Ours f · f · k 1 · k 2 S · f · n 1 · n 2 / S · f · m 1 · m 2 4k˜ (k˜ + 1 )(S · f + S · f + f · f )

Unit: #complex multiplication (time); #MB (RAM memory).

domain. The FFT is implemented for each of the inputs and ﬁlters; i.e., in addition to the transforming complexity, only one multiplication operation is required for a convolution operation. Therefore, a single convolution operation requires (n1 n2 + k1 k2 − 1 )(1 + 3/2 · log(n1 n2 + k1 k2 − 1 )) operations in the FFT-based network. When the ﬁlter is 8, 16 or 32, i.e.,k1 × k2 is equal to 3 × 3,4 × 4 or 5 × 5, the FFT-based network provides an increase of at least twice the arithmetic complexity compared to the direct convolution network. However, when the size of the ﬁlter is larger than 512, i.e., k1 × k2 is larger than 16 × 16, the FFT-based network provides at least 8 × arithmetic complexity reduction, which indicates that the traditional FFT-based convolution method is more applicable for training CNNs with large ﬁlters. The proposed FCNN implements product operations for each convnet layer in the Fourier domain, and the large input feature maps are divided into p numbers of small blocks. The size of blocks approximates the size of ﬁlters; i.e., a large convolution operation is decomposed into block-sized product operations in the Fourier domain. Therefore, a single convolution operation requires only (2k1 k2 − 1 )(1 + log(2k1 k2 − 1 )) operations in the proposed FCNN. When the input feature map is twice as large as the ﬁlter, FCNN provides at least a 4.9 × arithmetic complexity reduction compared to the direct convolution network, performing better than the FFT-based network undoubtedly. When the size ratio of the input feature map to the ﬁlter is more than twice, the FCNN provides at least 11.2 × arithmetic complexity reduction compared to the direct convolution network and 10.9 × arithmetic complexity reduction compared to the FFT-based network. Table 1 shows the arithmetic complexity analysis for several frameworks in forward propagation pass, backward propagation pass and weight gradient calculation. The further comparison results with respect to arithmetic complexity are presented in Section 6. 6. Results and discussion To demonstrate the advantages of FCNN, we test FCNN in two aspects. First, in terms of precision, a complete comparison of FCNN to the state-of-the-art methods is performed under multiple backbone architectures, in parallel with their implemented ablation. Second, in terms of speed, the throughput of the GPU implementation of the FCNN is measured in parallel with a thorough comparison between several state-of-the-art Fourier domain CNNs. Before the experiments, the implementation details are presented. The hyperparameters used for precision evaluation are set to the same values as the state-of-the-art object detection CNNs [43– 45]. The threshold of the intersection-over-union (IoU) is set to 0.45; i.e., when the IoU score of the candidate object is larger than 0.45, the object is labelled as positive; otherwise, the object is labelled as negative. IoU refers to the intersection between the candidate box and the ground truth detection box. The multi-task loss

function is applied for each positive candidate object during training, and the function is deﬁned as [43]. We train the CNNs on four GPUs (NVIDIA GEFORCE RTX 2080 GPU), and the effective batch size of each GPU is 32. The learning rate is 0.01, and the number of iterations is 200k. The learning rate is divided by 10 for the 160K iterations. The ﬁtting accuracy is set to 0.05. The NYUv2 dataset [46] and ImageNet dataset [6] are used for training the CNN. The precision evaluation metric is average precision (AP) with different subscripts, and the different subscripts refer to detecting objects in different dimensions. The backbone architectures used for our experiments include ResNet50, VGGNet19 and AlexNet7. Numbers refer to the depth of the network. Except for AlexNet7, ResNet50 and VGGNet19 use 3 × 3 ﬁlters in convnet layers and 2 × 2 ﬁlters in deconvnet layers. The stride is set to 1 or 2. ReLU activation is used in hidden layers. The speed of training and inference of the CNN are measured on a NVIDIA GEFORCE RTX 2080 GPU with a peak throughput of 8.92 TFLOPS. TFLOPS is used to measure the arithmetic complexity of a given algorithm; i.e., when TFLOPS for a convnet layer at speciﬁc batch sizes exceed the device peak throughput, the algorithm (or framework) provides a reduction in arithmetic complexity in this convnet layer and batch size. 6.1. Evaluation of learning precision During training and inference time, FCNN is compared to the state-of-the-art methods by drawing the learning precision curves under three backbone architectures. The comparison results are shown in Fig. 8. Combined with three backbone architectures, all learning curves of FCNN outperform koCNN [39] and perform similar to cuDNN [11]. As shown in Fig. 8, the convergence process of FCNN goes through three stages. In terms of FCNN with the ResNet50 backbone, the ﬁrst stage occurs in the ﬁrst 30k iterations of training. In this stage, the predicted outputs differ greatly from the theoretical outputs, and the large training error drives the network to converge rapidly. As the proposed FReLU alleviates the saturation problem in the Fourier domain training stage, FCNN converges easier than koCNN with the ResNet50 backbone. FReLU is an intuitive extension of spatial domain ReLU activation, so the learning curve of cuDNN and FCNN is not very different at this stage. The second stage takes place between 30k and 170k iterations. In this stage, the network training is almost ﬁnished, and the training loss between the predicted outputs and the theoretical outputs reaches less than 0.1. The convergence speed decreases greatly compared with the previous stage, and the network tends to be stable. Compared with koCNN, the FCNN learning curve ﬂuctuates less during training and inference time. koCNN transfers the original kernels to Fourier domain zero-padded kernels by a sync interpolation function. This strategy reduces the cost of memory; however, learning precision is also reduced. The third stage occurs

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM 10

ARTICLE IN PRESS

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

Fig. 8. Learning curves of FCNN, cuDNN and koCNN.

after the 170k-th iteration, and the training loss has been reduced to less than 0.1 with respect to FCNN with the ResNet50 backbone. In this stage, the network has entered into the ﬁne-tuning process, and the weight parameters of each node of the network no longer change signiﬁcantly. Finally, after 278K training iterations, FCNN reaches the pre-set accuracy, and the training stages ﬁnish. The learning precision of FCNN reaches 0.0499995 during the inference process, which ﬁts well with the pre-set precision of 0.05. In terms of FCNN with the VGGNet19 backbone, the second stage occurs between the 10k and 280k iterations. The network needs more time to converge to the pre-set precision than FCNN with the ResNet50 backbone. This veriﬁes that FCNN is more applicable for convnets with small ﬁlters and deep layers. Fig. 8 also shows that koCNN with the VGGNet19 backbone performs worse than FCNN with the VGGNet19 backbone. The learning curve of koCNN ﬂuctuates violently from begging to end, which shows the poor compatibility between the Fourier domain architecture of koCNN and the state-of-the-art backbone architecture. In terms of FCNN with the AlexNet7 backbone, the second stage occurs between 30k and 260k iterations, which is much closer to the second stage of FCNN with the ResNet50 backbone. Despite the backbone architecture with lower precision, FCNN still performs similarly to the network with a more advanced backbone and much better than koCNN, which uses FReLU in the hidden layers.

Following the above comparison results, ablation experiments are performed to verify the advantages of the Fourier domain convolution architecture of FCNNs. The ablation experiment results are presented in Table 2, and the discussions are presented as follows. Architecture: Table 2(a) presents the average learning precision of FCNNs with different backbone architectures. FCNNs with better backbones show higher precision than those with the basic backbones. This results from the advanced architecture of FCNN and the deep layers of ResNetX. The advanced architecture of FCNN beneﬁts from the block decomposition pipeline running through the backpropagation passes. This means that training the FCNN in the Fourier domain shows no precision loss compared with training its variants in the spatial domain. Additionally, it should be emphasized that not all architectures beneﬁt from the good backbones and deeper convnet layers. Activation function: Table 2(b) shows that the FCNN converges relatively easily in the Fourier domain training process. In order to evaluate the converge performance of FReLU, we compare FReLU to the Fourier domain sigmoid and tanh function in this experiment. The sigmoid and tanh functions are two saturated activation functions in CNN pipeline. For each convnet layer, when the input feature matrix is very small or very large, the output of sigmoid function is between 0 and 1, and the output of tanh function satu-

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

11

Table 2 Ablation experiments for FCNN. (a) Architecture: better architectures and deeper convnet layers improve the learning precision during training and inference processes. backbone-depth-archt.

AP

AP25

AP50

AlexNet-7-cuFFT VGGNet-19-cuFFT VGGNet-19-FCNN ResNet-34-FCNN ResNet-50-FCNN

39.5 46.5 47.6 50.2 52.4

56.8 65.5 66.9 70.4 73.5

42.5 50.0 50.2 55.2 57.1

(b) Activation: the network with FReLU function shows better learning precision than the variants with sigmoid/tanh function. (ResNet-50backbone is used in this experiment) activation

AP

AP25

AP50

sigmoid tanh FReLU D-value1 D-value2

46.0 46.5 52.4 +6.4 +5.9

65.3 65.7 73.5 +8.2 +7.8

49.6 49.7 57.1 +7.5 +7.4

(c) Pooling: the Fourier-domain learning results with two frequency pooling functions. Our Fourier-domain inverse pooling function improves the learning precision by 2.5 AP. (ResNet50 backbone is used in this experiment) pooling spectral pool FPm FPa inverse FPm inverse FPa

spectrum √ √ √ √ √

bilinear

agg.

AP

AP25

AP50

– √ √

– max avg max avg

52.8 52.2 52.5 55.3 55.6

74.5 73.1 73.6 76.2 76.6

57.1 56.9 56.5 58.6 58.7

– –

rates at −1 or 1. The gradients at saturated regions are almost zero, this means that it is diﬃcult for sigmoid and tanh to converge in Fourier domain training process. In addition, koCNN employs the sigmoid and tanh function to introduce non-linear features into the Fourier domain respectively, yet the sigmoid/tanh function makes it diﬃcult to ﬁt the pre-set precision in the Fourier domain; this can also be veriﬁed in Fig. 7. The FCNN expands the spatial domain ReLU activation intuitively, i.e., FReLU is proposed for alleviating the saturation problem in the Fourier domain training process. The advantages of FReLU are shown by using sigmoid and FReLU as activation functions to train the networks with the ResNet50 backbone. The comparison results are presented in Table 2(b). Tanh decreases neuronal backpropagation, leading to a loss of AP (almost 5.9 AP). Sigmoid also seriously decreases neuronal backpropagation, leading to a severe loss of AP (almost 6.4 AP), which indicates that once the network is trained in the Fourier domain, it is better to employ an activation function that is designed exclusively for the Fourier domain CNN. Pooling: Table 2(c) shows the comparison results between several state-of-the-art pooling functions. The ResNet50 backbone architecture is employed in this experiment. In the backpropagation pass, the proposed inverse FPool function outperforms the spectral pool function by at least 2.5 AP−1 , showing larger gains under higher multi-scale regions (i.e., AP−1 50 ). The maximum inverse FPool performs similarly to the average inverse FPool, and they are all better than the spectral pool. As the spectral pool function truncates the high-frequency signal of the input data in the Fourier domain forward propagation pass, the gradients with respect to the high-frequency outputs cannot be restored in the backpropagation pass. More importantly, every Fourier domain input consists of high-frequency and low-frequency signals, which suggests that each output and weight running back through the propagation pass is defective. This results in lower precision of the spectral pool in the backpropagation pass. The proposed inverse FPool function is an intuitive extension of the proven spatial pooling function that properly updates the features of the inputs and weights. In the forward propagation pass, the spectral pool function performs

similarly to the FPool function and better in some cases, which indicates that using the frequency truncation strategy to reduce the dimension of inputs is more applicable for training networks in the forward propagation pass. 6.2. Evaluation of learning speed Based on the arithmetic complexity analysis in Section 5, the speedup ratio of several advanced methods is calculated to evaluate the learning speed of FCNN. The advanced methods are fbFFT, cuDNN, FA and koCNN; in this experiment, they are all combined with three backbones (i.e., AlexNet7, VGGNet19 and ResNet50) and compared to the FCNN. Their speedup performance is measured on a NVIDIA GEFORCE RTX 2080 GPU. The speedup of a given method is computed by dividing the runtime required by multiplication operations, as tabulated in Table 1, by the runtime in the GPU. The number of GPUs determines the batch size; the batch size of each GPU is 32, and four GPUs are used for training the CNNs. Thus, the effective batch size is from 1 to 128 in this experiment. The arithmetic complexity of FCNN is evaluated by the speedup ratio of FCNN to four other advanced methods and calculated at batch sizes from 1 to 128 as follows:

Sp =

128 Si · f i · pi · (2k1i k2i − 1 )(1 + log(2k1i k2i − 1 )) Tpi

(22)

i=1

where Tp refers to the speedup of GPU implementation for a given method.Sp represents the speedup ratio of FCNN to a given method. The higher speedup ratio indicates that the reduced proportion of arithmetic complexity of the FCNN is larger than other methods. Fig. 9 shows the speedup ratio of FCNN with three backbones. The detailed analysis is presented as follows. Fig. 9(a) shows the speedup ratio of FCNN with the AlexNet7 backbone. Compared with fbFFT, FCNN outperforms fbFFT by 10.7260 points at a batch size of 1. As the number of batches increases, the speedup ratio of FCNN to fbFFT decreases to 3.9745 points at a batch size of 128. However, FCNN still performs better than fbFFT at every batch size. The arithmetic complexity of

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM 12

ARTICLE IN PRESS

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

Fig. 9. Speedup ratio of FCNN with different backbones.

FCNN reduces by 5.6159 points on average. This beneﬁts from the block decomposition pipeline embedded in the FCNN architecture. It decomposes the large convolution operations of the spatial domain into block-sized products in the Fourier domain, reducing the zero padding steps required by fbFFT. Additionally, the minimum speedup ratio of FCNN to fbFFT emerges at a batch size of 128, which suggests that training the FFT-based CNN tends to be stable regardless of the number of batches. However, compared with cuDNN, FCNN performs better than cuDNN by 4.8072 points at batch sizes of 1 and 6.4096 points at batch sizes of 128. Nearly two-point D-values suggest that the arithmetic complexity of spatial domain convolution increases in proportion to the depth of the convnet layer. The FCNN-based Fourier domain framework is more applicable for training networks with deeper layers and large batches. Additionally, only when the batch size is less than 8, cuDNN performs better than fbFFT, which takes results from the initial transformation time required by the Fourier domain frameworks. Another important spatial domain framework is built upon the Winograd algorithm, labelled as FA. FCNN outperforms FA by 1.8884 points at batch sizes of 1, and 2.5751 points at batch sizes of 128. The arithmetic complexity of FCNN decreases by 2.2532 points on average, which is a relatively small gain compared to other spatial domain convolution frameworks. FA is still the fastest algorithm for training CNN in the spatial domain. However, FA shows poor computational performance when the batch size reaches more than 64 because more transformation time is needed for implementing the tile-sized multiplication operations in convnet layers. Fig. 9(a) also shows the comparison results between FCNN and koCNN, which is the only proposed Fourier domain convolution framework, thus far, in addition to our FCNN. The speedup ratio of FCNN to koCNN is 3.9726 × at S = 1 and 1.4720 × at S = 128, and FCNN performs better than koCNN at every batch size. This suggests that the number of multiplication operations required by the sync interpolation function is still larger than the proposed block-sized multiplication operations, which was veriﬁed in Sections 3.1 and 4.1. Fig. 9(b) and (c) show the speedup ratio of FCNN with the VGGNet19 backbone and ResNet50 backbone, respectively. Following the use of advanced backbones, the performance of FCNN and other frameworks also improves. For example, the arithmetic complexity of FCNN with the AlexNet7 backbone decreases by at least 2.0799 points on average. This number is increased to 3.1720 points for FCNN with the ResNet50 backbone. More than one-point D-value suggests that the depth and performance of backbones show positive effects on the learning speed of FCNN. In addition to the above experiment, the throughput of FCNN and koCNN are further calculated layer by layer to evaluate

the learning speed of FCNN. The throughput experiment results are presented in Fig. 10, and the discussions are presented as follows. We carried out the throughput experiment by training FCNN and koCNN with the AlexNet7 backbone. The throughput of the GPU implementation of the FCNN and koCNN are measured on a NVIDIA GEFORCE RTX 2080 GPU. The GPU base clock rate is 1515 MHz, and the number of NVIDIA CUDA cores is 2944. The GPU peak throughput is calculated by the equation of 2 × 1515 × 2944 = 8.92 TFLOPS. The throughput for a given layer is obtained by dividing the ground truth TFLOPS by the runtime of GPU implementation. When the throughput of a given layer is higher than 8.92 TFLOPS, this layer is considered to have effective arithmetic complexity. This means that the framework can reduce the arithmetic complexity of training and inference if the throughput of convnet layers exceeds the GPU peak throughput. Fig. 10 shows the throughput of GPU implementation of FCNN and koCNN at various batch sizes. The single precision ﬂoat is used to represent the computation accuracy. In the forward propagation pass of layer 1, FCNN shows a speedup of 2.00 × at batch sizes of 128 and 1.63 × as fast at batch sizes of 1. The throughput at 128 batch size is 19.6292 TFLPS, still higher than the peak throughput. FCNN outperforms koCNN by 6.6867 TFLPS at batch sizes of 1 and 9.8146 TFLPS at batch size of 128. In the Fig. 10(a) depth maps, red points indicate the maximum throughput for three propagation passes where FCNN is used for training and inference. Blue points indicate the maximum throughput for three propagation passes where koCNN is used. The areas with higher brightness indicate a higher throughput in this region, i.e., the algorithms show a lower arithmetic complexity in this region. Compared with koCNN, FCNN has a larger bright area, especially for areas with large batch sizes. In Fig. 10(b), koCNN yields a lower TFLOPS for three propagation passes from layer 3 to 7, under 8.92 TFLOPS, which indicates that the sync interpolation implementation uses large complex multiplications to replace the FFT transformations, which results in ineﬃcient multiplication procedures unless the batch size is small. At a small batch size, koCNN performs much better, and the throughput is larger than 8.92 TFLOPS. Fig. 10(c) shows that FCNN outperforms koCNN at every convnet layer and batch size, especially for the Faccgrad pass where the gradient with respect to the weight is calculated. This pass contributes more than a third of the total arithmetic computation. In general, FCNN performs well unless the size of the ﬁlters is very large. In this circumstance, the block decomposition strategy either uses large blocks or a single block per input feature map, which requires more GPU memory to implement arithmetic com-

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM

ARTICLE IN PRESS J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

[m5G;July 29, 2019;7:58] 13

Fig. 10. The throughput analysis for FCNN and koCNN.

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM 14

ARTICLE IN PRESS

[m5G;July 29, 2019;7:58]

J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

putation. Therefore, FCNN is more applicable for backbones with small ﬁlters and depth layers. In other words, FCNN improves learning speed by at least 11.4813 TFLOPS on a NVIDIA GEFORCE RTX 2080 GPU with a peak throughput of 8.92 TFLOPS, showing larger gains for large batches. 7. Conclusion A Fourier domain acceleration framework is proposed in this paper for the speedup of training and inference of CNNs in forward and backward propagation passes. The main contributions are as follows. (1) A Fourier domain block decomposition pipeline is proposed and integrated into the forward and backward propagation passes of CNNs. The computational expense and memory latency for training and inference of CNNs are reduced by this decomposition strategy. (2) We nest the traditional spatial domain ReLU to construct the FReLU and FReLU−1 for FCNN. The non-linear features of the Fourier domain input can be caught by FReLU and FReLU−1 in forward and backward propagation passes. (3) The Fourier domain downsampling functions (FPool and FPool−1 ) are established by representing the maximum and average pooling operations in the Fourier domain. The dimensions of the Fourier domain output can be reduced by FPool and FPool−1 . In conclusion, the proposed framework based on the new Fourier domain propagation passes, activation function and downsampling operations can eﬃciently accelerate training and inference of CNNs in the Fourier domain and play a signiﬁcant role in large-scale learning efforts, showing a positive impact in the ﬁelds of pattern recognition and machine learning.

Declaration of Competing Interest None.

Funding This work was supported by National Natural Science Foundation of China [grant number 51705032]; National High-tech R&D Program[grant number 2014AA7031010B]. References [1] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classiﬁcation with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2) (2012) 1097–1105. [2] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2015) 1137–1149. [3] R. Collobert, J. Weston, Léon Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch, J. Mach. Learn. Res. 12 (1) (2011) 2493–2537. [4] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, et al., Recent advances in convolutional neural networks, Pattern Recognit. 77 (none) (2018) 354–377. [5] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, IEEE Trans. Image Process. 23 (5) (2014) 2019–2032. [6] J. Yu, D. Tao, M. Wang, Y. Rui, Learning to rank using user clicks and visual features for image retrieval, IEEE Trans. Cybern. 45 (4) (2015) 767–779. [7] J. Yu, X. Yang, F. Gao, D. Tao, Deep multimodal distance metric learning using click constraints for image ranking, IEEE Trans. Cybern. 47 (12) (2016) 1–11. [8] C. Hong, J. Yu, J. Wan, D. Tao, M. Wang, Multimodal deep autoencoder for human pose recovery, IEEE Trans. Image Process. 24 (12) (2015) 5659–5670. [9] C. Hong, J. Yu, D. Tao, M. Wang, Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval, IEEE Trans. Indust. Electron. 62 (6) (2015) 3742–3751. [10] Q. Zhang, M. Zhang, T. Chen, Z. Sun, Y. Ma, B. Yu, Recent advances in convolutional neural network acceleration, Neurocomputing 323 (2018) 37–51.

[11] Chetlur, Sharan, Woolley, Cliff, Vandermersch, Philippe, Cohen, Jonathan, Tran, John, Catanzaro, Bryan, and Shelhamer, Evan. cudnn: eEﬃcient primitives for deep learning. On line ﬁrst. [12] S. Filippone, M. Colajanni, Psblas:a library for parallel linear algebra computation on sparse matrices, ACM Trans. Math. Softw. 26 (4) (20 0 0) 527–550. [13] R.J. Cintra, S. Duffner, C. Garcia, A. Leite, Low-complexity approximate convolutional neural networks, IEEE Trans. Neural Netw. Learn. Syst. 29 (12) (2018) 5981–5992. [14] R. Collobert, K. Kavukcuoglu, Clément Farabet, Torch7: a Matlab-like environment for machine learning, in: Proceedings of BigLearn NIPS Workshop, January 2011, 2011. [15] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, Y. Bengio, Theano: a CPU and GPU math expression compiler, in: Proceedings of the Python for Scientiﬁc Computing Conference (SciPy), June 2010 Oral Presentation. [16] Krizhevsky, Alex. cudaCuda-convnet2, 2014. URL: https://code.google.com/p/ cudaconvnet2/. [17] Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: cConvolutional architecture for fast feature embedding. On line ﬁrst. [18] J. Nickolls, Parallel computing experiences with cuda, Micro IEEE 28 (4) (2008) 13–27. [19] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, T.M. Aamodt, Analyzing cuda workloads using a detailed gpu simulator, in: Proceedings of IEEE Intl Symp Performance Analysis of Systems & Software, Boston, MA, USA, 2009, pp. 163–174. April 2009. [20] A. Lavin, S. Gray, Fast algorithms for convolutional neural networks, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 On line ﬁrst. [21] S. Winograd, Arithmetic complexity of computations, SIAM 43 (2) (1980) 625–633. [22] F.L. Gall, Powers of tensors and fast matrix multiplication, in: Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, Kobe, Japan, 2014, pp. 296–303. July. [23] J. Cong, B. Xiao, Minimizing computation in convolutional neural networks, in: Proceedings of Artiﬁcial Neural Networks and Machine Learning, Hamburg, Germany, 2014, pp. 281–290. September. [24] V. Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (4) (1969) 354–356. [25] S. Ben-Yacoub, B. Fasel, J. Luttin, Fast face detection using mlp and fft, in: Proceedings of Audio and Video-based Biometric Person Authentiﬁcation, Washington, D. C., USA, 1999. [26] M.T. Heideman, D.H. Johnson, C.S. Burrus, Gauss and the history of the fast Fourier transform, IEEE ASSP Mag. 1 (4) (1984) 14–21. [27] P.D. Vetterli, Fast Fourier transforms: a tutorial review and a state of the art, Signal Process. 19 (4) (1990) 259–299. [28] Mathieu, M., Henaff, M., & Lecun, Y. (2013). Fast training of convolutional networks through ffts. Eprint Arxiv. On line ﬁrst. [29] Nvidia cuda fast fourier transform library (cufft) (2012). https://docs.nvidia. com/cuda/cufft/. [30] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput. 19 (90) (1965) 297–301. [31] T. Brosch, R. Tam, Eﬃcient training of convolutional deep belief networks in the frequency domain for application to high-resolution 2d and 3d images, Neural Comput. 27 (1) (2015) 211–227. [32] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, W. Xu, Cnn-rnn: a uniﬁed framework for multi-label image classiﬁcation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 On line ﬁrst. [33] N.V. Noord, E. Postma, Learning scale-variant and scale-invariant features for deep image classiﬁcation, Pattern Recognit. 61 (2017) 583–592. [34] C. Barat, C. Ducottet, String representations and distances in deep convolutional neural networks for image classiﬁcation, Pattern Recognit. 54 (C) (2016) 104–115. [35] J. Shin, M.W. Hall, J. Chame, C. Chen, P.D. Hovland, Speeding up Nek50 0 0 with autotuning and specialization, in: Proceedings of the 24th International Conference on Supercomputing, Tsukuba, Ibaraki, Japan, 2010, pp. 253–262. June 2010. [36] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, Y. Lecun, Fast convolutional nets with fbfft: a gpu performance evaluation, in: Proceedings of International Conference on Learning Representations, December 2014 On line ﬁrst. [37] J. Ragankelley, C. Barnes, A. Adams, S. Paris, Frédo Durand, S.P. Amarasinghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, ACM SIGPLAN Not. 48 (6) (2013) 519–530. [38] O. Rippel, J. Snoek, R.P. Adams, Spectral representations for convolutional neural networks, in: Proceedings of the 28th International Conference on Neural Information Processing Systems, 2, Montreal, Canada, December 2015, pp. 2449–2457. [39] J.H. Ko, B. Mudassar, T. Na, S. Mukhopadhyay, Design of an energy-eﬃcient accelerator for training of convolutional neural networks using frequency-domain computation, in: Proceedings of the 54th Annual Design Automation Conference, Article No. 59, Austin, TX, USA, June 2017. [40] Y. Ito, Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory, Neural Netw. 4 (3) (1991) 385–394.

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

JID: NEUCOM

ARTICLE IN PRESS J. Lin, L. Ma and Y. Yao / Neurocomputing xxx (xxxx) xxx

[m5G;July 29, 2019;7:58] 15

[41] L.K. Jones, Constructive approximations for neural networks by sigmoidal functions, Proc. IEEE 78 (10) (1990) 1586–1589. [42] D. Costarelli, R. Spigler, Approximation results for neural network operators activated by sigmoidal functions, Neural Netw. 44 (8) (2013) 101–106. [43] R. Girshick, Fast R-CNN, ICCV (2015) On line ﬁrst. ´ , R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid [44] T.-.Y. Lin, P. Dollar networks for object detection, CVPR (2017) On line ﬁrst. [45] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, International Conference on Neural Information Processing Systems, 2015 On line ﬁrst. [46] S. Song, J. Xiao, Sliding shapes for 3D object detection in depth images, in: European Conference on Computer Vision, 8694, 2014, pp. 634–651.

Lin Ma received his B.S. degree in material shaping and control engineering in 20 0 0 from Si Chuan University. He is currently a senior engineer at FAW Foundry co., Ltd. His research interests are casting process simulation and engineering application of artiﬁcial intelligence.

Jinhua Lin received her B.S. and M.S. degrees in Computer Science and Technology from Xi’an Jiao Tong University in 2004 and 2008, and her Ph.D. in Mechatronic Engineering from University of Chinese Academy of Sciences in 2017. She is currently an associate professor at Changchun University of Technology. Her research interests are in computational neuroscience, machine learning, computer vision, and engineering application of artiﬁcial intelligence.

Yu Yao received her B.S. and M.S. degrees in Mechanical Engineering from Changchun University of Technology in 2006 and 2009, and Ph.D. in Mechanical Engineering from Ji Lin University in 2016, specializing in tracked vehicle engineering. Her main research interests are computational neuroscience, industrial robotics, and vehicle engineering.

Please cite this article as: J. Lin, L. Ma and Y. Yao, A Fourier domain acceleration framework for convolutional neural networks, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.080

A Fourier domain acceleration framework for convolutional neural networks

A Fourier domain acceleration framework for convolutional neural networks

Recommend Documents