ARTICLE IN PRESS
JID: NEUCOM
[m5G;July 6, 2018;4:18]
Neurocomputing 0 0 0 (2018) 1–6
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Efficiency of deep networks for radially symmetric functions Brendan McCane∗, Lech Szymanski Department of Computer Science, University of Otago, Dunedin, New Zealand
a r t i c l e
i n f o
Article history: Received 28 July 2017 Revised 8 May 2018 Accepted 11 June 2018 Available online xxx Communicated by Dr Q Wei
a b s t r a c t We prove that radially symmetric functions in d dimensions can be approximated by a deep network with fewer neurons than the previously best known result. Our results are much more efficient in terms of the support radius of the radial function and the error of approximation. Our proofs are all constructive and we specify the network architecture and almost all of the weights. The method relies on space-folding transformations that allow us to approximate the norm of a high dimensional vector using relatively few neurons.
Keywords: Deep networks Function approximation
1. Introduction Deep networks have been stunningly successful in many machine learning domains since the area was reinvigorated by the work of Krizhevsky et al. [6]. Despite their success, relatively little is known about them theoretically, although this is changing. In particular, it would be very useful to know theoretically for which problems deep networks are more effective than shallow learners. This paper extends previous work on approximating radially symmetric functions. It provides a new upper bound for the number of neurons required in a deep network with rectified linear units (ReLUs) to approximate a radially symmetric function and does so using a constructive proof. A method for building ReLU networks that do the approximation is given. 2. Related work Most theoretical work on deep networks consists of existence proofs that give no insight into how to build a network for the problem under consideration. For example, ReLU networks with n0 inputs, L hidden layers of width n ≥ n0 can compute functions n0 n that have (n/n0 )(L−1 )n0 nn0 linear regions compared to j=0 j for a shallow network [7]. More generally, Telgarsky [10] proved for semi-algebraic neurons (including ReLU, sigmoid etc), that networks exist with (k3 ) layers and up to a constant number of nodes per layer that require at least 2k nodes to approximate with a network of O(k) layers. Delalleau and Bengio [3] show that deep sum-product networks exist for which a shallow network would
∗
Corresponding author. E-mail address:
[email protected] (B. McCane).
© 2018 Elsevier B.V. All rights reserved.
require exponentially more neurons to simulate. For convolutional arithmetic circuits (similar to sum-product networks), Cohen et al. [2], in an important result, show that “besides a negligible (zero measure) set, all functions that can be realized by a deep network of polynomial size, require exponential size in order to be realized, or even approximated, by a shallow network.” The above works, except for Cohen et al. [2] focus on approximating deep networks with shallow networks, but do not indicate what problems are best attacked with deep networks. For manifolds, Basri and Jacobs [1] show how deep networks can efficiently represent low-dimensional manifolds and that these networks are almost optimal, but they do not discuss limitations of shallow networks on the same problem. Somewhat similarly, Shaham et al. [8] show that depth-4 networks can approximate a function on a manifold where the number of neurons depends on the complexity of the function and the dimensionality of the manifold and only weakly on the embedding dimension. Again, they do not discuss the limitations of shallow networks for this problem. Importantly, both of these results are constructive and allow one to actually build the network. Szymanski and McCane [9] show that deep networks can approximate periodic functions of period P over {0, 1}N with O(log2 N − log2 P ) parameters versus O(Plog2 N) for shallow. Eldan and Shamir [4] show that networks with two hidden layers exist such that the network can approximate a radially symmetric function with O(d19/4 ) neurons, whereas a network with 1 hidden layer requires at least O(ed ) neurons. They do not extend the result to deeper networks. Therefore evidence is building that deep networks are more powerful than their shallow counterparts in terms of the number of parameters or neurons required. Nevertheless, more work is needed. In particular, it would be useful to determine which
https://doi.org/10.1016/j.neucom.2018.06.003 0925-2312/© 2018 Elsevier B.V. All rights reserved.
Please cite this article as: B. McCane, L. Szymanski, Efficiency of deep networks for radially symmetric functions, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.003
ARTICLE IN PRESS
JID: NEUCOM 2
problems are best solved with deep networks, and how to build networks for those particular problems. In this work we directly extend the work of Szymanski and McCane [9] and Eldan and Shamir [4]. The latter work [4] is extended to deeper networks for approximating radially symmetric functions that require fewer parameters than their shallow counterparts. The former [9] is extended by generalising their notion of folding transformations to work in multiple dimensions and more simply with ReLU networks. The proofs are constructive and allow us to build networks for approximating radially symmetric functions. 3. Context and notation A radially symmetric function is a function whose value is dependent on the norm of the input only. We are interested in LLipschitz functions f, | f (x ) − f (y )| ≤ L|x − y|, as this covers many functions common in classification tasks. The number of dimensions of the input is d, and we assume that f is constant outside a radius R. This is a similar context to that used by Eldan and Shamir [4]. Further, we restrict ourselves to ReLU networks only, which is more restrictive than Eldan and Shamir [4], but allows us to explicitly construct the networks of interest. A network with N layers has N − 1 hidden layers. Layer 0 is the input layer (not counted in the number of layers), and layer N is the output layer. Proofs are only sketched in the main body of the paper. Detailed proofs are provided in the supplementary material. 4. 3 Layer networks We start by stating a modified form of Lemma 18 from Eldan and Shamir [4] to do with 3 layer networks: Lemma 1 (Modified form of Lemma 18 from Eldan and Shamir [4]). Let σ (z ) = max(0, z ). Let f : R → R be an L-Lipschitz function supported on [0, R]. Then for any δ > 0, there exists a function g : 2 2 Rd → R expressible by a 3-layer network of width at most 6d R δ+3RL , such that
sup |g(x ) − f (||x|| )| < δ + L
x∈R d
√
δ
The proof follows the basic plan of Eldan and Shamir [4] where the first layer is the input layer, the second layer approximates x2i for each dimension i, and the third layer computes i x2i and approximates f. Since several sections of the second layer are doing the same thing (computing the square of their input), a weightsharing corollary follows immediately where only one copy of the square approximation is needed. Corollary 1 (3 Layer Weight Sharing). Let σ (z ) = max(0, z ). Let f : R → R be an L-Lipschitz function supported on [0, R]. Then for any δ > 0, there exists a function g : Rd → R expressible by a 3-layer 2 weight-sharing network with at most 6 dR δ+3RL weights, such that
sup |g(x ) − f (||x|| )| < δ + L
x∈R d
[m5G;July 6, 2018;4:18]
B. McCane, L. Szymanski / Neurocomputing 000 (2018) 1–6
√
δ
5. Deep folding networks In this section, we show how folding transformations can be used to create a much deeper network with the same error, but many fewer weights than needed in Lemma 1. A folding transformation is one in which half of a space is reflected about a hyperplane, and the other half remains unchanged. Fig. 1 a demonstrates how a sequence of folding transformations can transform a circle in 2D to a small sector. After enough folds, we can discard the almost zero coordinates to approximate the norm. We will use this general idea to prove the following theorem:
Theorem 1. Let x ∈ Rd , and σ (z ) = max(0, z ). Let f : R → R be an L-Lipschitz function supported on [0, R]. Fix L, δ , R > 0. There exists a function g : Rd → R expressible by a O(d log2 (d ) + log2 (d ) log2 ( √R )) δ
layer network where the number of weights, and number of neurons, Nw , Nn = O(d2 + d log2 ( √R ) + 3δRL ), such that: δ
sup |g(x ) − f (||x|| )| < δ + L
√
δ
x∈R d
The approach taken here is a constructive one and specifies the architecture of the network needed to approximate f. In fact, all of the weights are specified by the construction. The approach is somewhat different to that used to prove Lemma 1. We build a sequence of layers to directly approximate ||x|| and then approximate f in the last layer. To build our layers, we need a few helper lemmas. Lemma 2 (2D fold). There exists a function g : R2 → R2 , expressible by a ReLU network with 4 ReLU units and 2 sum units that can compute a folding transformation about a line through the origin, represented by the unit direction vector l = (lx , ly )T . The function g is of the form:
g( x ) =
⎧ ⎨x
lx2
−
l · x⊥ > 0
ly2
2lx ly x ly2 − lx2
⎩ 2lx ly
otherwise
The requisite ReLU network is shown in Fig. 1b. Only one of the nodes labeled x− (y− ) and x+ (y+ ) are active at any one time. Therefore there are four possible cases depending on which two nodes are active. Note that x− is active when l · x⊥ < 0 and x+ is active when l · x⊥ > 0. To approximate the 2D norm, we simply stack layers of the type shown in Fig. 1b with suitable choice of lx , ly at each layer. Note that the summation nodes are not required since they can be incorporated into the summations and weights of the next ReLU layer. These 2D folds can be used to estimate the norm of a vector as per the following lemma. Lemma 3 (Approximate ||x||, x ∈ R2 , ||x|| < R). There exists a function g : R2 → R, expressible by a ReLU network with no more than log2 R πδ layers and 4 nodes per layer such that:
sup
x∈R2 ,||x||≤R
|g(x ) − ||x||| ≤ δ
Proof. The proof is short and simple. After f layers, each data point will be within an angle of πf of the x-axis. Simple geometry and 2 appropriate approximations leads to:
π δ = ||x|| − ||x|| cos f
π 2 ≤ R 1 − cos
≤ R 2 sin ≤R
f
π2
f +1
π 2
2f
π δ
f ≤ log2 R
The following lemma generalises this construction to folds in d dimensions. Lemma 4 (Approximate ||x||, x ∈ Rd ). There exists a function g : Rd → R, expressible by a ReLU network with:
Nl ≤ log2 (d ) log2
Rπ δ
( 2
(d+1 ) 2
− 1) +
√ d 2 ( 2 2 − 1 )
Please cite this article as: B. McCane, L. Szymanski, Efficiency of deep networks for radially symmetric functions, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.003
ARTICLE IN PRESS
JID: NEUCOM
[m5G;July 6, 2018;4:18]
B. McCane, L. Szymanski / Neurocomputing 000 (2018) 1–6
3
1
(a) Series of folding transformations for 2D. y
x Σ
Σ −lx
ly
−lx −lx
lx
ly
x−
−ly
y−
x+ −lx
lx −ly
ly
y+
−lx
lx −ly
ly
ly
y
x
(b) A network to produce a 2D fold. Fig. 1. Folding in 2D.
Nn ≤ 4(d − 1 ) log2
Rπ
( 2
(d+1 ) 2
− 1) +
√ d 2 ( 2 2 − 1 )
δ
Rπ (d+1)
√ d Nw ≤ 8(d − 1 ) log2 ( 2 2 − 1 ) + 2 ( 2 2 − 1 ) δ such that:
sup
x∈Rd ,||x||≤R
|g(x ) − ||x||| ≤ δ
We note that a fold in a 2D plane in Rd will leave all coordinates perpendicular to the plane unchanged. We can therefore apply the approximation of Lemma 2 to pairs of input coordinates to produce d/2 new coordinates. Then apply the same reduction to produce d/4 coordinates and continue on this way until there is only one coordinate left. In effect, we are calculating the norm via
the following scheme:
x21 + x22 + x23 + · · · + x2d
=
x21
+
x22
2
+
x23
+
x24
2
2
···
··· +
2
x2n−1
+
2
x2n−2
Fig. 2 shows the resulting network. The proof is by induction and is rather long so is not produced here but appears in full in the supplementary material. At this point we make use of Lemma 19 from Eldan and Shamir [4] which we reproduce here: Lemma 5 (Lemma 19 from [4]). Let σ (z ) = max(0, z ) be the ReLU activation function, and fix L, δ , R > 0. Let f : R → R which is constant outside an interval [−R, R]. There exist scalars a, {αi , βi }w , i=1
Please cite this article as: B. McCane, L. Szymanski, Efficiency of deep networks for radially symmetric functions, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.003
ARTICLE IN PRESS
JID: NEUCOM 4
[m5G;July 6, 2018;4:18]
B. McCane, L. Szymanski / Neurocomputing 000 (2018) 1–6
d 2 i=1 xi
Fold Network Layer log2 (d), Num 1
.. . x2d−3 + x2d−2 + x2d−1 + x2d
x21 + x22 + x23 + x24 Fold Network Layer 2, Num 1 x21 + x22
Fold Network
···
Layer 2, Num x2d−3 + x2d−2
x23 + x24
Fold Network
Fold Network
Layer 1, Num 1
Layer 1, Num 2
x1
x3
x2
d 4
x2d−1 + x2d
Fold Network
···
Layer 1, Num
d 2
Layer 1, Num
xd−2
xd−3
x4
Fold Network
−1
d 2
xd−1
xd
Fig. 2. A network to approximate the norm of a vector. Each fold network consists of multiple layers.
√ Lemma 4 (O d log2 (d ) + log2 (d ) log2 R/ δ layers), thus requiring a total number of neurons no more than:
where w ≤ 3 RL δ , such that the function:
h (x ) = a +
w
αi σ (x − βi )
(1)
i=1
sup |h(x ) − f (x )| ≤ δ.
(2)
x∈R
Theorem 1. Let x ∈ Rd , and σ (z ) = max(0, z ). Let f : R → R be an L-Lipschitz function supported on [0, R]. Fix L, δ , R > 0. There exists a function g : Rd → R expressible by a O(d log2 (d ) + log2 (d ) log2 ( √R )) δ
layer network where the number of weights, and number of neurons, Nw , Nn = O(d2 + d log2 ( √R ) + 3δRL ), such that:
√
δ
Proof. From Lemma 4 we can approximate ||x|| to within using Lemma 5:
f (||x|| ) + L and
f (||x|| ) − L therefore:
f (||x|| ) − L
√
√
Nn = O d2 + d log2
We can now prove the main theorem (some steps left out for brevity):
x∈R d
√ (d−1 ) Rπ d √ ( 2 2 − 1 ) + 2 ( 2 2 − 1 )
δ
R √
3RL
+
δ
√
δ and
√
δ − δ ≤ g(||x|| + δ ) ≤ f (||x|| ) + L δ + δ
√
√
√
√
√
√
δ − δ ≤ g(||x|| + δ ) ≤ f (||x|| ) − L δ + δ δ − δ ≤ g(||x|| + δ ) ≤ f (||x|| ) + L δ + δ
The number of weights and neurons required by Lemma 5 is 3 δ . The number of weights and neurons required to estimate √ ||x|| is given by Lemma 4 (substituting δ for δ ). Stack the network from Lemma 5 (1 layer) onto the end of the network from
Nw ≤ 8(d − 1 ) log2
Nw = O d2 + d log2
R √
δ
3RL
δ
δ
√ (d−1 ) Rπ d √ ( 2 2 − 1 ) + 2 ( 2 2 − 1 )
δ
+
and a total number of weights no more than:
Moreover, one has |α i | ≤ 2L and w ≤ 3 RL δ .
sup |g(x ) − f (||x|| )| < δ + L
is L-Lipschitz and satisfies:
δ
Nn ≤ 4(d − 1 ) log2
+
3RL
+
3RL
δ
δ
Again, there is an obvious weight-sharing corollary: Corollary 2 (Deep weight sharing network). Let x ∈ Rd , and σ (z ) = max(0, z ). Let f : R → R be an L-Lipschitz function supported on [0, R]. Fix L, δ , R > 0. There exists a function g : Rd → R expressible
by a network
where the number of weights is at most Nw = O d + log2
√R
δ
+ 3δRL , such that:
sup |g(x ) − f (||x|| )| < L
x∈R d
√
δ+δ
Comparing Theorem 1 to Lemma 1, both are of order d2 , however the folding network version is more efficient in terms of R and δ . This means the deeper network can be much more efficient when either d or R is large, or δ is small. 6. Simulations
RL
To validate the theoretical results we have built deep and three layer approximation networks for a simple radially symmet-
Please cite this article as: B. McCane, L. Szymanski, Efficiency of deep networks for radially symmetric functions, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.003
ARTICLE IN PRESS
JID: NEUCOM
[m5G;July 6, 2018;4:18]
B. McCane, L. Szymanski / Neurocomputing 000 (2018) 1–6
5
Fig. 3. Comparing the number of neurons between deep networks and three layer networks. All the deep network curves are at the bottom of the plot. The y-axis is on a log scale.
ric function and counted the actual number of neurons in each. The results further demonstrate how much better the deep network is compared to the three layer network. The function to approximate is ψ 3, 1 from Fornefett et al. [5] which is a compact support approximation of the Gaussian kernel. It is defined as:
4
ψ3,1 (x, σ ) = 1 − ||x||/(3σ π /2 ) 4||x||/(3σ π /2 ) + 1 +
(3) where σ is the standard deviation of the related Gaussian, and the + subscript on the first bracketed term indicates that negative values are thresholded to 0 prior to taking the exponent. The function becomes 0 at R = 3σ π /2. This function was chosen because Gaussian’s are so widespread but they cannot be directly approximated using the methods in this paper because they are of infinite support. We have produced results by varying the number of dimensions of the input, 2 ≤ d ≤ 128, the maximum allowed error, δ = {0.1, 0.01}, and the standard deviation of the Gaussian, σ = {1.0, 2.0, 4.0} (R = 3σ π /2). Fig. 3 shows the results directly comparing deep and three layer networks as the number of dimensions increases. It should be clear from the figure that the deep network is many orders of magnitude more efficient than the three layer network and this holds true even for very few dimensions. For example, with only two dimensions, with δ = 0.01 and σ = 4.0, the three layer network requires more than 180,0 0 0 neurons compared to only 298 for the deep network. There are two other points of note made obvious by the plots. First, although both networks are O(d2 ), the three layer network is much more sensitive to δ and σ as the theory predicts. Second, neither of these networks provide a practical way of approximating radially symmetric functions. The result is of theoretical interest - if all you have is a ReLU network, then for radially symmetric functions, deeper is much better.
7. Discussion We have derived a new upper bound for deep networks approximating radially symmetric functions and have shown that deeper networks are more efficient than the 3-layer network of Eldan and Shamir [4]. The central concept in this construction is a space fold — halving the volume of space that each subsequent layer needs to handle. We hypothesise that to take full advantage of deep networks we need to apply operations that work on multiple areas of the input space simultaneously, analogous to taking advantage of the multiple linear regions noted by Montufar et al. [7]. Folding transformations are one way to ensure that any operation applied in a later layer is simultaneously applied to many regions in the input layer. We believe there are many other possible transformations, but the reflections used here might be considered fundamental in a sense due to the Cartan–Dieudonné–Scherk Theorem which states that all orthogonal transformations can be decomposed into a sequence of reflections. We are yet to fully investigate the consequences of this theorem. Acknowledgements Funding: The Titan X used for this research was donated by the NVIDIA Corporation. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.neucom.2018.06.003 References [1] R. Basri, D. Jacobs, Efficient representation of low-dimensional manifolds using deep networks, 2016, arXiv:1602.04723.
Please cite this article as: B. McCane, L. Szymanski, Efficiency of deep networks for radially symmetric functions, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.003
JID: NEUCOM 6
ARTICLE IN PRESS
[m5G;July 6, 2018;4:18]
B. McCane, L. Szymanski / Neurocomputing 000 (2018) 1–6
[2] N. Cohen, O. Sharir, A. Shashua, On the expressive power of deep learning: a tensor analysis, in: JMLR: Workshop and Conference Proceedings, 49, 2016, pp. 1–31. [3] O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2011, pp. 666–674. [4] R. Eldan, O. Shamir, The power of depth for feedforward neural networks, in: JMLR: Workshop and Conference Proceedings, 49, 2016, p. 134. [5] M. Fornefett, K. Rohr, H.S. Stiehl, Radial basis functions with compact support for elastic registration of medical images, Image Vis. Comput. 19 (1) (2001) 87–96. [6] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [7] G.F. Montufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 2924–2932. [8] U. Shaham, A. Cloninger, R.R. Coifman, Provable approximation properties for deep neural networks, Appl. Comput. Harmon. Anal. 44 (3) (2018) 537–557. [9] L. Szymanski, B. McCane, Deep networks are effective encoders of periodicity, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1816–1827. [10] M. Telgarsky, Benefits of depth in neural networks, in: JMLR: Workshop and Conference Proceedings, 49, 2016, pp. 1–23.
Brendan McCane received the B.Sc. (Hons.) and Ph.D. degrees from the James Cook University of North Queensland, Townsville City, Australia, in 1991 and 1996, respectively. He joined the Computer Science Department, University of Otago, Otago, New Zealand, in 1997. He served as the Head of the Department from 2007 to 2012. His current research interests include computer vision, pattern recognition, machine learning, and medical and biological imaging. He also enjoys reading, swimming, fishing and long walks on the beach with his dogs.
Lech Szymanski received the B.A.Sc. (Hons.) degree in computer engineering and the M.A.Sc. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 2001 and 2005, respectively, and the Ph.D. degree in computer science from the University of Otago, Otago, New Zealand, in 2012. He took up a Lecturer position at the Computer Science Department, University of Otago, New Zealand in 2016. His current research interests include machine learning, artificial neural networks, and deep architectures.
Please cite this article as: B. McCane, L. Szymanski, Efficiency of deep networks for radially symmetric functions, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.003