Probing the geometry of data with diffusion Fréchet functions

Probing the geometry of data with diffusion Fréchet functions

Appl. Comput. Harmon. Anal. 47 (2019) 935–947 Contents lists available at ScienceDirect Applied and Computational Harmonic Analysis www.elsevier.com...

929KB Sizes 0 Downloads 21 Views

Appl. Comput. Harmon. Anal. 47 (2019) 935–947

Contents lists available at ScienceDirect

Applied and Computational Harmonic Analysis www.elsevier.com/locate/acha

Probing the geometry of data with diffusion Fréchet functions Diego H. Díaz Martínez a , Christine H. Lee b , Peter T. Kim b,c , Washington Mio a,∗ a

Department of Mathematics, Florida State University, Tallahassee, FL 32306-4510, USA Department of Pathology and Molecular Medicine McMaster University, St Joseph’s Healthcare, 50 Charlton Avenue E, 424 Luke Wing, Hamilton, ON L8N4A6, Canada c Department of Mathematics and Statistics, University of Guelph, Guelph, ON N1G 2W1, Canada b

a r t i c l e

i n f o

Article history: Received 10 May 2016 Received in revised form 24 November 2017 Accepted 31 January 2018 Available online 13 February 2018 Communicated by Dominique Picard MSC: 62-07 92C50 Keywords: Distributions on networks Fréchet functions Diffusion distances Community detection

a b s t r a c t The state of many complex systems, such as ecosystems formed by multiple microbial taxa that interact in intricate ways, is often summarized as a probability distribution on the nodes of a weighted network. This paper develops methods for modeling the organization of such data, as well as their Euclidean counterparts, across spatial scales. Using the notion of diffusion distance, we introduce diffusion Fréchet functions and diffusion Fréchet vectors associated with probability distributions on Euclidean space and the vertex set of a weighted network, respectively. We prove that these functional statistics are stable with respect to the Wasserstein distance between probability measures, thus yielding robust descriptors of their shapes. We provide several examples that illustrate the geometric characteristics of a distribution that are captured by multi-scale Fréchet functions and vectors. © 2018 Elsevier Inc. All rights reserved.

1. Introduction The primary goal of this paper is to study the shape of probability distributions on the nodes of weighted networks such as those describing the evolving states of complex ecosystems. For example, a microbial community may be modeled on a network in which each node represents a taxon and the main interactions between pairs of taxa are represented by weighted edges, the weights reflecting levels of interaction. In this formulation, the relative abundance of microbial taxa in a sample may be viewed as a probability distribution on the vertex set of the network. Modeling and quantifying the organization and variation of communities across samples and scales are core problems. Similar questions arise in other contexts such as probability measures defined on Euclidean spaces or other metric spaces. We develop methods that (i) capture the geometry of data and probability measures across a continuum of spatial scales and (ii) integrate information * Corresponding author. E-mail address: [email protected] (W. Mio). https://doi.org/10.1016/j.acha.2018.01.003 1063-5203/© 2018 Elsevier Inc. All rights reserved.

936

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

obtained at all scales. In this paper, we focus on the Euclidean and network cases. We analyze distributions with the aid of a 1-parameter family of diffusion distances dt , t > 0, where t is treated as the scale parameter [1]. The definition of dt is recalled in Section 2. For (Borel) probability measures α on Rd , our approach is based on a functional statistic Vα,t : Rd → R derived from dt and termed the diffusion Fréchet function of α at scale t. Similarly, we define diffusion Fréchet vectors Fξ,t associated with a distribution ξ on the vertex set of a network. The (classical) Fréchet function of a random variable y ∈ Rd with finite second moment and distributed according to a probability measure α is defined as 

Vα (x) = Eα y − x

2



ˆ y − x2 dα(y) .

=

(1)

Rd

It is well-known that the mean (or expected value) of y is the unique minimizer of Vα . This is a useful statistic, however, it fails to capture the most basic characteristics of the shape of multimodal or more complex distributions. Aiming at more effective descriptors, we introduce diffusion Fréchet functions and show that they capture richer information about the shape of α. At scale t > 0, we define the diffusion Fréchet function Vα,t : Rd → R by   Vα,t (x) = Eα d2t (y, x) =

ˆ d2t (y, x) dα(y) ,

(2)

Rd

simply replacing the Euclidean distance in (1) with the diffusion distance dt . Vα,t (x) may be viewed as a localized second moment of α about x. In the network setting, Fréchet vectors are defined in a similar manner through the diffusion kernel associated with the graph Laplacian. Unlike Vα , diffusion Fréchet functions and vectors typically exhibit rich profiles that reflect the shape of the distribution at different scales. This point is illustrated with simulated data in Section 3. We prove stability theorems for diffusion Fréchet functions and vectors with respect to the Wasserstein distance between probability measures [2]. The results ensure that both Vα,t and Fξ,t yield robust descriptors, useful for data analysis. Community detection in networks has been studied with a variety of methods (cf. [3–5]), including the problem of selecting particular scales that seem to be most relevant to a problem. Over the years, there has been a growing trend in data analysis that takes the somewhat dual view (cf. [6–8]) that multiscale data summaries are especially informative, as important features of a distribution may be revealed at different scales. This is the point of view adopted in this paper, but not to say that identifying important scales is relegated to a secondary plane. Rather, the view is that machine learning and other techniques may be applied to a multiscale summary to identify combinations of scales that are most informative. Another important angle in a multi-scale formulation is that persistence of features across scales is a powerful indicator of their relevance. We provide a simple illustration of this point in Section 4. We also point out that the present method is tailored to probability measures on weighted networks, not just structures solely characterized by the underlying network. This is a key difference that allows us to use the method, for example, to model variation in the organization of ecosystems such as the microbiome of different individuals. As a matter of fact, this paper was largely motivated by the problem of understanding changes in the composition of the human gut microbiome related to Clostridium difficile infection and treatment with fecal microbiota transplantation [9]. The rest of the paper is organized as follows. Section 2 reviews the concept of diffusion distance in the Euclidean and network settings. We define multiscale diffusion Fréchet functions in Section 3 and prove that they are stable with respect to the Wasserstein distance between probability measures in Euclidean spaces. The network counterpart is developed in Section 4. We conclude with a summary and some discussion in Section 5.

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

937

2. Diffusion distances Our multiscale approach to probability measures uses a formulation derived from the notion of diffusion distance [1]. In this section, we review diffusion distances associated with the heat kernel on Rd , followed by a discrete analogue for weighted networks. For t > 0, let Gt : Rd × Rd → R be the diffusion (heat) kernel Gt (x, y) =

  y − x2 1 exp − , Cd (t) 4t

(3)

where Cd (t) = (4πt)d/2 . If an initial distribution of mass on Rd is described by a Borel probability measure α and its time evolution is governed by the heat equation ∂t u = Δu, then the distribution at time t has a smooth density function αt (y) = u(y, t) given by the convolution ˆ u(y, t) =

Gt (x, y) dα(x) .

(4)

Rd

For an initial point mass located at x ∈ R, α is the Dirac measure δx and u(y, t) = Gt (x, y). The diffusion map Ψt : Rd → L2 (Rd ), given by Ψt (x) = Gt (x, ·), embeds Rd into L2 (Rd ). The diffusion distance dt on Rd is defined as dt (x1 , x2 ) = Ψt (x1 ) − Ψ2 (x2 )2 ,

(5)

the metric that Rd inherits from L2 (Rd ) via the embedding ψt . A calculation shows that d2t (x1 , x2 ) = Ψt (x1 )22 + Ψt (x2 )22 − 2G2t (x1 , x2 )   1 − G2t (x1 , x2 ) , =2 Cd (2t)

(6)

where we used the fact that Ψt (x)22 = 1/Cd (2t), for every x ∈ Rd . Note that this implies that the metric space (Rd , dt ) has finite diameter. Diffusion distances on weighted networks may be defined in a similar way by invoking the graph Laplacian and the associated diffusion kernel. Let v1 , . . . , vn be the nodes of a weighted network K. The weight of the edge between vi and vj is denoted wij , with the convention that wij = 0 if there is no edge between the nodes. We let W be the n × n matrix whose (i, j)-entry is wij . The graph Laplacian is the n × n n matrix Δ = D − W , where D is the diagonal matrix with dii = k=1 wik , the sum of the weights of all edges incident with the node vi (cf. [10]). This definition is based on a finite difference discretization of the Laplacian, except for a sign that makes Δ a positive semi-definite, symmetric matrix. The diffusion (or heat) kernel at t > 0 is the matrix e−tΔ . Let ξ be a probability distribution on the vertex set V of K. If ξi ≥ 0, 1 ≤ i ≤ n, is the probability of the vertex vi , we write ξ as the vector ξ = [ξ1 . . . ξn ]T ∈ Rn , where T denotes transposition and ξ1 +. . .+ξn = 1. Note that u = e−tΔ ξ solves the heat equation ∂t u = −Δu with initial condition ξ. (The negative sign is due to the convention made in the definition of Δ.) Mass initially distributed according to ξ, whose diffusion is governed by the heat equation, has time t distribution ut = e−tΔ ξ. A point mass at the ith vertex is described by the vector ei ∈ Rn , whose jth entry is δij . Thus, e−tΔ ei may be viewed as a network analogue of the Gaussian Gt (x, ·). For t > 0, the diffusion mapping ψt : V → Rn is defined by ψt (vi ) = e−tΔ ei and the time t diffusion distance between the vertices vi and vj by dt (i, j) = e−tΔ ei − e−tΔ ej  ,

(7)

938

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

where  ·  denotes Euclidean norm. If 0 = λ1 ≤ . . . ≤ λn are the eigenvalues of Δ with orthonormal eigenvectors φ1 , . . . , φn , then d2t (i, j) =

n 

e−2λk t (φk (i) − φk (j)) , 2

(8)

k=1

where φk (i) denotes the ith component of φk [1]. 3. Diffusion Fréchet functions The classical Fréchet function Vα of a probability measure α on Rd with finite second moment is a useful statistic. However, for complex distributions, such as those with a multimodal profile or more intricate geometry, their Fréchet functions usually fail to provide a good description of their shape. Aiming at more effective descriptors, we introduce diffusion Fréchet functions and show that they encode a wealth of information across a full range of spatial scales. At scale t > 0, define the diffusion Fréchet function Vα,t : Rd → R by ˆ d2t (y, x) dα(y) .

Vα,t (x) =

(9)

Rd

Note that Vα,t is defined for any Borel probability measure α, not just those with finite second moment, because (Rd , dt ) has finite diameter. Moreover, Vα,t is uniformly bounded. We also point out that this construction is distinct from the standard kernel trick that pushes α forward to a probability measure (ψt )∗ (α) on L2 (Rd ), where the usual Fréchet function of (ψt )∗ (α) can be used to define such notions as an extrinsic mean of α. Vα,t is an intrinsic statistic and typically a function on Rd with a complex profile, as we shall see in the examples below. From (2) and (6), it follows that Vα,t/2 (x) =

2 −2 Cd (t)

ˆ Gt (x, y) dα(y) =

Rd

2 − 2αt (x) , Cd (t)

(10)

where αt (x) = u(x, t), the solution of the heat equation ∂t u = Δu with initial condition α. If p1 , . . . , pn ∈ Rd are data points sampled from α, then the diffusion Fréchet function of the empirical n n measure αn = n1 i=1 δpi is closely related to the Gaussian density estimator α n,t = n1 i=1 Gt (x, pi ) derived from the sample points. Indeed, it follows from (10) that 2 −2 Vαn ,t/2 (x) = Cd (t)

ˆ Gt (x, y) dαn (y) =

Rd

2 − 2 αn,t (x) . Cd (t)

(11)

Thus, diffusion Fréchet functions provide a new interpretation of Gaussian density estimators, essentially as second moments with respect to diffusion distances. This opens up interesting new perspectives. For example, the classical Fréchet function Vα : Rd → R (see (1)) may be viewed as the trace of the covariance tensor field Σα : Rd → Rd ⊗ Rd given by ˆ Σα (x) = Eα [(y − x) ⊗ (y − x)] =

(y − x) ⊗ (y − x) dα(y) .

(12)

Rd

Thus, it is natural to ask whether it is possible to define diffusion covariance tensor fields Σα,t that capture the modes of variation of α, about each x ∈ Rd at all scales, bearing a close relationship to Vα,t . A multiscale approach to covariance along related lines has been developed in [11,12].

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

939

Fig. 1. Evolution of the diffusion Fréchet function across scales.

The following examples illustrate that Fréchet functions encode valuable structural and dynamical information about datasets. Example 1. Here we consider the dataset highlighted in blue in Fig. 1, comprising n = 400 points p1 , . . . , pn on the real line, grouped into two clusters. The figure shows the evolution of the diffusion Fréchet function n for the empirical measure αn = i=1 δpi /n for increasing values of the scale parameter. At scale t = 0.1, the Fréchet function has two well defined “valleys”, each corresponding to a data cluster. At smaller scales, Vαn ,t captures finer properties of the organization of the data, whereas at larger scales Vαn ,t essentially views the data as comprising a single cluster. For probability distributions on the real line, the Fréchet √ function levels off to 1/ 2πt, as |x| → ∞. The local minima of Vα,t provide a generalization of the mean of the distribution and a summary of the organization of the data into sub-communities across multiple scales. Elucidating such organization in data in an easily interpretable manner is one of our principal aims. Example 2. In this example, we consider the dataset formed by n = 1000 points in R2 , shown on the first panel of Fig. 2. The other panels show the diffusion Fréchet functions for the corresponding empirical measure αn , calculated at increasing scales, and the gradient field −∇Vαn at t = 2, whose behavior reflects the organization of the data at that scale. Example 3. This example illustrates the evolution of a dataset consisting of n = 3158 points under the (negative) gradient flow of the Fréchet function of the associated empirical measure at scale t = 0.2. Panel (a) in Fig. 3 shows the original data and the other panels show various stages of the evolution towards the attractors of the system. This may be viewed as a technique for simplifying a dataset at a given scale. We now prove stability results for Vα,t and its gradient field ∇Vα,t , which provide a basis for their use as robust functional statistics in data analysis. We begin by reviewing the definition of the Wasserstein distance between two probability measures. A coupling between two probability measures α and β is a probability measure μ on Rd × Rd such that (p1 )∗ (μ) = α and (p2 )∗ (μ) = β. Here, p1 and p2 denote the projections onto the first and second coordinates, respectively. The set of all such couplings is denoted Γ(α, β).

940

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

Fig. 2. Data points in R2 , heat maps of their diffusion Fréchet functions at increasing scales, and the gradient field at t = 2.

Fig. 3. Panel (a) shows the original data and panels (b)–(f) display various stages of the evolution of the data towards the attractors of the gradient flow of the Fréchet function at scale t = 0.2.

For each p ∈ [1, ∞), let Pp (Rd ) denote the collection of all Borel probability measures α on Rd whose ´ pth moment Mp (α) = yp dα(y) is finite. Definition 1 (cf. [2]). Let α, β ∈ Pp (Rd ). The p-Wasserstein distance Wp (α, β) is defined as ⎛

¨

⎜ Wp (α, β) = ⎝inf μ

⎞1/p ⎟ x − yp dμ(x, y)⎠

,

(13)

Rd ×Rd

where the infimum is taken over all μ ∈ Γ(α, β). It is well-known that the infimum in (13) is realized by some μ ∈ Γ(α, β) [2]. Moreover, if p > q ≥ 1, then Pp (Rd ) ⊂ Pq (Rd ) and Wq (α, β) ≤ Wp (α, β).

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

941

The following lemma will be useful in the proof of the stability results. Part (a) of the lemma is a special case of Lemma 1 of [12]. Let Kt : Rd → R be the heat kernel centered at zero. In the notation used in (3), Kt (y) = Gt (0, y). Lemma 1. Let Cd (t) = (4πt)d/2 . For any y1 , y2 ∈ Rd and t > 0, we have: y1 − y2  √ ; Cd (t) 2t · e e+2 y1 − y2 . (b) y1 Kt (y1 ) − y2 Kt (y2 ) ≤ eCd (t) (a) |Kt (y1 ) − Kt (y2 )| ≤

Proof. (a) For each ξ ∈ [0, 1], let y(ξ) = ξy1 + (1 − ξ)y2 . Then,  1  ˆ  ˆ1      d d Kt (y(ξ)) dξ  ≤  Kt (y(ξ)) dξ Kt (y1 ) − Kt (y2 ) =  dξ  dξ  0

0

ˆ1 |∇Kt (y(ξ)) · (y1 − y2 )| dξ

=

(14)

0

ˆ1 ≤ y1 − y2 

∇Kt (y(ξ)) dξ . 0

Note that   y2 1 y y √ exp − ≤ ∇Kt (y) =  Kt (y) ≤ . 2t 2tCd (t) 4t Cd (t) 2t · e

(15)

   2 In the last inequality we used the fact that y exp − y ≤ 2t 4t e . Hence, |Kt (y1 ) − Kt (y2 )| ≤

y1 − y2  √ . Cd (t) 2t · e

(b) Using the same notation as in (a), Kt (y) d [yKt (y)] = Kt (y)(y1 − y2 ) − y y · (y2 − y2 ) . dξ 2t

(16)

    2  d  [yKt (y)] ≤ Kt (y) + y Kt (y) y2 − y2    dξ 2t   2 1 1+ y1 − y2  . ≤ Cd (t) e

(17)

Hence,

In the last inequality we used the facts that Kt (y) ≤ 1/Cd (t) and y2 Kt (y) ≤ 4t/eCd (t). Writing ˆ1 y1 Kt (y1 ) − y2 Kt (y2 ) = 0

d [yKt (y)] dξ , dξ

(18)

942

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

it follows from (17) and (18) that y1 Kt (y1 ) − y2 Kt (y2 ) ≤

1 Cd (t)

 1+

2 e

 y1 − y2  ,

(19)

as claimed. 2 Theorem 1 (Stability of Fréchet functions). Let α and β be Borel probability measures on Rd with diffusion Fréchet functions Vα,t and Vβ,t , t > 0, respectively. If α, β ∈ P1 (Rd ), then 1 √ W1 (α, β) . Cd (2t) t · e

Vα,t − Vβ,t ∞ ≤

Proof. Fix t > 0 and x ∈ Rd . Let μ ∈ Γ(α, β) be a coupling such that ¨ z1 − z2 dμ(z1 , z2 ) = W1 (α, β) .

(20)

Rd ×Rd

Then, we may write ¨ Vα,t (x) =

d2t (z1 , x) dμ(z1 , z2 )

(21)

d2t (z2 , x) dμ(z1 , z2 ).

(22)

Rd ×Rd

and ¨ Vβ,t (x) = Rd ×Rd

By (6), d2t (z, x) = 2/Cd (2t) − 2G2t (z, x). Hence, ¨ Vα,t (x) − Vβ,t (x) = −2

(G2t (z1 , x) − G2t (z2 , x)) dμ(z1 , z2 ) ,

(23)

|G2t (z1 , x) − G2t (z2 , x)| dμ(z1 , z2 ) .

(24)

Rd ×Rd

which implies that ¨ |Vα,t (x) − Vβ,t (x)| ≤ 2 Rd ×Rd

After translating α and β, we may assume that x = 0. Thus, by Lemma 1(a), |Vα,t (x) − Vβ,t (x)| ≤

1 √ Cd (2t) t · e

¨ z1 − z2  dμ(z1 , z2 ) Rd ×Rd

(25)

1 √ = W1 (α, β) , Cd (2t) t · e as claimed. 2 Theorem 2 (Stability of gradient fields). Let α and β be Borel probability measures on Rd with diffusion Fréchet functions Vα,t and Vβ,t , t > 0, respectively. If α, β ∈ P1 (Rd ), then

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

sup ∇Vα,t (x) − ∇Vβ,t (x) ≤ x∈Rd

943

e+2 W1 (α, β) . eCd (2t)

Proof. Let μ ∈ Γ(α, β) be a coupling that realizes W1 (α, β). From (23), it follows that ¨ ∇Vα,t (x) − ∇Vβ,t (x) ≤ 2 ∇x G2t (z1 , x) − ∇x G2t (z2 , x) dμ(z1 , z2 ) ¨ 1 (x − z1 )G2t (z1 , x) − (x − z2 )G2t (z2 , x) dμ(z1 , z2 ) . ≤ 2t

(26)

We may assume that x = 0, so we rewrite the previous inequality as ∇Vα,t (x) − ∇Vβ,t (x) ≤

1 2t

¨ z1 K2t (z1 ) − z2 K2t (z2 ) dμ(z1 , z2 ) .

(27)

Therefore, by Lemma 1(b), ¨ e+2 z1 − z2 dμ(z1 , z2 ) eCd (2t) e+2 = W1 (α, β) , eCd (2t)

∇Vα,t (x) − ∇Vβ,t (x) ≤

(28)

which proves the theorem. 2 Remark. It is well known that the Wasserstein distance metrizes weak convergence of probability measures. This implies that if y1 , . . . , yn are independent random variables distributed according to α ∈ P1 (Rd ) and αn is the corresponding empirical measure, then w1 (αn , α) → 0 almost surely, as n → ∞ [13]. Hence, the stability theorem ensures that the Fréchet function of α and its gradient field may be estimated reliably from sufficient, independently sampled data. Another basic question is how well one may estimate the Fréchet function of α at a given sample size n. Stability combined with convergence rate estimates by Fournier and Guillin [14] provide insights in this direction (cf. [12]). 4. Diffusion Fréchet vectors on networks In this section, we define a network analogue of diffusion Fréchet functions and prove a stability theorem. Let ξ = [ξ1 . . . ξn ]T ∈ Rn represent a probability distribution on the vertex set V = {v1 , . . . , vn } of a weighted network K. For t > 0, define the diffusion Fréchet vector (DFV) as the vector Fξ,t ∈ Rn whose ith component is

Fξ,t (i) =

n 

d2t (i, j)ξj .

(29)

j=1

Example 4. We illustrate the behavior of DFVs across scales for three distributions on a weighted network with six nodes, as shown in Figs. 4(a) and 4(b). The DFVs may be viewed as paths in R6 parameterized by the scale parameter t > 0. Fig. 4(c) shows a 3D reduction via principal component analysis that let us visualize the paths. As may be observed, some of the main contrasts between the distributions are observed at different scales suggesting that they may be learned from training data in applications. It also becomes apparent that some differences persist over a whole range of scales and the level of persistence may provide valuable information.

944

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

Fig. 4. Diffusion Fréchet vectors: (a) a weighted network; (b) three distributions on the node set; (c) 3D PCA representation of the DFVs viewed as paths in R3 parameterized by the scale parameter.

To state a stability theorem for DFVs, we first define the Wasserstein distance between two probability distributions on the vertex set V of a weighted network K. This requires a base metric on V to play a role similar to that of the Euclidean metric in (13). We use the commute-time distance between the nodes of a weighted network (cf. [15]). Definition 2. Let K be a connected, weighted network with nodes v1 , . . . , vn , and let φ1 , φ2 , . . . , φn be an orthonormal basis of Rn formed by eigenvectors of the graph Laplacian with eigenvalues 0 = λ1 < λ2 ≤ . . . ≤ λn . The commute-time distance between vi and vj is defined as  dCT (i, j) =

n  1 2 (φk (i) − φk (j)) λk

1/2 .

(30)

k=2

It is simple to verify that d2CT (i, j) = 2

´∞ 0

d2t (i, j) dt.

Definition 3. Let ξ, ζ ∈ Rn represent probability distributions on V . The p-Wasserstein distance, p ≥ 1, between ξ and ζ (with respect to the commute-time distance on V ) is defined as

Wp (ξ, ζ) =

min

μ∈Γ(α,β)

⎛ ⎞1/p n  n  ⎝ dpCT (i, j)μij ⎠ ,

(31)

j=1 i=1

where Γ(ξ, ζ) is the set of all probability measures μ on V × V satisfying (p1 )∗ (μ) = ξ and (p2 )∗ (μ) = ζ. Theorem 3. Let ξ, ζ ∈ Rn be probability distributions on the vertex set of a connected, weighted network. For each t > 0, their diffusion Fréchet vectors satisfy  Fξ,t − Fζ,t ∞ ≤ 4

Tr e−2tΔ − 1 W1 (ξ, ζ) . 2et

(32)

Proof. Fix t > 0 and a node v . Let μ ∈ Γ(ξ, ζ) be such that n  n 

dCT (i, j)μ(i, j) = W1 (ξ, ζ) .

(33)

j=1 i=1

Since μ has marginals α and β, we may write Fξ,t ( ) =

n  n  j=1 i=1

d2t (i, )μi,j

and Fζ,t ( ) =

n  n  j=1 i=1

d2t (j, )μi,j ,

(34)

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

945

which implies that |Fξ,t ( ) − Fζ,t ( )| ≤

n  n   2  dt (i, ) − d2t (j, ) μi,j .

(35)

j=1 i=1

By Lemma 2 below, n n    |Fξ,t ( ) − Fζ,t ( )| ≤4 Tr e−2Δt − 1 dt (i, j)μi,j .

(36)

j=1 i=1

Observe that d2t (i, j) =

n 

e−2λk t (φk (i) − φk (j))

2

k=1

=

n 

(37) λk e

−2λk t

k=1

since λk e−2λk t ≤

1 2et .

1 2 1 2 d (i, j) (φk (i) − φk (j)) ≤ λk 2et CT

The theorem follows from (36) and (37).

2

Lemma 2. Let vi , vj , v be nodes of a connected, weighted network. For any t > 0,    2 dt (i, ) − d2t (j, ) ≤ 4 Tr e−2tΔ − 1 dt (i, j) , where Tr denotes the trace operator. Proof. Since the eigenfunction φ1 is constant, we may write the diffusion distance as d2t (i, ) =

n 

  e−2λk t φ2k (i) − 2φk (i)φk ( ) + φ2k ( ) .

(38)

k=2

Thus, d2t (i, ) − d2t (j, ) =

n 

  e−2λk t φ2k (i) − φ2k (j)

k=2

−2

n 

e−2λk t φk ( ) (φk (i) − φk (j))

k=2

=

n 

e

(39) −2λk t

(φk (i) + φk (j))(φk (i) − φk (j))

k=2

−2

n 

e−2λk t φk ( ) (φk (i) − φk (j)) .

k=2

Since each φk has unit norm, |φk ( )| ≤ 1 and |φk (i) + φk (j)| ≤ 2 . Therefore, n    2 dt (i, ) − d2t (j, ) ≤ 4 e−2λk t |φk (i) − φk (j)| . k=2

(40)

946

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

   The Cauchy–Schwarz inequality, applied to the vectors a = e−λ2 t , . . . , e−λn t and b = e−λ2 t |φ2 (i) − φ2 (j)| ,  . . . , e−λn t |φn (i) − φn (j)| , yields n 

   n  n    2 −2λk t −2λ t   k e |φk (i) − φk (j)| ≤ e e−2λk t (φk (i) − φk (j))

k=2

= The lemma follows from (40) and (41).



k=2

k=2

(41)

Tr e−2tΔ − 1 dt (i, j) .

2

5. Concluding remarks We introduced diffusion Fréchet functions and diffusion Fréchet vectors associated with probability measures on Euclidean space and weighted networks that integrate geometric information about their distributions across spatial scales, yielding a pathway to the quantification of their shapes. We proved that these functional statistics are stable with respect to the Wasserstein distance, providing a theoretical basis for their use in data analysis. To demonstrate the usefulness of these concepts, we provided examples using simulated data that illustrate some of their key properties. Diffusion Fréchet functions may be defined in the broader context of metric measure spaces, however their stability and other properties in this general setting remains to be investigated. As remarked in Section 3, the present work suggests the development of refinements of diffusion Fréchet functions such as diffusion covariance tensor fields for Borel measures on Riemannian manifolds to make their geometric properties more readily accessible. Acknowledgments This research was supported in part by NSF grants DBI-1262351 and DMS-1418007, CIHR-NSERC Collaborative Health Research Projects (413548-2012), and NSF grant DMS-1127914 to SAMSI. References [1] R.R. Coifman, S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal. 21 (1) (2006) 5–30, https://doi.org/10.1016/ j.acha.2006.04.006. [2] C. Villani, Optimal Transport, Old and New, Springer, 2009. [3] M. Schaub, J.-C. Delvenne, S.S. Yaliraki, M. Barahona, Markov dynamics as a zooming lens for multiscale community detection: non clique-like communities and the field-of-view limit, PLoS ONE 7 (2012) e32210. [4] R. Lambiotte, Multi-scale modularity and dynamics in complex networks, in: A. Mukherjee, M. Choudhury, F. Peruani, N. Ganguly, B. Mitra (Eds.), Dynamics On and Of Complex Networks, in: Modeling and Simulation in Science, Engineering and Technology, vol. 2, Birkhäuser, New York, NY, 2013, pp. 125–141. [5] N. Tremblay, P. Borgnat, Graph wavelets for multiscale community mining, IEEE Trans. Signal Process. 62 (2014) 5227–5239. [6] T. Lindeberg, Scale Space Theory in Computer Vision, Kluwer, Boston, 1994. [7] P. Chaudhuri, J. Marron, Scale space view of curve estimation, Ann. Statist. 28 (2) (2000) 408–428. [8] G. Carlsson, Topology and data, Bull. Amer. Math. Soc. 46 (2) (2009) 255–308. [9] C. Lee, J. Belanger, Z. Kassam, M. Smieja, D. Higgins, G. Broukhanski, P. Kim, The outcome and long-term follow-up of 94 patients with recurrent and refractory Clostridium difficile infection using single to multiple fecal microbiota transplantation via retention enema, Eur. J. Clin. Microbiol. Infect. Dis. 33 (8) (2014) 1425–1428, https://doi.org/10.1007/ s10096-014-2088-9. [10] U. von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (4) (2007) 395–416, https://doi.org/10.1007/ s11222-007-9033-z. [11] D. Díaz Martínez, F. Mémoli, W. Mio, Multiscale covariance fields, local scales, and shape transforms, in: F. Nielsen, F. Barbaresco (Eds.), Geometric Science of Information, in: Lecture Notes in Computer Science, vol. 8085, Springer, 2013, pp. 794–801. [12] D. Díaz Martínez, F. Mémoli, W. Mio, The shape of data and probability measures, arXiv:1509.04632, 2015.

D.H. Díaz Martínez et al. / Appl. Comput. Harmon. Anal. 47 (2019) 935–947

947

[13] R.M. Dudley, Real Analysis and Probability, Cambridge Studies in Advanced Mathematics, vol. 74, Cambridge University Press, Cambridge, England, 2002. [14] N. Fournier, A. Guillin, On the rate of convergence in Wasserstein distance of the empirical measure, Probab. Theory Related Fields 162 (2015) 707–738. [15] L. Lovász, Random walks on graphs: a survey, in: Combinatorics, Paul Erdös is Eighty, vol. 2, 1996, pp. 353–398.