A preconditioned fast quadratic spline collocation method for two-sided space-fractional partial differential equations

A preconditioned fast quadratic spline collocation method for two-sided space-fractional partial differential equations

Accepted Manuscript A preconditioned fast quadratic spline collocation method for two-sided space-fractional partial differential equations Jun Liu, H...

368KB Sizes 1 Downloads 141 Views

Accepted Manuscript A preconditioned fast quadratic spline collocation method for two-sided space-fractional partial differential equations Jun Liu, Hongfei Fu, Hong Wang, Xiaochao Chai

PII: DOI: Reference:

S0377-0427(19)30179-7 https://doi.org/10.1016/j.cam.2019.03.048 CAM 12219

To appear in:

Journal of Computational and Applied Mathematics

Received date : 15 March 2018 Revised date : 18 March 2019 Please cite this article as: J. Liu, H. Fu, H. Wang et al., A preconditioned fast quadratic spline collocation method for two-sided space-fractional partial differential equations, Journal of Computational and Applied Mathematics (2019), https://doi.org/10.1016/j.cam.2019.03.048 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights (for review)

Highlights  A quadratic spline collocation (QSC) method combined with the Crank-Nicolson time discretization is proposed for two-sided variable-coefficient space-fractional diffusion equations in one and two dimensions.  A matrix-free fast Krylov subspace iterative solver for the corresponding QSC-CN scheme is carefully designed and analyzed.  Circulant-type preconditioners are constructed to further accelerate the convergence of the fast Krylov subspace iterative method.  The developed preconditioned fast solution method has greatly reduced the computational cost from O(N^2) to O(N logN) per iteration and memory requirement from O(N^2) to O(N), while still well approximates the fractional diffusion equations without any accuracy lost.  Numerical experiments are given to demonstrate the efficiency and effectiveness of the method.

Manuscript Click here to view linked References

A preconditioned fast quadratic spline collocation method for two-sided space-fractional partial differential equations Jun Liua , Hongfei Fua,∗, Hong Wangb , Xiaochao Chaia a College

of Science, China University of Petroleum (East China), Qingdao, Shandong 266580, China of Mathematics, University of South Carolina, Columbia, South Carolina 29208, USA

b Department

Abstract Quadratic spline collocation methods for the one and two-dimensional fractional diffusion equations are proposed. By carefully exploring the mathematical structure of the coefficient matrix, we propose a matrix-free fast Krylov subspace iterative solver for the corresponding quadratic spline collocation scheme. And then, preconditioning technique is applied to further accelerate the convergence of the fast Krylov subspace iterative method. It shows that the fast quadratic spline collocation scheme has greatly reduced the computational cost from O(N 2 ) to O(N log N) per iteration and memory requirement from O(N 2 ) to O(N), while still well approximates the fractional diffusion equations without any accuracy lost. Numerical experiments are given to demonstrate the efficiency and effectiveness of the fast method. Keywords: fractional diffusion equations, quadratic spline collocation method, fast matrix-vector multiplication, circulant (block) preconditioner

1. Introduction Fractional partial differential equations (FPDEs) are emerging as powerful and competitive tools for modeling challenging phenomena in science and engineering, see, [2, 5, 31, 33, 34, 35]. In the past decades, considerable numerical discretization methods of FPDEs have been proposed, for example, finite difference method [25, 26, 30, 37, 43, 44]; finite element method [8, 9, 42]; spectral method [29, 45]; finite volume method [10, 16, 27]; non-polynomial spline method [7, 17, 21, 22, 23, 24] and so on. However, due to the nonlocal property of the space fractional differential operators, the corresponding numerical schemes all produce dense stiffness matrices. Thus, the significantly increased computational complexity and memory requirements become the main obstacles in numerically solving FPDEs. In recent years, large amount of works based on finite difference/finite element/finite volume/spectral methods are devoted to fast solution methods of various FPDEs, see [4, 11, 12, 13, 14, 18, 19, 20, 32, 38, 39, 40], where [4, 14, 38, 39, 40] are based on fast matrixvector multiplications, [18, 19, 20, 32] are based on preconditioning techniques, and [11, 12] are based on both of them. ∗ Corresponding

author. Email addresses: [email protected] (Jun Liu), [email protected] (Hongfei Fu), [email protected] (Hong Wang), [email protected] (Xiaochao Chai) Preprint submitted to

March 18, 2019

Quadratic spline collocation (QSC) approximation for the one-dimensional fractional diffusion equations is first proposed in [28], where the authors proved the unconditionally stability and a priori error estimate in a discrete L2 -norm. As proved there, the QSC method combined with the Crank-Nicolson time discretization (QSC-CN) is an efficient tool for solving FPDEs. However, the corresponding huge and dense stiffness matrix has been the main obstacle and difficulty for the numerical computation. In this paper, by exploring the mathematical structure of the coefficient matrix resulted from the QSC approximation, we propose a matrix-free fast Krylov subspace iterative method without resorting to any lossy compression. And then, preconditioning technique is applied to reduce the number of iterations and thus to accelerate the convergence of the fast Krylov subspace iterative method. Consequently, the fast QSC scheme has significantly reduced the computational cost and memory requirement, while well approximates the fractional diffusion equations without any accuracy lost. Numerical experiments show the strong potential of the method. The rest of the paper is organized as follows. We first review the QSC-CN method for the one-dimensional two-sided time-dependent fractional diffusion equations in Section 2, and then in Section 3, by exploring the structure of the coefficient matrix, we develop a matrix-free fast Krylov subspace iterative solver with significantly reduced computational complexity and memory requirement for the QSC method. To further reduce the number of iterations and the computational cost of the developed fast solver, a circulant preconditioner is developed for the fast version QSC-CN method. In Section 4, we extend our discussion to the two-dimensional timedependent fractional diffusion equations, a biquadratic spline collocation method combined with Crank-Nicolson time discretization (BiQSC-CN) is proposed, stability and convergence analysis are then strictly proved, and fast solution techniques are also developed. Finally, numerical experiments are presented in Section 5 to illustrate the performance of the fast BiQSC-CN method. 2. QSC-CN discretization for the one-dimensional FPDE model In this section, we are concerned with the following variable-coefficient two-sided fractional diffusion equation in one-space dimension  ∂u(x, t) ∂α u(x, t) ∂α u(x, t)    + a− (x, t) + s(x, t), = a+ (x, t)  α   ∂t ∂+ x ∂ − xα     xL < x < xR , 0 < t ≤ T,      u(xL , t) = u(xR , t) = 0, 0 ≤ t ≤ T,      u(x, 0) = u (x), x ≤ x ≤ x , 0 L R

(2.1)

where the fractional order of spatial derivative 1 < α < 2. The left and right transport-related α α and ∂ ∂u(x,t) are the coefficients a± (x, t) > 0, and function s(x, t) is a given force term. ∂ ∂u(x,t) α α +x −x left-sided and right-sided Riemann-Liouville (R-L) fractional derivatives of order α defined by

2

[33] Z x  ∂2 u(ξ, t) 1    dξ,   2  ∂α u(x, t) Γ(2 − α) ∂x (x − ξ)α−1  xL :=    ∂ + xα  ∂2 u(x, t)    , ∂x2 Z xR  1 u(ξ, t) ∂2    dξ,    Γ(2 − α) ∂x2 x (ξ − x)α−1 ∂α u(x, t)  :=    ∂ − xα  ∂2 u(x, t)    , ∂x2

1 < α < 2, α = 2, (2.2) 1 < α < 2, α = 2.

For α = 2, the fractional derivative of order α reduces to the classical second-order derivative, and model (2.1) reduces to the classical diffusion equation. Now we consider the QSC-CN discretization for model (2.1) based on uniform space-time mesh partitions. Let M and N be two positive integers. First, define tm = m∆t for m = 0, 1, . . . , M with ∆t = T/M the uniform temporal mesh size. Then, we employ uniform spatial partition of [xL , xR ] such that ∆ x = {xL = x0 < x1 < . . . < xN = xR },

with spatial mesh size ∆x = (xR − xL )/N and x j = xL + j∆x for j = 0, 1, . . . , N. Let P2,∆x be the space of piecewise quadratic polynomials with respect to partitions ∆ x . Furthermore, let S2,∆x = P2,∆x ∩ C10 ([xL , xR ]) be the space of piecewise quadratic polynomials in [xL , xR ], which has continuous first derivative with respect to x and satisfies the homogeneous Dirichlet boundary condition. Now we define the basis functions for the space S2,∆x . First, let  2  x , 0 ≤ x ≤ 1,     1  −2(x − 1)2 + 2(x − 1) + 1, 1 ≤ x ≤ 2, φ(x) =   (3 − x)2 , 2 ≤ x ≤ 3, 2     0, elsewhere,

and define

x − x

L

 − j+2 ,

j = 0, 1, . . . , N + 1. (2.3) ∆x Then a set of basis functions for the one-dimensional quadratic spline space S2,∆x = P2,∆x ∩ C10 ([xL , xR ]) can be chosen as   φ (x) − φ0 (x), j = 1,    1 φ (x), j = 2, . . . , N − 1, (2.4) φ˜ j (x) =  j    φ (x) − φ (x), j = N. φ j (x) = φ

N

N+1

With these basis functions, the quadratic spline approximate solution u∆ (x, tm ) ∈ S2,∆x of model (2.1) can be written as u∆ (x, tm ) =

N X

cmj φ˜ j (x),

j=1

where cmj , j = 1, . . . , N are to be determined.

3

m = 0, 1, . . . , M,

(2.5)

Replacing u(x, t) by u∆ (x, t) in (2.1), and combining with the the Crank-Nicolson (CN) time discretization, we obtain " α # ∂ u∆ (x, tm+1 ) ∂α u∆ (x, tm ) u∆ (x, tm+1 ) − u∆ (x, tm ) 1 m+1/2 + = a+ (x) ∆t 2 ∂ + xα ∂ + xα # (2.6) " α 1 ∂ u∆ (x, tm+1 ) ∂α u∆ (x, tm ) m+1/2 + a−m+1/2 (x) + s (x), + 2 ∂ − xα ∂ − xα or equivalently N h X j=1

=

dα φ˜ j (x) ∆t m+1/2 dα φ˜ j (x) i m+1 ∆t − a− cj (x) φ˜ j (x) − a+m+1/2 (x) 2 d + xα 2 d − xα

N h X j=1

(2.7)

dα φ˜ j (x) ∆t m+1/2 dα φ˜ j (x) i m ∆t + a− c j + ∆tsm+1/2 (x), (x) φ˜ j (x) + a+m+1/2 (x) 2 d + xα 2 d − xα

where

a±m+1/2 (x) := a± (x, tm+1/2 ),

sm+1/2 (x) := s(x, tm+1/2 ).

To eventually determine the N unknowns cm+1 for j = 1, . . . , N, let equation (2.7) be satisfied j at the collocation points {τi = (xi−1 + xi )/2, 1 ≤ i ≤ N}.

This leads to the following quadratic spline collocation equations (QSC-CN): i ∆t m+1/2 ∆t m+1/2 Qα− cm+1 Da+ Qα+ − Da− 2 2 h i ∆t m+1/2 ∆t m+1/2 = Q + Da+ Qα+ + Da− Qα− cm + ∆tsm+1/2 , 2 2 h

Q−

for 0 ≤ m ≤ M − 1. Here the matrices

Q(i, j) = φ˜ j (τi ),

Qα± (i, j) =

for 1 ≤ i, j ≤ N, and

(2.8)

dα φ˜ j (τi ) d ± xα

   m ⊤ sm+1/2 = s(τ1 , tm+1/2 ), . . . , s(τN , tm+1/2 ) ⊤ , cm = cm 1 , . . . , cN ,  m+1/2 Da± = diag a± (τ1 , tm+1/2 ), . . . , a± (τN , tm+1/2 ) .

In particular, the initial value c0 ∈ RN is determined by solving the linear algebra equation Qc0 = u0

where u0 = [u0 (τ1 ), u0 (τ2 ), . . . , u0 (τN )]⊤ . A direct calculation shows that   5  1  1  Q =  8    0

1 6 .. .

1 .. . 1 4

(2.9)

..

. 6 1

 0      ,  1  5

(2.10)

and

Qα+

Qα−

        =       

        =       

(α) q(α) 1 − q2

q(α) 0

(α) q(α) 2 − q3 .. . .. .

q(α) 1

q(α) 0

q(α) 2

..

..

..

(α) q(α) N−1 − qN (α) q(α) N − qN+1

q(α) N−2 q(α) N−1

(α) q(α) 1 − q0

q(α) 0

q(α) 2

.

0

.

..

.

..

q(α) N−3 q(α) N−2 q(α) 3

. .

··· ···

···

..

q(α) N−1

q(α) 1

q(α) 2

..

..

.

..

.

..

.

..

.

..

.

..

.

.

···

q(α) 0

0

.

q(α) 1 q(α) 2 q(α) N−2

q(α) 1

q(α) 0

q(α) 0

(α) q(α) 1 − q0

         ,     

(α)  q(α)  N − qN+1   (α) (α)  qN−1 − qN    ..  .  ,  ..  .  (α) (α)  q2 − q3   (α)  q(α) 1 − q2

(2.11)

(2.12)

1 where q(α) i := q(i + 2 ) for i = 0, 1, . . . , N + 1, and q(x) is a piecewise function defined as follows:

      −α    (∆x)  q(x) =   Γ(3 − α)       

0, x2−α , 0≤ x2−α − 3(x − 1)2−α , 1≤ x2−α − 3(x − 1)2−α + 3(x − 2)2−α , 2≤ x2−α − 3(x − 1)2−α + 3(x − 2)2−α − (x − 3)2−α ,

x < 0, x < 1, x < 2, x < 3, x ≥ 3,

(2.13)

In [28], we have proved that the QSC-CN scheme (2.8) is unconditionally stable and convergent with (3 − α)th order in space and second order in time. However, due to the nonlocal property of the space fractional differential operators, the corresponding coefficient matrix of (2.8) is dense and full with a complicated structure. Therefore, an efficient and fast solver for this kind of large-scale modeling problem is of great importance. Traditionally, the Gaussian elimination method which is used to solve such system requires storage of O(N 2 ) and computational cost of O(N 3 ). Consequently, this has been a main obstacle and difficulty for numerical solving the FPDE models. 3. Efficient and fast solution method We can see that Qα+ (see (2.11)) and Qα− (see (2.12)) are Toeplitz-like matrices, and they are centrosymmetric to each other. For the QSC-CN method (2.8), we have to solve a N-dimensional linear algebraic system with a dense coefficient matrix at each time step, which has a memory requirement of O(N 2 ). Even we consider those Krylov subspace iterative methods, e.g., the BiConjugate Gradient Stabilized method (Bi-CGSTAB), the main computational work lies in several matrix-vector multiplications, which need O(N 2 ) operations per iteration. It is clearly computationally expensive for large-scale modelling problems. 5

In this section, we aim to develop an efficient and yet faithful fast Krylov subspace iterative method, we shall see the fast method is nothing but the standard iterative method based on fast matrix-vector multiplications. As an illustration, we utilize the preconditioned Bi-CGSTAB method (see [1]) to solve the nonsymmetric linear system (2.8). Algorithm 3.1 (Preconditioned Bi-CGSTAB Method) At each time level tm , choose an initial guess c(0) = cm , compute h i r(0) = ∆t Dm+1/2 Qα+ cm + Dm+1/2 Qα− cm + sm+1/2 a+ a− Choose r˜ (for example r˜ = r(0) ) For i = 1, 2, . . . ρi−1 = r˜ ⊤ r(i−1) method fails

If ρi−1 = 0 If i = 1

(i)

p = r(i−1)

Else βi−1 = (ρi−1 /ρi−2 )(αi−1 /ωi−1 ) p(i) = r(i−1) + βi−1 (p(i−1) − ωi−1 v(i−1) )

Endif

Solve Mpˆ = p(i) h i m+1/2 v(i) = Q − ∆t2 Dm+1/2 Qα+ + Da− Qα− pˆ a+

αi = ρi−1 /˜r⊤ v(i) ,

s = r(i−1) − αi v(i)

ˆ break If ksk small enough, choose c(i) = c(i−1) + αi p, Solve Mˆs = s h i t = Q − ∆t2 Dm+1/2 Qα+ + Dm+1/2 Qα− sˆ a+ a−

ωi = t⊤ s/t⊤ t,

c(i) = c(i−1) + αi pˆ + ωi sˆ,

r(i) = s − ωi t

Check for convergence; continue if necessary End cm+1 = c(i)

3.1. Fast matrix-vector multiplications It can be observed that, in the above algorithm, the evaluation of the matrix-vector multim+1/2 m+1/2 Qα− and Q in the initial guess and during each Qα+ , Da− plications with the matrices Da+ 2 iteration require O(N ) operations, while all the other computations require the optimal-order computational cost O(N). Considering the mathematical structures of the matrices Qα+ , Qα− and Q, we shall prove that the algorithm can be realized in O(N log N) operations per iteration. m+1/2 Theorem 3.1. The matrix-vector multiplications Da± Qα± f can be performed in O(N log N) operations and Q f can be performed in O(N) operation, for any f ∈ RN .

6

Proof. Note that the matrix Qα+ can be decomposed as sums of two matrices Qα+ = Qα+,1 + Qα+,2 such that Qα+,1 (i, j) =

1 (∆x)α

(

1 Qα+,2 (i, j) = − (∆x)α

q(α) i− j+1 , 0,

 (α)   q ,    i+1 q(α)   0 ,    0,

(3.1) j ≤ i + 1, j > i + 1,

(3.2)

j = 1, i = j = N, elsewhere.

(3.3)

It can be found that Qα+,1 is a N-by-N Toeplitz matrix, and thus it can be embedded into a 2N-by-2N circulant matrix C2N [6, 15] # " Qα+,1 BN , (3.4) C2N = BN Qα+,1 where

 (α)  qN+i− j+1 ,   1   (α) BN (i, j) =  q0 ,  (∆x)α    0,

i < j, i = N, j = 1, elsewhere.

(3.5)

It is well known that the circulant matrix C2N can be diagonalized by the discrete Fourier transform matrix F2N [6, 15] C2N = F−1 (3.6) 2N diag(F2N c2N ) F2N , where c2N is the first column vector of C2N . 2N Note that the matrix-vector multiplications F2N w and F−1 can be carried 2N w for any w ∈ R out in O(2N log(2N)) = O(N log N) operations via the fast Fourier transform (FFT) and the inverse fast Fourier transform (iFFT). Besides, the diagonal matrix-vector multiplication diag(v)· w for any v, w ∈ R2N can be evaluated by a Hadamard product in O(2N) = O(N) operations. Let f˜ = [f ⊤ , 0⊤ ]⊤ ∈ R2N . Then the evaluation C2N f˜ requires O(N log N) operations. While Qα+,1 f is ˜ Thus, evaluating Qα+,1 f also needs O(N log N) operations. just the first half of C2N f. Furthermore, note that Qα+,2 is a low-rank matrix with only N + 1 nonzero entries. Therefore, Qα+,2 f can be computed in O(N) operations, which implies that Qα+ f = Qα+,1 f + Qα+,2 f can m+1/2 Qα+ f can be evaluated in O(N log N) be performed in O(N log N) operations. And thus Da+ operations. Similarly, Qα− can be decomposed as Qα− = Qα−,1 + Qα−,2 such that

1 Qα−,1 (i, j) = (∆x)α

(

q(α) j−i+1 , 0,

 (α)   qN−i+2 ,  1   (α) Qα−,2 (i, j) = − q0 ,   (∆x)α    0,

(3.7) j ≥ i − 1, j < i − 1,

(3.8)

j = N, i = j = 1, elsewhere.

(3.9)

m+1/2 Qα− f can also be evaluated in O(N log N) operations. Thus, Da− Finally, as Q is a tridiagonal matrix. Thus Qf can be computed in O(N) operations. 7

2

Theorem 3.2. The coefficient matrix Q − memory.

∆t m+1/2 Qα+ 2 Da+



∆t m+1/2 Qα− 2 Da−

can be stored in O(N)

m+1/2 m+1/2 Qα+ − ∆t2 Da− Qα− , we have to efficiently Proof. To store the coefficient matrix Q − ∆t2 Da+ m+1/2 store Q, Qα± and Da± , respectively. First, note that Q is a symmetric tridiagonal matrix. Thus we can store it by only storing three nonzero entries 18 , 85 and 68 . Second, from the proof of Theorem 3.1 we see that the Toeplitz matrix Qα+,1 can be stored in O(N) memory by storing N + 1 distinct entries, and the sparse matrix Qα+,2 can also be stored in O(N) memory, since each of them has only N + 1 nonzero entries. Thus, (3.1) implies that Qα+ can be stored in O(N) memory. Similarly, Qα− can also be stored in O(N) memory. Finally, the diagonal matrices m+1/2 m+1/2 Da+ , Da− can also be stored in O(N) memory by storing their diagonal entries. Therefore, we prove the conclusion. 2

Remark 3.1. Following the fast matrix-vector multiplications developed in Theorem 3.1 and efficient storage of the coefficient matrix developed in Theorem 3.2, a fast version Bi-CGSTAB method can be proposed, which reduces the computational cost from O(N 2 ) to O(N log N) per iteration and also reduces the storage from O(N 2 ) to O(N). 3.2. Circulant preconditioner The fast version Bi-CGSTAB method developed in the last subsection has greatly reduced the computational complexity. However, the number of iterations as well as the condition numbers of the stiffness matrices may increase rapidly as the spatial-temporal mesh sizes tend to zero. Consequently, even a fast solution method may diverge. Thus, it is crucial to develop an effective and efficient preconditioner for the QSC method to reduce the number of iterations and yet the computational cost. m+1/2 m+1/2 Qα+ − ∆t2 Da− Qα− is an almost Toeplitz matrix. We As the stiffness matrix Q − ∆t2 Da+ employ T. Chan’s circulant preconditioner developed for Toeplitz system [36]. First, we recall that for any general n-by-n matrix T = [ti, j ]ni, j=1 , the best circulant approximation CircF (T) that minimizes kCF − TkF over all circulant matrices CF of order n can be obtained by taking the arithmetic average of the entries of T, i.e., the first column entries of the circulant matrix CircF (T) are given by [36] cl+1 =

X 1 ti, j , n l=i− j(mod n)

0 ≤ l ≤ n − 1.

(3.10)

m+1/2 Qα+ − Then, a generalized T. Chan’s circulant preconditioner [36] for the matrix Q − ∆t2 Da+ m+1/2 ∆t Qα− can be constructed by the following steps: 2 Da− Step 1. Construct a circulant approximation CircF (Q) to the tridiagonal matrix Q, where the first column entries of CircF (Q) are defined as   6N − 2, j = 1,  1   N − 1, j = 2, N, cj = (3.11)   8N   0, 3 ≤ j ≤ N − 1.

Here and what follows, we use CircF (·) to denote the best circulant matrix approximation of a given matrix in the sense of (3.10). 8

Step 2. Construct circulant approximations CircF (Qα+ ) and CircF (Qα− ) to matrices Qα+ and Qα− , respectively, where the first column entries of CircF (Qα+ ) and CircF (Qα− ) are defined as  (α) (α)  Nq(α) j = 1,   1 − q0 − q2 ,  1  (α) (α) (N − j + 1)q − q , 2 ≤ j ≤ N − 1, cα+, j = (3.12)  j j+1  N(∆x)α    (N − 1)q(α) + q(α) − q(α) , j = N. N N+1 0 cα−, j

1 = N(∆x)α

 (α) (α)  Nq(α)   1 − q2 − q0 ,   (α) (α) (N − 1)q0 + qN − q(α)  N+1 ,     ( j − 1)q(α) − q(α) , N− j+3 N− j+2

j = 1, j = 2, 3 ≤ j ≤ N.

(3.13)

m+1/2 Qα+ − Step 3. Formulate the generalized T. Chan’s circulant preconditioner CmF for Q− ∆t2 Da+ ∆t m+1/2 Q as D α− 2 a−

CmF := CircF (Q) − m+1/2

where D±

∆t m+1/2 ∆t m+1/2 D+ CircF (Qα+ ) − D− CircF (Qα− ), 2 2

(3.14)

are defined as m+1/2



:=

N 1 X a± (τ j , tm+1/2 ). N j=1

(3.15)

The generalized T. Chan’s circulant preconditioner CmF can be diagonalized by the Fourier transform matrix FN , i.e., m CmF = F−1 N diag(FN cF )FN , where cmF is the first column vector of CmF . Therefore, the preconditioner CmF can be inverted in O(N log N) operations. Then the preconditionings Mpˆ = p(i) and Mˆs = s in Algorithm 3.1, where M is chosen as CmF , can be expressed as follows: Algorithm 3.2 (Fast Circulant Preconditioning) At each time step tm , 1. Compute vm = fft(cmF ), 2. Compute vm = 1./vm , 3. The computation Mpˆ = p(i) in the preconditioned Bi-CGSTAB method is replaced by the following operations, w(i) = fft(p(i) ), z = vm . ∗ w(i) ,

pˆ = ifft(z).

4. The computation Mˆs = s in the preconditioned Bi-CGSTAB method is replaced by similar operations.

9

4. Fast BiQSC-CN method for the two-dimensional FPDE In this section, we consider the following two-sided space-fractional diffusion equation with fractional orders 1 < α, β < 2 in two-space dimension   ∂α u(x, y, t) ∂β u(x, y, t) ∂α u(x, y, t) ∂u(x, y, t)    + a− (x, y, t) + b+ (x, y, t) = a+ (x, y, t)   α α  ∂t ∂+ x ∂− x ∂ + yβ       ∂βu(x, y, t)   + b− (x, y, t) + s(x, y, t), (x, y) ∈ Ω, 0 < t ≤ T, (4.1)   ∂ − yβ       u(x, y, t) = 0, (x, y) ∈ ∂Ω, 0 ≤ t ≤ T,        u(x, y, 0) = u0 (x, y), (x, y) ∈ Ω.

Here, Ω = (xL , xR ) × (yL , yR ) is a rectangular domain, and ∂Ω denotes its boundary. The left and right transport-related coefficients a± (x, y, t) > 0 and b± (x, y, t) > 0. The left-sided (+) and β α and ∂ u(x,y,t) are defined similarly as in Section 2. right-sided (−) fractional derivatives ∂ ∂u(x,y,t) α ∂± yβ ±x

4.1. BiQSC-CN scheme Let N x and Ny be two positive integers, and denote N := N x Ny . We introduce a uniform spatial partition ∆ := ∆ x × ∆y such that ∆ x = {xL = x0 < x1 < . . . < xNx = xR },

∆y = {yL = y0 < y1 < . . . < yNy = yR },

with mesh sizes ∆x = (xR − xL )/N x and ∆y = (yR − yL )/Ny . Furthermore, Let {(τ xj , τyk ), j = 1, 2, . . . , N x ; k = 1, 2, . . . , Ny } be the collocation points, where {τ xj , j = 1, 2, . . . , N x } and {τyk , k = 1, 2, . . . , Ny } are the midpoints of ∆ x and ∆y , respectively . As the one-dimensional case, let P2,∆ := P2,∆x ⊗ P2,∆y be the space of piecewise biquadratic polynomials with respect to the partition ∆. Furthermore, denote S2,∆ = P2,∆ ∩ C10 (Ω). Similar Ny to (2.4), let {φ˜ k (y)}k=1 be the basis functions for the one-dimensional quadratic spline spaces S2,∆y = P2,∆y ∩ C10 ([yL , yR ]). Thus, the basis functions for S2,∆ can be constructed by forming the tensor product of the basis functions for the spaces S2,∆x and S2,∆y . Therefore, the biquadratic spline approximate solution u∆ (x, y, tm ) ∈ S2,∆ of model (4.1) can be expressed as u∆ (x, y, tm ) =

Ny Nx X X

cmj,k φ˜ j (x)φ˜ k (y). m = 0, 1, . . . , M,

(4.2)

j=1 k=1

where cmj,k for j = 1, 2, . . . , N x and k = 1, 2, . . . , Ny are to be determined. Similar to (2.7), The Crank-Nicolson/biquadratic spline collocation method (BiQSC-CN) for

10

model (4.1) is proposed as follows: Ny " Nx X X dα φ˜ j (x) dα φ˜ j (x) ∆t ˜ k (y) − ∆t a−m+1/2 (x, y) φ˜ j (x)φ˜ k (y) − a+m+1/2 (x, y) φ φ˜ k (y) 2 d + xα 2 d − xα j=1 k=1 # ∆t dβ φ˜ k (y) ∆t m+1/2 dβ φ˜ k (y) m+1 ˜ c j,k − b+m+1/2 (x, y)φ˜ j (x) − b (x, y) φ (x) j 2 d + yβ 2 − d − yβ Ny " Nx X X dα φ˜ j (x) dα φ˜ j (x) ∆t ˜ k (y) + ∆t a−m+1/2 (x, y) φ˜ j (x)φ˜ k (y) + a+m+1/2 (x, y) = φ φ˜ k (y) 2 d + xα 2 d − xα j=1 k=1 # ∆t m+1/2 dβ φ˜ k (y) ∆t m+1/2 dβ φ˜ k (y) m ˜ ˜ c j,k + ∆tsm+1/2 (x, y). + b+ (x, y)φ j (x) + b− (x, y)φ j (x) 2 d + yβ 2 d − yβ (4.3) Let (x, y) = (τlx , τyp ) for all l = 1, 2, . . . , N x and p = 1, 2, . . . , Ny in (4.3). Then, we derive the corresponding biquadratic spline collocation equations: Am+1/2 cm+1 = Bm+1/2 cm + ∆t sm+1/2 , where Am+1/2 = (Q x ⊗ Qy ) − and

i ∆t h m+1/2 m+1/2 (Qα− ⊗ Qy ) Da+ (Qα+ ⊗ Qy ) + Da− 2 i ∆t h m+1/2 m+1/2 (Q x ⊗ Qβ− ) , Db+ (Q x ⊗ Qβ+ ) + Db− − 2

(4.4)

(4.5)

i ∆t h m+1/2 m+1/2 (Qα− ⊗ Qy ) Da+ (Qα+ ⊗ Qy ) + Da− 2 (4.6) i ∆t h m+1/2 m+1/2 + (Q x ⊗ Qβ− ) . Db+ (Q x ⊗ Qβ+ ) + Db− 2 Here ⊗ denotes the Kronecker product. The matrices Q x and Qy have the same structures as Q in Section 2, but with sizes N x -by-N x and Ny -by-Ny , respectively. Also, Qα± both with sizes N x -byN x are the same as they defined in Section 2. While Qβ± are defined similarly as Qα± just with α and N x instead by β and Ny , respectively. Furthermore, for m = 0, 1, . . . , M, the two-dimensional variables {cmjk } are arranged into a N = N x Ny -dimensional column vector i⊤ h m m m m m cm = cm 11 , . . . , c1Ny , c21 , . . . , c2Ny , . . . , cN x 1 , . . . , cN x Ny . Bm+1/2 = (Q x ⊗ Qy ) +

m+1/2 m+1/2 Besides, Da± and Db± are N-dimensional diagonal matrix n y y y y m+1/2 Da± = diag a± (τ1x , τ1 , tm+1/2 ), . . . , a± (τ1x , τNy , tm+1/2 ), a±(τ2x , τ1 , tm+1/2 ), . . . , a± (τ2x , τNy , tm+1/2 ), o . . . , a± (τNx x , τy1 , tm+1/2 ), . . . , a± (τNx x , τyNy , tm+1/2 ) , n m+1/2 = diag b± (τ1x , τy1 , tm+1/2 ), . . . , b± (τ1x , τyNy , tm+1/2 ), b±(τ2x , τy1 , tm+1/2 ), . . . , b± (τ2x , τyNy , tm+1/2 ), Db± o y y . . . , b± (τNx x , τ1 , tm+1/2 ), . . . , b± (τNx x , τNy , tm+1/2 ) ,

and sm+1/2 is a N-dimensional column vector h y y y y sm+1/2 = s(τ1x , τ1 , tm+1/2 ), . . . , s(τ1x , τNy , tm+1/2 ), s(τ2x , τ1 , tm+1/2 ), . . . , s(τ2x , τNy , tm+1/2 ), i⊤ y y . . . , s(τNx x , τ1 , tm+1/2 ), . . . , s(τNx x , τNy , tm+1/2 ) . 11

It is seen that the BiQSC-CN scheme (4.4) reduces to solve a series of linear system with N-by-N dense coefficient matrix, where N = N x Ny . This is usually square size of corresponding one-dimensional case. Thus, compared with the one-dimensional case, it will be much more computationally cost whether using the direct solution solver or the Krylove subspace iterative solver. In the following subsections, we shall first analyze the stability and convergence of the BiQSC-CN scheme, then by carefully exploring the structures of the coefficient matrices, we shall develop a preconditioned fast Bi-CGSTAB solution method which are much more complex than the one-dimensional case. 4.2. Numerical analysis of the BiQSC-CN scheme We start from the matrix form (4.4) to prove the stability and convergence of the BiQSC-CN scheme. For simplicity, we restrict ourself to the case of constant transport-related coefficients, i.e., a± (x, y, t) = a± and b± (x, y, t) = b± in (4.1). First, we define the following discrete L2 -norm p (4.7) kvkL2 := ∆x∆y v⊤ v, ∀v ∈ RN , and Q-weighted norm

kvk2Q := ∆x∆y v⊤ (Q x ⊗ Qy )v,

∀v ∈ RN .

It is easy to check that the two norms are equivalent and √ 2 kvkL2 ≤ kvkQ ≤ kvkL2 . 4

(4.8)

(4.9)

Theorem 4.1. For γ∗ (≈ 1.2576) ≤ α, β < 2, the solution vector {cm } of the BiQSC-CN scheme (4.4) satisfies m−1 X √ 1 ks k+ 2 kL2 , 1 ≤ m ≤ M. kcm kQ ≤ kc0 kQ + 2 2∆t k=0

Proof. First, denote

G := a+ (Qα+ ⊗ Qy ) + a− (Qα− ⊗ Qy ) + b+ (Q x ⊗ Qβ+ ) + b− (Q x ⊗ Qβ− ). Similar to the proof in [28], we can easily check that matrices Qα± and Qβ± are negative definite for γ∗ (≈ 1.2576) ≤ α, β < 2, and thus matrix G is negative definite. Then, we rewrite the BiQSC-CN scheme (4.4) as (Q x ⊗ Qy )(cm+1 − cm ) −

1 ∆t G(cm+1 + cm ) = ∆ts m+ 2 . 2

(4.10)

Left multiplying ∆x∆y(cm+1 + cm )⊤ on both sides of (4.10) and considering the negative definite property of G, we obtain 1   ∆x∆y cm+1 + cm ⊤ (Q x ⊗ Qy )(cm+1 − cm ) ≤ ∆t∆x∆y cm+1 + cm ⊤ s m+ 2 . It then follows from the Cauchy-Schwarz inequality and (4.9) that √ 1 1 kcm+1 k2Q − kcm k2Q ≤ ∆t(kcm+1 kL2 + kcm kL2 )ks m+ 2 kL2 ≤ 2 2∆t(kcm+1 kQ + kcm kQ )ks m+ 2 kL2 , which implies that

√ 1 kcm+1 kQ ≤ kcm kQ + 2 2∆tks m+ 2 kL2 .

Theorem 4.1 is proved by recursion.

2 12

Theorem 4.2. Let um and um ∆ be respectively the solution vectors for (4.1) and (4.4) at the collocation points. Then, for γ∗ ≤ α, β < 2, we have   min{3−α,3−β} + ∆t2 , 1 ≤ m ≤ M, kum − um ∆ kL2 ≤ C h

where C is a positive constant and h = max{∆x, ∆y}.

Proof. Let u m (x, y) ∈ S2,∆ be the biquadratic spline interpolation of um (x, y) = u(x, y, tm ) at the collocation points such that y

y

u m (τlx , τ p ) = um (τlx , τ p ),

l = 1, 2, . . . , N x ,

p = 1, 2, . . . , Ny .

(4.11)

Thus, u m (x, y) can be expressed as a function of the spline basis functions, i.e., u m (x, y) =

Ny Nx X X

dmj,k φ˜ j (x)φ˜ k (y),

m = 0, 1, . . . , M.

(4.12)

j=1 k=1

Moreover, denote Lum (x, y) := a+

∂α um (x, y) ∂β um (x, y) ∂β um (x, y) ∂α um (x, y) + a + b + b . − + − ∂ + xα ∂ − xα ∂ + yβ ∂ − yβ

Then, we conclude from (4.11) that u m (x, y) also satisfies u m+1 (τlx , τyp ) −

1 ∆t ∆t Lu m+1 (τlx , τyp ) = u m (τlx , τyp ) + Lu m (τlx , τyp ) + ∆tsm+ 2 (τlx , τyp ) 2 2 + gm (τlx , τyp ) + O(∆t3 ),

(4.13)

where

∆t ∆t L(u m+1 − um+1 )(τlx , τyp ) − L(u m − um )(τlx , τyp ). 2 2 m m m m x y Let the vectors d = {dl,p } and g = {g (τl , τ p )} be defined in a natural way as before. Then similar to (4.4), equation (4.13) is equivalent to the following matrix form gm (τlx , τyp ) = −

Adm+1 = Bdm + ∆t sm+1/2 + gm + O(∆t3 )E,

(4.14)

where E = [1, 1, . . . , 1]⊤ ∈ RN , and in this context matrices A and B are independent of time, so we omit the superscript m + 1/2. Denote em = dm − cm . Then by subtracting (4.4) from (4.14) we obtain the error equation Aem+1 = Bem + gm + O(∆t3 )E.

(4.15)

Hence, an application of the stability conclusion in Theorem 4.1 implies that m−1 m−1 X √ p √ X  kgk kL2 + O(∆t3 ) ≤ 2 2 (xR − xL )(yR − yL ) max |gk (τlx , τyp )| + O(∆t2 ). kem kQ ≤ 2 2 k=0

k=0

Note that by the analysis in [28], max |gk (τlx , τyp )| ≤ ∆tM0 hmin{3−α,3−β} , l,p

13

l,p

(4.16)

where M0 is a positive constant. Then we have  kem kQ ≤ C hmin{3−α,3−β} + ∆t2 .

(4.17)

Finally, let um = {um (τlx , τyp )} ∈ RN be the solution vector of (4.12) at the collocation points. Thus we derive from (4.2) and (4.12) that √ m m m m m kum − um (4.18) ∆ kL2 = ku − u∆ kL2 = k(Q x ⊗ Qy )e kL2 ≤ ke kL2 ≤ 2 2 ke kQ , this together with (4.17) proves the theorem.

2

4.3. An efficient storage of the coefficient matrices Am+1/2 and Bm+1/2 We can see from expressions (4.5) and (4.6) that, Am+1/2 and Bm+1/2 have similar structures. In this part, we will show that the coefficient matrix Am+1/2 can be stored in O(N) memory and explain what information needs to be stored. Theorem 4.3. The coefficient matrix Am+1/2 can be stored in O(N) memory. Proof. We analyze each component of matrix Am+1/2 individually. First, for the matrix Q x ⊗ Qy , there are only three different blocks, i.e., 81 Qy , 58 Qy , and 68 Qy . All these block matrices are symmetric tridiagonal matrices, and only three entries need to be stored for each block. So we only need to store 9 entries for the matrix Q x ⊗ Qy . Second, for the matrix Qα+ ⊗ Qy , following (3.1) we have Qα+ ⊗ Qy = Qα+,1 ⊗ Qy + Qα+,2 ⊗ Qy . Note that Qα+,1 is a Toeplitz matrix with (N x + 1) different nonzero entries, and Qα+,2 is a sparse matrix with (N x + 1) nonzero entries. Thus, the product Qα+,1 ⊗ Qy contains (N x + 1) different blocks, i.e., j = 0, 1, . . . , N x , (∆x)−α q(α) j Qy , and the product Qα+,2 ⊗ Qy contains (N x + 1) nonzero blocks, i.e., −α (α) −(∆x)−α q(α) 0 Qy , (∆x) q j+1 Qy , j = 1, . . . , N x .

It is clear that each of the above blocks has only three different entries. So at most (6N x + 6) entries need to be stored for Qα+ ⊗ Qy . The conclusion is also valid for Qα− ⊗ Qy . Third, the matrix Q x ⊗ Qβ+ is a block tridiagonal symmetric matrix and contains only three different blocks, i.e., 81 Qβ+ , 58 Qβ+ , and 86 Qβ+ . All of them are Toeplitz-like matrices and following the decomposition (3.1)-(3.3), each of them has (2Ny + 2) different entries. So (6Ny + 6) entries need to be stored totally for Q x ⊗ Qβ+ . Likewise, another (6Ny + 6) entries need to be stored for Q x ⊗ Qβ− . m+1/2 m+1/2 can be stored in 4N memory. Finally, the N-dimensional diagonal matrices Da± and Db± Therefore, the total storage for the matrix Am+1/2 is 4N + 2(6N x + 6Ny + 12) + 9 = O(N) memory. 2 Remark 4.1. As the matrix Bm+1/2 has the same structure as Am+1/2 . Thus, Theorem 4.3 is also valid for Bm+1/2 , i.e., it can also be stored in O(N) memory. 14

4.4. An O(N log N) matrix-vector multiplications of Am+1/2 f and Bm+1/2 f As in the one-dimensional case, in this part we shall prove that the matrix-vector multiplications of Am+1/2 f and Bm+1/2 f for any vector f ∈ RN can also be achieved in O(N log N) operations. Theorem 4.4. The matrix-vector multiplications Am+1/2 f and Bm+1/2 f can be performed in O(N log N) operations for any vector f ∈ RN . To prove the theorem, we need first prove the following two lemmas. Lemma 4.1. The matrix-vector multiplications (Q x ⊗Qβ± ) f can be performed in O(N x Ny log Ny ) operations for any vector f ∈ RN .   Proof. For any N = N x Ny -dimensional vector f in the form f = f1 , f2 , . . . , fN ⊤ , we write it into a N x -dimensional block vector h  i⊤   (4.19) f = f (1) ⊤ , f (2) ⊤ , . . . , f (Nx ) ⊤   with each f ( j) = f( j−1)Ny +1 , f( j−1)Ny +2 , . . . , f jNy ⊤ a Ny -dimensional column vector. Then the matrix-vector multiplication   5Qβ+ f (1) + Qβ+ f (2)    (1) (2) (3)  Q f + 6Q f + Q f β+ β+ β+    1   . .. (4.20) Q x ⊗ Qβ+ f =   . 8    Qβ+ f (Nx −2) + 6Qβ+f (Nx −1) + Qβ+ f (Nx )  Qβ+ f (Nx −1) + 5Qβ+ f (Nx )

We observe that the main computational cost for (4.20) lies in N x number of matrix-vector multiplications Qβ+ f ( j) . Similar to the proof in Theorem 3.1, the multiplication Qβ+ f ( j) can be performed in O(Ny log Ny ) operations, and thus the total computational cost for (Q x ⊗ Qβ+ )f is O(N x Ny log Ny ). Similarly, the matrix-vector multiplication (Q x ⊗ Qβ− )f can also be carried out in O(N x Ny log Ny ) operations. 2 Lemma 4.2. The matrix-vector multiplication (Qα± ⊗ Qy ) f can be performed in O(N log N) operations for any vector f ∈ RN . Proof. Let f be any column vector of order N that is expressed in the form of (4.19). Then the matrix-vector multiplication (Qα+ ⊗ Qy )f can be expressed in the same block form (Qα+ ⊗ Qy )f = (Qα+ ⊗ INy )(INx ⊗ Qy )f h  i⊤   = (Qα+,1 ⊗ INy + Qα+,2 ⊗ INy ) Qy f (1) ⊤ , Qy f (2) ⊤ , . . . , Qy f (Nx ) ⊤ ,

where INx and INy are respectively the identity matrices with sizes N x and Ny . As Qy is a tridiagonal matrix, therefore Qy f ( j) can be preformed in O(Ny ) operations which accounts for O(N x Ny ) = O(N) operations in total for all j = 1, 2, . . . , N x . Then, note that Qα+,1 ⊗ INy is a N-by-N block-Toeplitz-circulant-block (BTCB) matrix. Similar to (3.4), we can embed Qα+,1 ⊗ INy into a 2N × 2N block-circulant-circulant-block (BCCB) matrix as follows [6, 15] # " # " Qα+,1 ⊗ INy BNx ⊗ INy Qα+,1 BNx ⊗ I Ny , (4.21) C2N,+ := = BNx ⊗ INy Qα+,1 ⊗ INy BNx Qα+,1 15

where BNx is defined similarly as (3.5) with size N x . We further denote h  i⊤   fˆ = Qy f (1) ⊤ , Qy f (2) ⊤ , . . . , Qy f (Nx ) ⊤ ,

and expand fˆ to be a double size vector fˆ2N = [fˆ ⊤ , 0⊤ ]⊤ ∈ R2N . It is easy to check that " # (Qα+,1 ⊗ INy )fˆ ˆ C2N,+ f2N = , (BNx ⊗ INy )fˆ

(4.22)

and the first half of (4.22) is just what we need. Let c2N,+ be the first column vector of C2N,+ , and Fˆ 2N = F2Nx ⊗ FNy be the corresponding two-dimensional discrete Fourier transform matrix. Then matrix-vector multiplication C2N,+ fˆ2N can be realized through the diagonalization [6, 15] ˆ ˆ ˆ C2N,+ fˆ2N = Fˆ −1 2N diag(F2N c2N,+ )F2N f2N .

(4.23)

It is well known that the Fˆ 2N fˆ2N can be performed in O(2N x Ny log Ny ) + O(2N x Ny log(2N x )) =O(N log N) operations via the two-dimensional FFT. Furthermore, the computational costs of the Hadamard product and the two-dimensional iFFT in (4.23) are O(2N) = O(N) and O(N log N), respectively. So [Qα+,1 ⊗ INy ]fˆ can be obtained in O(N log N) operations. On the other hand, as Qα+,2 is sparse, we have i h  (1) ⊤ (α) (N x ) ⊤ ⊤ (1) ⊤ , qNx +1 Qy f (1) ⊤ + q(α) , (Qα+,2 ⊗ INy )fˆ = −(∆x)−α q(α) , . . . , q(α) N x Qy f 0 Qy f 2 Qy f

which can be carried out in O(N) operations. Thus, (Qα+ ⊗ Qy )f can be obtained in totally O(N) + O(N log N) = O(N log N) operations, and similarly, (Qα− ⊗ Qy )f consumes the same computational cost. 2 Now we turn to prove Theorem 4.4 by combining Lemmas 4.1 and 4.2. Proof of Theorem 4.4. It follows from (4.5) that Am+1/2 f can be split as i ∆t h m+1/2 m+1/2 (Qα− ⊗ Qy ) f Da+ (Qα+ ⊗ Qy ) + Da− 2 i ∆t h m+1/2 m+1/2 (Q x ⊗ Qβ− ) f, Db+ (Q x ⊗ Qβ+ ) + Db− − 2

Am+1/2 f = (Q x ⊗ Qy )f −

(4.24)

As diagonal-multiply-vector has the linear complexity, thus following Lemmas 4.1 and 4.2, the last four terms in (4.24) can be carried out in O(N log N) operations. Finally, note that (Q x ⊗Qy ) is a block-tridiagonal-tridiagonal-blcok sparse matrix. So (Q x ⊗Qy )f can be performed in O(N) operations. Therefore, the total computational coat for the multiplication Am+1/2 f is O(N log N) + O(N) = O(N log N). Likewise, the matrix-vector multiplication Bm+1/2 f can also be performed in O(N log N) operations. 2 4.5. A circulant block preconditioner for the matrix Am+1/2 In this subsection, we shall present a circulant block preconditioner for the BiQSC-CN scheme (4.4) based on T. Chan’s block preconditioner [3] for Toeplitz-block systems. 16

We observe from (4.5) that to present a circulant block preconditioner, it is necessary to propose the corresponding circulant block approximation to each term of (4.5). As an illustration, we would like to discuss the circulant block approximation to the second term, i.e., (Qα+ ⊗ Qy ). All the other terms can be discussed similarly. A simple so-called level-1 circulant block approximation, denoted by Circ1F (Qα+ ⊗ Qy ) to (Qα+ ⊗ Qy ) is given by [3] Circ1F (Qα+ ⊗ Qy ) := Qα+ ⊗ CircF (Qy ),

(4.25)

where as what we pointed before, CircF (·) denotes T. Chan’s point-circulant approximation [36], and the first column vector cy of CircF (Qy ) can be computed using (3.10) that   6Ny − 2, j = 1,  1   N − 1, j = 2, Ny , cy, j = (4.26)  y  8Ny   0, 3 ≤ j ≤ N − 1. y

It is well known that CircF (Qy ) can be diagonalized by the Fourier transform matrix FNy [6, 15] that CircF (Qy ) = F−1 (4.27) Ny diag(FNy cy )FNy . Therefore, the level-1 circulant block approximation to (Qα+ ⊗ Qy ) can be rewritten as e Circ1F (Qα+ ⊗ Qy ) = e F−1 N Qα+ ⊗ diag(FNy cy ) FN .

(4.28)

where e FN = INx ⊗ FNy . For any N-dimensional vector f in the form (4.19), we introduce a permutation matrix P ∈ RN×N such that  ⊤  ⊤ ⊤  ⊤  Pf = f˜ (1) , f˜(2) , . . . , f˜ (Nx ) , (4.29) i⊤ h with each f˜ ( j) = f j , fNy + j , . . . , f(N x−1)Ny + j ∈ RNx . Then (4.28) can be expressed in the form  e ⊤ Circ1F (Qα+ ⊗ Qy ) = e F−1 N P diag(FNy cy ) ⊗ Qα+ PFN .

(4.30)

Note that the Kronecker product diag(FNy cy ) ⊗ Qα+ reduces to a Ny -by-Ny diagonal-block matrix with each block a lower Hessenburg matrix, i.e., a constant times Qα+ (not Toeplitz!). Hence, such a circulant block preconditioner will be computationally expensive, because its inverse in the preconditioning step Mpˆ = p(i) and Mˆs = s in Algorithm 3.1 seems to be very tough for large N x . To reduce the computational cost, we further use CircF (Qα+ ) to replace Qα+ , and this give rises to the so-called level-2 circulant block approximation to Qα+ ⊗ Qy  e ⊤ F−1 Circ2F (Qα+ ⊗ Qy ) := e N P diag(FNy cy ) ⊗ CircF (Qα+ ) PFN .

(4.31)

Also note that CircF (Qα+ ) can be diagonalized by the Fourier transform matrix FNx [6, 15] that CircF (Qα+ ) = F−1 N x diag(FN x cα+ )FN x , where cα+ is the first column of CircF (Qα+ ) such that  (α) (α)  N x q(α)  1 − q0 − q2 ,   1  (α) (N x − j + 1)q j − q(α) cα+, j =  j+1 ,  N x (∆x)α    (N x − 1)q(α) + q(α) − q(α) , Nx 0 N x +1 17

j = 1, 2 ≤ j ≤ N x − 1, j = Nx .

Therefore, the level-2 circulant block approximation (4.31) can be reformulated as follows b e ⊤b−1 Circ2F (Qα+ ⊗ Qy ) = e F−1 N P FN diag (FNy cy ) ⊗ (FN x cα+ ) FN PFN ,

(4.32)

where b F N = I Ny ⊗ F N x . Likewise, the level-2 circulant block approximations to the other terms in Am+1/2 can be expressed as b e ⊤b−1 Circ2F (Q x ⊗ Qy ) = e F−1 N P FN diag (FNy cy ) ⊗ (FN x c x ) FN PFN , b e ⊤b−1 Circ2F (Qα− ⊗ Qy ) = e F−1 N P FN diag (FNy cy ) ⊗ (FN x cα− ) FN PFN , b e ⊤b−1 Circ2F (Q x ⊗ Qβ+ ) = e F−1 N P FN diag (FNy cβ+ ) ⊗ (FN x c x ) FN PFN , b e ⊤b−1 Circ2F (Q x ⊗ Qβ− ) = e F−1 N P FN diag (FNy cβ− ) ⊗ (FN x c x ) FN PFN ,

(4.33)

where c x , cα− , cβ+ , and cβ− are the first column vectors of CircF (Q x ), CircF (Qα− ), CircF (Qβ+ ), and CircF (Qβ− ), respectively, and c x, j

1 = 8N x

  6N x − 2,    N x − 1,     0,

j = 1, j = 2, N x , 3 ≤ j ≤ N x − 1,

 (α) (α)  N x q(α)   1 − q2 − q0 ,  1  (α) (α) (N x − 1)q0 + qNx − q(α) cα−, j =  N x +1 ,  N x (∆x)α   (α)  ( j − 1)q(α) N x − j+2 − qN x − j+3 ,  (β) (β)   Ny q(β)  1 − q0 − q2 ,   1  (N − j + 1)q(β) − q(β) , cβ+, j = y  j j+1  Ny (∆y)β   (β) (β)   (Ny − 1)q(β) + q − qNy +1 , Ny 0  (β) (β)   Ny q(β)  1 − q2 − q0 ,   1  (N − 1)q(β) + q(β) − q(β) , cβ−, j = y  Ny 0 Ny +1  Ny (∆y)β   (β)   ( j − 1)q(β) − q Ny − j+2 Ny − j+3 ,

j = 1, j = 2, 3 ≤ j ≤ Nx , j = 1, 2 ≤ j ≤ Ny − 1, j = Ny . j = 1, j = 2, 3 ≤ j ≤ Ny ,

(β)

where the function qi is defined similarly as q(α) i , just with α instead by β. Furthermore, denote m+1/2

Da±

=

Ny Nx X 1 X a± (τ xj , τyk , tm+1/2 ), N j=1 k=1

m+1/2

Db±

and define a vector cm+1/2 ∈ RN as

=

Ny Nx X 1 X b± (τ xj , τyk , tm+1/2 ), N j=1 k=1

(4.34)

i ∆t h m+1/2 m+1/2 Da+ (FNy cy ) ⊗ (FNx cα+ ) + Da− (FNy cy ) ⊗ (FNx cα− ) 2 i ∆t h m+1/2 m+1/2 Db+ (FNy cβ+ ) ⊗ (FNx c x ) + Db− (FNy cβ− ) ⊗ (FNx c x ) . − 2 (4.35)

c m+1/2 = (FNy cy ) ⊗ (FNx c x ) −

18

Then, following (4.32)-(4.35), the level-2 circulant block preconditioner for the coefficient matrix Am+1/2 can be constructed as  m+1/2 b ⊤b−1 F−1 Circ2F Am+1/2 = e FN Pe FN . N P FN diag c

(4.36)

Let Circ2F (Am+1/2 ) be the preconditioner M in Algorithm 3.1. Thus, the solution x of the linear system Mx = b in the preconditioning procedure can be computed via    m+1/2 −1 b ⊤b−1 x=e F−1 FN P e FN b . (4.37) N P FN diag c

Remark 4.2. Note that the five tensor products in (4.35) are independent of time, and thus can be precomputed only once in O(Ny log Ny + N x log N x + N) operations via fast Fourier transformations and tensor products of two known vectors. Besides, at each time step we need to m+1/2 m+1/2 m+1/2 m+1/2 compute four numbers Da+ , Da− , Db+ , Db− , and multiply them with corresponding vectors. These consume O(N) computational cost. Finally, the inverse-matrix-vector multiplication (4.37) can be computed in O(2N x Ny log Ny + 2Ny N x log N x + N) = O(N log N) operations, which include 2N x (inverse) fast Fourier transformations of order Ny , 2Ny (inverse) fast Fourier transformation of order N x , and a Hadamard product of two vectors of order N. Thus, the total computational cost by preconditioning brings into the fast Bi-CGSTAB iterative method another O(N log N) operations. However, we shall see in the numerical experiments part, the number of iterations for the preconditioned fast Bi-CGSTAB iterative method is significantly reduced for large-scale modeling problems, and thus the computational cost is finally reduced compared to the fast version Bi-CGSTAB iterative method. For detailed analysis of T. Chan’s circulant block preconditioner, we refer the readers to [3]. 5. Numerical Experiments We carry out numerical experiments to investigate the performance of the QSC method, which is solved by the Bi-CGSTAB iterative solver, the fast Bi-CGSTAB (FBi-CGSTAB) iterative solver, and the preconditioned fast Bi-CGSTAB (PFBi-CGSTAB) iterative solver as well as Gaussian elimination (Gauss). All the numerical experiments are tested using Matlab R2017b on a Lenovo desktop with Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz. The stopping criterion for the corresponding iterative solver is designated as kr(i) k/kr(0) k ≤ 10−10 , where i means the i-th iteration.

Example 5.1. For the first example, we consider a two-sided one-dimensional fractional diffusion equation (2.1) with a+ (x, t) = Γ(3 − α)xα , and

a− (x, t) = Γ(3 − α)(2 − x)α ,

h s(x, t) = −32e−t x2 + (2 − x)2 + 0.125x2(2 − x)2 − +

i 3 (x4 + (2 − x)4 ) , (4 − α)(3 − α) 19

3 (x3 + (2 − x)3 ) 3−α

for 0 < x < 2 and 0 < t < 1. Let the initial condition u(x, 0) = 4x2 (2 − x)2 , and the boundary condition u(0, t) = u(2, t) = 0. Therefore, the true solution of model (2.1) is given by [30, 41] u(x, t) = 4e−t x2 (2 − x)2 . Table 1: Performance of the QSC-CN method by different solvers for α = 1.8

α = 1.8 Gauss Bi-CGSTAB FBi-CGSTAB

PFBi-CGSTAB

N 210 211 210 211 210 211 212 210 211 212 213 214 215

M 43 64 43 64 43 64 98 43 64 98 148 223 338

Error 1.53e-04 6.63e-05 1.53e-04 6.64e-05 1.53e-04 6.63e-05 — 1.53e-04 6.63e-05 2.90e-05 1.27e-05 5.63e-06 2.46e-06

CPU 15m 47s 3h 54m 1s 31m 18s 7h 25m 32s 6s 16m 56s — 0.6s 4s 6s 19s 56s 5m 6s

Aver. iter. — — 507.37 1011.34 484.74 1986.64 4096 22.91 26.92 31.21 35.92 40.41 43.76

Table 2: Performance of the QSC-CN method by different solvers for α = 1.5

α = 1.5 Gauss Bi-CGSTAB FBi-CGSTAB

PFBi-CGSTAB

N 210 211 210 211 210 211 212 213 210 211 212 213 214

M 108 182 108 182 108 182 305 512 108 182 305 512 862

Error 5.10e-05 1.80e-05 5.10e-05 1.80e-05 5.10e-05 1.80e-05 6.36e-06 2.08e-06 5.10e-05 1.80e-05 6.36e-06 2.08e-06 4.86e-07

CPU 37m 26s 11h 13m 34s 41m 33s 8h 26m 18s 10s 2m 31s 3m 43s 6h 27m 37s 1s 8s 9s 33s 1m 37s

Aver. iter. — — 249.38 418.29 249.67 408.65 662.43 4141.45 19.99 19.33 18.05 18.42 18.31

We present the numerical errors, the CPU times and the average number of iterations (Aver. iter.) for the QSC-CN method (2.8) in Tables 1–3, for α=1.8, 1.5 and 1.2, respectively. In these tests, we choose the time steps ( M1 )2 ≈ ( N2 )3−α , and the linear system (2.8) is solved by four different solvers. From the numerical results displayed in Tables 1–3 we basically have the following observations: (1) All the solvers generate the numerical solutions with the same accuracy. For example, with α = 1.5, N = 211 and M = 182, the numerical errors are all 1.80e-05 for the QSC-CN method. We note that in some cases, for example, α = 1.8, N = 212 and M = 98, even the fast Bi-CGSTAB solver is divergent. However, the developed preconditioned fast Bi-CGSTAB solver is do convergent to the theoretical result. 20

(2) The fast matrix-vector multiplication developed in Section 3 has improved the performance of the QSC-CN method significantly with less memory requirement and CPU usage. For example, when α = 1.2, N = 210 and M = 275, Gauss uses more than one and a half hours of CPU time, while the fast Bi-CGSTAB consumes only 8 seconds CPU time! Even for the Bi-CGSTAB with α = 1.5, N = 211 and M = 182, it consumes about eight and a half hours of CPU time while the fast Bi-CGSTAB consumes no more than 3 minutes CPU time! Table 3: Performance of the QSC-CN method by different solvers for α = 1.2

α = 1.2 Gauss Bi-CGSTAB FBi-CGSTAB

PFBi-CGSTAB

N 29 210 29 210 29 210 211 212 213 29 210 211 212 213

M 148 275 148 275 148 275 513 956 1783 148 275 513 956 1783

Error 3.58e-05 1.03e-05 3.58e-05 1.03e-05 3.58e-05 1.03e-05 2.95e-06 8.61e-07 4.37e-07 3.58e-05 1.03e-05 2.95e-06 8.61e-07 4.37e-07

CPU 3m 46s 1h 42m 50s 1m 52s 30m 34s 2s 8s 1m 37s 1m 48s 11m 27s 1s 4s 41s 57s 4m 9s

Aver. iter. — — 59.98 75.91 59.81 76.07 88.06 104.71 112.95 31.65 32.92 34.01 34.73 35.06

(3) The circulant preconditioner developed in Section 3 can further improve the efficiency of the QSC-CN method solved by the fast Bi-CGSTAB. For example, they have reduced the CPU time from more than 3.5 minutes to 9 seconds in the case α = 1.5, N = 212 and M = 305. Meanwhile, the average number of iteration is reduced from about 662 to 18. Moreover, with α = 1.8, N = 215 and M = 338, the preconditioned fast Bi-CGSTAB solver uses only about 5 minutes CPU time, in contrast to more than 7 hours for the Bi-CGSTAB solver with N = 211 and M = 64. Example 5.2. For the second example, we consider a two-dimensional case which is defined in a rectangular domain Ω = (0, 1) × (0, 1) and T = 1. Let a+ (x, y, t) = 1 + xα + t,

a− (x, y, t) = 1 + (1 − x)α + t,

b+ (x, y, t) = 1 + yβ + t,

b− (x, y, t) = 1 + (1 − y)β + t,

and s(x, y, t) is properly chosen such that the true solution of model (4.1) is given by [28] u(x, y, t) = 4096e−t x3 (1 − x)3 y3 (1 − y)3 . The convergence order of the BiQSC-CN scheme has been verified in [28], which is completely consistent with the conclusion in Theorem 4.2. Besides, although the proof in Theorem 4.2 is based on constant transport-related coefficients, the numerical results show that the convergence theory is also valid for more general case. In the following runs we carry out the same numerical tests, where model (4.1) with (α, β)=(1.9,1.8), (1.5,1.4), (1.1,1.2) are solved by the 21

Table 4: Performance of the BiQSC-CN method by different solvers for (α, β) = (1.9, 1.8)

(α, β) = (1.9, 1.8) Gauss Bi-CGSTAB FBi-CGSTAB

PFBi-CGSTAB

N = N x Ny 28 210 28 210 28 210 212 214 216 218 28 210 212 214 216 218

M 5 7 5 7 5 7 10 15 22 31 5 7 10 15 22 31

Error 1.06e-02 3.95e-03 1.06e-02 3.95e-03 1.06e-02 3.95e-03 1.58e-03 5.92e-04 2.41e-04 1.10e-04 1.06e-02 3.95e-03 1.58e-03 5.92e-04 2.41e-04 1.10e-04

CPU 9m49s 17h 33m 26s 1m 24s 1h 1m 48s 0.4s 3s 19s 2m 21s 24m 31s 3h 42m 11s 0.5s 3s 7s 46s 7m 23s 1h 7m 58s

Aver. iter. — — 19.00 38.14 19.00 38.29 74.90 143.80 272.05 527.13 13.00 20.00 29.20 44.80 69.50 106.39

Table 5: Performance of the BiQSC-CN method by different solvers for (α, β) = (1.5, 1.4)

(α, β) = (1.5, 1.4) Gauss Bi-CGSTAB FBi-CGSTAB

PFBi-CGSTAB

N = N x Ny 28 210 28 210 28 210 212 214 216 218 28 210 212 214 216 218

M 8 14 8 14 8 14 23 39 64 108 8 14 23 39 64 108

Error 3.50e-03 8.31e-04 3.50e-03 8.31e-04 3.50e-03 8.31e-04 2.07e-04 6.50e-05 2.13e-05 7.28e-06 3.50e-03 8.31e-04 2.07e-04 6.50e-05 2.13e-05 7.28e-06

22

CPU 15m 35s 34h 40m 57s 1m 42s 1h 17m 21s 0.3s 2s 11s 1m 14s 13m 29s 2h 7m 0s 0.2s 0.7s 5s 39s 6m 10s 55m 49s

Aver. iter. — — 14.38 23.43 14.38 23.43 40.44 61.13 89.58 133.26 10.00 13.00 16.09 20.00 24.05 26.81

BiQSC-CN method (4.4) via different solvers. Numerical comparisons for the three cases, including the numerical errors, the CPU times and the average number of iterations, are shown in Tables 4–6 respectively. Basically, we can observe the same conclusions as the one-dimensional case that both Gauss and the Bi-CGSTAB solvers are computationally very expensive, while the fast Bi-CGSTAB solver based on the fast matrix-vector multiplication developed in Section 4 has greatly reduced the CPU usage. For example, Gauss uses more than 34 hours with (α, β) = (1.5, 1.4) for N = 210 and M = 14, while the fast Bi-CGSTAB uses CPU time only 2 seconds! Moreover, the T. Chan’s circulant block preconditioner uses less number of iterations than the fast Bi-CGSTAB solver does, and thus less CPU time is consumed. For example, when (α, β) = (1.9, 1.8), N = 218 and M = 108, the average number of iteration is reduced from about 527 to 106. Meanwhile, only one third of CPU time is consumed for the preconditioned fast Bi-CGSTAB solver, compared with the fast version. However, for the case (α, β) = (1.1, 1.2), the acceleration of the T. Chan’s circulant block preconditioner is not so obvious as the other two cases. Table 6: Performance of the BiQSC-CN method by different solvers for (α, β) = (1.1, 1.2)

(α, β) = (1.1, 1.2) Gauss Bi-CGSTAB FBi-CGSTAB

PFBi-CGSTAB

N = N x Ny 28 210 28 210 28 210 212 214 216 218 220 28 210 212 214 216 218 220

M 13 13 23 13 23 43 79 148 275 513 13 23 43 79 148 275 513

Error 2.02e-03 2.02e-03 4.88e-04 2.02e-03 4.88e-04 1.29e-04 3.50e-05 9.77e-06 2.74e-06 7.79e-07 2.02e-03 4.88e-04 1.29e-04 3.50e-05 9.77e-06 2.74e-06 7.79e-07

CPU 24m 7s 2m 3s 1h 10m 40s 0.3s 2s 9s 1m 4s 9m 1h 6m 46s 9h 14m 5s 0.4s 2s 12s 1m 8s 7m 9s 1h 2m 50s 7h 46m 30s

Aver. iter. — — 10.39 13.17 10.39 13.17 16.16 19.41 23.24 25.67 28.06 9.00 10.22 11.02 11.34 11.03 10.98 10.83

6. Concluding remarks This paper seems to be the first attempt to discuss fast solution techniques of fractional diffusion equations using the quadratic spline collocation method. Both one and two-dimensional fractional diffusion equations are considered, and preconditioned matrix-free fast Krylov subspace iterative solvers are developed for corresponding QSC schemes, by carefully exploring the mathematical structure of the coefficient matrix. Numerical experiments show that the method has greatly reduced the computational cost and memory requirement, while still well approximates the fractional diffusion equations without any accuracy lost. Acknowledgements This work was supported in part by the National Natural Science Foundation of China under Grants 91630207 and 11571115, by the Shandong Provincial Natural Science Foundation under 23

Grant ZR2017MA006, by the Taishan Project of Shandong, by the Fundamental Research Funds for the Central Universities under Grants 18CX02044A, 18CX02049A and 17CX02066, by the OSD/ARO MURI Grant W911NF-15-1-0562, and by the National Science Foundation under Grant DMS-1620194. The authors would like to express their sincere thanks to the referees for their very helpful comments and suggestions, which greatly improved the quality of this paper. References [1] R. Barrett, M. Berry, T.F. Chan, J. Demmel, J.M. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H.V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994. [2] D. Benson, S.W. Wheatcraft, M.M. Meerschaert, The fractional-order governing equation of L´evy motion. Water Resour. Res. 36 (2000) 1413–1423. [3] T.F. Chan, J.A. Olkin, Circulant preconditioners for Toeplitz-block matrices, Numer. Algor. 6 (1994) 89–101. [4] S. Chen, F. Liu, X. Jiang, I. Turner, V. Anh, A fast semi-implicit difference method for a nonlinear two-sided spacefractional diffusion equation with variable diffusivity coefficients, Appl. Math. Comput. 257 (2015) 591–601. [5] W. Chen, Y. Liang, S. Hu, H. Sun, Fractional derivative anomalous diffusion equation modeling prime number distribution, Frac. Cal. Appl. Anal. 18 (2015) 789–798. [6] P.J. Davis, Circulant Matrices, Wiley-Intersciences, New York, 1979. [7] T. El-Danaf, A. Hadhoud, Parametric spline functions for the solution of the one time fractional Burgers’ equation, Appl. Math. Model. 36 (2012) 4557–4564. [8] V.J. Ervin, J.P. Roop, Variational formulation for the stationary fractional advection dispersion equation, Numer. Methods Part. Diff. Eqn. 22 (2006) 558–576. [9] W. Fan, F. Liu, X. Jiang, I. Turner, A novel unstructured mesh finite element method for solving the time-space fractional wave equation on a two-dimensional irregular convex domain, Fract. Calc. Appl. Anal. 20 (2017) 352– 383. [10] L. Feng, P. Zhuang, F. Liu, I. Turner, Stability and convergence of a new finite volume method for a two-sided space-fractional diffusion equation, Appl. Math. Comput. 257 (2015) 52–65. [11] H. Fu, M.K. Ng, H. Wang, A divide-and-conquer fast finite difference method for space-time fractional partial differential equation, Comput. Math. Appl. 73 (2017) 1233–1242. [12] H. Fu, H. Wang, A preconditioned fast finite difference method for space-time fractional partial differential equations, Fract. Calc. Appl. Anal. 20 (2017) 88–116. [13] H. Fu, Z. Zhu, H. Wang, POD/DEIM reduced-order modeling of time-fractional partial differential equations with applications in parameter identification, J. Sci. Comput. 74 (2018) 220–243. [14] X. Gu, T. Huang, C. Ji, B. Carpentieri, A.A. Alikhanov, Fast iterative method with a second order implicit difference scheme for time-space fractional convection-diffusion equations, J. Sci. Comput. 72 (2017) 957–985. [15] R.M. Gray, Toeplitz and Circulant Matrices: A Review, in: Foundations and Trends in Communications and Information Theory, Vol 2, Issue 3. Now, Boston, 2006. [16] H. Hejazi, T. Moroney, F. Liu, Stability and convergence of a finite volume method for the space fractional advection-dispersion equation, J. Comput. Appl. Math. 255 (2014) 684–697. [17] S. Hosseini, R. Ghaffari, Polynomial and nonpolynomial spline methods for fractional sub-diffusion equations, Appl. Math. Model. 38 (2014) 3554–3566. [18] X. Jin, F. Lin, Z. Zhao, Preconditioned iterative methods for two-dimensional space fractional diffusion equations, Commun. Comput. Phys. 18 (2015) 468–488. [19] R. Ke, M.K. Ng, H. Sun, A fast direct method for block triangular Toeplitz-like with tridiagonal block systems from time-fractional partial differential equations, J. Comput. Phys. 303 (2015) 203–211. [20] S. Lei, H. Sun, A circulant preconditioner for fractional diffusion equations, J. Comput. Phys. 242 (2013) 715-725. [21] X. Li, P. Wong, A higher order non-polynomial spline method for fractional sub-diffusion problems, J. Comput. Phys. 328 (2017) 47–65. [22] X. Li, P. Wong, A non-polynomial numerical scheme for fourth-order fractional diffusion-wave model, Appl. Math. Comput. 331 (2018) 80–95. [23] X. Li, P. Wong, An efficient numerical treatment of fourth-order fractional diffusion-wave problems, Numer. Meth. PDE. 34 (2018) 1324–1347. [24] X. Li, P. Wong, An efficient nonpolynomial spline method for distributed order fractional subdiffusion equations, Math. Method Appl. Sci. 41 (2018) 4906–4922. [25] F. Liu, S. Chen, I. Turner, K. Burrage, V. Anh, Numerical simulation for two-dimensional Riesz space fractional diffusion equations with a nonlinear reaction term, Cent. Eur. J. Phys. 11 (2013) 1221–1232.

24

[26] F. Liu, P. Zhuang, I. Turner, V. Anh, K. Burrage, A semi-alternating direction method for a 2-D fractional FitzHughNagumo monodomain model on an approximate irregular domain, J. Comput. Phys. 293 (2015) 252–263. [27] F. Liu, P. Zhuang, I. Turner, K. Burrage, V. Anh, A new fractional finite volume method for solving the fractional diffusion equation, Appl. Math. Model. 38 (2014) 3871–3878. [28] J. Liu, H. Fu, X. Chai, Y. Sun, H. Guo, Stability and convergence analysis of the quadratic spline collocation method for time-dependent fractional diffusion equations, Appl. Math. Comput. 346 (2019) 633–648. [29] Z. Mao, J. Shen, Efficient spectral-Galerkin methods for fractional partial differential equations with variable coefficients, J. Comput. Phys. 307 (2016) 243–261. [30] M.M. Meerschaert, C. Tadjeran, Finite difference approximations for two-sided space-fractional partial differential equations, Appl. Numer. Math. 56 (2006) 80–90. [31] R. Metler, J. Klafter, The restaurant at the end of random walk: recent developments in the description of anomalous transport by fractional dynamics, J. Phys. A. 37 (2004) 161–208. [32] J. Pan, R. Ke, M.K. Ng, H. Sun, Preconditioning techniques for diagonal-times-Toeplitz matrices in fractional diffusion equations, SIAM J. Sci. Comput. 36 (2014) 2698–2719. [33] I. Podlubny, Fractional Differential Equations, Academic Press, New York, 1999. [34] E. Scalas, R. Gorenflo, F. Mainardi, Fractional calculus and continuous-time finance, Physica A. 284 (2000) 376– 384. [35] H. Sun, Y. Zhang, W. Chen, D.M. Reeves, Use of a variable-index fractional-derivative model to capture transient dispersion in heterogeneous media, J. Contam. Hydrol. 157 (2014) 47–58. [36] E.E. Tyrtyshnikov, Optimal and superoptimal circulant preconditioners, SIAM J. Matrix Anal. Appl. 13 (1992) 459–473. [37] W. Tian, H. Zhou, W. Deng, A class of second order difference approximations for solving space fractional diffusion equations, Math. Comput. 84 (2015) 1703–1727. [38] H. Wang, T.S. Basu, A fast finite difference method for two-dimensional space-fractional diffusion equations, SIAM J. Sci. Comput. 34 (2012) 2444–2458. [39] H. Wang, N. Du, A fast finite difference method for three-dimensional time-dependent space-fractional diffusion equations and its efficient implementation, J. Comput. Phys. 253 (2013) 50–63. [40] H. Wang, N. Du, Fast alternating-direction finite difference methods for three-dimensional space-fractional diffusion equations, J. Comput. Phys. 258 (2013) 305–318. [41] H. Wang, K.X. Wang, T. Sircar, A direct O(N log2 N) finite difference method for fractional diffusion equations, J. Comput. Phys. 229 (2010) 8095–8104. [42] H. Wang, D. Yang, Wellposedness of variable-coefficient conservative fractional elliptic differential equations, SIAM J. Numer. Anal. 51 (2013) 1088–1107. [43] Q. Yang, I. Turner, F. Liu and M. Ilis, Novel numerical methods for solving the time-space fractional diffusion equation in two dimensions, SIAM J. Sci. Comput. 33 (2011) 1159-1180. [44] S.B. Yuste, J. Quintana-Murillo, A finite difference scheme with non-uniform timesteps for fractional diffusion equations, Comput. Phys. Commun. 183 (2012) 2594–2600. [45] F. Zeng, F. Liu, C. Li, K. Burrage, I. Turner, V. Anh, A Crank-Nicolson ADI spectral method for a two-dimensional Riesz space fractional nonlinear reaction-diffusion equation, SIAM J. Numer. Anal. 52 (2014) 2599–2622.

25