Author’s Accepted Manuscript Optimisation of Language-Integrated Queries by Query Unnesting Tomasz Marek Kowalski, Radosław Adamus
www.elsevier.com/locate/cl
PII: DOI: Reference:
S1477-8424(16)30024-0 http://dx.doi.org/10.1016/j.cl.2016.09.002 COMLAN236
To appear in: Computer Language Received date: 8 February 2016 Accepted date: 11 September 2016 Cite this article as: Tomasz Marek Kowalski and Radosław Adamus, Optimisation of Language-Integrated Queries by Query Unnesting, Computer Language, http://dx.doi.org/10.1016/j.cl.2016.09.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Optimisation of Language-Integrated Queries by Query Unnesting Authors: Tomasz Marek Kowalski, email:
[email protected] (corresponding author) Radosław Adamus, email:
[email protected] Affiliation: Institute of Applied Computer Science Lodz University of Technology ul. Stefanowskiego 18/22 90-924 Lodz, Poland
Abstract Native functional-style querying extensions for programming languages (e.g., LINQ or Java 8 streams) are widely considered as declarative. However, their very limited degree of optimisation when dealing with local collection processing contradicts this statement. We show that developers constructing complex LINQ queries or combining queries expose themselves to the risk of severe performance deterioration. For an inexperienced programmer, a way of getting an appropriate query form can be too complicated. Also, a manual query transformation is justified by the need of improving performance, but achieved at the expense of reflecting an actual business goal. As a result, benefits from a declarative form and an increased level of abstraction are lost. In this paper, we claim that moving of selected methods for automated optimisation elaborated for declarative query languages to the level of imperative programming languages is possible and desired. Our approach is based on the assumption that programmer is able distinguish whether a language-integrated query is intentionally used to introduce some side-effects or its sole purpose is to only query the data. We propose two optimisation procedures through query unnesting designed to avoid unnecessary multiple calculations in collection-processing constructs based on higher-order functions. We have implemented and verified this idea as a simple proof-of-concept LINQ optimiser library.
1 Introduction Since the release of LINQ (Language-Integrated Query) for the Microsoft .NET platform in 2007, there has been a significant progress in the topic of extending programming languages with native querying capabilities [1]. Programming languages are mostly imperative; their semantics relies on the program stack concept. They operate on volatile data and the meaning of collections is rather secondary. On the other hand, query languages are usually declarative and their semantics often bases on some forms of algebras or logics; these languages operate mostly on collections of persistent data. Declarativity of a query language reveals itself mostly when considering operators for collections. In the case of an imperative language, operating on a collection takes a form of an explicit loop iterating over collection elements in a specified order, while in query languages one declares a desired result (e.g., a sub-collection containing elements of a base collection matching a given selection predicate) and an algorithm of filtration itself is not an element of an expression representing the query. Based on characteristics of data structures, a database state and existence of additional auxiliary structures (e.g., indices), an execution environment can choose the most promising algorithm (a plan) for evaluation of the query. Declarativity allows one to postpone selection of an algorithm even to the moment of an actual query execution. In this paper we discuss to what extent solutions for processing of collections within programming languages are actually declarative. To do so, we made an extensive research on query optimisation. In databases it is a
crucial process that allows a programmer to be relieved from thinking about details of a processing control flow, auxiliary data structures and algorithms. LINQ seems to be the most robust solution introducing a promise of declarative collection processing within an imperative programming environment. It is commonly used for direct processing of collections and a mapper to resources devoid of a robust declarative query API or query optimisation. When encountering performance issues, developers are forced to manually optimise LINQ expressions or partly resign from declarative constructs in favour of an imperative code. var maxQuery = from p in products where (from p2 in products select p2.unitPrice).Max() == p.unitPrice select p.productName;
Listing 1 Example 1 – query expression syntax. Consider a LINQ query expression in Listing 1 (the database diagram including the Products table is available at http://northwinddatabase.codeplex.com/) whose purpose is to find names of products with a unit price equal to the maximal unit price among products. If the query addresses a native collection of objects, its execution is severely inefficient as the nested subquery, searching for the maximal price of products, is unnecessarily evaluated for each product addressed by the outer query. Although this task could be resolved in a time linearly proportional to the collection's cardinality, the LINQ engine induces an outer loop and a nested loop, both iterating over the products’ collection. Using this example, in further sections we show that manual optimisation of complex LINQ queries is not an easy task. Besides the query business semantics a programmer must carefully consider a query execution strategy (e.g., lazy evaluation) and possibility of sideeffects (e.g., an exception is thrown by the Max method if called on an empty collection). LINQ enables to express the same goal in many different ways. However, evaluation times of two semantically equivalent queries may differ by several orders of magnitude. In particular, in the context of the LINQ query expressions’ declarative syntax, it violates the declarative programming principle. Without knowledge on how a query engine works in a context of given data, the optimisation process is too complex and time-consuming. This is particularly true if a programmer wants to preserve semantics and properties of his query construct. To the best of our knowledge, the problem of automated global optimisation of LINQ queries for direct processing of collections of objects has not been addressed in the literature so far. By global optimisation we understand the ability to define an efficient query execution plan based on the whole query structure as opposed to the local optimisation that usually only targets a single operator. Below we prove that global optimisation can be done automatically making LINQ genuinely declarative. Nonetheless, the problem that this paper deals with is not limited to LINQ. Surprisingly, it extends to dozens of programming environments that support functional-style operations on collections of elements, such as filter, map or reduce. Pipelines and streams introduced in Java 8 are a solution equivalent to LINQ to Objects [2]. The main difference lies in the naming convention of new operators corresponding to their functional prototypes (e.g., map and filter instead of LINQ’s Select and Where). Furthermore, list comprehension constructs are examples of a shorthand syntax for specifying operations of projection and selection (filtering). Consequently, discussed issues concern many imperative languages exploiting this feature (e.g., Python). Fowler summarises such a functional-style programming pattern using a term collection pipeline [3]. Examples given in LINQ can be expressed in many imperative and functional programming languages. While we extend the conclusions of our work to the universe of imperative programming, they do not directly apply to functional languages (e.g., Haskell) since their principles of program evaluation are significantly different [4].
The rest of the paper is organised in the following way. First, we present a brief description of the state of the art followed by characteristics of language-integrated query constructs. In section 4 we describe the problem with improving performance of independent and correlated subqueries that reveal a huge optimisation potential. Next, we propose the solution, i.e. two query rewriting procedures which can be applied to functional-style operations on collections. Section 6 explains principles of our optimisation approach enabling implementation of automatic optimisers. Finally, we present measured results. The paper is concluded with a short summary.
2 Related Work and the State-of-the-Art Databases are the area of the computer science where declarative programming and query optimisation have developed extensively. Over 40 years of the research on relational systems resulted in various optimisation techniques [5][6] and numerous solutions are incorporated in available commercial products. Our research presented in the paper particularly addresses query optimisations analogous to query unnesting, dating back to the early 80s [7]. This topic is constantly appearing in the context of arising database technologies. Different approaches to handle nested queries evaluation have been proposed for object-oriented databases [8][9] and XML documentbased stores [10]. However, NoSQL solutions marginalise the topic of query languages and usually rely on a minimalistic programming interface and domain-specific optimisations, mostly implemented by high redundancies and storing data in the form matching assumed queries. Most attention from the scientific community concentrates on the topic of distributed data-parallel computing using the Map/Reduce paradigm (like Hadoop or Dryad for Azure) [11]. This paradigm can be transparently used in declarative collection processing. The Dryad programming environment based on LINQ [12] takes advantage of mechanisms similar to Map/Reduce in order to write scalable, parallel and distributed programs. To increase sharing of computations in a data centre, Dryad can benefit from the Nectar system [13]. It is able to cache results of frequently used queries and incrementally update them. The use of cached results is achieved through automatic query rewriting. Robust query and program optimisations have been developed for solutions based on the functional paradigm. According to Fegaras [14], an optimisation framework for a functional lambda-DB object-oriented database relies on mathematical bases, i.e. the monoid comprehensions calculus. It generalises many unnesting techniques proposed in the literature. Glasgow Haskell Compiler (GHC) for the Haskell non-strict purely functional language introduces many methods based on code rewriting. They range from relatively simple rules that can be used to improve efficiency of programs through modifications on a high syntactic level to more complex low-level core language transformations (e.g., let-floating, beta reduction, case swapping, case elimination) [15]. In particular, a procedure called full laziness (or fully lazy lambda lifting) has been proposed to avoid reevaluation of inner expressions for which result could be pre-calculated only once [16][17]. Currently, due to introduction of lambda abstractions into object-oriented languages, functional style of programming became ubiquitous. Stream and collection processing constructs derived from functional languages can be naturally evaluated in parallel using multiple processor cores. Therefore, the most popular solutions, like Java 8 streams, LINQ or Scala, enable such optimisation through various libraries or frameworks [18]. In the field of functional-style queries integrated into a programming language, the topic of query optimisation seems the most advanced in LINQ. A LINQ provider library can implement direct processing of data (e.g., LINQ to Objects, LINQ to XML) or delegate processing to a remote external resource by sending a translated query (e.g., LINQ to SQL, LINQ to Entities). To be precise, a mixture of both approaches can be used, e.g. when the query language of a remote resource cannot completely express the semantics of a LINQ query. In the case when LINQ sends a translated query, it also delegates the responsibility for query optimisation. Consequently, if the external resource engine provides optimisation, developers can fully rely on a declarative style of
programming. However, in the context of LINQ to SQL, the problem of analysing and normalising of LINQ queries in order to provide minimal and cohesive mapping to SQL has drawn attention of the scientific community. This is caused mostly by some drawbacks of the original Microsoft’s solution that in some cases may fail or produce a so-called “query avalanche” [19][20]. The issue of performance deficiencies while processing collections of objects has not passed unnoticed by the LINQ community. In order to cope with the shortage in optimisation comparing to database engines, the i4o project (abbr. index for objects) solution adapted the idea of indexing to native objects’ collections [21]. It is implemented as an alternative for the LINQ to Objects provider library. Utilising the concept of secondary access structures, i4o can produce several orders of magnitude of a gain in performance for queries filtering data at the cost of a data modification overhead. Another examples of LINQ query optimisation tools are Steno [22] and LinqOptimizer [23] provider libraries. Their authors focused on a significant performance deficiency of LINQ queries in contrast to the equivalent manually optimised code that can be several times faster. Experiments have shown that Steno allows one to obtain up to 14-fold increase in processing of sequential data and 2-fold comparing to a problem processed by the DryadLINQ distributed engine [12]. The main idea behind Steno is to eliminate the overhead introduced by virtual calls to iterators that are the fundamental mechanism used by the LINQ engine. This problem has been solved by automatic generation of an imperative code omitting iterators. The optimisation addresses mainly implementation of individual operators. This also concerns the case of nested loops’ optimisation when Steno has to analyse a series of operators only to preserve the order of iteration induced by the LINQ to Objects library implementation. This is justified by the loop fusion efficiency and consideration of side-effects that are allowed in LINQ. Steno is also capable of higher-level optimisation giving an example of the GroupBy-Aggregate optimisation. It involves a local term rewriting, addressing a pair of neighbouring operators, i.e. GroupBy followed by Aggregate. When encountering such a sequence of operators, Steno replaces it by a dedicated GroupByAggregate operator that saves memory by storing per-key partial aggregates instead of the whole collection of group values. This optimisation takes advantage of LINQ declarativity by changing the course of evaluation. As a result, introducing side-effects would cause its incorrectness. Being aware of a difficulty of automatic reasoning about side-effects within queries, Steno’s authors suggest developer-guided optimisation. Optimisation similar to GroupBy-Aggregate is considered in the SkyLINQ project [24] that develops an alternative operator called Top. This operator can be used to substitute a sequence of OrderBy and Take method calls (i.e. an operation to get top k elements). The significance of LINQ grew up with introducing LINQ to Events, an extension enabling declarative programming according to the reactive paradigm [25]. The solution derives from Functional Reactive Programming and is well suited for composing asynchronous and event-based programs [26]. Recently, this approach has attracted attention of commercial and scientific communities and, as a programming paradigm, faces efficiency issues indicating possible areas for optimisation [27][28]. Declarative functional-style constructs in general-purpose object-oriented languages are not pure. As a result, decisions concerning optimisation have to be made by programmers. Transparent and aggressive compile-time optimisations can be achieved by introducing a query language extension into a programming language compiler [29]. One of numerous examples of extending compilers of existing languages with declarative constructs is SBQL4J [30]. It enables seamless integration of SBQL queries with language instructions and executing them in a context of Java collections. SBQL4J is based on the Stack-Based Architecture (SBA) approach instead of the functional approach and offers capabilities comparable to the LINQ technology [31][32]. What distinguishes it from other programming language-integrated queries is incorporation of several automatic optimisation methods developed for SBA. One of these methods, i.e. factoring out independent subqueries [8], enables SBQL4J to cope with optimisation of queries equivalent to examples discussed in this paper. It belongs to the group of optimisation methods that
are based on query rewriting. Factoring out concerns a subquery (that in SBA represent any subexpression) that is processed many times in loops implied by so called non-algebraic operators despite that in subsequent loop cycles its result is the same. In SBQL4J rewriting is applied at a compile-time and a resulting performance improvement can be very significant, sometimes giving query response times shorter by several orders of magnitude. Table 1: Language-integrated query optimisation tools Optimisation type
time
Target language/ environment
GHC
Global query rewriting Parallelisation
Compile
Haskell
Functional operations
PLINQ
Parallelisation
Runtime
Linq (.NET)
Functional-style methods
Java 8 parallel streams
Parallelisation
Runtime
Java
Functional-style methods
ScalaBlitz
Parallelisation
Runtime
Scala
Functional-style methods
i4o
Secondary access structures
Runtime
Linq (.NET)
Functional-style methods
LinqOptimizer
Low-level code generation Parallelisation
Runtime
Linq (.NET)
Functional-style methods
Steno
Local query rewriting Low-level code generation
Runtime
Linq (.NET)
Functional-style methods
SkyLinq
Local query rewriting
Runtime
Linq (.NET)
Functional-style methods
PQL/Java
Global query rewriting Parallelisation
Compile
Java
PQL queries
SBQL4J
Global query rewriting
Compile
Java
SBQL queries
Tool
Collection processing constructs type
Table 1 summarises described language-integrated queries optimisation tools. It focuses on programming environments targeting in memory collections as it lays in the centre of our research. In particular, we strive towards filling the gap in global query rewriting mechanisms for environments introducing functional-style language-integrated query constructs.
3 Characteristics of Language-Integrated Query Constructs Declarative style programming (especially in the context of databases) is often associated with the select-from-where syntactic sugar known from SQL that was adapted into LINQ. The query in Listing 1 is expressed using the LINQ query expression syntax. That form lacks explicit information on an order of performed operations and virtually a compiler could translate it to any semantically equivalent lower-level code that could be considered a query execution plan. Consequently, programmers must be particularly careful about potential side-effects within declarative constructs in order to avoid the risk of unpredicted violations. Technically, query expressions are syntactic sugar over the implementation layer using lambda expressions, higher-
order functions and, so called, extension methods [33]. An executable query, after removing the LINQ syntax sugar, will take the form presented in Listing 2. var maxQuery = products. Where(p => products.Max(p2 => p2.unitPrice) == p.unitPrice). Select(p => p.productName);
Listing 2 Example 1 – de-sugared. The translated query uses the traditional, non-declarative object-oriented programming syntax. When processing collections or XML documents directly, the most crucial LINQ library extension methods (e.g., Select and Where) expose iterators that perform a specified operation on elements of a given collection. Lambda expressions are used to express details concerning such an operation, e.g. the selection predicate for the Where operator. Despite of similarity of Listing 2 to the original query expression, such composition of method calls on the products collection determines the order of evaluation. Due to the specific implementation based on iterators and lambda abstractions, the execution strategy of LINQ queries is deferred. Execution is performed in presence of functions or instructions forcing iteration over elements specified by a query. However, a result of an iteration is not saved or cached, so each execution reevaluates a query against a given (current) data state. The approach used in the LINQ to the objects’ library implementation is generally ubiquitous (however, not uniform) in numerous programming languages (e.g., Python, Java 8, Elixir, Ruby) [3]. A good summary describing the possible set of properties of functional-style constructs can be found in the documentation of Java 8 streams [2]. The functional nature makes a language construct a good candidate for optimisation due to intelligible querying. All operations in a query processing chain produce a new queryable result instead of modifying original data; hence they do not introduce side-effects. For example, during filtration on a list, no element is actually removed. Even though filter and map (common functionalstyle operators) are often used to directly process elements of a local in-memory collection, in reality elements can be obtained one by one from any so called queryable data source, e.g., a data structure, a generator function, an iterator, an I/O channel and a chained pipeline of collection operations. Such generality allows, usually time-consuming, querying of remote data sources, additionally making optimisation desirable. The above properties are common in implementations of language-integrated query mechanisms. However a programmer must be sensitive to possible differences in various programming languages. For example, some languages implement consumable evaluation of queries. In such a strategy, elements of a queryable data source instance can be visited only once during its life. As a result, each query instance can be evaluated only once. The last property is present in Java 8 streams whereas LINQ operators are not consumable. The laziness-seeking property has the most profound impact on evaluation and semantics of language-integrated queries. It is connected with the lazy evaluation strategy assuming that a next element is returned for further processing only if necessary. Usually, a place of a lazy construct definition does not determine an actual moment of query execution (i.e. deferred execution strategy). Actual query execution occurs when its result is required, for example elements referred by a query are iterated or counted. Operations like selection, projection, and removal of duplicates are often implemented lazily. Consequently, to ensure coherence execution of eager constructs (e.g., grouping or ordering) is also deferred. Lazy evaluation usually results in better performance. It is cache-friendly since an element is processed by a chain of collection pipeline operations before proceeding to the next element. Moreover, in cases when the desired query result has been reached before visiting all elements it is not necessary to continue iterating (e.g., a query finding the first product with the given name). In the context of query optimisation, it is important to preserve properties of optimised constructs. In a general case, any change in this matter can affect semantics of application code. Switching an
expression evaluation strategy from lazy to eager or forcing immediate execution can have serious consequences. Only laziness-seeking constructs can deal with possibly unbounded data sources. Eager evaluation of selection and projection operators on an infinite data source would require infinite computational resources and time, while lazy evaluation can return partial results. In the next section, we show that considering side-effects and preserving deferred execution, which is implied by the lazy evaluation strategy, are factors impeding query optimisation.
4 Manual Optimisation Pitfalls One of the main assumptions enabling the optimisation procedure is that a programmer's intent is to query data, not to intentionally introduce side-effects. For example, such queries cannot be used to add or modify data in the data sources. The most ideal situation, from the query optimisation point of view, is when the queried data are devoid of null values and there are no irregularities in the schema (e.g., optional attributes). Unfortunately, in real cases, the optimisation cannot assume existence of such “pure” environment and has to be proved against unintended, but possible, situations that can affect query execution. Because of this we divide possible side-effects that can be generated during query execution into two categories: 1. Intended side-effect, when a query definer's intention is to use the query syntax for changing the state of an application data. This case needs to be excluded from optimisation and we omit it on our further considerations. 2. Unintended (or unwanted) side-effects, when a programmer's intention is just to query the data and any side-effect that can appear during query evaluation is not related to query semantics. An occurrence of unintended side-effects is a consequence of: 1. Irregularities in data (e.g. optional values that can be evaluated to null values, empty collections, exceptions thrown, etc.). We can argue that a programmer should be aware of the data schema and prepare the query in a way allowing dealing with them. For example, the order of predicates in the filtering part of a query is intentional and allows to avoid null values, or a known exception that can be thrown is used in the catch clause of try-catch statement. If the query is not prepared for such a situation is to say that it is wrong. 2. A lack of knowledge about external programming abstractions used in a query. In imperative languages, a programmer never can be sure that calling a given method/function does not imply some side-effects (e.g. writing a log file, throwing an unknown exception, etc.). The key assumption here is that this ‘external’ behaviour does not imply the query semantics. Because a programmer is not aware of it, the query itself cannot be designedly prepared for any negative influences of a possible result of an external side-effect. Our goal is to design query optimisation that is efficient (not deteriorating performance in reasonable cases), preserves query evaluation strategy (e.g., deferred evaluation), and is semantically correct (particularly, in the presence of unintended side-effects).
4.1 Evaluation of Uncorrelated Nested Expressions Although LINQ queries' execution is deferred, an execution strategy of some expressions comprising LINQ queries can be immediate. Such expressions are evaluated locally in the place of the definition. Some operators, like aggregate functions returning a single value instead of a queryable data source, force immediate execution of a query. The query in Listing 2 contains such an expression determining the greatest unit price in the products’ collection. However, it will not be evaluated until execution of maxQuery since it is contained in a lambda expression defining a
selection predicate for the Where method. Analysing the query, it becomes obvious that the nested expression will be executed multiple times, since it is a part of the lambda abstraction (specifying the selection predicate) called against each product (the external Where operator induces a loop iterating over elements of the products' collection). This form is not efficient and makes the computational complexity quadratic (i.e., O(n2)). However, the expression determining the greatest unit price among products is independent of the parent query and could be evaluated just once. Let us refer to such constructs as “free expressions”. In order to improve query performance, a programmer must transform it. A natural way for optimisation seems to be factoring out of the problematic free subexpression to a separate instruction and assigning it to a new variable (see Listing 3). The changes could also be presented in the LINQ query expression, but because the form with extension methods is actually executed, it will be a basis for this study. var nestedMaxPrice = products.Max(p2=>p2.unitPrice); var maxQuery = products. Where(p => nestedMaxPrice == p.unitPrice). Select(p => p.productName);
Listing 3 Example 1 – loops in separate instructions. A result of the transformation shown in Listing 3 may seem effective. The solution is more efficient (a linear computational complexity) than the query in Listing 2. However, a part of the original query specified in the nestedMaxPrice variable is executed but the other part remains deferred to the moment of an actual demand. It is possible that data in a collection may change between creation of a query and its evaluation. The original query form (and the programmer’s intention) is insusceptible to it – the query is always completely executed on a current data state. After immediate execution of the nested expression one cannot be sure about it – maxQuery can be evaluated when data needed for calculating the nestedMaxPrice subexpression got already modified. As a result of the transformation, there occurred a change of the query semantics that is very difficult to detect by a programmer or tests. Ultimately, a programmer is forced to resign from deferred execution of the whole query, which is shown in Listing 4. There exist several techniques to force immediate execution of a LINQ query. For example, the ToList method returns a list containing a materialised query result. var maxPrice = products.Max(p2=>p2.unitPrice); var maxQuery = products. Where(p => maxPrice == p.unitPrice). Select(p => p.productName).ToList();
Listing 4 Example 1 – fully immediate execution (linear computational complexity). Notice that the ToList operation should be applied also to a factored out expression if its evaluation strategy is lazy. It cannot be applied to the expression defining maxPrice due to its inherent immediate execution. In our previous paper we discuss details concerning factoring out of a construct evaluated lazily [34]. Due to explicit materialisation, reusing the optimised query against a different data state becomes troublesome. For an inexperienced programmer, a way of getting an appropriate query form can be too complicated. Without deeper knowledge on the LINQ internal semantics in a context of object data, obtaining optimal structure of code is a tricky, time-consuming and error-prone task. The example shows a lack of real independence of LINQ from a type of a data source. Despite the fact that LINQ allows unified processing on various types and sources of data, an actual execution plan relies on them. It seems that the basis for elaborating this layer of the language was mostly the integration of the object-relational mapping with a type system of a programming language (what also shows at the level of the LINQ query expression syntax and implementing providers [35]).
4.2 Consequences of Changing the Evaluation Order There exist some subtle consequences concerning evaluation after the manual optimisation. In the original forms of example query (Listing 2), the nested expression would be evaluated only if the products’ collection is not empty. In the optimised form (Listing 4) it will be unnecessarily evaluated also when the collection is empty. This is particularly important for performance when a nested query operates on a collection different than the external query does. The current example concerns just one collection, but it is easy to imagine a situation when collections are distinct (e.g., products from other shops kept in separate collections). In extreme cases, if calculation of a factored out expression is time-consuming, this can worsen overall query performance. Aside from performance issues, the transformation presented in previous sections can have dangerous impact on query semantics. In the example (Listing 2), in the case of an empty collection of products, the selection predicate is not evaluated at all and the final result is simply an empty collection of product names. After optimisation (Listing 4), the expression products.Max(p2 => p2.unitPrice) is always evaluated at the beginning. The Max method applied to an empty collection throws an exception. Consequently, the behaviour of the optimised query is unsafe and inconsistent with the intent of a programmer. To make optimisation immune to the described risk, the original order of evaluation should be restored. This could be achieved by applying the lazy loading pattern to the free expression determining maxPrice. In Listing 5 we introduce an improved transformation. var maxPriceThunk = new Lazy
(() => products.Max(p2=>p2.unitPrice)); var maxQuery = products. Where(p => maxPriceThunk.Value == p.unitPrice). Select(p => p.productName).ToList();
Listing 5 Example 2 – immediate execution (linear computational complexity). A Lazy class instance is a simple thunk – an object in memory representing an unevaluated (suspended) computation, used in the call-by-need evaluation strategy. The argument of the Lazy constructor specifies a function that should be evaluated at most once, only if its value is requested for the first time. The request is signalised by accessing the Value property of the Lazy instance. Consequently, the original order of the query and the free expression evaluation are restored (except the free expression being processed at most once) making the optimisation semantically safe. It is achieved at the expense of overhead concerning access to the Value property.
4.3 Optimisation of Filtering in Correlated Queries A common query optimisation technique in databases is unnesting a correlated subquery [7][9][10]. An analogous pattern occurring in complex functional-style collection processing constructs concerns nested queries performing filtering according to a specified locally constant criterion. The query in Listing 6 contains such a subquery that counts products with the same category as a product currently processed by an outer loop induced by the first Where operator. var uniqueQuery = products. Where(p => products.Where(p2 => p.category.Equals(p2.category)).Count() == 1). Select(p => p.productName);
Listing 6 Example 2 – extension methods syntax (quadratic computational complexity). The purpose of the uniqueQuery query is to find products that alone have a specific category. The LINQ engine will perform comparisons of pairs rendered by the Cartesian square of the products' collection. The query does not contain any free expressions that are suitable for factoring out (we assume that products and p.category do not introduce any costly computations). In order to solve
the problem efficiently, a programmer must apply another strategy, e.g., he can take advantage of grouping operators (see Listing 7). var uniqueQueryUsingGroupBy = products.GroupBy(p => p.category). Where(prodGroup => prodGroup.Count() == 1). SelectMany(prodGroup => prodGroup.Select(p => p.productName))
Listing 7 Example 2 – solution using grouping operator (near-linear computational complexity). The GroupBy operator groups products according to the category property in the near-linear complexity and eliminates the need for further comparison of product categories. The rest of the query in evaluated in a linear time. Despite the improvement in performance, this solution has several drawbacks (taking as a reference the original Listing 6 form): 1. Grouping requires additional memory for an underlying data structure. 2. It might be complicated for an inexperienced programmer since it uses 4 different LINQ operators (the original one uses only Where and Select operators). 3. The original semantics has not been preserved in at least two aspects: 1. The order of products in a final result is not consistent with the order of source the products' collection. 2. In the case of a null reference in the category property, the original query would throw an exception, whereas in the grouping variant null is treated as an individual category. It is natural that the optimised query evaluation strategy takes advantage of additional data structures and more complex algorithms, but changes in query semantics are impermissible. It seems that using grouping is not a simple approach to finding general and correct transformation for optimisation of correlated queries similar to the one in Listing 6. Reconstruction of the original query semantics would require complex and careful modifications to the query in Listing 7 in order to restore the proper order in the final result and to react appropriately in presence of null references. We propose another approach for correlated query optimisation by substituting a nested filter operation with a call to an appropriate secondary access structure (e.g. an index). In C# there is a facility enabling simple creation of a lookup table of products according to the category property (i.e., the key value of the lookup table). It can be instantiated by calling an extension method ToLookup on any collection. Similarly to evaluation of the subquery determining the greatest unitPrice of the product in the first example (see Listing 2), it should be created only once during execution of the uniqueQuery query and therefore needs to be factored out. Using the same procedure as presented in the previous sections, we break the query into two instructions and, perform immediate execution (see Listing 8). To prevent creation of the lookup table before evaluation of the Where operator predicate we apply the lazy loading pattern. var productsCategoryLookupThunk = new Lazy(() => products.ToLookup(p2 => p2.category)); var uniqueQueryUsingLookup = products. Where(p => productsCategoryLookupThunk.Value[p.category].Count() == 1). Select(p => p.productName).ToList();
Listing 8 Example 2 – solution using lookup table (near-linear computational complexity). Notice that together with the products free expression also a part of the selection predicate, i.e. the p2.category expression, has been factored out and used to define the lookup table. Technically, this expression is not free in the original query, yet its value (according to Where operator semantics) depends only on the currently processed element of products collection, which itself is independent in the query. Let us call constructs such as p2.category strongly dependent expressions.
Building the lookup table defined in productsCategoryLookupThunk is more time-consuming (nearlinear complexity) then a single evaluation of a nested Where. However, the lookup table call (productsCategoryLookupThunk.Value[p.category]) takes only a fraction of the time of the original inner query and together with the Count operation can be evaluated in a constant time. The last solution, in contrast to the grouping transformation, preserves the order of the final result. Still, problem with reconstruction of the original query behaviour in the presence of null references in the category property remains. To solve this, a custom implementation of a lookup table is necessary. It must simulate the actual evaluation of Where operator, but without a need to reevaluate the recurring computations. Let us name such a structure a “volatile index” (the name will be explained in further sections). To properly fulfil its task, the volatile index must consider all aspects of business semantics and unintended side-effects associated with performing a filter operation. 1. The order of filtered elements must remain original. The current implementation of the C# lookup structure preserves it, however it is not stated in the language specifications. 2. Evaluation of expressions defining a filtered collection and a selection predicate may render an unintended side-effect (specifically, throw an exception). Even in the case of such a simple Where clause as in Listing 6, it is very difficult to predict where an exception may occur since both the products variable and the category property may encapsulate complex calculations, e.g., a query or a getter call. The following aspects must be considered to simulate the original order of evaluation and a possibility of unintended side-effects:
iteration over elements of a source collection (in our example the products' collection) is performed only when the following element is demanded for filtration,
evaluation of condition operands (the key value and the criterion) in a filtration predicate (usually performed starting from the first one on the left),
comparing the operands (assuming a correct implementation of a comparator operator, the risk occurs when it is a method called on null reference to the left operand object specified in the condition),
original evaluation strategy of Where operator (in case of lazy evaluation if condition in a predicate is true an element is further processed by collection pipeline operators before the next one is demanded).
In the example in Listing 8, the whole source collection and all key values are computed during the ToLookup method call and the criterion operand is evaluated at the end. Besides the obvious change in the order of evaluation, such an approach cannot be used in the case of filtering of an infinite data source. In the next section we present a general approach to optimisations that preserves semantics and characteristics of original queries while reducing theirs computational complexity.
5 Unnesting Uncorrelated and Correlated Queries The solutions presented for the examples share a common shortcoming; they do not preserve the deferred execution property. Our main aim is to propose general query rewriting rules overcoming this and other mentioned problems. In order to keep the solution generic, additional constraints have been assumed: (1) transformation should not break a query into separate instructions (in contrary to what is shown, for example in Listing 5), (2) we express the rules in general terms rather than LINQ specific ones (e.g., operators common in functional programming). Obviously, we assume also that the intent of a programmer is simply to query and not to introduce side-effects deliberately.
Generalisation of the factoring out of free expressions and the volatile indexing procedures should take into account queries more complex than presented above. A nesting level of a lambda expression in the examples in Listing 2 and Listing 6 is shallow, but conceptually can be arbitrary with no need to modify the procedures. Free or strongly dependent expressions can be bound either globally, i.e. to an environment independent from a query, or to a lambda expression at any nesting level lower than the lambda expression containing a free or strongly dependent expression. The examples presented in Listing 2 and Listing 6 concern the former case. A generalising solution to the latter case can be achieved by treating any subquery as a separate query and the rest as a global environment. The basic idea behind the transformation of factoring out of free expressions is, first, to identify free expressions that could be evaluated before a loop induced by an operator containing them, and next, to apply an appropriate rewriting rule. This is generalisation of the standard procedure called loopinvariant code motion known from the compilation theory [36]. An example of incorporating this idea to programming language-integrated queries can be found in the Stack-Based Architecture theory [8]. To optimise evaluation in functional languages, a similar procedure of fully lazy lambda lifting (called also full laziness) has been also proposed [17]. In both cases rewriting rules are straightforward and make use only of the basic set of language operators. Our attempt to generalise, in a similar manner, factoring out of free expressions within LINQ queries using only methods supplied by LINQ has been unsuccessful. In particular, LINQ operators in presence of a queryable data source (e.g., a collection) cause iteration over elements, whereas factoring out requires treating empty, single or multiple elements as an individual result cached for reuse in further calculation. In contrast to factoring out, the volatile indexing procedure addresses the problem of correlated nested subqueries. Various approaches to automatic query decorrelation have been proposed in the research literature for relational, object-oriented and XML databases [7][9][10], whereas in the field of functional programming and language-integrated queries this topic has not received any noticeable interest. In particular, this shortcoming concerns frequently used constructs that define correlation inside a selection predicate. Our idea for optimisation is derived from basic SQL query evaluation strategies where joins or repetitive scans over a table are performed efficiently by introducing a suitable, temporary secondary access structure (e.g., a hash table index). Volatile indexing shares many aspects of query analysis and transformation with factoring out of free expressions. Additionally, the optimisation considers semantics of data filtering operators to discover actual independence of some subexpressions and to substitute a filtering loop with a semantically equivalent index call.
5.1 Factoring out of Free Expressions The procedure of factoring out of free expressions can be applied to the following query pattern: queryUnoptimized ::= queryExpr(freeExpr) where queryExpr(freeExpr) denotes a query expression containing freeExpr that is free from any lambda abstraction within the query. Additionally, we assume that freeExpr should be evaluated several times during the execution in order to make factoring out profitable. This pattern is not restricted to a whole query. It can match any subquery. The solution requires introducing transformation of factoring out a free expression before a construct inducing a loop using it and applying the lazy loading evaluation strategy. Several aspects need to be addressed to make such optimisation effective and general. (1) In imperative programming languages deferring execution is often achieved through enclosing code in a function or by introducing an iterator. In both cases, repeated execution (e.g., inside a loop implied by map or filter collection pipeline operators) causes repeated evaluation. If this applies to a factored out expression, then it is usually necessary to force materialisation of its result before entering a loop using it. (2) Moreover, materialisation solves the issue with factoring out consumable data sources since they cannot be evaluated more than once. Before factoring out, the problem does not exist
since such constructs are introduced inside lambda abstractions (that are parameters of collection pipeline operators) and therefore a single lambda call evaluates a new instance of a consumable data source abstraction. (3) As stated earlier, it is possible that a free expression is skipped during evaluation of the original query. In a general case it is safe to preserve an order of evaluation by suspending materialisation of a factored out expression and prevent its immediate execution before entering a loop using it. (4) An instance of a mechanism used for suspending materialisation of a factored out expression should not be shared between query executions. To solve this problem, it can be additionally enclosed in a lambda abstraction. Otherwise, following executions would share a cached result determined during the first execution. (5) In order to prevent collection pipeline operators from iterating over a collection, it has to be nested into a new collection as a single element. The same procedure can be applied to a single result to enable usage of collection pipeline operators. Let us denote the following abstract operations:
Collection(arg) – creates and returns a collection consisting of one element specified by an argument, e.g., if an argument is a collection it returns a nested collection),
Materialise(expr) – applies a mechanism which materialises a result of an expression passed as an argument while preserving its original evaluation strategy (it is unnecessary when the expr execution strategy is already immediate),
Suspend(lambda) – returns an instance of a thunk (mechanism for lazy loading of an expression) specified by a lambda abstraction passed as an argument,
thunk.Value() – returns a lazily initialised value stored by a specified thunk instance.
Taking advantage of above operations, we introduce the following rewriting procedure: queryOptimized ::= Collection(() => Suspend(() => Materialise(freeExpr))). map(lambdaParam => lambdaParam()). map(freeExprThunk => queryExpr(freeExprThunk.Value())).flatten() where queryExpr(freeExprThunk.Value()) is a nested query queryExpr(freeExpr) with an occurrence of freeExpr expression substituted by freeExprThunk.Value(). This form ensures that execution of all components of the original query is deferred assuming that collection pipeline operators map and flatten have such an execution strategy. The first part of the rewritten query Collection(() => Suspend(() => Materialise(freeExpr))) creates a collection consisting of a single element that is a lambda function creating an instance of a mechanism for suspended materialisation of the factored out of free expression. The following map operator ensures execution of the lambda function. As a result, the following map operator will process queryExpr(freeExprThunk.Value()) expression only once for freeExprThunk assigned the lazily loaded cached value of freeExpr. Therefore, the result of evaluation of queryExpr(freeExprThunk.Value()) is equal to the result of evaluation of queryExpr(freeExpr). The flatten operator eliminates an outer collection implied by the Collection operator. Consequently, the final result of the optimised query is taken from evaluation of the queryExpr(freeExprThunk.Value()) expression. In the case when freeExpr defines a queryable data source evaluated in a lazy manner, optimisation
cannot rely on a trivial materialisation mechanism that simply converts a result of freeExpr to an array or a list. First, it may perturb the order of evaluation since in the original query an element of freeExpr result may be processed (e.g., by a chained pipeline of collection operations) before the next one is demanded. Secondly, a lazy evaluated queryable data source can be infinite and eager materialisation would result in an out of memory error. To avoid these issues the materialisation mechanism should implement a lazy evaluation strategy. Namely, a specific element of the freeExpr result should be calculated and memorised when it is demanded for the first time. Subsequent requests for the same element can be taken from the cache.
5.2 Volatile Indexing The general query (or subquery) pattern suitable for applying the volatile indexing procedure is slightly more complex: queryUnoptimized ::= queryExpr( queryableDataSourceFreeExpr.filter(e => partlyDependentConditionExpr)) partlyDependentConditionExpr ::= eqOp(stronglyDependentOperand, criterionOperand) | eqOp(criterionOperand, stronglyDependentOperand) | stronglyDependentOperand.eqOp(criterionOperand) | criterionOperand.eqOp(stronglyDependentOperand) where queryExpr denotes a larger query expression comprising the filter operation subexpression, that meets specified conditions. First, the collection being filtered (defined by queryableDataSourceFreeExpr) is free from any lambda abstraction within the query. Secondly, the transformation is possible when the filtering predicate takes one of the forms presented in the partlyDependentConditionExpr definition. In general, the predicate tests equality of some key property of collection elements against the given criterion. The filtration criterion is defined by the criterionOperand expression that is free in the lambda abstraction describing the filtering predicate (i.e. it does not use the e parameter). Its value is therefore constant inside a loop, implied by the filter operator, iterating over elements of the queryableDataSourceFreeExpr collection, but it may change between subsequent filter calls. On the other hand, the key property defined by stronglyDependentOperand is bound only to the lambda abstraction describing the filtering predicate (i.e., it uses only the e parameter and global ones). Considering the semantics of the filter operator, a collection of key values is constant for all filter calls during evaluation of the query. A final element defining the predicate is the eqOp operator. It is used to denote the most common ways to indicate equivalence of two objects in programming languages, i.e. a static method or an operator and an object method. Finally, we assume that in order to make volatile indexing profitable, the filter operator should be called several times during the execution. The solution requires introducing a transformation substituting a filter call with a fast index lookup. To achieve this, it is necessary to introduce an appropriate index before a construct inducing the loop using it. Let us call it a volatile index because, in contrast to indices in a database, such an index is not maintained and is valid only during the single query execution (a state of data can change between query executions). Just like through factoring out of free expressions, using a volatile index allows avoiding recurring calculations of queryableDataSourceFreeExpr and criterionOperand. Moreover, there is an additional gain from the procedure since key values defined by stronglyDependentOperand are also computed at most once during creation of the index. Building an index, in many aspects, is a step similar to materialisation of a factored out construct. In both cases, query transformation should consider its influence on changes in the order of evaluation. Consequently, some solutions can be shared, e.g., creating index instance using the lazy loading evaluation strategy. However, since the volatile indexing procedure is more complex, new issues arise that need to be addressed. In the original order criterionOperand is evaluated after requesting the first element of the collection defined by queryableDataSourceFreeExpr. Accurately, it happens directly before or after (depending on the order of operands in partlyDependentConditionExpr) a key value of the first
filtered element. Maintaining the original execution order requires a carefully implemented index. Ideally, an index should insure that materialisation of criterionOperand value, elements of the indexed collection, and individual key values is performed in a lazy manner, i.e. only when they are demanded, according to the semantics of the original query. Let us denote the following abstract operations concerning a volatile index:
VolatileIndex(querableArg, keyLambda) – returns an instance of a lazily initialised volatile index on a collection specified by the first argument indexed using a key defined by a lambda abstraction in the second argument,
index.Lookup(criterionLambda) – is an index object method returning (in the original order) elements of the indexed collection matching a criterion defined in a function given as its argument (ideally, the index is materialised during Lookup calls while consecutive filtration results are demanded).
Finally, we can introduce the following rewriting procedure: queryOptimized ::= Collection(() => Suspend(() => VolatileIndex(queryableDataSourceFreeExpr, e => stronglyDependentOperand))). map(lambdaParam => lambdaParam()).map(indexThunk => queryExpr(indexThunk.Value().Lookup(() => criterionOperand))).flatten() Expressions queryableDataSourceFreeExpr and stronglyDependentOperand have been factored out before queryExpr in order to create VolatileIndex instance before its evaluation. It is permissible since both expressions are independent from the rest of the query. The transformation, in general, is very similar to the procedure of factoring out of free expressions. Using the same reasoning as in previous subsections it is easy to prove that the final result of the optimised query is taken from single evaluation of the queryExpr(indexThunk.Value().Lookup(() => criterionOperand)) expression. It is equal to the final result of queryUnoptimized because individual index lookups are equivalent to the filter operation. Predicate operands are embraced by lambda abstractions (notice that the lambda wrapping criterionOperand does not need the e parameter since the criterion was not bound to e), which enables volatile index to manage order of their execution. The transformation omits the details concerning passing information about the original order of operands in partlyDependentConditionExpr and the method for testing the equality. This aspect can be easily handled in an index constructor or a lookup method call. 5.3 Auxiliary
Transformations
The query pattern for the volatile indexing procedure is constrained only to the filter operation with a single condition in the predicate. In practice, programmers often introduce conjunction of predicates or use dedicated methods that have semantics associated with filtering. There exist transformations that can potentially enable optimisation procedures on such queries. Programming languages with language-integrated queries are often equipped with redundant operators requiring defining a selection predicate (e.g. an existential quantifier, a count function). For example, the LINQ's Any() operator has its overloaded version Any(predicateFunction). It allows a programmer to write queries that test for existence of filtered data in a more compact, readable (and potentially optimised) way. Some of them can be easily substituted using filter operations. Such transformation takes the following query (where FBOp is a filter-based operation suitable to be transformed): queryableDataSourceExpr.FBOp(e => selectionPredicateExpr) and explicitly applies filtering before desired operation:
queryableDataSourceExpr.filter(e => selectionPredicateExpr).FBOp(e => true) Usually, FBOp operation has an overload without a parameter semantically equal to FBOp(e => true) call. For example in LINQ query: products.Any(p => p.UnitPrice == 99.99) without any change in semantics and the order of evaluation can be substituted with: products.Where(p => p.UnitPrice == 99.99).Any() Second auxiliary transformation helps in dealing with a conjunction of predicates implemented using logical shortcut AND operator: queryableDataSourceExpr.filter(e => firstPredicateExpr && restOfPredicatesExpr) In functional-style programming this query is semantically equal to double filtration: queryableDataSourceExpr.filter(e => firstPredicateExpr).filter(e => restOfPredicatesExpr) This transformation may enable the volatile indexing procedure as well as factoring out of free expressions since bindings from the restOfPredicatesExpr expression are removed from the first filter call. Enabling volatile indexing may be problematic if the partlyDependentConditionExpr condition is not the first one in conjunction of predicates chain: queryExpr(queryableDataSourceFreeExpr.filter(e => firstPredicatesExpr && partlyDependentConditionExpr)) In some cases, the only option to use a volatile index is to change the order of predicates. queryExpr(queryableDataSourceFreeExpr.filter(e => partlyDependentConditionExpr).filter(e => firstPredicatesExpr)) However, such small modification may have serious consequences. First of all, firstPredicatesExpr may be aimed to prevent evaluation of partlyDependentConditionExpr if it would otherwise lead to an exception (e.g. testing for null references or empty collections). Secondly, when partlyDependentConditionExpr evaluates to false, then evaluation of the firstPredicatesExpr expression would be skipped. Although, it does not affect the bussiness goal of the query, it might lead to a situation where some unintended side-effects are skipped. Therefore, we believe that allowing such optimisation is permissible only if a programmer is sure that his original query is safe from any unintended side-effects and all exceptions rendered by transformation can be ignored. This assumption simplifies implementation of a volatile index since there is no need for exception management and considering the order of evaluation of conditions in partlyDependentConditionExpr.
5.4 Implementing Optimisation in C# Both procedures share the same optimised query pattern: queryOptimized ::= Collection(() => Suspend(() => suspendedExpression)). map(lambdaParam => lambdaParam()). map(thunk => queryExpr(optimisedExpressionSubstitute)).flatten() In C#, to simplify optimisation we introduce an auxiliary method AsGroup to take care of the Collection operation and suspended evaluation of a lambda expression returning a machanism lazily mechanism materialising the value of the free expression or the volatile index instance. Listing 9 shows the implementation of the auxiliary operator.
static IEnumerable AsGroup(Func sourceFunc) { yield return new Lazy(sourceFunc); }
Listing 9 Auxiliary optimisation method. The Suspend operation is achieved by a Lazy class constructor new Lazy(sourceFunc). The yield return statement is a syntax sugar enabling creating a collection available through an iterator deferring any computations until an iteration starts. In this way, a programmer avoids using a concrete type of a collection and enables a compiler to choose the best implementation on its own. AsGroup exposes an iterator that returns only one element, i.e. an instance of a mechanism for suspending expression execution. The instance is created directly before yielding replacing the projection map(lambdaParam => lambdaParam()). Consequently, the rewritten queries in the case of LINQ take the following form: LINQ-deferredQueryOptimized ::= AsGroup(() => suspendedExpression). SelectMany(thunk => queryExpr(optimisedExpressionSubstitute)) where the SelectMany LINQ operator substitutes map and flatten operations. The above transformation can be adapted to a situation when queryExpr(optimisedExpressionSubstitute) is a construct executed immediately (e.g., when it returns a single value). In that case SelectMany needs to be replaced with two operations: Select realising projection and First responsible for flattening and immediate execution: LINQ-immediateExpressionOptimized ::= AsGroup(() => suspendedExpression). Select(thunk => queryExpr(optimisedExpressionSubstitute)).First() The procedure of factoring out of free expressions introduces the Materialise operation. It is required only in the case when a factored out expression freeExpr is a LINQ query deferred in execution. Explicit materialisation requires implementing a custom LINQ mechanism. In LINQ it can be applied to a query using the extension methods syntax, e.g., freeExpr.ToMaterializable(). The same approach can be used to create a volatile index instance, e.g., queryableDataSourceFreeExpr.ToVolatileIndex(). Implementations of these mechanisms are available in a public repository with a project concerning development of optimisations in LINQ (available at https://github.com/radamus/OptimizableLINQ). The transformation constitutes the general rewriting rule for optimisation of LINQ queries through factoring out of free expressions and volatile indexing procedures. Applying them to the examples from Listing 2 and Listing 6 is shown in Listing 10 and Listing 11, respectively. var maxQueryWithFactoringOutOfFreeExpressions = AsGroup(() => products.Max(p2=>p2.unitPrice)). SelectMany(maxPriceThunk => products.Where(p => maxPriceThunk.Value == p.unitPrice). Select(p => p.productName));
Listing 10 Example 1 – after factoring out suspended free expressions optimisation. var uniqueCategoryQueryWithVolatileIndex = AsGroup(() => products.ToVolatileIndex(p => p.category)). SelectMany(prodIdxThunk => products.Where(p => prodIdxThunk.Value.Lookup(() => p.category, false, true).Count() == 1). Select(p => p.productName));
Listing 11 Example 2 – after volatile index procedure optimisation. Accessing the Value property of thunk objects realises the thunk.Value() operation. To imitate order
of original calculations Lookup method in Listing 11 takes additionally two arguments:
first one indicating if partlyDependentConditionExpr expression operand defining a search key is evaluated before criterion,
second one informs if the comparison in condition has been performed using Equals object method.
The queries execution strategy after optimisation remains deferred and in the case of the first example (Listing 10), the problem of an exception while addressing an empty products’ collection does not occur. The form in Listing 11 in the presence of unintended exceptions (e.g., caused by null references) behaves like the original query in Listing 6. We have chosen an alternative approach to implement the volatile index in C# that satisfies the assumptions described in section 5.2. The easy and efficient way to build an index is to materialise all key values before a lookup on the index is called with a criterion as an argument. Such an approach is permissible only for indexing finite data sources and requires implementing a mechanism for handling unintended exceptions. Premature and eager evaluation of queryableDataSourceFreeExpr and key values may result in throwing exceptions that would not occur in the original query evaluation. To solve this issue, our volatile index caches exceptions and throws them lazily, i.e. if calculations that rendered an exception are demanded.
6 Automatic Optimisation 6.1 Free Expression Detection Described transformations are justified by the need to increase effectiveness, which is achieved at the expense of reflecting the business goal. As a result, benefits from a declarative form and an increased level of abstraction are lost. For this reason, it is desirable that such transformations should be performed automatically during compilation at run-time. LINQ expression trees enable run-time analysing and dynamic building of LINQ queries [37]. This feature allows developing an optimisation method relying on rewriting of a LINQ abstract syntax tree. Automated detection of specified query patterns and transformation to an optimised form are required to make LINQ queries truly declarative. The previous part of this paper deals with the latter, i.e. the definition of efficacious rewriting rules for factoring out of free expressions and volatile index procedures. This section describes assumptions and rules for implementing an algorithm for detection of free expressions within a query. The procedure does not address any details of implementation for the LINQ platform. It is general in terms of functional-style programming. Let us establish a set of definitions concerning expressions and lambda abstractions (inspired by the definitions introduced by Hughes [17]): Def. 1 (bound variables of lambda). An occurrence of a variable within lambda λA is bound to λA if and only if it is a parameter of λA, Def. 2 (bound expressions of lambda). An expression within lambda λA is bound to λA if and only if it contains a variable bound to λA. Def. 3 (native lambda of expression). The innermost lambda in which an expression e is bound is its native lambda. Let us denote this lambda nλ(e), Def. 4 (free expressions in lambda). An expression e within lambda λA is free in lambda λA if λA is nested in native lambda of expression e. Def. 5 (maximal free expressions). A maximal free expression (MFE) is a free expression of some λA that is not a proper subexpression of another free expression of λA.
Additionally, to simplify definitions and the description of the free expression detection rules, we assume that names of variables are unique. Moreover, we implicitly treat a whole query as a lambda abstraction with all free variables (constituting a global environment) as its parameters. In the case of examples from Listing 2 and Listing 6 native lambda of each MFE is the whole query. From the definitions above, it follows that any MFE e free in a lambda λA can be calculated before λA evaluation. Precisely, it could be determined any time during evaluation of nλ(e). The above statement is correct since: 1. e is a free expression (see definition 5). 2. λA is inside nλ(e) (see definition 4). 3. e is not bound to λA (see definition 3). 4. e does not contain variables bound to λA (see definition 2). 5. λA call does not introduce any variable (parameter) required by e (see definition 1) that makes e independent from λA. Consequently, it is possible to factor out the expression e from λA and evaluate it at the level of the nλ(e) lambda. The algorithm implementation can use the standard depth-first search approach and detects all MFEs during a single pass through a query expression tree. Expression visitation focuses on finding its bindings that we define as a set of lambda abstractions declaring variables (usually as lambda parameters) used in the expression. This information is further used to determine bindings of its parent. Usage of lambda abstraction parameters determines whether an expression is free or bound. Therefore, it is necessary to handle information about names of the parameters and lambda abstractions to which they are bound. This is a task of an auxiliary map called binders. To correctly manage parameters’ binding, the procedure specifically handles lambda abstractions and terminal name binding expressions. While visiting lambda abstraction, the binders’ map is filled with its parameters. They are visible only within the lambda abstraction. This sets the right context for the recursive visitation of the lambda body in order to detect free expressions bound specifically to the current lambda. Finally, the bindings set is returned to the lambda parent except for the current lambda that is removed (information on binding to the current lambda is not relevant outside). The binders’ map is used when visiting name-binding terminal expressions. These expressions consist only of an identifier name. If a name is found in the binders’ map, a corresponding lambda is returned (as a single-element bindings set). If a name is not bound to any lambda, then it is assumed to be a globally free variable. The described behaviour does not concern a name on the right hand side of a member access operator (e.g., field names). Such a name is bound locally to its left side, therefore field member access bindings are inherited from their left side expression. In general, bindings for remaining types of expressions are simply inherited from their children (a sum of the sets). To simplify the binding analysis the nesting level annotations for lambda abstractions and variables can be introduced in the algorithm implementation. Expression bindings provide sufficient information to determine all MFEs and their native lambdas. var promoProducts = products.Where(p => products. Any(p2 => p2.unitPrice == Math.Round(p.unitPrice / 1.2, 2))). Select(p => p.productName);
Listing 12 Example 3 – A query concerning the promoProducts problem before optimisation. To exemplify the sketched algorithm let us consider the promoProduct problem shown in Listing
12. The expression determining a price Math.Round(p.unitPrice / 1.2, 2)) is unnecessarily evaluated multiple times during execution of the inner loop implied by the Any operator. The query after factoring out of free expressions is presented in Listing 13. What distinguishes this and previous examples is that the transformation applies not to the whole query but only to the Where predicate. Additionally, the predicate is not a LINQ query but an expression returning a Boolean value. Therefore, Select and First methods were used instead of SelectMany. var promoProductsWithFactoringOutOfFreeExpressions = products.Where(p => AsGroup(() => Math.Round(p.unitPrice / 1.2, 2)). Select(priceThunk => products.Any(p2 => p2.unitPrice == priceThunk.Value)).First()). Select(p => p.productName);
Listing 13 Example 3 – factoring out of free expressions inside lambda abstraction. Partial results of the algorithm work for the unoptimised query are presented in Fig. 1. Each abstract syntax tree node of the query is annotated with three values: (1) a number indicating an order of visitation, (2) a lambda expression directly including an expression, (3) bindings set including the bolded element denoting a native lambda of an expression. Lambda expressions have been assigned unique numbers to facilitate their identification. Bindings that are removed at the end of lambda node visitation are indicated by a strikethrough symbol.
Fig. 1 Example abstract syntax tree nodes annotations. Free expressions have their native lambda (bolded lambda in bindings set) different from a nearest lambda (denoted by the second annotation), i.e. expressions with visitation order ranks 6, 11-17. After omitting terminal expressions such as literals (constant type nodes ranks 16 and 17) and name
bindings (bind type nodes ranks 6 and 15), the only MFE left to factor out is Math.Round(p.unitPrice / 1.2, 2)). Its native lambda is λ1. Hence, factoring out should be applied to its indirect parent: the Any node with a visitation order rank 5 (presented in Listing 12). It is an expression inducing iteration over the products collection at the highest level within λ1.
6.2 Volatile Indexing and Free Expressions Similarly, expression tree analysis supplies information about possibility to apply the volatile indexing procedure. Operator Any is a redundant operator taking as an argument a selection predicate and potentially can be submitted to volatile indexing procedure (detail explanation in section 5.3). The selection predicate condition takes as a first operand the p2.unitPrice expression which is bound only to lambda abstraction λ2.. Hence, it is strongly dependent to the collection defined by the products free expression (the first argument of the Any extension method). Since expression Math.Round(p.unitPrice / 1.2, 2)) is a MFE it can be considered the selection criterion while p2.unitPrice constitutes the key value definition. The query rewritten according to the transformation presented in section 5.2 takes the form presented in Listing 14. var promoProductsWithVolatileIndex = AsGroup(() => products.ToVolatileIndex(p2 => p2.unitPrice)). SelectMany(prodIdxThunk => products. Where(p => prodIdxThunk.Value. Lookup(() => Math.Round(p.unitPrice / 1.2, 2), true, false).Any()). Select(p => p.productName));
Listing 14 Example 3 – after volatile index procedure optimisation. Further analysis of a transformed expression tree may reveal that new optimisations are possible or that rewriting has cancelled a possibility of another optimisation. For example, after factoring out of free expressions (Listing 13) applying volatile index procedure is still possible, whereas the form in Listing 14 has no MFEs proper for factoring out (loop induced by the Any operator has been removed by introducing the index lookup call). Effectively, the volatile indexing procedure may in result prevent multiple calculation of some MFEs (particularly, the Math.Round(p.unitPrice / 1.2) expression in the discussed example).
6.3 Applying Optimisation Procedures The rewriting rules can be applied during visitation of lambda expressions. However, in case of factoring out of free expressions not all MFEs should be factored out. The conditions under which the optimisation can be profitable are described in analogous solutions [8][16], namely: (1) a free expression cannot be too simple (e.g., names and literals), (2) a free expressions’ result should be used more than once. They can be verified during preparation to the transformation. First, the complexity of an MFE can be examined. An appropriate threshold for applying transformation could be introduced, e.g., based on an arbitrarily set weight of language constructs comprising an MFE. Performance tests on the promoProducts problem involving factoring out a relatively simple expression have proven improvement in the case of collections consisting of at least tens of objects. For over hundred products the optimised query was about twice as fast. The second condition concerns a number of times that a MFE result is used in evaluation. An additional analysis may be necessary for confirming that nλ(MFE) contains a method that causes iteration over some collection that may require repeated evaluation of the MFE. For example, in LINQ this concerns mainly operators parameterised with a lambda abstraction (such as Select, Where, Max, etc). Operators operating on sets (e.g., Contains, Union) or custom ones are not any indication for the optimisation. Similarly, the volatile index should be used only if a substituted filter operation is likely to be executed multiple times. The more detailed cardinality analysis is doubtful in the case of a programming language environment and a lack of a cost model.
We have implemented a prototype LINQ provider library realising the factoring out of free expressions optimisation (available at https://github.com/radamus/OptimizableLINQ). The analysis and the transformation are performed using the LINQ expression trees’ representation available at runtime. Access to expression trees is provided though the IQueryable interface that does not allow direct query execution. Instead, it exposes an abstract syntax tree of a query (in a form of a type-checked expression tree) to a data store provider. The provider makes use of this representation to build a query in a form (language) dedicated for a given data model (e.g., LINQ to SQL) [37]. Implementing optimisation in the form of a LINQ provider library gives a developer possibility to resign from aggressive, global query optimisation, e.g. when the order of evaluation is important considering some planned side-effects. To enable automatic optimisation, the AsOptimizable extension method should be applied to a source collection. It is shown in Listing 15 for the maxPrice example. var maxQuery = products.AsOptimizable(). Where(p => products. Max(p2 => p2.unitPrice) == p.unitPrice). Select(p => p.productName);
Listing 15 Example 1 – automatically optimised. As a result, a rewritten query is compiled and becomes available for multiple use. One-time overhead occurring at the site of the definition is about a millisecond. A developer should consider runtime optimisation with caution when a query is used only once over a small collection. In contrast to LINQ, Java 8 streams operators are consumable, which prevents multiple usages of the same query. We are not aware of any mechanism enabling rewriting optimisations of Java 8 stream queries at runtime; nevertheless, in the case of consumable constructs the cost of optimisation done at runtime would burden each query execution.
7 Performance Tests We have evaluated the impact of the optimisation procedures in C# by applying them manually to a number of problems: maxPrice – given a collection of products, find products with the maximal price in the collection, pythagoreanTriples – from natural numbers between 1 and n find a number of triples satisfying the Pythagorean theorem, promoProducts – given a collection of products, find names of products in the imaginary sale promotion, i.e. exactly k times more expensive than any other product, uniqueCategory – given a collection of products, find categories in which there is only one product, and, samePrices – given a collection of products and set of product categories for audit, find all pairs of products from different categories with the same price, where one product belongs to audited category and the other is taken from the whole collection of products. var pythagoreanTriples = Enumerable.Range(1, max + 1).SelectMany(a => Enumerable.Range(a, max + 1 - a).SelectMany(b => Enumerable.Range(b, max + 1 – b).Where( c => a * a + b * b == c * c))).Count();
Listing 16 A query concerning the pythagoreanTriples problem before optimisation. var samePrices = products. Where(p => categoriesToAudit.Contains(p.category)). SelectMany(p => products.Where(p2 => !p2.category.Equals(p.category) && p2.unitPrice == p.unitPrice). Select(p2 => p.productName + " and " + p2.productName));
Listing 17 A query concerning the samePrices problem before optimisation. In experimental tests, the collection of products ranged from 1 to 1,000,000 elements. The size of each product averaged to 175 bytes. Tests for maxPrice, pythagoreanTriples, promoProducts,
uniqueCategory and samePrices problems have been conducted using queries in Listing 2, Listing 16, Listing 12, Listing 6, and Listing 17 accordingly. Each of the first three problems has at least one free expression suitable for the factoring out optimisation. A solution to maxPrice have a free nested query, whereas pythagoreanTriples and promoProducts introduce simple mathematical calculations that can be factored out. Problems promoProducts, uniqueCategory and samePrices introduce a nested filtration that is suitable for applying the volatile indexing procedure. We have tested a combination of proposed optimisations with the promoProducts problem. Plots show also the results of alternative solutions to selected problems, i.e. using the grouping operation GroupBy or the Lookup structure. The samePrices problem concerns the case discussed at the end of section 5.3 where a condition suitable for a volatile index is not the first one in the selection predicates' chain. Consequently, we assumed that the optimisation can ignore the unintended side-effects and focus only on a business goal. We conducted our experiments on a workstation with a 4-core Intel Core i7 4790 3.6 GHz processor, 32 GB of DDR3 1600MHz RAM, hosting Windows Server 2012 R2. Benchmarks have been compiled for a x64 platform with enabled code optimisations using target .NET Framework v. 4.5.51650. Tests results for following problems are presented in Fig. 2, Fig. 3, Fig. 4, Fig. 5 and Fig. 6. The tests include comparison with PLINQ and the LinqOptimizer optimisation framework which take advantage of parallel processing. We also combine them manually with our optimisation to explore limits and further opportunities. The maxPrice problem test includes factoring out of free expressions performed by our LINQ provider library denoted as OptimizableLINQ (the tested query is shown in Listing 15). Since a LINQ query can be executed repeatedly, the tests focus on query execution times and omit a one-time overhead, i.e. optimisation and compilation of a query in case of OptimizableLINQ and LinqOptimizer. Most of the plots use logarithmic scales to more clearly reveal differences in performance for various collection sizes. To improve readability, some plots omit optimisation variants that are generally worse. In particular, the better alternative between PLINQ and LinqOptimizer is selected.
Fig. 2 Query evaluation times for maxPrice problem.
Fig. 5 Query evaluation times for pythagoreanTriples problem.
Fig. 4 Query evaluation times for promoProducts problem.
Fig. 3 Query evaluation times for uniqueCategory problem (on the right side enlarged part with linear scale on execution time axis).
Fig. 6 Query evaluation times for samePrices problem (on the right side enlarged part with linear scale on execution time axis). Results of the tests are as follows: The results are consistent with an expected computational complexity. In all problems concerning the products' collection it has been reduced from quadratic to linear, achieving a gain in orders of magnitude for large collections, e.g.,: in the maxPrice example the query after factoring out is more than 60,000 times faster for 100,000 products (boost from ~229 s to ~3.5 ms), in the promoProducts example the query after applying a volatile index is nearly 3,000 times faster for 1,000,000 products (boost from ~903 s to ~329 ms). Except for the pythagoreanTriples problem, the profitability threshold of an individual factoring out optimisation is very low when comparing to PLINQ and LinqOptimizer. Even for a collection of a few objects, optimised queries can work faster than original ones (e.g., the maxPrice problem). This conclusions (however, in a smaller degree) apply also to the volatile indexing procedure (building an index imposes a small overhead). The performance penalty of optimisation procedures in the case of a collection consisting of a single element is below 1 μs (which corresponds to a ~80% deterioration in the uniqueCategory problem). When processing large collections, the proposed transformations can give several times better performance by taking advantage of PLINQ (especially in the case of the maxPrice and samePrices problem). For smaller collections, PLINQ imposes significant overhead. The pythagoreanTriples problem optimisation tests show that it may be difficult to obtain a significant gain when factoring out a simple expression (i.e. a * a + b * b). A ~11% gain is achieved for n equal to 10,000. The LinqOptimizer framework seems to be designed for optimising queries involving numbers rather than complex objects. Only in the pythagoreanTriples problem optimisation, it outperforms both PLINQ and factoring out. In general, combining factoring out of free expressions with LinqOptimizer is not likely to produce the best solution. However, it seems that tuning of the LinqOptimizer algorithm should be possible. In the pythagoreanTriples problem, PLINQ is able to produce a more efficient query after factoring out, whereas LinqOptimizer favours the original query. Unfortunately, the differences are too small to be seen on the plot. The promoProducts problem test confirms that applying the factoring out optimisation before the volatile indexing procedure is ineffective since the latter transformation factors out some expressions and eliminates a filtering operation loop. Consequently, changes introduced with the first transformation may impose unnecessary burden (~29%
deterioration for 1,000,000 objects in collection). In general, GroupBy alternatives to uniqueCategory and samePrices problems are faster solutions, however they interact negatively with optimisations that enable parallel processing. For larger collections, the volatile indexing procedure boosted using PLINQ offers comparable performance without a loss in the query business semantics and occasionally even slightly outperforms the locally best GroupBy variant (e.g., ~7% more efficient for ~500,000 products in the samePrices problem). Query optimised and compiled automatically by OptimizableLINQ during the run-time is noticeably slower than its compile-time equivalent (see the factoring out variant in maxPrice problem). This is probably due to the usage of run-time C# code compiler in current optimiser implementation. Namely, compilation of some inner lambda expressions occurs not before but during query execution. Still run-time optimisation is extremely profitable when processing large collections. The tests show that run-time optimisation of queries executed only once during an application lifetime should not be considered in case of collections smaller than thousands of objects. This issue concerns OptimizableLINQ and LinqOptimizer which are burden with an overhead of the order of milliseconds incurred before the first query execution (the overhead is not shown on the plots). C# libraries offer a Lazy class realising the Suspend operation, but considering performance, we have implemented our own lightweight version. We have experimented with different variants of performing Collection, Suspend operations but the presented solutions generally resulted in performance better than others.
8 Summary In this paper, we have presented issues concerning the unnesting procedure of correlated and uncorrelated language-integrated queries. The proposed solution proves that it is possible to provide programming languages offering functional-style access to querying data collections with resourceindependent static optimisation mechanisms. We proposed a formal methods – factoring out of free expressions and volatile indexing – based on higher-order functions rewriting. Their essence is to avoid unnecessary recurring calculations. Presented methods may reduce computational complexity producing a robust performance gain. Such optimisations can be fully automated and do not require any interference or implementation-specific knowledge from a programmer. Using simple examples, we emphasise the significance of an order and a strategy of evaluation implied by semantics of functional-style operators. Finally, we elaborate general and safe optimisation procedures, considering characteristics of functional-style querying and a possibility of unintended side-effects in imperative programming languages. In contrast to the Nectar system [13], which also uses term rewriting to increase sharing of computations, our work addresses functional-style queries in general, i.e. without context of application which would limit our optimisation. We take advantage of the similar approach to optimisation as Steno [22], LinqOptimizer [23], or SkyLinq [24]. However, we make an attempt to explore more aggressive, global optimisations comparable to optimisations of database query languages. The presented approach was verified in Microsoft .NET environment and its Language-Integrated Query technology. However, the automated solution has not been straightforward to elaborate due to necessity of considering several variants implied by execution strategies of constructs comprising LINQ queries and complexity of implementing LINQ providers [35]. Our optimisation for LINQ can be combined automatically with other ones as long as they preserve queries in an expression trees form. In other cases, fusion of optimisations has to be done manually.
For example, PLINQ enables to take advantage of multiple cores and achieve several times better efficiency in processing of large collections. Moreover, the optimiser in some cases could automatically (or by a programmer’s decision) resign from suspending evaluation of a factored out expression and remove overhead that it imposes. The tests showed that it results in further improvement of performance, up to ~18%. Finally, it seems that transformations would be the most profitable if incorporated in a compiler. Considering source-to-source transformations already performed by the C# compiler on LINQ query expressions [33] this solution imposes itself. We believe that our work is as a real step towards genuine declarative language-integrated queries. We conduct further works on optimisation of functional-style constructs processing collections. One branch of our research concerns the elaboration of methods that are aware of operators semantics, Currently, we are working on extending our LINQ query optimisation provider with the volatile indexing procedure, which shows performance gain of several orders of magnitude. However, there exist other well promising optimisation approaches addressing nested queries in the context of specified operators (e.g., pushing selection [38]). We also consider adapting other methods, such as revealing weak dependencies within queries that enable performing further factoring out [39].
References [1] E. Meijer, “The world according to LINQ,” Commun. ACM, vol. 54, no. 10, pp. 45–51, Oct. 2011. http://dx.doi.org/10.1145/2001269.2001285 [2] Oracle, Java API docs, “Package java.util.stream”, http://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html, accessed: December 2014. [3] M. Fowler, “Collection Pipeline”, 28 July 2014, http://martinfowler.com/articles/collectionpipeline/, accessed: December 2014. [4] Hudak, "Conception, Evolution, and Application of Functional Programming Languages". ACM Computing Surveys 21 (3), pp. 383–385, 1989. http://dx.doi.org/10.1145/72551.72554 [5] Chaudhuri, “An Overview of Query Optimization in Relational Systems”, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of data-base systems, Seattle, Washington, United States, pp. 34-43, 1998. http://dx.doi.org/10.1145/275487.275492 [6] M. Jarke, and J. Koch, “Query Optimization in Database Systems”, ACM Computing Surveys 16(2), pp. 111-152, 1984. http://dx.doi.org/10.1145/356924.356928 [7] W. Kim, “On optimizing an SQL-like nested query”, ACM Trans. on Database Systems, 7(3), pp. 443–469, 1982. http://dx.doi.org/10.1145/319732.319745 [8] J. Plodzien, and A. Kraken, “Object Query Optimization through Detecting Independent Subqueries”. Information Systems 25(8), pp. 467-490, 2000. [9] S. Cluet, and G. Moerkotte, “Nested Queries in Object Bases”, In DBPL'93, pp. 226-242, 1993. [10] N. May, S. Helmer, and G. Moerkotte, “Strategies for Query Unnesting in XML Databases”, ACM Transactions on Database Systems (TODS), Volume 31 Issue 3, September 2006, pp. 9681013, 2006. http://dx.doi.org/10.1145/1166074.1166081 [11] E. Zdravevski, P. Lameski, A. Kulakov, S. Filiposka, D. Trajanov and B. Jakimovski, “Parallel computation of information gain using Hadoop and MapReduce”, Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, Annals of Computer Science and Information Systems, Volume 5, pp. 181-192, 2015. http://dx.doi.org/10.15439/2015F89 [12] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey, “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language”, Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008. [13] P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. “Nectar: Automatic Management of Data and Computation in Datacenters”, In OSDI, 2010. [14] L. Fegaras, and D.Maier, “Optimizing object queries using an effective calculus”, ACM
Transactions on Database Systems (TODS), Volume 25 Issue 4, pp. 457-516, 2000. http://dx.doi.org/10.1145/1166074.1166081 [15] S. Jones, A. Tolmach, and T. Hoare, “Playing by the Rules: Rewriting as a practical optimisation technique in GHC”, Proceedings of the 2001 Haskell Workshop, pp. 203-233, 2001. [16] S. P. Jones, W. Partain, and A. Santos, "Let-Floating: Moving Bindings to Give Faster Programs", Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, 1996. http://dx.doi.org/10.1145/232629.232630 [17] J. Hughes, “The Design and Implementation of Programming Languages”, Oxford University, D.Phil. Thesis, 1983. [18] A. Biboudis, N. Palladinos, and Y. Smaragdakis, “Clash of the Lambdas”, 9th ICOOOLPS (Implementation, Compilation, Optimization of OO Languages, Programs and Systems) workshop, Uppsala, Sweden, 2014. [19] T. Grust, J. Rittinger, and T.Schreiber, “Avalanche-safe LINQ compilation”, Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, pp. 162–172, 2010. http://dx.doi.org/10.14778/1920841.1920866 [20] J. Chaney, S. Lindley, and P Wadler, “A practical theory of language-integrated query”, ICFP '13 18th ACM SIGPLAN international conference on Functional programming, ACM SIGPLAN Notices - ICFP'13, Volume 48 Issue 9, pp. 403-416, 2013. http://dx.doi.org/10.1145/2500365.2500586 [21] i4o, “i4o - Indexed LINQ”, http://i4o.codeplex.com, accessed: September 2014. [22] D. G. Murray, M. Isard, and Y. Yu, "Steno: Automatic Optimization of Declarative Queries", PLDI’11 June 4–8, San Jose, California, USA, 2011. http://dx.doi.org/10.1145/1993498.1993513 [23] N. Palladinos, and K. Rontogiannis., “LinqOptimizer: an automatic query optimizer for LINQ to objects and PLINQ”, http://nessos.github.io/LinqOptimizer/, accessed: December 2014. [24] Sky LINQ, “Sky LINQ”, https://skylinq.codeplex.com, accessed: December 2014. [25] The Reactive Manifesto, “The Reactive Manifesto”, http://www.reactivemanifesto.org, 23 September 2013, accessed: December 2014. [26] E. Meijer, “Your mouse is a database”, Commun. ACM, 55(5), pp. 66-73, 2012. http://dx.doi.org/10.1145/2160718.2160735 [27] I. Maier, and M. Odersky, “Higher-Order Reactive Programming with Incremental Lists”, ECOOP 2013 – Object-Oriented Programming, Lecture Notes in Computer Science Volume 7920, pp. 707-731, 2013. http://dx.doi.org/10.1007/978-3-642-39038-8_29 [28] G. Schueller, and A. Begrend, “Stream fusion using reactive programming, LINQ and magic updates”, 16th International Conference on Information Fusion (FUSION), pp. 1265-1272, 2013. [29] C. Reichenbach, Y. Smaragdakis, and N. Immerman. PQL, “A purely-declarative java extension for parallel programming”. ECOOP ’12, LNCS 7313, pp. 53–78, 2012. http://dx.doi.org/10.1007/978-3-642-31057-7_4 [30] E. Wcislo, P. Habela, and K. Subieta, “Implementing a Query Language for Java Object Database”, ADBIS 2012, pp. 413-426, 2012. http://dx.doi.org/10.1007/978-3-642-33074-2_31 [31] R. Adamus., P. Habela, K. Kaczmarski., M. Lentner, K. Stencel, and K. Subieta, “StackBased Architecture and Stack-Based Query Language”, ICOODB Proceedings of the First International Conference on Object Databases. Germany, Berlin, pp. 77-95, 2008. [32] K. Subieta, “Stack-Based Approach (SBA) and Stack-Based Query Language (SBQL)”, http://www.sbql.pl, 2011, accessed: December 2014. [33] G. M. Bierman, E. Meijer, and M. Torgersen, “Lost In Translation: Formalizing Proposed Extensions to C#”, Proceedings of the 22nd annual ACM SIGPLAN conference on Objectoriented programming systems and applications, pp. 479-498, 2007. http://dx.doi.org/10.1145/1297105.1297063
[34] R. Adamus, T.M. Kowalski and J. Wiślicki, “A step towards genuine declarative languageintegrated queries”, Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, Annals of Computer Science and Information Systems, Volume 5, pp. 935-946, 2015. http://dx.doi.org/10.15439/2015F156 [35] O. Eini, “The Pain of Implementing LINQ Providers”, ACM Queue - Mobile Devices in the Enterprise, Volume 9 Issue 7, 2011. http://dx.doi.org/10.1145/1297105.1297063 [36] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, “Compilers - principles, techniques and tools”, Pearson Education, Inc., 2006. [37] T. Petricek, “Building LINQ queries at runtime in C#”, http://tomasp.net/blog/dynamic-linqqueries.aspx/, 2007, accessed: December 2014. [38] M. Drozd, M. Bleja, K. Stencel, and K. Subieta, “Optimization of Object-Oriented Queries through Pushing Selections”, Advances in Databases and Information Systems, Advances in Intelligent Systems and Computing, Volume 186, pp. 57-68, 2013. http://dx.doi.org/10.1007/978-3-642-32741-4_6 [39] M. Bleja, K.Stencel, and K. Subieta, “Optimization of object-oriented queries addressing large and small collections” IMCSIT 2009, pp. 643-650, 2009. http://dx.doi.org/10.1007/978-3642-32741-4_6
Highlights ●
Selected optimization methods from query languages work with imperative languages.
●
Optimisation relies on higher-order functions analysis.
●
Factoring out of free expressions avoid unnecessary multiple calculations.
●
Volatile indexing procedure performs unnesting of correlated queries.
●
Proof-of-concept LINQ optimiser with results for sample queries is presented.