Future Generation Computer Systems 21 (2005) 717–724
An efficient incremental marshaling framework for distributed systems K. Popova,∗ , V. Vlassovb , P. Branda , S. Haridib b
a Swedish Institute of Computer Science (SICS), Box 1263, SE-164 29 Kista, Sweden Department of Microelectronics and Information Technology, Royal Institute of Technology, Electrum 229, SE-164 40 Kista, Sweden
Available online 5 November 2004
Abstract We present an efficient and incremental (un)marshaling framework designed for distributed applications. A marshaler/ unmarshaler pair converts arbitrary structured data between its host and network representations. This technology can also be used for persistent storage. Our framework simplifies the design of efficient and flexible marshalers. The network latency is reduced by concurrent execution of (un)marshaling and network operations. The framework is actually used in Mozart, a distributed programming system that implements Oz, a multi-paradigm concurrent language. © 2004 Elsevier B.V. All rights reserved. Keywords: Distributed computing; Marshaling; Concurrency; Latency; Throughput
1. Introduction This work addresses marshaling/unmarshaling in distributed systems. When a message (a data structure) is sent from one host to another (see Fig. 1), it is copied from the memory of the source host to the network, and then from the network to the memory of the destination host. This process is rather trivial if the data to be sent has a regular structure such as an array of bytes. In this case, the bytes are sequentially copied to the network, forming a serial (network) representation of the array’s memory (host) representation. Our work targets arbitrary structured data where marshaling is less trivial, as illustrated in Fig. 1. Furthermore, the memory rep∗
Corresponding author. E-mail address:
[email protected] (K. Popov).
0167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2004.05.012
resentation of a transferred data on the receiving host does not have to be the same as on the sending host, in particular, on hosts with different architectures. Persistent storage is another application of marshaling. The good run-time performance and compactness of the serial Representations are our goals, even at the expense of portability between systems, e.g. in the sense of XML. We also address the software engineering issues such as flexibility and maintainability. The way marshaling interfaces the rest of the software of a host in a distributed system can affect the network latency and throughput. In order to explain this, consider the architecture of a node in a distributed system shown in Fig. 2. Here, a message is first constructed by the host application software. Then a reference to that data is passed to the marshaler, which constructs the serial representation of the message in the
718
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
Fig. 1. (Un)marshaling.
marshaling buffer, or just buffer hereafter. The buffer is copied to the network by the network layer. Observe that messages are shared between the application and the marshaler, whereas the buffer is shared between the marshaler and the network layer. In order to fully utilize the network, the network layer has to be invoked sufficiently frequently with a sufficient amount of data on hand. Lost bandwidth cannot be “made up” later by calling the network layer more frequently, or with bigger chunks of data. The application, marshaller and network layers can run sequentially as shown in Fig. 2 (case Sequential). In this case, the network layer waits until marshaling completes causing some network bandwidth to be lost.
Alternatively, the network layer can run concurrently with the application layer and with marshaling of subsequent message(s) as shown in Fig. 2 (case Concurrency I). This approach does not affect the latency, but does affect the throughput of the whole system because of the better utilization of the network bandwidth. We take the third approach depicted in Fig. 2 as case Concurrency II. In our approach, the marshaler and the network layer run concurrently handling the same message, thus reducing the latency since the network layer can start sending a message before its marshaling completes. This requires the marshaler to be preemptable. But it also allows the use of a fixed-size buffer for the serial representation, as well as limits the time spent in the marshaler at every invocation. The concurrently running marshaler and network layers must synchronize on the shared buffer. Synchronization between the concurrently running application and marshaler is only necessary if data in messages is mutable. Synchronization can be avoided by copying messages. If the application’s data model distinguishes between immutable and mutable data, it can be arranged that only mutable data is copied. A marshaler in our framework consists of two parts: (1) a set of methods that marshal structural elements of data, that we call nodes, and (2) a traverser that applies those methods to nodes according to some traversal strategy. There is a dedicated marshaling method for each node type. A serial representation consists of a series of tokens that corresponds to data ele-
Fig. 2. Concurrency in a distributed system.
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
ments. An unmarshaler, in turn, is a procedure that reads the serial representation token by token, constructs data elements from tokens, and passes those nodes to the builder that assembles them into the structural data. There are explicit fixed interfaces between the traverser and the marshaling methods, as well as between the unmarshaling procedure and the builder. The presence of those interfaces simplifies design and code maintenance. The same traverser/builder pair can also be used for persistent storage, inspecting data, even copying data in an incremental garbage collector. References within a node can be retrieved either by the traverser or can be passed back by specific node marshaling methods. In our scenario, where we needed different marshalers for the same data structures, reference retrieval is, for the most part, encoded into the traverser itself. For example, node marshaling methods for persistent storage, for data exchange in distributed applications, and snapshotting are different. However, for nodes that have irregular structure, reference retrieval is best performed by the marshaling methods. Here, nodes have to be analyzed in order to find references, and that analysis can be done along with marshaling the node itself, i.e. we avoid doing the analysis twice. For example, a chunk of bytecode in a bytecode interpreted language can contain references to data structures, but locating them would require analysis of all instructions from the beginning of the chunk. Another special case is marshaling of large nodes; this needs to be preemptable. When this type of preemption occurs, the status of marshaling within the node needs to be saved. We developed the concept of opaque node processors to handle both these types of nodes. An opaque node processor is an abstract data type that provides the “marshal” method, and encapsulates the state of marshaling. A marshaler gives an opaque node processor to the traverser, and the traverser runs it. References in an opaque node are passed back to the traverser, and marshaled in the usual way. A corresponding approach is used also for unmarshaling. Our framework suits distributed systems that are implemented in languages like C++ and run on conventional single-processor hardware under an OS like Unix or Windows. To our best knowledge, this is the first paper that addresses concurrent marshaling of
719
structured data in this context. Serialization is known for programming languages such as Java, Python, Erlang [1], but we have not seen any publications about the serialization in their implementations. Efficient serialization is known for, e.g. RPC/XDR [5,6], CORBA [3], and for parallel programming environments such as MPI [7], but these implementations address non-structured data. Work on portable serialization is being carried on (e.g. WDDX [4] or XML-RPC [8]), but neither focus on efficiency nor incrementality.
2. Data model In this work, we consider marshaling of structured data that consists of nodes, each containing values and references to other nodes. Nodes containing references are called compound; other nodes are primitive. References are directional; nodes that are pointed by references are called (direct) descendants. There is a root node. There can be cycles in a data structure. In a C++ application, for example, this data model naturally maps to memory objects such as records, and their addresses. The serial representation of a data structure consists of a sequence of tokens that represent nodes. A token contains representations of the corresponding node’s values. A simplest representation of a value is its memory representation, but in practice other representations are used because of cross-platform portability, space efficiency, network security and other reasons. Note that tokens do not contain any representation of references. Instead, references are usually represented implicitly through the order in which data is traversed, and nodes are marshaled to the buffer. A reoccurring node, e.g. in a cyclic structure, is represented by a special REF(erence) token containing a node’s index. The first, or defining occurrence of such a node carries also its REF index. The example in Figs. 3 and 4 depicts a class in an object-oriented language and present its possible memory representation and a corresponding serial representation. The serial representation is built according to the depth-first, left-to-right traversal strategy. The class has two references: a list of static class members and a list of class methods. The serial repre-
720
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
Fig. 3. Memory representation.
Fig. 4. Serial representation.
sentations of these descendants follow the class node representation.
3. Marshaling Marshaling a data structure is a process of marshaling nodes encountered while traversing that data. A straightforward implementation of traversing is recursive: whenever a descendant node is to be marshaled, its node marshaling procedure is invoked. This approach is widely used, e.g. in Java, Erlang, and was used in Mozart. Our traverser (see Fig. 7) is iterative: it processes elements to be marshaled one by one. In this paper, we stick to the depth-first traversal strategy, which corresponds to a stack as the repository of nodes to be marshaled. Observe that the stack explicitly represents the frontier between traversed and not yet traversed nodes; initially it contains the root node. A marshaler is a set of node marshaling procedures that are executed under the guidance of the traverser. Marshaling the “class” node 1 shown in Fig. 3 is illustrated in Fig. 5. When the traverser reaches the node, it calls the corresponding marshaling method (arrow 1), which, in turn, handles the “name” field in the node (arrow 2). After that, traverser itself retrieves references to descendant nodes (arrow 3) and proceeds iteratively.
Fig. 5. Marshaling a node.
Cycles are resolved using a “node table” that records compound nodes and their indices assigned by the traverser. Reoccurring nodes are marshaled as REF tokens. Preemption of marshaling merely implies breaking the loop and saving the stack until resumption, for which the flag running is reset. Resumption proceeds by calling traverser again. In comparison, preempting/resuming a recursive marshaler in a singlethreaded application requires the expensive operations of saving/restoring the marshaler’s execution stack. A typical “marshal” method for the “method” node shown in Fig. 3 is presented in Fig. 7. Two node’s values are marshaled: the method’s name and the code size. The Method’s code area is marshaled as a opaque node. CodeAreaProc is a opaque node processor. traverseOpaque pushes the processor (proc) onto the stack. Traverser eventually reaches the code node and calls the processor (the code in Fig. 7 is extended to handle such stack entries). Note that the formal argument of traverseOpaque has the type Proc, which is an abstract superclass of CodeAreaProc. This allows the traverser to handle different processors in the uniform way. The code area can contain references to other nodes declared by means of the traverseNode traverser’s method which pushes a node onto the stack. An opaque node processor returns true when the area is finished; otherwise marshaling is preempted and the processor’s stack entry is preserved. To avoid synchronization between the marshaler and the application, we use the following method of message copying that works best in a system with automatic memory management and distinction between mutable and immutable data. A set of ‘marshal’ methods is defined that record all mutable values and references of a data structure to a special structure which we call a snapshot. When marshaling is resumed, actual values
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
721
from the data structure are exchanged with the saved ones, and restored back when marshaling is finished or preempted.
4. Unmarshaling Unmarshaling a serial representation of a data structure is a process of unmarshaling individual nodes and linking them together. Again, a straightforward implementation is recursive: node’s descendants are unmarshaled by application of the same “unmarshal” procedure. Our unmarshaler is iterative (see Fig. 8): it reads tokens from the stream and executes corresponding node unmarshalers. Node unmarshalers construct and initialize nodes but do not fill in the references, and then passes nodes to the builder that assembles them into structured data. The builder code is shared between all our unmarshalers, just as the traverser is shared between all our marshalers. In our example, illustrated in Fig. 6, unmarshaling a “class” token includes constructing a class node itself (arrow 1), unmarshaling and storing its content (arrow 2), and passing the node over to the builder (arrow 3). Builder contains a stack of entries representing references to be created. In the example, builder knows about the NIL fields in the partially constructed “class” node (arrow 4). Specifically, a stack entry contains a memory address where a reference, i.e. a memory address of a node should be stored. A reference is created and stack entry dropped when the unmarshaler passes a node to the builder, which happens by means of build*() builder’s methods. Note that the stack represents the boundary between unmarshaled (i.e. constructed) and not yet unmarshaled nodes. The top
Fig. 6. Unmarshaling a node.
Fig. 7. Marshaler skeleton.
entry marks the spot in a value being constructed by the unmarshaler where the next node will be attached. Unmarshaling starts with a memory address where the reference to the root node should finally appear. This memory address points to the cell in the builder, which is retrieved by means of the finish() builder’s method. In our example, buildCLASS first stores the reference to the “class” node as dictated by the top stack entry, after which the stack entry is discarded. Since this node is the root, there will be the cell in the builder that is returned by finish(). Second, it pushes two more entries onto the stack that represent static members and methods. Note that the builder and the traverser constitute a matching pair: the strategy of the traverser must correspond to the order in which the builder expects the entries. Specifically, since our traverser works in the leftto-right depth-first order, the builder must build values depth-first, i.e. use the stack, and also push the entry
722
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
of opaque node in the uniform way, similarly to the traverser.
5. Marshaling and unmarshaling in Mozart
Fig. 8. Unmarshaler skeleton.
for the left-most descendant last, so it is retrieved and used first. Like the marshaler, the unmarshaler treats an opaque node as a node. When a node being unmarshaled has an opaque node descendant, the builder is informed, and the opaque node is represented in the builder’s stack by means of the buildOpaque traverser’s method (see Fig. 8). Such a stack entry corresponds to a forthcoming opaque node token in the serial representation. When that opaque node token is reached and its unmarshaling is in progress, the corresponding stack entry is at the top of stack, and can be accessed by the unmarshaler through the fillOpaque method. In this way, the builder stack keeps information necessary for unmarshaling of the opaque node, which is supplied by the parent node. Finally, if unmarshaling of the area is preempted, the stack entry is preserved using the suspendFillOpaque method. Note that the builder handles different kinds
We have used our framework in the programming system Mozart [2,9]. Mozart is an implementation of Oz, a multi-paradigm concurrent programming language. Oz offers symbolic computation with a variety of built-in data types. Oz distinguishes between mutable (e.g. dataflow, single-assignment variables) and immutable (e.g. records) data. Mozart provides a network-transparent distribution: an application can run on a network of computers as on a single computer. Oz entities can be shared between processes. Stateless data that becomes shared is replicated between sites. The consistency of distributed stateful Oz entities, such as a dataflow variable, is guaranteed by entity type-specific distributed protocols. Whenever an operation on a distributed stateful Oz entity is performed, the distribution subsystem of the involved Mozart processes exchange messages according to the protocol of the entity. Protocol messages can contain Oz values. The core of Mozart is a run-time system that interprets the intermediate code that an Oz program is compiled to. The run-time system, including the distribution subsystem, is implemented in C++ and runs as a single-threaded application. Marshaling in the Mozart system synchronizes also with the garbage collector. The code of the Mozart’s marshaler is optimized on its own, e.g. static resolution of C++ methods is used whenever possible. The performance of marshaling in Mozart compares favourably with the performance of serialization in Sun Java J2RE 1.4.1. We have implemented a linked list holding integers in Oz and in Java. Our comparison actually favours Java, because list elements in Mozart are always polymorphic, therefore the virtual machine has to determine the element types at run time. In the Java code the type of elements are known at compile time. For the case of 5000 list elements (Java failed to serialize larger structures), marshaling and unmarshlaing in Java took 246 and 223 ms, respectively, whereas marshaling and unmarshlaing in Mozart took 1.1 and 0.6 ms, respectively. We used a 1 GHz AMD Athlon.
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
723
Fig. 9. Overhead due preemption, and latency reduction due concurrent marshaling.
The performance of the marshaler and unmarshaler is illustrated by the two plots on the left-hand side of Fig. 9. We have taken lists of integers. List size ranges from 5 × 103 nodes to 5 × 106 nodes. The size of the serial representation of these lists ranges from 28 kb to 39 Mb. Each of these lists is marshaled into buffers of different sizes. The marshaler is preempted whenever it runs out of buffer space, then the buffer content is flushed and marshaling is resumed. The unmarshaler’s performance is measured indirectly: we actually measure the performance of the marshaler and unmarshaler pair, and subtract the time spent by the marshaler. The plots in the figure show that preemption of marshaling and unmarshaling induces negligible overhead. This means that distributed systems/applications can be built using only a small fixed-size marshaling buffer. The same plots also demonstrate the good scalability of the (un)marshaler: (un)marshaling time increases roughly proportionally to the data structure’s size. Probably the most important factor making unmarshaling faster than marshaling is that the marshaler has to maintain the node table (introduced in Section 3) with all potential reoccurring nodes, while the unmarshaler’s node table contains only actually reoccurring nodes. We believe that the slight performance increase with smaller unmarshaling buffers is because the smaller buffers are favored by the limited-size cache memory in the computer. The plot on the right-hand side of Fig. 9 illustrates the impact of concurrency between the marshaling and network layers. In these experiments, one Mozart pro-
cess sends a list of 106 integers, the serial representation of which is approximately 5 Mb, and the other process sends it back. Different test runs use marshaling buffers of different sizes. The 10 Mb marshaling buffer is large enough to hold the complete serial representation of the entire list, so the marshaler is not preempted and therefore there is no concurrency between the (un)marshaler and network layers. The plot also confirms that the “user” time does not depend significantly on the size of the buffer. The communication latency increases as the buffer grows, as indicated by the increase of the “total” (wall-clock) process time. Moreover, the shown latency with small buffers is close to the minimum. The two marshaling steps are necessarily sequential, as the list is not exposed to the application until it is completely constructed. The total time with small buffers is close to the time of these two marshaling steps.
6. Conclusions We have presented an efficient and incremental (un)marshaling framework for distributed systems. It minimizes the latency of communication between computers in the system through concurrency between the application, marshaling and operations on the network. Concurrency is achieved by preemption of (un)marshaling, which, however, imposes very lit-
724
K. Popov et al. / Future Generation Computer Systems 21 (2005) 717–724
tle overhead. Furthermore, preemption of marshaling enables fixed-size marshaling buffers, as well as limits the time per marshaler invocation. We have developed, evaluated, and actually use our framework within the Mozart programming system.
References [1] J. Armstrong, R. Virding, M. Williams, Concurrent Programming in Erlang, Prentice-Hall, 1993. [2] The Mozart Programming System, 1998–2004, http://www. mozart-oz.org/. [3] Object Management Group (OMG): Common Object Request Broker Architecture (CORBA), 1997–2003, http://www. omg. org/. [4] Open WDDX: The Web Distributed Data Exchange (WDDX), 1998–2003, http://www.openwddx.org/. [5] R. Srinivasan, RPC: Remote Procedure Call Protocol Specification, Version 2, Network Working Group Request for Comments (RFC), vol. 1831, 1995. [6] R. Srinivasan, XDR: External Data Representation Standard, Network Working Group Request for Comments (RFC), vol. 1832, 1995. [7] The MPI Forum: MPI: A Message Passing Interface, Supercomputing ’93, 1993. [8] UserLand Software, Inc., XML-RPC, 1998–2003, http://www. xmlrpc.com/. [9] P. Van Roy, S. Haridi, Concepts, Techniques, and Models of Computer Programming, MIT Press, 2004.
K. Popov is a researcher at SICS, Swedish Institute of Computer Science. His research interests include high-performance symbolic parallel and distributed computation, large-scale simulation, security, and software engineering. He participated in the design and implementation of the Mozart programming system since its very origins at the DFKI and University of Saarland, Germany. Before joining the academia, he received his M.Sc. degree from the Electrotechnical University of St. Petersburg, Russia, and had 5 years of experience in the industry with software development of Internet applications, administration and system maintenance.
V. Vlassov is an associate professor of computer systems at the Department of Microelectronics and Information Technology (IMIT), Royal Institute of Technology (KTH), Stockholm, Sweden. His current research interests include parallel and distributed processing, peer-to-peer and Grid computing, computer architecture, performance evaluation. He received his Candidate of Technical Science degree (Ph.D.) in computer engineering in 1984 from the Department of Computer Science, Electrotechnical University of St. Petersburg, Russia, where he was an assistant professor (1985–1990) and an associate professor (1990–1993). In 1998 he was a visiting scientist in the Laboratory for Computer Science, Massachusetts Institute of Technology (MIT). P. Brand is currently co-manager of the Distributed Systems Laboratory (DSL) at Swedish Institute of Computer Science (SICS) located in Kista, Stockholm, Sweden. His current research interests are in the field of distributed computing, with particular focus on the design of programming systems that simplifies the construction of distributed applications. He is co-designer of the Mozart Programming System. S. Haridi is professor at the Royal Institute of Technology (KTH), and chief scientist of the Swedish Institute of Computer Science (SICS). He is leading research groups at KTH and SICS in the area of distributed computer systems. The activities include peer-to-peer computing, programming languages design and implementation for distributed and parallel computing, middleware of distributed computing, distributed and highly available systems. He is a codesigner of the programming language Oz and the Mozart programming platform (see http://www.mozart-oz.org). Seif Haridi has done earlier research on implementation of logic and constraint-based languages including SICStus Prolog and AKL (Andorra Kernel Language), and on scalable cache-coherent parallel computers. He is a co-inventor of COMA architectures, a scalable cache-coherent multiprocessor with only caches. This concept has been taken by SUN Microsystems. He is currently the scientific coordinator the current EU project PEPITO on peer-to-peer systems.