LOCUS operating system, a transparent system

LOCUS operating system, a transparent system

locus LOCUS operating system, a transparent system Greg Thiel examines the architecture and functionality of the LOCUS OS, and IBM's derivative Trans...

1MB Sizes 41 Downloads 247 Views

locus

LOCUS operating system, a transparent system Greg Thiel examines the architecture and functionality of the LOCUS OS, and IBM's derivative Transparent Computing Facility

The primary system components and architecture of the LOCUS operating system and the Transparent Computing Facility are identified. The current level of TCF functionality is described. ReferencestotheLOCUSOSareusedinareas where the fundamental concepts are described. The initial two sections outline the vision of the LOCUS Operating System, and provide definitions of key terms. The third section identifies the functionality provided by IBM's Transparent Computing Facility. The fourth section describes several of the key algorithms and basic system structure, and the last two sections provide insight into future directions for the technology, and summarize the material presented, Keywords: LOCUS05, transparency, TransparentComputing Facility, architecture

INTRODUCTION The LOCUS operating system project 1-5 started in early 1979 at the University of California Los Angeles. The objective of this work was to establish that machine boundaries in a network of computers could be hidden by the operating system. 'Transparency' became the cornerstone of the effort. Fundamentally, we believed that via 'transparency' we could dramatically ease the user's effort to use a networked environment, Now in the 1"990s the concept of 'transparency' is common in many areas of software design. The objective of the LOCUS operating system project was successfully materialized in a UNIX* derivative called the LOCUS Locus ComputingCorporation, 9800 LaCienegaBoulevard, Inglewood,

CA 90301, USA *UNIX is a trademarkof AT&T.

Operating System. Subsequent to the initial work at UCLA, a version of the system was developed by Locus Computing Corporation and IBM as the first product level distributed system. The product, IBM's Transparent Computing Facility (TCF), became generally available in March 1990 on AIX/370 and AIX PS/26. A second release became available in March 1991.

VISION OF THE LOCUS OS Transparency was the fundamental goal of the LOCUS operating system. Intrinsically, this meant that the user should not be aware of machine boundaries. Data should be available using the same programs, independently of whether the data was local or remote. An administrator should be able to reallocate user data to different drives and machines without surfacing these changes to the user. The same behaviour should be provided for all resources, including processing. In addition to the obvious application of transparency to data, the LOCUS operating system designers viewed processing as also requiring full transparency. In some cases, bringing data to the processing is not the most efficient answer, moving processing to the data may be more efficient, or the program may only be available on another machine in the network. A user program request within the network environment should be surfaced by the same interface, independent of where the request is initiated; thus providing a single system image. The second design goal was to allow users to control their data and processing within the network, if they so chose. Data and processing location information should be available if the user requests this information. In addition, the controls for allocation of data and processing within the network should be available as orthogonal

0140-3664/91/006336-I I © 1991 Butterworth-Heinemann Ltd 336

computer communications

locus constructs to the normal access mechanisms. An example is program initiation: a program is normally initiated via an exec0 system call: • exec0 should initiate a program on another machine type; if appropriate. • execution site path functionality can be used to direct the exec0 operation to initiate a program on a specific network host. • rexecO, a new system service, will be available to force a program to a specific network host. The third design goal required each network host to be able to operate independently. If other network hosts are unavailable, then their data and processing is unavailable, The missing data and processing should not preclude operation of the local host. This is in contrast to networks where specific machines are designated as file servers, print servers, mail servers, etc. By designing for the complete environment, the LOCUS OS is able to provide superior configuration flexibility over a strict server environment, The fourth design goal addressed the kernel level functionality provided. Following the UNIX convention, the goal was to provide key functions at the lowest level, The key functionality enabled construction of more complex constructs at the higher system levels. For example, moving tasks between machines, dynamically, was supported by providingakernel primitive to move an active task between two machines. Usingthis primitive a user, user application or system daemon can identity tasks which should change execution sites, and the kernel primitive is used to move the task. The fifth design goal was applied largely to user interface decisions. In spite of our best attempts, there are some single system abstractions that do not scale cleanly to a network environment. Typically, such problems occurred in obscure situations; but the semantics must be defined. An error was returned in the cases where the network prevented correct emulation of the single machine semantics. Our experience indicated that it was better to fail an operation than allow its completion and not operate as intended. A definition of key concepts is given below.

DEFINITIONS This section briefly describes several key Transparent Computing Facility (TCF) concepts which are used throughout the remainder of this paper. The initial terms seperate the various facets of transparency into separate concepts:

Name transparency Name transparency means that the same name used from any site in the network will result in the same object. Without name transparency a user must always be aware of the task's location and object's location before constructing a name. Location transparency Location transparency means that the name of the

vol 14 no 6 july/august 1991

object does not identify the location of the object. Name and location transparency enables the administrator or user to change the storage location of an object without changing the name used to reference the object. Access transparency Access transparency means that the same interface used to access an object locally is used to access the object when it is remote. If openO is the function used to gain access to a local file then open() is used to gain access to a remote file (and vice versa).

Semantic transparency Semantic transparency means the same command or operation on a network host should operate identically when initiated on any other network host. For example, the sort command on Host A should accept the same input, options and give the same output as the sort command on Host B. Data transparency Datatransparencyisthecombinationofname, location, access and semantic transparency when applied to file data. By providingdatatransparency, auseroradministrator is given complete freedom; tasks can be executed anywhere and data may be created, controlled and manipulated independent of storage location. Device transparency Device transparency is distinguished from data transparency due to the common convention of network system builders to separate the two. Device transparency is identical to data transparency except the objects are devices. Device transparency enables a user or administrator to exercise complete freedom where devices are connected, controlled, read, written, created and manipulated. Process transparency Process transparency delivers a combination of name, location, access and semantic transparency for process objects. TCF delivers data, device and process transparency to the programmer, user and administrator, thus giving the ultimate level of flexibility in determining how and where operations are completed! Performance transparency Performance transparency means an administrator or user is not able to identify the location of data or processing by measuring the performance of the operation. The time required to complete the same operation locally and remotely should be approximately equal. Network transparency Network transparency combines data, device and process transparency into a single concept. In a network transparent environment (like TCF) a user will be delivered a single system image of the networked environment. Site Site is a host running a copy of the LOCUS operating system. Cluster Cluster is the set of sites that have chosen to integrate themselves into a single network transparent environ-

337

locus ment. The cluster provides a single system image on the set of sites, Storage site Storage site is the site which stores the object (or a replica). In the LOCUS OS, access to data requires a storage site of the object to be present, Using site Using site is the site which is accessing an object (or a replica). A task is accessing a local object when the object's storage site and the using site are the same site. Current synchronization site Current synchronization site (CSS) is the storage site which coordinates access to data files. A new CSS is dynamically appointed for each file system. Every file in the system has an assigned CSS. If a file object is replicated there is only one CSS for the file object, and it is responsible for guaranteeing access transparency, with complete UNIX semantics, Partition The minimum partition is a single cluster site which either is not attempting communication with other cluster sites, or is unable to reach the other cluster sites, Two partitions can merge producing a larger partition, Simply put, partitions are one or more sites that are actively cooperating, This completes the terminology introduction. The next section describes the functionality provided by TCF.

TCF FUNCTIONALITY TCF is a functionally rich distributed operating system. It requires all systems wishing to cooperate to produce a single system image, and the cluster must run TCF. TCF provides data, device and process transparency. In addition to these transparency features, other significant functionalty is provided, TCF delivers the functionality described below to all the sites in the cluster. This section briefly describes the major functional components. In each area the interfaces are described. (Further details are available in several publications 7,8.)

Data transparency The TCF transparency functionality is subdivided into several major components: Global name space Distributed file system File data tokens Pipes Replication Replicated root Heterogeneity and local support Atomic commit UNIXprovidesahierarchicalfilesystemcomposedoffiles and directories. Files can contain data or can be references

338

to special file objects like pipes, devices or sockets..A directory contains a list of files or directories. A subtree of files and directories are gouped into file systems. Every file system is stored on a logically contiguous portion of disc. The file systems are pasted together using the UNIX mount operation. A mount operation logically attaches the root of a file system's subtree onto an existing directory in the name space. TCF provides cluster wide enforced consistency of the global data name space. All mount operations are cluster wide and the root directory (the base of the UNIX data name space) is common to all sites in the cluster. The global data name space provides name transparency for the file system. Name transparency for the data name space is the first key to network transparency. Using the global data name space the system is enabled to assemble a name space with all the local data of all sites in the cluster. Access to the data is dependent on a distributed file system for access to non-local data. TCF includes a distributed file system. TCF's distributed file system is UNIX compatible; including POSIX 1003.1 compatibility. The global name space translates character string file names into internal data object names. The internal data object name is a pair: global file system number and inode number (in UNIX an inode describes a specific file). TCF's distributed file system uses this pair to locate the data's storage site and data; then it will do the appropriate permission checking. As data pages are read and written the data is demand paged from the storage site to the using site. As with normal data access these pages are cached in the kernel's cache at both sites. A pipe is a UNIX interprocess communication object. TCF pipes are file system objects; a pipe connects writers to readers and allows them to exchange data. There can be one or more readers and writers. Data written to a pipe is consumed (destroyed) when it is read. The system is responsible for synchronization of the reading and writing operations. A single read or write is atomic; a pipe cannot overflow, and a read will block if insufficient data is available. Pipes can be named or unnamed. TCF's distributed file system includes network pipes support. Readers and writers can be located throughout the cluster, independent of pipe storage site. The correct synchronization semantics are provided for all pipe operations, independent of source and destination location. U N IX file access semantics requires each read operation to see all previously written data. If a process on Site A writes one byte to a file then Site B reads one byte, and Site B needs to see the byte written by Site A. T(ZF uses file data tokens to control consistency of the buffered data between using sites. A site must have the appropriate file data token to perform an operation. The TCF's distributed file system provides location transparency, since the character string name does not identify data location. In fact, even the internal name does not imply data location. Objects are located by using the global file system number to consult a table built by mount operations. The gmount table identifies the location of all mounted file systems within the cluster.

computer communications

locus Entries in the gmount table are added by mount and deleted by unmount operations, One implication of the global data name space is the existence of exactly one root directory for the entire system. This may appear to violate the third design principle for the system by creating a single root file system server. Using file system replication this constraint was successfully overcome, File system replication allows any file system to have replicas on different sites in the cluster. One replica is designated as the primary replica for the file system. The primary replica must be available for data in the file system to be updated. Other copies can be designed as complete read-only replicas (backbone replicas)or partial read-only replicas (secondary replicas). The system guarantees that all copies are kept consistent, and that all access is to the partition's most current available replica, The administrator(s) and user(s) control the number of replicas maintained for a given file. The administrator controls the minimum and maximum number of replicas bymanagementofthenumberandtypeofthefilesystem replicas. For example, a given file system has the primary, two secondaries, and two backbones; the administrator has said '1 want at least three copies of everything and no more than five copies of anything'. The user must take action to get more than three copies. The user controls whether he gets 3, 4 or 5 copies of his files by setting the replication control factors (called fstore) on files, An fstore is a 32 bit mask. Each file has its own fstore value. Each secondary file system replica has a replica fstore. A secondary replica will store a copy of a file if the 'bitwise and' of the replica's fstore and the file's fstore is non-zero. TCF has defined two interpretations of the fstore mask: system and user. The interpretation of the fstore mask is a property of a file system and all it's replicas. The system interpretation treats the fstore as a set of type bits. Each machine type is given a subset of the 32 bits. All sites of the designated machine type (storing a secondary file system replica) will have a replica fstore which is a subset of the machine type's fstore value. In simple terms, an fstore value does not specify the number of replicas of a file; instead, the fstore value identifies the category of the file. The user interpretation identifies the specific replicas. Each file system replica is assigned a unique bit in the mask. A file's fstore value will have the appropriate bit set based on which file system replicas should maintain a replica, File system replication delivers increased availability of data (e.g. allowing there to be multiple copies of key data objectives). If a file is being read, and the storage site fails, the system will automatically select a new storage site. The application will not detect an interruption of service, Another significant advantage is that improved performance on read traffic will benefit from being replicated. With replication, multiple storage sites are available for data access. Processes on a storage site will primarily use the local copy*, TCE's most common replicated file system is the • If the localcopyis not up-to-date(or beingwritten)thenanothercopyis selected,

vol 14 no 6 july/august 1991

system's root file system t. In TCFthe standard UNIX root file system is separated into two distinct file systems. The bulk of the data is placed in a highly replicated file system (the replicated root file system) and the remaining data is placed in a separate, per site, file system (the local file system). TCF architecture does not require a replicated root file system on each site; but the standard configuration and installation tools create this configuration. The local file system is, by convention, mounted on the directory~sitename. Contrary to the name, a local file system is visible throughout the cluster. However, it commonly contains directories, configuration files and devices specifictothe particular site. Based on design goal three, a site's local file system is stored on the applicable site. The separation of UNIX's root file system into two separate file systems would not be transparent to applications without some additional mechanism t. In addition, a given site-specific file needs to be mapped to different site-specific files, depending on the context in which the name is used. TCF provides a contextdependent naming mechanism to address this issue. The object moved to the local file system has a symbolic link placed in it's original location. The contents of the symbolic link has a special first component '(LOCAL)'. TCF replaces this keyword with a process's local alias when the symbolic link is evaluated. The local alias is a component of the process's state information, like the current working directory §. Clusters are not assumed to be composed of homogeneous machine types. Therefore, a replicated root must integrate multiple system binaries into a given name. For example, UNIX users e x p e c t / b i n / w h o to be the who program for the system. In a cluster with S/370 and PS/2 system which binary can it be? TCF created the concept of a hidden directory to allow pathnames l i k e / b i n / w h o to concurrently be identified with multiple objects. In a heterogeneous TCF replicated root~bin/who is a hidden directory with S/370 and PS/2 binaries. All access operations will automatically select one of the objects inside the hidden directory (commonly referred to as 'sliding through'). The selection is based on the user's preference list, the execution site path. The system will always select the hidden directory component which is most preferred. The execution site path is an ordered list of site names, site types and the 'LOCAL' keyword. The LOCAL keyword always maps to the local site and site type. The LOCUS OS's file system provides increased data consistency, as compared to typical UNIX file systems. The increased consistency is provided via a single file atomic update and commit mechanism. When a file is updated, the old data is not overwritten. Instead, new * In UNIX the root file systemcontainssystembinariesand other files necessaryto systemoperation. In general,the bulk of this data is read only and rarely updated. Some files or directories do see extensive updates,e.g. temporaryfiles, accountingdata, configurationfiles, etc. t~Filesmovedto the localfile systemhavehad their nameschangedfrom filename to/site-name/filename. §Temporaryfilesarealsohandledin this manner./tmp is a symboliclink to (LOCAL)/tmp.

339

locus pages are selected and linked into an incoreversion of the inode*. When the update is completed the file is committed by flushing the new pages and writing the inode to disc. With this single I/O operation, the inode write, the new data is committed t. Finally the old pages are freed, System calls are provided to commit a file and abort (roll-back) changes to the last commit. By default, a file is committed on last close; if an explicit commit is not issued. In addition, a system mode exists where the system will automatically commit all open files on a periodic basis. At open time users can override this mode on a file-by-file basis. The motivation to add this mode is to avoid data written to long open files from being lost $. As described, TCF provides extensive data transparency functionality. The system is capable of supporting a wide variety of distributed data configurations with excellent network transparency. Since the LOCUS OS and TCF are the evolution of a single site system, some non-transparencies do exist; primarily to provide binary cornpatibility with the UNIX interface. Primitives (local alias and hidden directories) were added which provide a context-sensitive naming mechanism. In both cases it is possible to reference the underlying objects in a name transparent fashion §. The following section describes the device transparency functionality.

Device transparency TCF's device transparency functionality includes support for: Remote block devices Remote terminal devices Remote select UNIX block devices are quite similar to file I/O. The kernel accesses the device via the buffer cache. Data is read/written in buffer size units hythe kernel.The bulk of the remote block device support is the logical result of the LOCUS OS structure. The most significant issue is the difference in assignment of storage sites. For a file the storage site is the storage site of the file's inode. In the case of devices, the inode storage site may or may not have any relationship to the storage site of the device (device site). A UN IX device is identified by a type and unit number pair**. The pair is kept in the inode and used when the device is opened or accessed. The result is *This includes indirect pages. tBy careful optimization of the commit algorithm the number of I/Os to the write the file is not significantly increased. $The most common example is a command log file; without periodic commits it would have no data.

that a device file has two storage sites, and different operations require different storage sites. A change of permission will use the inode storage site, and a read operation requires the device storage site. To provide complete device transparency TCF extended the device identification pair to be a triplet. The third element is the device site of the device. By including the device site in the inode, a device file will reference the samedevicefromanysiteinthecluster*.Usingthedevice site the system is able to locate and access the correct device. The system is structured to properly direct inode specific operations to the inode storage site, even when the storage site is established as the device site. The device type and unit numbers are only used on the appropriate device site; this giving complete autonomy to each site on its device numbering scheme. The most commonly accessed UNIX device is the terminal. In order to provide complete transparency, including remote processing, it must be possible for a remote program to write to a user's terminal. TCF provides this complete remote terminal transparency. Access to a terminal (orpseudo-terminal) is redirected to the appropriate site for input/output. Flow control is handled to prevent a process or terminal from inappropriately consuming system resources. All terminal operations are supported, including ioctlO operations. The known ioctlO operations are appropriately communicated between the using and device sites (even in the heterogeneous hardware environment). One of the common device functions is the selectO operation. Select() allows the user program to block concurrently on several file descriptors. When one or more of the underlying objects unblocks the selectO call completes; the system returns the appropriate status to the application. TCF permits the operands of a single select call to be distributed throughout the cluster. The application is provided with the correct single site semantics. This material describes the major TCF device transparency functionality. The following section describes the functions available for computing in the distributed environment.

Process transparency The TCF process transparency functionalty is subdivided into several major components: Global name space Remote program initiation Automatic site selection Process migration Software interrupts Load levelling File o f f s e t t o k e n s

§In the caseof the localfile system,/sitename/filenameis not a context

The U N IX process name space is a range of values within a

sensitive name and is part of the global name space. Hidden directories have an escape mechanism which disables the automatic slide through; specifically, an '(M is put atthe end of the component name./bin~who(a/ i370 would name the S/370 component of the ~bin~who hidden directory, **UNIX refers to the pair as the major/minor device numbers,

*This is a significant security issue. Some systems allow device files to select their device site at open time; thus allowing device files to reference different sites depending on when and how they are accessed and violate a site's security.

340

computer communications

locus signed integer. The name, process identifier (PID), of a process is assigned when the process is created. The system allocates a PID using a moving wave algorithm. To prevent duplicates the system is obligated to verify that the PID is not in use prior to completing the allocation, The UNIX process name space has both explicit and implicit naming functions. Explicit process naming occurs when a process call uses a PID and identifies a particular process. Implicit process naming occurs by referencing a group of processes or the parent process. POSIX defines two grouping mechanisms: sessions and process groups, Other process related system calls use the parent, child and sibling relationships in their semantics. Process groups and sessions both use the PIDs of their leader processes as their names, The LOCUS OS provides a complete global transparent name space for processes. PIDs are uniquely allocated throughout the cluster, and the system retains relationships independent of process location. Since PIDs are unique within the cluster, process groups and session identifiers are also globally unique; providing a transparent PID name space produces a transparent name space for sessions and process groups, An efficient global PID allocation scheme was created by segmenting the PID values into a series of ranges. Each site is assigned one of the ranges to allocate PI Ds. The PI D range assignment is used to determine the origin site of the process. The origin site does not indicate where the process is currently executing; it does provide a hint for use when trying to locate an arbitrary process, As with the data name space, the process name space has one case of non-transparency. The kernel threads and initialization process are system specific. None of these processes are intended to be globally named. In fact, in most cases the local process is the right process to be referenced. These are the defined semantics in TCF. UNIX commonly initiates a command via a process duplication operation, fork(), followed by a program initiation operation, exec(). The LOCUS OS has provided a performance enhancement via the introduction of a new system service, run(). Run is the combination of a forkO and exec(); a new process is created and the program is initiated in the new process. Since the memory image need not be copied (as fork implies), the overhead of invoking the command is reduced. Remote program initiation causes a command to be invoked on another site. A program may be initiated remotely, either explicitly or implicitly, The common explicit remote program initiation user interface is the on command of the command interpreters (shells). The on syntax prepends 'on' and the 'site name' or 'site type' to the command and its arguments. To remotely invoke a command the interpreter first changes the local alias to the destination site, and then uses the rexec() system service to remotely initiate the command, Rexec is an extended version of the UNIX exec0 system service. The primary difference between exec() and rexecO is an additional argument identifying the destination site of the program initiation. Run() also has a destination site argument, The common implicit remote program initiation user

vol 14 no 6 july/august 1991

interface is 'automatic site selection'. All program initiations result in a checktoverifythatthe specified program is appropriate for the underlying machine. If the system detects a mismatch it will automatically try and re-direct the initiation operation (and the current process) to another machine*. Alternatively, implicit remote program initiation can occur via the execution site path. The execution site path is a preference list of execution sites (normally, 'LOCAL' is the first element of the list). This is the same list used to evaluate hidden directories; there is a direct relationship between selecting a program to execute and selecting a program's execution site. The execution site path method of implicit remote program initiation is useful when preexisting programs are extended to be distributed programs, without direct change t. Rexec0 and exec() can both initiate a program on a remote site. To initiate a program on the remote site the current process must first be moved to the destination site. When run() is used a new process is created on the remote site. In all three remote program initiation cases the process is in a known state with a limited amount of current environmental data. Process migration is quite different, since it will move a randomly executing process to a 'like' remote site ~. There are two interfaces for process migration: migrate() and SlGMIGRATE. The migrate() system call can be used to migrate the current process synchronously. The only argument is the destination site of the process. Process migration can only occur between like machine types. If the migrate operation fails, the process continues to execute on the original site. SIGMIGRATE is a new signal defined in the LOCUS OS. It can be sent to any process or process group for which the sender has permission. SIGMIGRATE'sdefault behaviour attempts to move the process to the designated site. The sender only receives status on the delivery of the signal; not the migration operation.Alternatively, the process may discard or catch the signal and move temporary files, alter context information, etc., prior to migrating. The process migration operation re-establishes the execution environment (open files, etc.) on the destination site; this includes moving the entire memory image. Software interrupts or signals are used to &synchronously notify another process, or notify the current process, of an exception. Signals are initiated via the killO system call using a PID or process group identifier. Kill() is able to locate an arbitrary process or process group. The system will locate the process or process group (all its members) and post the signal event, once and only once. If the process is not located the kill will fail. *The ability to perform automatic site selection is the direct result of the data and processingtransparencyprovided by the LOCUSOS and TCF. Withoutthe high degree of transparencythe user semanticswould be unpredictable. ~The execution site path is an example of design goal two providing orthogonal control. ~:ln spite of their significant user level differences, at the system level rexecO and process migration are quite similar. In fact, the bulk of the code is common between the two operations.

341

locus The locating algorithm uses the origin site of the PID as the starting point forthe search. The origin site is obligated to know the current execution site of all processes it originated. Othertransparent processingfunctions include priority management and group management, TCF provides several load levelling mechanisms and the basic primitives to build additional mechanisms. The command interpreters provide a migrate command, it uses SIGMIGRATE to request that a process or group move to a new execution site. The fast command is provided to make load levelling decisions at program initiation. Recently, a prototype of a cluster wide load levelling service was developed. The service would check loads throughout the cluster and use SIGMIGRATE to automatically re-direct processing load within the cluster, Process movement, either from a migrate or a remote process initiation, requires extensions to several underlying system services. UNIX defines shared file access characteristics across forks. The sharing includes the sharing of a single file offset. When one of the two processes moves to another site it is necessary for the offset and access flags effectively to become shared memory between the two sites. TCF's file offset tokens provide the necessary control and sharing of the file access information. The token algorithm provides fairness and correct synchronization semantics, Other underlying areas which provide critical support are the distributed file system and locking services. These services must support the re-establishment of the active locks and open files for a process migration or remote program initiation. The functionality described above provides complete process transparency. TCF is the only production level system to provide complete process transparency. The following section describes how system state is resolved when a site joins or leaves the partition,

Error h a n d l i n g Distributed systems commonly have increased failure modes over a single site system. One of the fundamental objectives of a transparent distributed system is to automatically recover from as many failure conditions as possible. One example is automatically switching to another replica, when a replicated file is being read and the current storage site fails. However, some errors must be reported to the user or application. To avoid impacting existing applications the LOCUS OS attempts to map as many errors as possible to a sensible single site error. One example is mapping the error returned if a write to a replicated file is attempted and when the primary replica is not available. The error returned is the same error returned when a file system is mounted read-only, The LOCUS OS allows cluster sites to come and go at will. Whenever a site failure is detected, or a new site is announced, the system must resolve key state information. This process is called dynamic reconfiguration. Phase I in this process is the verification of the sites in the partition. During this step all sites in the partition

342

select one site to lead the re-configuration process. The selected site then initiates the second phase. In Phase II the system reconciles the resources within the cluster. For example, a storage site which has lost its using site closes the file. If a new site is being added the data and processing present on that site are added to the global names spaces. During Phase II each file system will have its CSS verified. Any file system which lost its CSS, or gained its primary replica, has a new CSS selected and assigned. Each CSS of each replicated file system verifies that it has state information on all active files. If processes are present in the custer but their origin site is not present, a surrogate origin site is appointed. A surrogate origin site is the origin site for all processes that are orphaned from their origin site. During dynamic reconfiguration all local operations continue without interruption. Some remote operations will be temporarily suspended. On a typical system the suspension is asmall number of seconds. Fundamentally, the reconfiguration processis responsible f or re-establishing a consistent view of the new cluster; however, that view may have changed. The preceding discussion has described all the key functions in the LOCUS OS and TCF systems. Those functions are in the areas of data, device and process transparency. The resulting system is a highly network transparent system, providing a single system image.

ARCHITECTURAL STRUCTURE The LOCUS OS and TCF system architecture is a multilevelled architecture. This section discusses several of the key layers. The lowest level of the distributed system is a IP-based transport layer. This layer provides site-to-site reliable connections ('communication channels'). If a gateway fails a route re-establishment protocol is invoked (LARP) to re-establish the connections in a timely fashion*. The transport layer handles byte flipping, acknowledgements and fragmentation and reassembly of packets. In addition, a buffered I/O interface, for use by the file system, is provided. In this waythe file system component can treat reads and writes of local files similar to remote accesses. The transport layer also detects any partition changes and initiates dynamic reconfiguration, as appropriate. TCF cluster sites keep a pool of kernel threads (processes) to be the servers for incoming requests. The servers are lightweight processes that execute only in kernel code. An incoming message is surveyed and assigned to a server process. The server process consults the message, determines the requested action, executes the requested action and then, if appropriate, responds to the message. When the actions associated with the message are completed the server process returns to the server pool for future requests. The server allocation support will automatically allocate new threads. If the *The LOCUSOSassumesconnectiontransitivitybetweenall sitesin the cluster, i.e. if A can communicatewith g and B with C then the system assumes A can talk to C.

computer communications

locus allocation routines detect a low server availability they allocate new threads. The system is prevented from excessive creation of servers, Intrinsically, many of the system components are independent of the others in their implementation. The dependence of the components is the result of the design semantics for the system. For example, replication is not required by the distributed file system or process transparency. However, providing the necessary performance and availability needs for the cluster motivates the inclusion of replication, The data transparency components have several architectural relationships. The global name space is used by the heterogeneity and local alias functionality. The replication uses the atomic commit mechanism to guarantee that a new version of a replica is updated atomically. Replication is structured on top of the distributed file system. Access to individual replicas is done via the non-replicated protocol. Clearly, the replicated root support uses the system replication, local alias, and the hidden directory mechanism. All of the data transparency functions depend on the dynamic re-configuration support to collect cluster data, resolve orphaned active objects, and establish the new cluster view. The device transparency functionality has two major components.The remote terminal device support does not use the file system to access the terminals. The remote block device support does use the distributed file system support for data access, Process transparency functionality's major architectural interactions are with the global name space, distributed file system, remote terminal devices and dynamic reconfiguration. Without the underlying file system transparency the ability to automatically select execution sites and migrate processes would be inhibited. Terminal interaction is mandatory to allow users to freely direct tasks at the most appropriate execution location. Dynamic reconfiguration is responsible for resolving the process name space.

Design axioms and algorithms The objective of this section is to give the reader insight into several key algorithms within the system, Fundamentally the LOCUS OS provides single system image. A key to establishing a single system image is establishing name space transparency for each object type. The mechanism for building the name space should permit each cluster site to operate as independently as possible. The second concept that was repeatedly applied is to precisely emulate UNIX semantics. Distributed system behaviour was correlated to single system behaviourand the appropriate code provided. When no direct correlation could be found the fifth design goal was applied,

Process management The process migration algorithm is largely common to all forms of process movement. When the SIGMIGRATE

vol 14 no 6 july/august 1991

signal is received or the migrateO call is issued, the system determines if the destination site is the correct type. Next it collects some critical process state information (process's UID, PID, etc.) and sends a migrate request to the destination site. At the destination site a server process executes the server code for the migrate request. After the server's verification of the request it forks. This new process will be the migrated process. The server process responds to the migrate request with an affirmative, and the process identifier of the new process. The new process sets it's PID to the PID of the migrating process (and its session ID and process group ID). The new process, at the destination site, now takes control of the protocol. It requests information on the open files and active locks. These files are re-opened and re-locked within the new process at the destination site. The appropriate file access structures are constructed. Next the destination process starts requesting the memory pages of the migrating process. Typically, the text will be read only and available directly from the original load module. The process private pages are pulled by the destination process. When all the writable pages have been transferred process migration is almost complete. Finally, all newly posted signals must be transferred to the new process and the process locating information must be updated to indicate the process's new location. Next the new process sends a migrate complete message to the source process. The migrating process at the source node first changes it PID to be 0, and then notifies the appropriate origin site that the new location of the process is the destination site (similarly, for the session and process group). To complete the migrate operation the source site responds to the migrate complete messagewith any newly posted software interrupts.

Access synchronization The file access synchronization algorithm is described below. The LOCUS OS uses a centralized synchronization algorithm, and the CSS to coordinate file accesses. For each file system a CSS is appointed (from the set of sites storing a replica of the file system). If a partition loses (due to a failure) the CSS for a replicated file system, then dynamic reconfiguration will appoint a new CSS. For replicated file systems the CSS is responsible for tracking the access modes of each using site. When a using site requests write access the CSS will force all subsequent accesses(by all using sites) to the primary replica. The initial message in the open file protocol is to the CSS to select a storage site. The CSS also guarantees that the token mechanism grants only one write token at a time (and all the read tokens are reclaimed). The replicated file system CSS is also responsible for guaranteeing that all file operations are directed to the most current replica of the requested file. There are two performance optimizations to the synchronization mechanism. First, a directory search of a locally stored replicated directory does not require communication with the CSS. Directories are commonly small, changed relatively infrequently, and automatically

343

locus updated, so directories are normally up-to-date. If the directory is known to be out-of-date then the CSS is consulted. A similar optimization is performed relative to program invocation. On exec0 no communication with the CSS is performed and the local copy is used. TCF has a different approach to pathname expansion than is currently found in AFS9 and NFSI°. In AFS and NFS the system expands a pathname one component at a time, sending each component to the appropriate storage site for evaluation (AFS may have the file cached and not require the communication). TCF pathname expansion was optimized to send all of the remaining path to the storage site. The storage site expands the name until another storage site is required; then it returns the unresolved path and the low level name of the last component it evaluated. In most cases the number of messages is significantly reduced, and better performance results. The ability to otimize in this fashion is the direct result of the common, cluster-wide, view of the name space, Portions of the TCF replication design were outlined in the synchronization discussion above. The other significant replication algorithm is responsible for managing the update propagation between replicas. TCF does not use a broadcast write strategy. Instead, application writes are directed to the primary replica. When the application completes the primary replica is committed to disc. The application proceeds when the commit completes. Each time a program commits a new version of a file, the primary replica a file system-wide commit count is assigned to the file. Commit counts are used to detect replicas which are out-of-date. After the commit operation completes, the primary replica notifies all other replicas that a new version of the file is available. They are responsible forpropagatingthe new v e r s i o n o f t h e f i l e a t their leisure, A special kernel process is responsible for propagating updates to all local replicas. The standard open and read protocols are used to open the file and read the data. The reads are typically from the primary replica and are always written to the local replica. The commit count from the primary replica is propagated to the local replica. In some conditions, if the file is only partially over-modified, the system will only propagate the pages, The mechanism described would be completely correct if all sites were always available and no failures ever occurred. TCF replication does not require these assumptions; instead it uses commit count information to detect replicas that are out-of-date. In addition to a per file commit count, replication maintains additional information on a file system replica basis: a commit count high water mark, a commit count Iowwater mark and a commit list. The high water mark is defined to be the largest commit count of any known change. The low water mark is defined as the commit count value whose change and all previous changes are successfully propagated to this replica. The commit list is used as a window to the recent changes. The low water mark can only be updated by increments of one. The commit list is used to manage propagations which occur out of commit count order. When a file system replica has been unavailable

344

(unable to communicate with the primary replica) for a period of time it is likely to get out-of-date. There are two levels of reconciliation which can occur. The first method is called kernel reconciliation; it can handle the replica being 50 commits out-of-date. The second method is called user-level reconciliation; it handles an arbitrary outof-date situation. When the replica becomes available it receives the commit list, high water mark and low water mark from the primary replica (actually, the CSS). If its low water mark falls within the primary replica's commit list, the new replica uses this commit list to identify the files to propagate. User-level reconciliation is invoked bythe kernel notifying a user-level daemon. The user daemon creates a recovery process specific to the replica. The recovery process uses the replica's current low water mark to find all recent updates. The primary replica is scanned for larger commit count values. Each file on the recently updated list is propagated to the out-of-date replica. If all propagation requests succeed the new low water mark is forced onto the replica. Process transparency to remote procedure call technologies are compared below, and some future directions for distributed systems technology are identified. RPC A N D PROCESS TRANSPARENCY Over the last ten years remote procedure call (RPC) technology has developed into a significant component of distributed systems technology. To apply RPC technology to an application to produce a distributed application commonly requires the application to be redesigned. This is in contrast to TCF's process transparency, where significant effort is made to match existing interfaces and provide orthogonal control interfaces. More importantly than the fundamental application of either technology is an observation that process transparency is a higher architectural concept than RPC. In an RPC-based environment, process transparency would be built onto RPC services. The process transparency services can be exported and used independently. For example, the file re-open and file offset token support can be used in an arbitrary application to share a single file access path between two servers. Asthenumberof RPC-basedapplications increasesthe value of the process transparency will increase. Distributed application builders will object to continually rebuilding the functions necessary to address an application's transparency needs. The following section explores the possible future enhancements. FUTURE DIRECTIONS In 1989 the next generation of process transparency and clustering was submitted to the Open Software Foundation (OSF) in response to the DCE RFT*. The DEcorum process *These technologies

were a portion of the DEcorum submission. DEcorum was composed of technology from Locus Computing Cor-

potation,Transarc,HP and IBM.

computer communications

locus transparency and clustering support was designed to be portable, and scale to large numbers of systems (approximately 4000). The clustering, full file system replication, and process transparency support used the underlying RPC and distributed file systems to provide the network transport and remote file access. The portability objective of the DEcorum design was less than 2500 lines of hooks in the base and distributed file system software. Using these hooks additional modules can be added to provide the clustering, full file system replication and process transparency support, One of the major technology steps embedded in the DEcorum process transparency architecture was the creation of a virtual process (vprocs) layer within the system. Similar to the file system 'vnode' switch, vprocs separate the naming space management from the specific operations available on a process object. Via vprocs the system is able to insert local process operations, client operations, etc., as the object operations for a vproc, Depending on the location of the underlying process the correct operation set is established for a given virtual process, Modular architectures are also defined for full file system replication and clustering to allow the underlying system support to be fully leveraged. All DEcorum components are carefully designed to permit scaling to a large number of sites (the TCF implementation has a 31 site limit), The marketplace has just recently started to analyze the importance and value of the distributed systems and RPC technology. The architectural value and significance of clustering and process transparency technologies will not be fully realized until users gain more familiarity with distributed systems and RPC. Object technology is being combined with RPC to produce graphical access interfaces which manage data conversions, Future systems can be simplified by using process transparency functions to transfer process state and manage the active agents. Fundamentally, process transparency and clustering are higher level services than R P C . Other services, like object interfaces, are even higher level interfaces. The LOCUS OS and TCF experiences have completely verified the assertion that software developmerit is significantly simplified by a network transparent environment. The application of these technologies will decrease development costs and in most cases enhance usability*, DEcorum's distributed file system creates a concept of a cell A cell is defined to be a single authentication domain with any number of sites. In DEcorum, a cluster is a subset of a cell which was cooperating to provide a higher level of consistency for data, processing and naming. These two definition boundaries start to blur when the strict transparency definitions for clustering are eased. One obvious cell versus cluster example is remote command invocation; why should the interface change to *It is likelythat someaspectof the underlyingnamingmechanismwill be visible through the object interface. The transparencyprovided by clusteringwillincreasethe usabilityof this namespaceby providingsome consistency,

vol 14 no 6 july/august 1991

start a command outside the cluster? The answer should be that the interface remains the same, but the underlying environment may impact the command's results. For example, if no distributed file system is present then local data may not be accessible. If the data is available, via the same name and access interface then effectively cluster semantics to the remote command invocation have been provided. This philosophy of subscriptions to different levels of service will become more common in the future. Most users will not immediately see the value of a fully transparent environment. However, specific services may be of value. As each new service is understood users and administrators will understand the distributed environment better and use it more effectively. As the level of cooperation and coordination increases in a network, the non-transparencies can be decreased. The concept of service subscription is another degree of flexibility available for configuration establishment.The flexibility is required since different configurations have different sharing requirements. Further, the level of sharing often changes over time; as can the subscribed services. Transitioningthe LOCUS OSfromaresearchprojectto the TCF product has provided numerous lessons. One of the most significant lessons surrounds software service (fixes). One of the significant features of a replicated environment is the ability to service the cluster only once; the fix is then automatically propagated to other sites. This is a significant feature to many administrators, but it also presents a problem. An administrator wants to test a new release prior to making the new code generally available. Alternatively, during a transition phase he may need to have two releases of the base code available. Logically, this is equivalent to providing each environment with all user data and providing a switch to select the current environment on a per user or per process basis. In each environment networktransparencyisprovided.Thevalue of network transparency remains as remote access to resources is likely to be especially common during the transition. This section has outlined several areas where recent work has occurred. The primant areas of attention have been to increase the number of the participating cluster sites in the cluster and to increase the portability. The philosophy of installation has changed to consider configurations where systems subscribe to additional services over time. Service subscription and software fixes have both been motivated as a generalization of the single system image. The generalization provides more configuration flexibility to improve the change management properties of the installation.

SUMMARY The material in this article has summarized the design principles, functions and some of the key algorithms of the LOCUS operating system and the evolved product, the Transparent Computing Facility. The systems provide

345

locus a very high degree of network transparency, thus providing a true clusterwide single system image. Many of the LOCUS OS design principles could be applied to non-UNiX systems and produce an equally transparent result. The future directions material identified several areas for extensions, the most significant of which is growth in the area of portability. A short comparison of RPC and process transparency is provided to aid the reader in positioning both technologies.

4

ACKNOWLEDGEMENTS

5

I wish to thank Yolanda Cueva for her extensive efforts in the production of this material. I also wish to thank Steve Kiser for his thoughtful review.

6

3

7

REFERENCES 8 1 Popek, G, Walker, B, Chow, J, Edwards, D, Kline, C, Rudisin, G and Thiel, G 'LOCUS: A network transparent, high reliability distributed system', Proc. 8th Symposium on Operating Systems Principles, Pacific Grove, CA, USA (December 1981) 2 Walker, B J Issues of Network Transparency and File

346

9 10

Replication in Distributed Systems: LOCUS, PhD Dissertation, Computer Science Department, University ofCalifornia, Los Angeles, USA (1983) Walker, B J, Popek, G, English, B, Kline, C and Thiel, G 'The LOCUS distributed operating system', Proc. 9th Symposium on Operating Systems Principles, Bretton Woods, New Hampshire, USA (October 1983) Walker, B J 'The LOCUS distributed computing environment', lEE Aerospace Applications Conf. Digest, lEE, USA (1984) Popek, G and Walker, B l The LOCUS Distributed System Architecture, MIT Press, USA (1985) AIX Announcement 3-1588 (Sections 288-130 and 288-132), IBM Corporation, USA AIX Operating System Technical Reference Volumes 1 and 2 (SC23-2300 and SC23-2301), IBM Corporation, USA (March 1990) AIXOperatingSystem Commands Reference Volumes 1 and2 (SC23-2292 and SC23-2184), IBM Corporation, USA (March 1990) DEcorum File System Transarc Corporation (January 24 1990) NFS-VAX Documentation: Sun Microsystems, Inc., USA

computer communications