Computer architecture

Computer architecture

299 North-Holland Microprocessing and Microprogramming 12 (1983) 299-310 Euromicro Reports Conference Reporter: N.A. Schreiner-Novick Computer Arc...

1004KB Sizes 3 Downloads 155 Views

299

North-Holland Microprocessing and Microprogramming 12 (1983) 299-310

Euromicro

Reports Conference Reporter: N.A. Schreiner-Novick

Computer Architecture The 10th Annual International Symposium on Computer Architecture was held at the Royal Institute of Technology in Stockholm, Sweden from June 13-17, 1983. This conference, at which almost 50 lectures were delivered, was sponsored by the IEEE Computer Society, the ACM, EUROMICRO and the National Swedish Board for Technical Development. We present below a report on the topical lectures of this conference.

Keynote Speech Size, Power and Speed

In the Keynote Address, entitled 'Size, Power and Speed', M. V. Wilkes (Digital Equipment Corporation and M.I.T., U.S.A.) discussed the roles of power and size in determining the speed of a computer. Wilkes noted that the developments in VLSI that are now going ahead so rapidly will make fast computers available in large quantities. They will not be supercomputers, he said, but they will be very fast and powerful compared with present-day personal computers. Those who now have to make do with a single personal computer, or a share in a large computer, can look forward to the day when they may hope to have several, perhaps many, such computers at their disposal, concluded Wilkes.

Computer Architecture Taxonomy Machine Data Type View

According to W.K. Giloi (Technical University of Berlin and Univ. of Southwestern Louisiana), ex-

isting taxonomies of computer architecture lack the descriptive tools to deal with the large variety of existing principles, features, and mechanisms of the existing spectrum of single processor, multi processor, and multi computer architectures. Consequently, Giloi believes, they lack the discriminating power to be able to taxonomize computer architecture. In his lecture, Giloi presented a new approach toward a complete taxonomy. The key to the taxonomy is to start with the dichotomy of 'operational principle' and 'hardware structure' as the foundation of a computer architecture and describe the constituents of the operational principle in terms of 'machine data types' consisting of 'machine data objects', their representations, and the functions applicable on the objects. The resulting taxonomy provides a systematic approach to the design of innovative computer architectures. Fault- Tolerance Taxonomy

A conceptual framework was presented by A. Avizienis (UCLA, U.S.A.) which relates various aspects of fault-tolerance in the context of system structure and architecture. Such a framework, Avizienis noted, is an essential first step for the construction of a taxonomy of fault-tolerance. Avizienis used a design methodology for fault-tolerant systems as the means to identify and classify the major aspects of fault-tolerance: system pathology, fault detection and recovery algorithms, and methods of modeling and evaluation. A computing system was described in terms of four universes of observation and interpretation: physical, logic, in-

300

EUROMICRO Reports

formation, and interface, or user's. The description was used by Avizienis to present a classification of faults, i.e. the causes of undesired behavior of computing systems.

Architecture Design Methods

Caddie B. Pehrson and J. Parrow (Uppsala Institute of Technology, Sweden) reported on a design methodology and an experimental CAD system, named Caddie, based on this methodology. Caddie supports specification, analysis and synthesis of objects that can be described as communicating processes, e.g. electronic circuits, sequential networks, digital processors and programs. Pehrson and Parrow explained that the basic ideas behind Caddie are: (1) Top-down hierarchical design. Specifications are broken down hierarchically by stepwide refinement. (2) Integrated design. Design objects from different design phases are described in the same formalism. This technique allows universal design operations. (3) Formal systems with varying descriptive and decisive powers are exploited. (4) Incremental design. Parts can be modified separately without effect on other parts. (5) Interactive design, i.e. the designer can interact with the design tools in real time. The lecture was entitled 'Caddie: An Interactivb Design Environment'. Description Language S. Dasgupta (University of Southwestern Louisiana, U.S.A.) introduced the notion of family of languages for the multilevel design and description of computer architectures. Details of a particular language family, currently under development, has been previously described. One of the constituent members of this family is S~, intended for the specifications of the outer (or exo-) and inner (or endo-) architectures of general purpose von Neumann style computers. In a lecture entitled 'On the Verification of Computer Architectures Using an Architecture Description Language', Dasgupta described the formalization and application Of SI

to the formal proofs o f correctness of architecture designs.

Concurrent Programs The object of R.M. King's research (Kestrel Institute, U.S.A.) is the codification of programming knowledge for the synthesis of concurrent programs. He presented sample rules and techniques that he shows can be used to derive two concurrent algorithms: dynamic programming (for the class of problems that run in polynomial time on sequential machines) and array multiplication. For both derived concurrent versions the code runs in linear time. The concurrent versions are significant and complex algorithms, though they are not new and already have been reported in the literature. The synthesis knowledge for these derivations is embodied in seven synthesis rules. King expects these rules to generalize to other classes of algorithms. He has also discovered a pair of techniques called virtualization and aggregation. This pair of techniques (plus the seven rules) is shown to be powerful enough to synthesize Kung's systolic array architecture from a specification of matrix multiplication.

VLSI Architectures

Systolic Chip In recent years, many systolic algorithms have been proposed as solutions to computationally demanding problems in signal and image processing and other areas. Such algorithms exploit the regularity and parallelism of problems to achieve high performance and low I/O requirements. Since systolic algorithms generally consist of a few types of simple processors, or systolic cells, connected in a regular pattern, they are less expensive to design and implement than more general machines. According to A.L. Fisher, H.T. Kung, L.M. Monier (Carnegie-Mellon University) and Y. Dohi (Yokohama National University, Japan), this advantage is offset by the fact that a particular systolic system can generally be used only on a narrow set of problems, and thus design cost cannot be amortized over a large number of units. One way to

EUROMICRO Reports

approach this problem is to provide a programmable systolic chip (PSC), many copies of which can be connected and programmed to implement many systolic algorithms. Fisher, Kung, Monier and Dohi described the CMU PSC, a single-chip microprocessor suitable for use in groups of tens or hundreds for the efficient implementation of a broad variety of systolic arrays. The processor has been fabricated in nMOS, and is undergoing testing.

Synchronization Schemes Highly parallel VLSI computing structures consist of many processing elements operating simultaneously. In order for such processing elements to communicate among themselves, some provision must be made for synchronization of data transfer. According to A.L. Fisher and 1-1.T. Kung (Carnegie-Mellon University, U.S.A.), the simplest means of synchronization is the use of a global clock. Unfortunately, large clocked systems can be difficult to implement because of the inevitable problem of clock skews and delays, which can be especially acute in VLSI systems as feature sizes shrink. For the near term, good engineering and technology improvements can be expected to maintain the feasibility of clocking in such systems; however, noted Fisher and Kung, clock distribution problems crop up in any technology as systems grow. An alternative means of enforcing necessary synchronization is the use of self-timed, asynchronous schemes, at the cost of increased design complexity and hardware cost. Realizing that different circumstances call for different synchronization methods, Fisher and Kung provided in their lecture a spectrum of synchronization models. Based on the assumptions made for each model, theoretical lower bounds on clock skew were derived, and appropriate or best-possible synchronization schemes for large processor arrays were proposed.

Boolean Vector Machine R.A. Wagner (Duke University, U.S.A.) described the architecture of a class o f machines intended to solve computationally intensive problems much

301

faster than can today's machines, at no increase in cost. The architectural approach he advocates is conceptually simple, and is described as follows: Take the memory of a conventional machine, M, holding 2 k words, each p bits long. Reorganize the same components into p huge registers, each 2 k bits in length. Then, add a small amount of processing logic to each bit position of the register collection. Add also a communication network, allowing the 2 k pieces of processing logic to interact. According to Wagner, the resulting architecture, B, is capable of executing a wide variety o f algorithms a factor of 2k, p to 2k/p 2 times faster than can the conventional machine M originally. Wagner calls this architecture a Boolean Vector Machine, reflecting the fact that the machine's basic operations perform Boolean operations on huge vectors of O's and l's.

Relational Data Bases A VLSI chip for performing relational data base operations was proposed by M.A. Bonuccelli, E. Lodi, F. Luccio, P. Maestrini and L. Pagli (Universitfi di Pisa, Italy). The chip is a tree of processors (TOP), where each chip has elementary storage and processing capabilities. A relation will be stored in the lowest levels of a TOP. Denoted by h, the height of the tree, the upper h's levels, are used for routing and bookkeeping purposes. A number of basic operations such as allocate and deallocate subtrees, insert and compare m-tuples, etc., were defined for the TOP's. Relational operations are effectively performed as simple combinations of basic operations. The architecture of a data base machine based on T O P ' s was sketched by Bonuccelli et al. Such a machine, they said, is feasible with the current VLSI technology and could become attractive in a few years if density and performance of VLSI keep improving at the current rate.

Data Flow Architectures I

Implementing Streams In several data flow architectures, 'streams' are proposed as special data structures able to improve

302

EUROMICRO Reports

parallel execution in functional programs by providing a pipelining effect between different program parts. L.J. Caluwaerts (Agfa-Gevaert, (Belgium), J. Debacker and J.A. Peperstraete (Katholieke Universiteit Leuven, Belgium) described how streams are implemented on a data flow computer system based on a paged memory. This memory holds both the data flow programs and data structures such as streams. Streams, they noted, are stored in the memory as a linked list of pages while pointers to the streams are flowing as data tokens. A reference count was used to prevent excessive copying of data and to control the allocation and recovery of pages. In the lecture, input/output was treated as a special application of streams. Piecewise Data Flow Machine J.E. Requa (Lawrence Livermore National Laboratory, U.S.A.) presented the hardware register management and instruction block control flow sequencing provided by the PDF block processing section of the Piecewise Data Flow machine, a proposed high performance computer architecture. Combined, these capabilities provide the maximum allowed execution overlap of instruction blocks with minimum hardware contention and high hardware utilization. Requa's lecture was entitled 'The Piecewise Data Flow Architecture Control Flow and Register Management'. Working Set Concept M. Tokoro, H. Sunahara (Keio University, Japan) and J.R. Jagannathan (University of Waterloo, Canada) discussed the concept of the working set for data-flow machines in order to establish one of the criteria for the realization of cost effective dataflow machines. The characteristics of program execution in conventional machines and data-flow machines were compared. Then, a definition of the working set for data-flow machines was proposed, based on the simultaneity of execution and the principle of locality. Several segmentation, fetch, and removal policies were described in the lecture. Evaluation was made in terms of feasibility, efficiency, and performance, through computer simulations.

Data Driven System A hardware approach to the design of data driven computers using a microprogrammed processor as a building block was proposed by R. W. Marczyhski (Polish Academy of Sciences) and J. Milewski (Warsaw University, Poland). The data driven computer is a network of processor virtually implementing the data flow program graph. The flexibility of microprogramming, they said, provides a wide variety of possible implementations for the fixed data driven network structure. Due to the proposed organization of communication, the network is deadlock-free. The lecture was appropriately entitled 'A Data Driven System Based on a Microprogrammed Processor Module'.

Cache Memories

VLS1 Instruction Cache for a RISC D.A. Patterson and his colleagues (University of Calif., Berkeley, U.S.A.) presented an 'Architecture of a VLSI Instruction Cache for a RISC', Reduced Instruction Set Computer, an architectural philosophy promising higher performance using simpler hardware. Their long-term goal is to design a single chip that combines the CPU with an instruction cache, as this combination reduces off-chip references. The silicon processing available to them precludes an on-chip cache. This separate cache chip should, they noted, be considered a research prototype rather than a potential commercial product. They presented the four architectural ideas potentially applicable to other VLIS machines, then explained the performance evaluation of each idea. They then described the implementation of the cache and finally summarized the results. Shared-Cache Performance According to P.C.C. Yeh (IBM Corporation, U.S.A.), J.H. Patel and E.S. Davidson (University of Illinois, U.S.A.), shared-cache memory organizations for parallel-pipelined multiple instruction stream processors avoid the cache coherence problem of private caches by sharing single copies of common blocks. A shared cache

EUROMICRO Reports

may have a higher hit ratio, but suffers performance degradation due to access conflicts. Yeh, Patel and Davidson proposed effective shared cache organization which retain the cache coherency advantage and which have very low access conflict even with very high request rates. Analytic expressions for performance based on a Markov model have been found for several important cases. Performance of shared cache organizations and design tradeoffs were discussed in the lecture. Memory-Processor Traffic In a lecture entitled 'Using Cache Memory to Reduce Processor-Memory Traffic', J.R. Goodman (University of Wisconsin-Madison, U.S.A.) recognized the importance of reducing processormemory bandwidth in two distinct situations: single board computer systems and microprocessors of the future. Cache memory was investigated as a way to reduce the memory-processor traffic. Goodman showed that traditional caches which depend heavily on spatial locality (lookahead) for their performance are inappropriate in these environments because they generate large bursts of bus traffic. A cache exploiting primarily temporal locality (look-behind) was then proposed and demonstrated to be effective in an environment where process switches are infrequent. Goodman argued that such an environment is possible if the traffic to backing store is small enough that many processors can share a common memory and if the cache data consistency problem is solved.

Instruction Caches Instruction caches were analyzed by J.E. Smith and J.R. Goodman (University of WisconsinMadison, U.S.A.) both theoretically and experimentally. The theoretical analysis began with a new model for cache referencing behavior - the loop model. This model was used to study cache organization and replacement policies. Smith and Goodman concluded theoretically that random replacement is better than LRU and FIFO, and that under certain circumstances, a direct-mapped or set associative cache may perform better than a full associative cache organization. Experimental

303

results using instruction trace data were given in the lecture and the experimental results were shown to support the theoretical conclusions. Multiple Functional Unit Processors Very Long Instruction Word Architectures By compiling ordinary scientific applications programs with a radical technique called trace scheduling, J.A. Fisher (Yale University, U.S.A.) is generating code for a parallel machine that will run these programs faster than an equivalent sequential machine. Trace scheduling, he explained, generates code for machines called Very Long Instruction Word architectures. In VLIW machines, many statically scheduled, tightly coupled, fine-grained operations execute in parallel within a single instruction stream. VLIWs are more parallel executions of several current architectures. Once it became clear to Fisher that code for a VLIW machine could actually be compiled, some new questions appeared, and answers were presented in his lecture: How do we put enough tests in each cycle without making the machine too big? How do we put enough memory references in each cycle without making the machine too slow? Low-level Parallelism W. Tomita, K. Shibayama, T. Kitamura, T. Nakata and H. Hagiwara (Kyoto University, Japan) described the architecture of a dynamically microprogrammable computer with low-level parallelism, called QA-2, which is designed as a high-performance, local host computer for laboratory use. The architectural principle of the QA-2 is the marriage of high-speed, parallel processing capability offered by four powerful Arithmetic and Logic Units (ALUs) with architectural flexibility provided by large scale, dynamic user-microprogramming. By changing its writable control storage dynamically, the QA-2 can be tailored to a wide spectrum of research-oriented applications covering high-level language processing and real-time processing.

304

EUROMICRO Reports

Reliability

Combining Tags Many computer systems include extra bits in each word of storage to allow detection (and possibly correction) of memory failures. According to R.H. Gumpertz (Carnegie-Mellon University, U.S.A.), these same bits can be used to implement tagchecking without sacrificing their normal errorhandling properties. The tagging facility so provided can be used to detect a variety of hardware and software errors that might otherwise go undetected. Because no extra storage is required, the cost of adding such tagging can be small even though the benefit derived can be large.

several mapping questions and gave a new fault diagnosis method based on the solution to the problem of mapping the ring structure.

Shuffle-Exchange Networks F.

Sovis (Slovak Academy of Sciences, Czechoslovakia) presented the uniform theory for describing the shuffle-exchange type permutation networks - the theory of Ek stages. The use of this new approach was demonstrated by applying it to the flip, omega, and other incomplete permutation networks and to some complete networks, e.g. the Benes network. Sovis dealt in particular with the p 2p, and / 2 p - 1/-stage networks, where p = l o g 2 n , and n is the number of the inputs (or outputs) of these networks.

Interconnection Networks I

(d, k) Problem M.A. Fiol, L Alegre and J.L.A. Yebra (ETS Ingenieros de Telecommunicaci6n, Spain) considered in their lecture the (d,k) problem for directed graphs: to maximize the number of vertices in a digraph of degree d and diameter k. For any values of d and k, they constructed a graph with a number of vertices larger than (d2-1)/d 2 times the (nonattainable) Moore bound. In particular, this solves the (d,k) digraph problem for k = 2. They also showed that these graphs can be obtained as line digraph iterations and that this technique provides them with a simple local routing algorithm for the corresponding networks.

Resource Allocation E. Opper, M. Malek and G.J. Lipovski (University of Texas at Austin, U.S.A.) studied a resource allocation problem in a reconfigurable multicomputer architecture based on rectangular CC-banyan multistage interconnection network with arbitrary fanout and arbitrary number of levels. Four commonly used problem structures, ring, pipeline, broadcast and macropipeline, were introduced and the mapping problem of these structures on the system model, which is comparable to the resource allocation problem, was discussed. Opper, Malek and Lipovski presented analytic solutions to

Performance Evaluation of Scientific Computers

Cray-lS Architecture An analysis of the Cray-lS architecture based on dataflow graphs was presented by V.P. Srini (University of Alabama, U.S.A.) and J.F. Asenjo (City Bank, Puerto Rico). The approach consists of representing the components of a Cray-lS system as the nodes of a dataflow graph and the interconnections between the components as the arcs of the dataflow graph. The elapsed time and the resources used in a component are represented by the attributes of the node corresponding to the component. The resulting dataflow graph model is simulated to obtain timing statistics using as input a control stream that represents the instruction and data stream of the real computer system. Srini and Asenjo analyzed the Cray-lS architecture by conducting several experiments with the model. They observed that the architecture is a well balanced one and performance improvements are hard to achieve without major changes. Significant improvement in performance was shown when parallel instruction issue is allowed with multiple CIP/LIPs in the architecture.

Pipelined MIMD Computer A pipelined implementation of MIMD operation is

EUROMICRO Reports

embodied in the H E P computer. This architectural concept should be carefully evaluated now that such a computer is available commercially, H.F. Jordan (University of Colorado, U.S.A.) believes. Jordan studied the degree of utilization of pipelines in the MIMD environment. A detailed analysis of two extreme cases indicates that pipeline utilization is quite high. Although no direct comparisons are made with other computers in Jordan's lecture, the low pipeline idle time in this machine indicates that this architectural technique may be more beneficial in an MIMD machine than in either SISD or SIMD machines.

Sparse Matrix Solving Machine In analyzing electronic circuits, it is usually necessary to solve simultaneous linear equations which provide with a sparse coefficient matrix. In order to treat these problems effectively, H. Amano, T. Yoshida and H. Aiso (Keio University, Japan) propose a dedicated parallel machine called 'Sparse Matrix Solving Machine', (SM) 2. (SM) 2 is composed of multiple clusters, each consisting of multiple processing units (PUs) attached to a sophisticated shared memory which enables any PUs to read data from memory simultaneously without conflict. The structure of the cluster is organized so as to make the best use of the neighboring effects involved in the problems. Amano, Yoshida and Aiso discussed the characteristics of the problems to be solved and the behavior of the PUs and showed the optimum configuration for (SM) 2 and justified its effectiveness through behavior analysis and simulations.

Educational Aspects

Experimental System A number of educational institutions o f the world offer academic programs in Computer Science for undergraduate and graduate students. Many of these programs have used a medium sized to large computer system as a facility for the students. According to R. Kalyana Krishnan, A.K. Rajasekar and C.S. Moghe (Indian Institute of Technology, Madras), there is a need for a system in which the

305

students have full access to the machine in terms of hardware and software details. Most commercial systems do not normally provide this information. In a lecture entitled 'An Experimental System for Computer Science Instruction', Kalyana Krishnan, Rajasekar and Moghe described an experimental system designed and built at I.I.T., Madras, which is used for illustrating several concepts in Machine Organization, Computer Architecture, Operating Systems Distributed Computing, etc. Present use of the system for Computer Science instruction was identified and proposed applications were indicated.

Data Flow Architectures II

Data Flow Signal Processor The architecture of the Data Flow Signal Processor (DFSP) was discussed by K. KronlOf (Helsinki University of Technology, Finland) with the emphasis on its control mechanism. He argued that the data flow principle can be efficiently applied to block processing operations of nonrecursive DSP computations, when shared data structures are avoided. Simulation results involving the optimal operand size and the memory use of the control section were presented. Due to the expandability and convenient programmability of the DFSP architecture, the range of its potential applications extends beyond signal processing, as demonstrated by a DFSP based database machine.

Distributed Data Driven Processor M. Kishi, H. Yasuhara and Y. Kawamura (OSI Electric Industry Co., Ltd., Japan) described an architecture of a data flow computer named the Distributed Data Driven Processor (DDDP), and presented an experimental system and the results of experiments using several benchmarks. The experimental system has four processing elements connected by a ring bus, and a structured data memory. The main features of their system are that each processing element is provided with a hardware hashing mechanism to implement token coloring, and a ring bus is used to pass tokens concurrently among processing elements. A hardware

306

EUROMICRO Reports

monitor was used to measure the performance of the experimental system.

Processor Array System N. Takahashi and M. Amamiya (Nippon Telegraph and Telephone Public Corp., Japan) presented the architecture of a highly parallel processor array system which executes programs by means of a data driven control mechanism. The data driven control mechanism makes it easy to construct a MIMD system, since it unifies interprocessor data transfer and intra-processor execution control. The design philosophy of the data flow processor array system presented in the lecture is to achieve high performance by adapting a system structure to operational characteristics of application programs, and also to attain flexibility through executing instructions based on a data driven mechanism.

Processor and I / 0 Architectures

Dorado In late 1975, members of the Xerox Palo Alto Research Center embarked on the specification of a high-performance successor to the Alto personal minicomputer, in use since 1973. After four years, the resulting machine, called the Dorado, was in use within the research community at PARC. K.A. Pier (Xerox Palo Alto Research Centers, U.S.A.) began his presentation with an overview of the design goals, architecture, and implementation of the Dorado and then provided a retrospective view and critique of the Dorado project as a whole. The major machine architectural features were evaluated, other project aspects such as design automation and management structures were explained, a chronological history with milestones was included, and a variety of accomplishments, red herrings, and shortfalls was discussed. Pier concluded his lecture with some speculations on what the project might have done differently and what might be done differently today instead of in the late 1970s.

Channel Subsystem The 370-XA channel-subsystem architecture represents an evolutionary and significant extension of the System/370 channel architecture. R.J. Dugan (IBM Corp., N.Y., U.S.A.) examined the programming-machine interface of the 370-XA channel subsystem and how it was designed to meet the requirements called for by the evolution of IBM's large-scale systems. In particular, emphasis has been placed upon meeting the needs of multiprocessing, maintaining availability, and supporting large I / O configurations while at the same time preserving compatibility for running System/270 channel programs.

Adaptive Interpretation R.L. Norton and J.A. Abraham (University of Illinois, U.S.A.) concentrated in their lecture on the effect of instruction set architecture on the performance potential of a computer system. These issues, they believe, are key in considerations of what instruction set is most appropriate for the support of high level languages on general purpose machines. Norton and Abraham proposed a method of instruction set interpretation that takes advantage of the architectural features of complex instruction sets. These methods have been simulated executing real programs and in the case of the VAX instruction set have resulted in a typical improvement of a factor of two, assuming the same cycle time as the VAX 11-780. The techniques presented exploit the context available in a complex instruction and retain this information for use in subsequent execution of that instruction.

Interconnection Networks II

Switching Strategies M. Kumar, J.R. Jump (Rice University, U.S.A.) and D.M. Dias (Bell Laboratories, U.S.A.) investigated some methods for improving the performance of Single Stage Shuffle Exchange Networks (SENs) and Multistage Interconnection Networks (MINs). The three new switching strategies proposed use extra buffers to enhance performance. Ap-

EUROMICRO Reports

proximate analysis and simulation results indicate significant improvement in performance for both SENs and MINs. An intuitive method for determining the applicability of the approximate analysis was discussed and some performance measures which should be useful in evaluating the performance of networks were defined.

Distributed Resource Sharing B. W. Wah (Purdue University, U.S.A.) studied the interconnection of resources to multiprocessors and the distributed scheduling of these resources. Three different classes of interconnection networks have been investigated; namely, single shared bus, multiple shared buses, and networks with logarithmic delays such as the cube and Omega networks. For a given network, the resource mapping problem entails the search of one (or more) of the free resources which can be connected to each requesting processor. To prevent the bottleneck of sequential scheduling, the type(s) and number(s) of resources desired by a processor are given to the network and it is the responsibility of the network to find the necessary resources and connect them to the processor. The addressing mechanism is, thus, distributed in the network. Wah explained that this is a generalization of conventional interconnection networks with routing tags in which all the resources are of different types.

Concurrent Error Detection Comprehensive VLSI fault models were proposed by W.K. Fuchs, J.A. Abraham and K-H. Huang (University of Illinois, U.S.A.) for three broad classes of interconnection networks between multiple processors and multiple m e m o r y modules. System-level algorithms were given for concurrent detection of errors produced by these faults during the normal use of the networks. The proposed algorithms were shown to be applicable to the three classes of interconnection networks with minimal changes in their classical design. The algorithms are appropriate for the broad classes of permanent and transient faults predominant in dense VLSI and wafer-scale integration with a minimal amount of network redundancy required for implementation.

307

Multicomputers & Multiprocessors Hierarchical Function Distribution An abstract view of a computer system is provided by a hierarchy of functions, ranging from the highlevel operating system functions down to the primitive functions of the hardware. Vertical migration of high-level functions into the microcode of a C P U or horizontal migration of hardware functions out of the CPU into dedicated processors alone is not an adequate realization method for innovative computer architectures with complex functionality, according to W.K. Giloi and P. Behr (Technical University of Berlin). In their lecture, a new design principle called hierarchical function distribution was introduced to cope with the task of designing innovative multicompurer systems with complex functionality. The design rules of hierarchical function distribution were presented, and the advantages of the approach were discussed and illustrated by examples.

Communication Structure An experimental multiprocessor computer was designed and built by L. Philipson, B. Nilsson and B. Breidegard (University of Lund, Sweden) to explore the feasibility of certain internal communication mechanisms. The system consisted of seven processing elements, each containing a part of the global m e m o r y connected to a local bus. For each processor the global m e m o r y is seen as one single, linearly addressable structure. The processing elements were all connected to a c o m m o n , global bus, consisting of three separate busses in order to increase the capacity. A bus selection unit was designed, capable of making a unique bus selection for each request, within a fraction of m e m o r y cycle. The experiments have shown that communication structures based on distributed global m e m o r y and global bus systems can be used efficiently for medium scale systems.

Architectural Support for High Level Languages ALPHA A L P H A is a dedicated machine designed for high-

308

EUROMICRO Reports

speed list processing. In their lecture H. HayashL A'. Hattori and H. Akimoto (Fujitsu Laboratories Ltd., Japan) described a highly effective stack which can support a value cache and virtual stack, and high-speed garbage collection algorithm for virtual memory. These new ideas have been studied in A L P H A , which is designed as a back end processor for a large computer under TSS. A L P H A allows TSS users to do more high-speed list processing than a large computer does. Currently U T I L I S P is operating on A L P H A and runs several times faster than M A C L I S P on the DEC 2060.

Logic Programs A logic programming language offers several kinds o f parallelism for its execution. Among these, S. Umeyama and K. Tamura (Electrotechnical Laboratory, Japan) concentrated on OR-parallelism which is an alternative to the backtracking mechanism of a serial interpreter, and proposed an abstract model for OR-parallel interpretation. It consists of tokens and five kinds of function units mutually connected as a process graph. The overall processing is done by the flows of tokens among these units. Umeyama and Tamura also presented a mechanism for token labeling, which makes this process graph reentrant. A simulation result was given to show how efficiently the model works in terms of parallelism.

Concurrent Evaluation C. Schmittgen and W. Kluge (University of Bonn and Gesellschaft ffir Mathematik und Datenverarbeitung mbH, F.R.G.) outlined the principles for the concurrent evaluation o f applicative programs based on Berklings reduction language. The recursive style of program design supported by this language lends itself to a recursive partitioning scheme which, for suitable program expressions, generates dynamically a hierarchy of processes for the concurrent evaluation of subexpressions. This hierarchy, according to Schmittgen and Kluge, can elegantly be mapped onto a system o f cooperating reduction machines featuring a stack architecture. A special ticket mechanism enforces an upper limit on the number o f processes that, at any time, may

exist within the system, which does not significantly exceed the number of the available machines.

Lisp-Based Data-Driven Machine A Lisp-based data-driven machine with a novel parallel control mechanism and its performance evaluation were presented by Y. YamaguchL K. Toda and T. Yuba (Electrotechnical Laboratory, Japan). The proposed control mechanism is the natural extension of a data-driven scheme to function evaluation and is achieved by packet communication architecture. First, Yamaguchi, Toda and Yuba described the organization of the datadriven machine and then they showed the results of the simulation studies which confirm the effectiveness of the control mechanism. The performance characteristics of the data-driven machine obtained by the software simulator were also given.

Architectures for Image Processing Pyramidal Approach S.L. Tanimoto (University of Washington, U.S.A.) presented the architecture of a parallel computer called a pyramid machine. The system consists of a pyramidal array o f processing elements, each of which executes the instructions broadcast by a controller. Each processing element except those on the outside of the array is directly connected to thirteen neighboring elements: eight on the same level, four on the next finer level and one on the next coarser level. The architecture combines features of tree machines and features of mesh-connected parallel computers. As a result, explained Tanimoto, it is able to rapidly perform computations of local and global processing. The main areas of application are image processing, graphics, and spatial problem solving. Tanimoto discussed the motivation, basic structure, and applications of the system. Parallel Processor G. Gaillat (MATRA Espace Produits & Technologies, France) described a parallel MIMD type processor for use in image processing applications

EUROMICRO Reports

on board satellites. Emphasis was given to the application requirements in terms of processing power, type of parallelism, communication need and to the impact of these requirements on the architecture design. Gaillat presented the choice of a MIMD processor with a ring bus, the convenience of a multiple bus structure, the definition of the bus protocol, the synchronization mechanism and the typical performances as successive choices and he discussed these in regard to the requirements. Possibilities and limits of the architecture were carefully analyzed. Typical examples of efficiently implementabte applications in other fields of image processing were given. In addition, Gaillat pointed out the limits of the structure for other types of parallel processing. Image Creation

In a lecture entitled 'LINKS-l: A Parallel Pipelined Multimicrocomputer System for Image Creation', H. Nishimura, H. Ohno, T. Kawata, L Shirakawa a n d K . Omura (Osaka University, Japan) described a multimicrocomputer system, stressing mainly software and hardware architectures, which has been constructed mainly for image creation. This system is distinctive mainly in that (1) 64 unit computers are interconnected with a root computer, each of equal performance, such that a number of unit computers constitute a pipelined computer and such pipelined computers work in parallel, all controlled by the root computer, and (2) an intercomputer memory swapping unit is introduced, which is to be linked with a pair of unit computers to transfer a great amount of data at a time from one to the other through the use of a bus exchange switch.

LIPP

L I P P (Link6ping Image Parallel Processor), explained T. Ericsson and P.E. Danielsson (Link6ping University, Sweden), is a multiprocessor system intended mainly for image analysis and image processing but even other computing tasks where large amount of data should be manipulated in forms of matrices, such as weather forecasts or other related problems, namely systems of dif-

309

ferential equations. The processors within the processor array are of bit-serial type with the capability of directly processing data with wordlengths in the range of 1 bit to 32 bits in one bit increments without time penalty. Bit-serial operation gives the possibility of designing surprisingly fast algorithms. To each processor is associated a fairly large memory (64 Kbit). A processor can instantly reach 8 neighboring memories through an interconnecting network. The processor array whose size is thought to be 16 by 16 is running in SIMD mode. In this way memory access collisions can be minimized. Image and matrix data are mapped in the memory space so that each memory holds a subimage. Ericsson and Danielsson call this mapping distributed processor topology. Because of the memory mapping and interconnection network neighborhood operations such as two dimensional convolution are easily performed.

Special Topic: Applied Artificial Intelligence and its Influence on Computer Architecture N e w Generation

Four major areas of research are involved in attempting to identify the fifth generation of computers: the investigation of (1) knowledge processing systems, (2) data and demand driven computers, (3) integrating communications and computers and (4) VLSI processor architectures. From these four areas, two approaches for the fifth generation are emerging: one 'revolutionary' - a parallel logic machine supporting knowledge processing applications and the other 'evolutionary' a decentralized control flow system consisting of a network of heterogeneous processors. P.C. Treleaven (University of Newcastle upon Tyne, England) described the above four areas of research and discussed how their computing technologies are converging to produce fifth generation computers. He then contrasted the revolutionary logic machine approach adopted by Japan's Fifth Generation Project, and favored by the artificial intelligence community, with the evolutionary control flow computer approach favored by the data communications and microelectronics communities.

310

EUROMICRO Reports

Inference Machine

Relational Data Base Machine

In a lecture entitled 'Inference Machine: From Sequential to Parallel', S. Uchida (Institute for New Generation Computer Technology, Japan) described the research and development plan for computer architecture in the fifth generation computer system project (FGCS Project), focussing on the research on the inference machine. In the FGCS project, a logic programming language has been chosen as its base language and it is named FGKL: Fifth Generation Kernel Language. The goal of this project, Uchida pointed out, is to develop basic computer technology to build an intelligent computer system and its prototype which will have an inference function, a knowledge base function and an intelligent interface function.

The Japan's Fifth Generation Computer System project is divided into three stages. In the first three-year stage, a working relational data base machine is developed for a software development support system to be used in the second stage and also for an experimental system which provides a research tool for the knowledge base machine. K. Murakami, T. Kakuta, N. Miyazaki, S. Shibayama and H. Yokota (Institute for New Generation Computer Technology, Japan) briefly described the concepts and architecture of the relational data base machine named 'Delta' which is currently under development at ICOT.

Overview to Fifth Generation According to T. Moto-oka (University of Tokyo, Japan), computers which have high performances for non-numeric data processing should be developed in order to satisfy and expand new applications which will become predominant fields in information processing of the 1990s. Knowledge information processing forming the main part of applied artificial intelligence is expected to be one of the important fields in 1990s information processing and the dedicated computers for this have been selected as the main theme of the national project of the Fifth Generation computers. The key technologies for the FGCS seem to be VLSI architecture, parallel processing such as data flow control, logic programming, knowledge base based on relational database, and applied artificial intelligence and pattern processing. Moto-oka discussed how inference machines and relational algebra machines are typical of the core processors which constitute FGCS.

Critique In recent years, there have been many attempts to construct multiple-processor computer systems. The majority of these systems are based on von Neumann style uniprocessors. To exploit the parallelism in algorithms, any high performance multiprocessor system must, however, address two very basic issues - the ability to tolerate long latencies for memory requests and the ability to achieve unconstrained, yet synchronized, access to shared data. R.A. Iannueei (M.I.T., U.S.A.) defined these two problems in his lecture and examined the ways in which they are addressed by some of the current and past von Neuman multiprocessor projects. He then proceeded to hypothesize that the problems cannot be solved in a von Neumann context. Iannucci offered the data flow model as one possible alternative, and described his research in this area. The Proceedings of this 10th Annual International Symposium on Computer Architecture are available from the I EEE Computer Society, P.O. Box 80452, Worldway Postal Center, Los Angeles, Calif. 90080, U.S.A. Computer Society Order No. 473.1983. x + 438 pages.