Inter/bees itz (!(~ntpuling, 2 ( 1 9 8 4 ) 111 - 1 3 0
SUPPORTING PARALLEL PROGRAMMING ENVIRONMENT
ARTffUR
1. K A R S H M E R
IN A D I S T R I B U T E D U N I X
and ALAN WEHE
(':~mputer Sci(,nce D e p a r t m e n t , ((" S.A.) (Received October
'¢ ] I
N e w M e x i c o S t a t e (:t2il'crsity, Las Crl~ees, N M 3~:~0.¢
31, 1 9 8 3 )
Summary A d i s t r i b u t e d p r o c e s s i n g s y s t e m b a s e d on U N I X ( t r a d e m a r k o f Bell L a b o r a t o r i e s ) is c u r r e n t l y o p e r a t i o n a l at N e w M e x i c o S t a t e University. T h e s y s t e m which is c o m p o s e d o f a v a r i e t y o f PDP-11 and LSI-11 p r o c e s s i n g elem e n t s allows users t o s c h e d u l e tasks t o r u n t o t a l l y o r in parallel on its satellite units. A U N I X - l i k e kernel is run on the satellite p r o c e s s o r s , and this allows virtually a n y process t o run on a n y p r o c e s s i n g e l e m e n t . T h e arc, hitect u r e o f the s y s t e m is a s t a r c o n f i g u r a t i o n w i t h a P D P - 1 1 / 3 4 a as the c e n t r a l or h o s t n o d e . T o s u p p o r t e x p e r i m e n t s in parallel processing, a parallel version o f the C p r o g r a m m i n g language has b e e n d e v e l o p e d w h i c h allows users to write p r o g r a m s as a c o l l e c t i o n o f f u n c t i o n a l units w h i c h can be a u t o m a t i c a l l y s c h e d u l e d t o r u n on t h e satellite p r o c e s s o r s . In this p a p e r t h e s t r u c t u r e o f the s y s t e m is d e s c r i b e d in t e r m s o f h a r d w a r e a n d s o f t w a r e , a n d o u r i m p l e m e n t a tion o f pc, a parallel C language, is discussed.
l. Introduction As a direct result o f the d e c r e a s e s in cost a n d the increases in p e r f o r m a n c e o f m i c r o p r o c e s s o r s , c o u p l e d with t h e a d v e n t of v e r y high s p e e d local area n e t w o r k t e c h n o l o g i e s o v e r t h e past f e w years, we have seen a r e n e w e d i n t e r e s t in d i s t r i b u t e d p r o c e s s i n g s y s t e m research. T h e t y p e s o f s y s t e m t h a t are called d i s t r i b u t e d v a r y g r e a t l y in f o r m and f u n c t i o n . At o n e e n d o f t h e d i s t r i b u t e d c o m p u t i n g s p e c t r u m we find t h e s y s t e m s t h a t are c o n n e c t e d t o g e t h e r o v e r large d i s t a n c e s via r e a s o n a b l y low speed c o m m u n i c a t i o n s lines. P r o b a b l y t h e b e s t e x a m p l e o f this t y p e o f d i s t r i b u t e d s y s t e m is f o u n d in the A R P A N E T [1] w h i c h was designed t o i n t e r c o n n e c t a large n u m b e r o f dissimilar c o m p u t e r s f o r t h e p u r p o s e o f p r o g r a m and d a t a sharing. T h e c o n c e p t o f parallel p r o g r a m e x e c u t i o n t o d e c r e a s e c o m p u t a t i o n t i m e was n o t a design goal o f A R P A N E T or t h e o t h e r l o n g - d i s t a n c e c o m p u t ing n e t w o r k s such as T Y M E N E T , T e l e n e t , C S N E T , t h e A L O H A s y s t e m [ 2 ] , W a n g N e t etc. 0252 7308/84/$3.00
© Elsevier Sequoia/Printed
in T h e N e t h e r l a n d s
112 In terms of a more local topology, we see a variety of distributed computing environments that have been designed primarily to serve similar functions to the geographically large-scale networks with enhancements to support some strictly local functionality. Examples of the special local functionality f o u n d in these networks are various types of "servers" such as mass storage and printing. While the primary function of these networks is n o t intended to support parallel execution, their raw communication speed, usually in the 10 Mbit s 1 range, makes them interesting tools on which distributed processing research can be carried out. Typical of this communication environment are systems such as Ethernet [3] and its variants such as Ungerman Bass NetOne, 3Corn and a host of IEEE 802.3 systems soon to be announced. Also competing in this market we see local area network schemes based on rings such as the Cambridge Digital Communication Ring [4] and Pronet [5]. These communications media are just starting to find their way into distributed processing research in systems designed to improve c o m p u t a t i o n speed through the distribution of the work load. Traditionally, research into distributed processing as described above has been based on a wide variety of interconnection schemes which were designed for the particular project [6]. The schemes vary not only in their interconnection topology but also in their physical properties [7]. For example in HXDP [8] and CM* [9] high speed busses were used to interconnect the various processing elements, while in Micronet [ 10] interprocessor and interprocess communications were carried on a regular network of communication paths with an adaptive routing scheme. Another system to employ an adaptive routing scheme is the New Mexico State University ring-star system [11, 12] in which both a store and forward ring and a star network were used to provide two independent but overlapping communication paths. Other distributed processing research systems have been based on a variety of other off-the-shelf communication schemes. The COCANET system [13] at the University of California at Berkeley, for example, was based on the local network interface (LNI) developed by the University of California at Irvine [ 14, 15]. In another high speed ring-based environment, we find the University of York distributed UNIX [ 16] environment which is built on top of the Cambridge Digital Communication Ring. The software base on which the various distributed processing systems were built was as varied as their communication techniques. CM* used an operating system designed specifically to augment the hardware design and to support "caching" of operating system functionality [17]. Micronet used a variant o f the PASCAL environment at the University of California at San Diego to support its research environment. While UNIX (trademark of Bell Laboratories) was the basis on which the University of York and New Mexico State University systems were built, the two research groups used totally different approaches i n the implementation of their systems. At York, the PULSE (personal UNIX-like system environment) was written in ADA while at New Mexico State University the actual UNIX kernel was modified to run in the various processing elements.
115
A b o v e t h e o p e r a t i n g s y s t e m level in d i s t r i b u t e d processing environm e n t s o n e m u s t e x a m i n e the s o f t w a r e t h a t is i m p l e m e n t e d t o s u p p o r t the c o n c e p t o f parallel p r o g r a m e x e c u t i o n . A l t h o u g h it is receiving m o r e and m o r e a t t e n t i o n , c o n c u r r e n t p r o g r a m m i n g still suffers f r o m t h e p r o b l e m t h a t it has no a c c e p t e d m o d e l o r f o r m a l f o u n d a t i o n [ 1 8 ] . Most m o d e l s are e i t h e r t o o c o m p l i c a t e d or t o o simple. At o n e end o f t h e s o f t w a r e s p e c t r u m , we find s y s t e m s in which the d e t e c t i o n o f parallelism is d o n e b y s o f t w a r e ; this is c e r t a i n l y a non-trivial process w h i c h forces t h e user t o be a w a r e o f t h e d e t e c t i o n s c h e m e s used b y the c o m p i l e r . Such s o f t w a r e s c h e m e s allow d e t e c t i o n o f parallelism in i n n e r l o o p s w h i c h f o r c e s t h e p r o g r a m m e r t o be k n o w l e d g e a b l e a b o u t certain cons t r u c t s such as a s s i g n m e n t o f scalars, go t o c o m m a n d s , c o n d i t i o n a l statem e n t s and o p e r a t i o n s across e l e m e n t s t h a t m i g h t cause p r o b l e m s during parallel e x e c u t i o n [ 1 9 ] . T h e p r o g r a m m i n g language E D I S O N [20] is this t y p e o f e n v i r o n m e n t . When a process reaches a c o n c u r r e n t s t a t e m e n t , the c u r r e n t c o n t e x t o f t h e process b e c o m e s t h e initial c o n t e x t o f t h e c o n c u r r e n t processes. O n e o f t h e p r o b l e m s w i t h this a p p r o a c h is t h a t , w h e n c o n c u r r e n t p r o c e s s e s o p e r a t e on i n t e r s e c t i n g c o m m o n variables, the results are unp r e d i c t a b l e . P r o g r a m s m u s t be w r i t t e n w i t h e x t r e m e care as this class o f e r r o r m a y go u n d e t e c t e d at c o m p i l a t i o n t i m e [ 2 1 ] . T h e language C F D [19] has a s y n t a x b a s e d on t h e parallelism o f the u n d e r l y i n g m a c h i n e : t h e Illiac IV. T h e user m u s t d e c i d e in w h i c h p r o c e s s o r c e r t a i n variables m u s t be m a n i p u l a t e d a n d t h e r e f o r e c o n v e r t all d a t a struct u r e s t o m a t c h t h e m a c h i n e s t r u c t u r e . This process places a very large r e s p o n s i b i l i t y on t h e p r o g r a m m e r to u n d e r s t a n d b o t h t h e n a t u r e o f the mac h i n e a n d t h e n a t u r e o f t h e p r o b l e m s o l u t i o n . In c o n t r a s t w i t h this m a c h i n e level a p p r o a c h , we find c o n c u r r e n t P A S C A L [22] w h i c h e x t e n d s a p o p u l a r high level language f o r c o n c u r r e n t e x e c u t i o n . Y e t a n o t h e r e x a m p l e o f high level c o n c u r r e n c y can be seen in A D A [23] in w h i c h tasks are d e f i n e d in t e r m s o f parallel s u b p r o c e s s e s o f t h e m a i n process. Since the run t i m e e n v i r o n m e n t s u p p o r t s p s e u d o - c o n c u r r e n c y , t h e t a s k a p p e a r s to be e x e c u t i n g in parallel. T h e c o n c e p t o f r e n d e z v o u s is u s e d to help t h e user t o solve s o m e o f the basic p r o b l e m s i n h e r e n t in c o n c u r r e n t p r o g r a m m i n g . Finally, in the area o f p s e u d o - c o n c u r r e n t p r o g r a m m i n g languages we find M O D U L A - 1 [24] and C [25] w h i c h use tools shared b y o t h e r c o n c u r r e n t languages r u n n i n g on a single p r o c e s s o r ( s e m a p h o r e s , m o n i t o r s and rendezvous) to simulate a truly concurrent environment. T h e r e have also b e e n i m p l e m e n t a t i o n s o f truly parallel p r o g r a m m i n g languages. O n e such i m p l e m e n t a t i o n is t h e S t a r m o d d i s t r i b u t e d s y s t e m [ 2 6 / w h i c h uses a c o l l e c t i o n o f p r o c e s s o r s w i t h fixed c o m m u n i c a t i o n p a t h s t h a t all run t h e s a m e kernel. This d i s t r i b u t i o n using i d e n t i c a l kernels implies t h a t the j o b o f selecting a p r o c e s s o r f o r a t a s k is s o m e w h a t m o r e simplistic t h a n in a h e t e r o g e n e o u s p r o c e s s i n g e n v i r o n m e n t . In S t a r m o d t h e M O D U L A language was m o d i f i e d t o create an e n v i r o n m e n t t h a t allows p r o c e s s e s t o e x e c u t e i n d e p e n d e n t l y in various p r o c e s s i n g units. In the P P L / 1 i m p l e m e n t a t i o n [27] we find a parallel version o f the PL/1 language r u n n i n g in a " o n e - l e v e l t r e e " n e t w o r k as a m a s t e r slave rela-
114
t i o n s h i p . T h e user is r e q u i r e d t o s p e c i f y w h i c h processes are to be run in parallel. A t run t i m e t h e slave p r o c e s s o r s are l o a d e d w i t h the a p p r o p r i a t e c o d e a n d e x e c u t i o n begins. In t h e c u r r e n t w o r k w h i c h is built on t h e N e w Mexico S t a t e University d i s t r i b u t e d U N I X s y s t e m [11, 12] t h e p r o g r a m m e r is given a great deal o f flexibility in assigning p r o c e s s e s to p r o c e s s o r s w i t h a v a r i e t y o f d e f a u l t opt i o n s available. In the r e m a i n d e r o f this p a p e r , we shall describe t h e N e w M e x i c o S t a t e U n i v e r s i t y s y s t e m in m o r e detail in t e r m s o f its u n d e r l y i n g h a r d w a r e a n d s o f t w a r e as well as give a d e s c r i p t i o n o f o u r i m p l e m e n t a t i o n o f pc, a parallel C language i m p l e m e n t a t i o n . Finally, a s u m m a r y o f o u r c u r r e n t results will be p r e s e n t e d f o l l o w e d b y a d e s c r i p t i o n o f f u t u r e w o r k t o be carried o u t t o e n h a n c e o u r c u r r e n t s y s t e m .
2. S o f t w a r e s t r u c t u r e T h e N e w M e x i c o State U n i v e r s i t y s y s t e m [11, 12, 28] is based on a version o f the U N I X o p e r a t i n g s y s t e m [29] w h i c h has b e e n m o d i f i e d t o run in a d i s t r i b u t e d e n v i r o n m e n t . While U N I X was n o t originally designed to s u p p o r t d i s t r i b u t e d p r o c e s s i n g , it was c h o s e n f o r o u r s y s t e m f o r a v a r i e t y o f reasons. First, o u r r e s e a r c h g r o u p d e t e r m i n e d t h a t we did n o t have the personnel to develop our own well-structured operating system from the ground u p a n d , if we did, it w o u l d have m a n y o f t h e f e a t u r e s a l r e a d y f o u n d in U N I X . S e c o n d , a f t e r e x a m i n a t i o n o f o t h e r o p e r a t i n g s y s t e m s f o r t h e PDP-11 f a m i l y o f c o m p u t e r s , it b e c a m e o b v i o u s t o us t h a t U N I X w o u l d be t h e m o s t easily m o d i f i e d as it is p r i m a r i l y w r i t t e n in a high level s y s t e m s p r o g r a m m i n g language, C [ 2 5 ] . T h i r d , t h e U N I X file s y s t e m is ideally s u i t e d t o o u r n e e d s in t h a t p a t h n a m e s r e f e r to devices as well as files, and a p a t h n a m e p r o v i d e s a simple m e t h o d o f describing p a t h s b e t w e e n n o d e s in a n e t w o r k . Finally, U N I X is m o d u l a r in n a t u r e and lends itself t o t h e a s s e m b l y o f a v a r i e t y of k e r n e l s t h a t we n e e d e d t o s u p p o r t o u r set o f PDP-11 and LSI-11 p r o c e s s o r s w h i c h v a r y in m e m o r y size a n d h a r d w a r e capabilities. In t w o versions o f t h e s y s t e m , a star a r c h i t e c t u r e (Fig. 1) and a r i n g s t a r a r c h i t e c t u r e (Fig. 2), e a c h satellite p r o c e s s o r e x e c u t e s a U N I X " l o o k a l i k e " k e r n e l w h i c h is less t h a n 5 k b y t e s in length. At s y s t e m b o o t t i m e t h e P D P - 1 1 / 3 4 p r o c e s s o r loads e a c h satellite w i t h a kernel w h i c h has b e e n built specifically f o r t h e individual p r o c e s s i n g u n i t . In this w a y t h e r u n n i n g kernel has s u p p o r t s o f t w a r e f o r its actual h a r d w a r e c o n f i g u r a t i o n . This f e a t u r e avoids t h e loading o f s u p e r f l u o u s s o f t w a r e at a satellite p r o c e s s o r . When a user t a s k n e e d s t o use o n e or m o r e o f t h e satellite p r o c e s s o r s , it f o r k s a m o n i t o r w h i c h r e q u e s t s t h e use o f p r o c e s s i n g e l e m e n t s b y e n t e r i n g its p r o c e s s i d e n t i f i e r and p r o c e s s o r r e q u e s t i n t o a s h a r e d d a t a file. T h e m o n i t o r signals t h e s c h e d u l e r t o assign a c o m m u n i c a t i o n p a t h f o r the m o n i t o r ' s r e q u e s t in the s h a r e d d a t a file. T h e m o n i t o r uses this c o m m u n i c a t i o n p a t h t o l o a d a n d e x e c u t e its c o d e on a satellite p r o c e s s o r . When t h e m o n i t o r is finished with the assigned satellite p r o c e s s o r , it r e m o v e s its p r o c e s s i d e n t i f i e r f r o m t h e s h a r e d d a t a a r e a a n d again signals t h e s c h e d u l e r .
!15
Fix. 1. The New Mexico State University star network.
t
I ~" Communication zs0 processor ~
All communication lines are 96k baud serial lines
Fig. 2. The New Mexico State University ring star network. Processes r u n n i n g on a satellite p r o c e s s o r are able t o e x e c u t e virtually a n y s y s t e m call t h a t can be e x e c u t e d at the h o s t n o d e . As the satellite processor kernels and h a r d w a r e c a n n o t s u p p o r t all s y s t e m calls, the kernel has several o p t i o n s in relation t o the disposition o f a given call. T h o s e t h a t can be h a n d l e d locally are satisfied at t h e satellite p r o c e s s o r , while t h o s e t h a t are m o r e global in n a t u r e are h a n d l e d by the satellite p r o c e s s o r ' s m o n i t o r r u n n i n g at the h o s t n o d e . In this case the satellite p r o c e s s o r kernel sends b a c k the necessary i n f o r m a t i o n t o the h o s t w h i c h t h e n recreates the s y s t e m
116 call. Once this call has been e x e c u t e d , t h e m o n i t o r r e t u r n s the r e s u l t a n t d a t a t o its satellite processor. In an early version o f the s y s t e m all c o m m u n i c a t i o n s in the c u r r e n t syst e m were h a n d l e d b y the host n o d e at t h e c e n t e r o f the star via serial comm u n i c a t i o n s lines. Each satellite p r o c e s s o r was responsible f o r its own comm u n i c a t i o n s on an i n t e r r u p t basis t o and f r o m the host. While this s c h e m e was s a t i s f a c t o r y for a simple m a s t e r - s l a v e star c o n f i g u r a t i o n , it lacked the flexibility o f c o m m u n i c a t i o n and the fault t o l e r a n c e r e q u i r e d in a s y s t e m o f m o r e d i s t r i b u t e d c o n t r o l . T o this end, each satellite p r o c e s s o r was f i t t e d with a c o m m u n i c a t i o n f r o n t - e n d p r o c e s s o r which i n t e r f a c e d the satellite p r o c e s s o r t o t w o c o m m u n i c a t i o n s n e t w o r k s : t h e existing star plus a store and f o r w a r d ring {Fig. 2). In the u p g r a d e d s y s t e m the satellite processors have the ability t o c o m m u n i c a t e d i r e c t l y t o o t h e r processes or processors on e i t h e r the ring or the star. 3. H a r d w a r e s t r u c t u r e The New M e x i c o State University d i s t r i b u t e d UNIX s y s t e m is built o f PDP-11 and LSI-11 processing units. T h e unit at t h e c e n t e r o f the star o r the r i n g - s t a r is a P D P - 1 1 / 3 4 a r u n n i n g version 6.5 UNIX. A m o n g its various peripherals are t w o 6 7 - M b y t e disk drives, one R K 0 5 disk, t w o line printers and 32 serial c o m m u n i c a t i o n s lines. O f the 32 lines, m o s t are used for user terminals and the r e m a i n d e r are used t o i n t e r f a c e the satellite processors t o t h e host n o d e . T h e satellite processors in t h e s y s t e m range f r o m an old P D P - 1 1 / 1 0 based s y s t e m r u n n i n g m i n i - U N I X t o a P D P - 1 1 / 2 3 p r o c e s s o r which is itself a UNIX-based m u l t i u s e r system. T h e P D P - 1 1 / 2 3 s y s t e m with a 1 0 - M b y t e disk drive was i n c l u d e d to s u p p o r t studies in areas such as d i s t r i b u t e d filing systems and host s y s t e m migration. T h e r e m a i n i n g satellite processors are LSI11- and LSI-11/2-based systems c o n t a i n i n g a variety o f peripheral devices and h a r d w a r e m a t h e m a t i c a l s u p p o r t . All satellite processors c o n t a i n 56 k b y t e s o f m e m o r y e x c e p t f o r the P D P - 1 1 / 2 3 system which contains 256 kbytes. Figure 2 shows the h a r d w a r e c o n f i g u r a t i o n o f the n e w s y s t e m c u r r e n t l y being c o m p l e t e d . It should be n o t e d t h a t the original star c o n f i g u r a t i o n still exists with the P D P - 1 1 / 3 4 a d e v e l o p m e n t s y s t e m as the c e n t e r n o d e . Each satellite p r o c e s s o r in the n e w s y s t e m has a s e c o n d c o m m u n i c a t i o n path w h i c h is t o t a l l y separate f r o m the star: a ring. In this c o n f i g u r a t i o n , each satellite p r o c e s s o r m a y c o m m u n i c a t e with o t h e r processes or processors in the s y s t e m via t h e ring o r the star. T o facilitate the c o m m u n i c a t i o n process, each satellite p r o c e s s o r is e q u i p p e d with its own Z80-based ring i n t e r f a c e u n i t (RIU) which handles b o t h ring and star c o m m u n i c a t i o n s f o r its satellite processor. C u r r e n t l y , each RIU has f o u r c o m m u n i c a t i o n s channels: o n e t o and f r o m its satellite p r o c e s s o r , o n e t o and f r o m the host n o d e in the star, o n e f r o m the p r e d e c e s s o r n o d e o n the ring and o n e t o t h e successor n o d e on the ring. All message assembly and e r r o r c h e c k i n g is carried o u t b y the RIU.
!1T
T h e c o m m u n i c a t i o n s c h a n n e l s c o n n e c t i n g all units are serial R S 2 3 2 lines r u n n i n g at 9 6 0 0 Baud. In the n e x t version o f t h e s y s t e m we plan to replace these c h a n n e l s w i t h a v a r i e t y o f c o m m u n i c a t i o n m e d i a w h i c h will run at s u b s t a n t i a l l y higher speeds. C u r r e n t l y the c o m m u n i c a t i o n s lines are being r e p l a c e d by an i m p l e m e n t a t i o n o f the C a m b r i d g e Digital C o m m u n i c a t i o n Ring [4] w h i c h has a r a w c o m m u n i c a t i o n s p e e d o f 10 Mbits s
4. T h e parallel C language: pc Parallel C, pc, is an e x t e n s i o n of the C p r o g r a m m i n g language [251 w h i c h was designed and i m p l e m e n t e d t o e x p l o i t t h e f e a t u r e s o f the N e w Mexico State U n i v e r s i t y d i s t r i b u t e d U N I X s y s t e m . In pc t h e user has c o m plete c o n t r o l in defining w h i c h f u n c t i o n s , or groups o f f u n c t i o n s , are t o run on certain t y p e s o f p r o c e s s i n g e l e m e n t f o u n d in the c o m p u t a t i o n a l environm e n t . F o r e x a m p l e , the f o l l o w i n g pc p r o g r a m d e m o n s t r a t e s s o m e o f the f e a t u r e s o f t h e language.
main( ) { int a, b, c, d, e, f; a = foo(b}; bar (&d); wait; c = baz(f); resume; w a i t f o r ( a , c);
parallel [1, O] f o o ( ) (
main(x) int x; int k; k + = foobar(x); return(k); foobar(m } int m ; int o, p, q;
118
return(o + p -- m);
}
7 f
parallel [1, 1] bar( ) ( int a, b, c; main(x) int *x;
{ *x=a+b+c;
} } parallel [2, O] baz( ) { char a[20] ; int i, j, k ; main(l) int 1
{ return(i'j/k);
}
T h e p r e c e d i n g e x a m p l e d e m o n s t r a t e s s o m e o f t h e basic f e a t u r e s f o u n d in t h e pc language. First, the f u n c t i o n " m a i n " w h i c h is n o t i n c l u d e d in a b l o c k d e l i m i t e d b y ( and } a n d p r e c e d e d b y t h e k e y w o r d " p a r a l l e l " runs in t h e h o s t p r o c e s s o r in t h e N e w M e x i c o State U n i v e r s i t y s y s t e m . All o f t h e o t h e r u s e r - d e f i n e d s e c t i o n s o f c o d e will be s c h e d u l e d t o r u n in o n e o f t h e satellite p r o c e s s o r s or on t h e h o s t , d e p e n d i n g o n t h e c o m p i l a t i o n p a r a m e t e r s enclosed between the [ and the ] immediately following the keyword " p a r a l l e l " . T h e s c h e d u l e r uses t h e s e p a r a m e t e r s as d e s c r i b e d in S e c t i o n 6.1. I t s h o u l d be n o t e d t h a t the c o d e t o be e x e c u t e d e i t h e r on a satellite p r o c e s s o r or on t h e h o s t if n o satellite is available is s t r u c t u r e d in such a w a y as t o a l l o w the p r o g r a m m e r t o t h i n k o f it as a c o m p l e t e l y i n d e p e n d e n t C program consisting of any n u m b e r of functions. Each code section contains o n e m a i n p r o g r a m , w h i c h is called b y t h e m a i n p r o g r a m r u n n i n g o n t h e h o s t p r o c e s s o r , a n d a n y n u m b e r o f f u n c t i o n s t h a t can be i n v o k e d locally on t h e satellite p r o c e s s o r . T h e p a r a m e t e r s passed to t h e satellite r e s i d e n t f u n c t i o n are d e f i n e d as p a r a m e t e r s in t h e d e f i n i t i o n o f t h e local m a i n r o u t i n e . T h e use o f m a i n a n d o t h e r u s e r - d e f i n e d f u n c t i o n s being d e f i n e d f o r e x e c u t i o n on a parallel p r o c e s s o r was c h o s e n t o give the p r o g r a m m e r t h e feeling t h a t h e / s h e was c r e a t i n g a set o f c o m m u n i c a t i n g a s y n c h r o n o u s processes w i t h e a c h having s o m e s o r t o f i n d e p e n d e n t e x i s t e n c e . T h e c o d e g e n e r a t e d b y t h e pc syst e m actually g e n e r a t e s c o d e t h a t can be e x e c u t e d on t h e h o s t p r o c e s s o r as well as on a satellite p r o c e s s o r if no h o s t p r o c e s s o r s are available.
119
T h e final p o i n t t o be m a d e in o u r general i n t r o d u c t i o n to pc c o n c e r n s t h e n e w p r i m i t i v e s r e s u m e , wait a n d w a i t f o r . T h e s e f e a t u r e s w e r e a d d e d to t h e language t o allow t h e p r o g r a m m e r t o c o n t i n u e p r o c e s s i n g on t h e h o s t if the results o f a parallel f u n c t i o n were n o t y e t available, or t o wait at the p o i n t o f i n v o c a t i o n f o r t h e results t o be r e t u r n e d . In the w a i t f o r p r i m i t i v e , t h e h o s t f u n c t i o n is a l l o w e d to p r o c e e d until a p o i n t in e x e c u t i o n when certain d a t a are n e e d e d a n d t h e n e i t h e r to c o n t i n u e o r wait f o r t h e i r return d e p e n d i n g on t h e availability o f t h e d a t a . 5. T h e design o f parallel C T h e parallel C i m p l e m e n t a t i o n at N e w M e x i c o S t a t e U n i v e r s i t y was designed as a p r e p r o c e s s o r f o r t h e C p r o g r a m m i n g language, The p r e p r o c e s sor is w r i t t e n in s t a n d a r d C a n d consists o f f o u r phases: a lexical a n a l y z e r phase, a s y n t a c t i c a n a l y z e r phase, a c o d e g e n e r a t i o n p h a s e a n d a c o m p i l a t i o n phase. T h e final p r o d u c t o f the f o u r phases is a set o f e x e c u t a b l e c o d e and d a t a s e g m e n t s t h a t can be run on t h e h o s t a n d satellite p r o c e s s o r s as well as a run t i m e s u p p o r t p a c k a g e t h a t r u n s on t h e h o s t .
5.1. Lexical analyzer T h e lexical a n a l y z e r p h a s e o f t h e pc p r e p r o c e s s o r reads t h e user's s o u r c e t e x t and t h e n t r a n s l a t e s it i n t o t o k e n s . A t o k e n can be either a m a j o r c o n s t r u c t o r an actual c h a r a c t e r . All t o k e n s a n d C c o n s t r u c t s are t h e n p a s s e d via a U N I X pipe to t h e s y n t a c t i c a n a l y z e r phase.
5.2. The syntactic analyzer T h e parser e m p l o y e d in p r o c e s s i n g pc p r o g r a m s is n o t a c o m p l e t e s y n t a x a n a l y z e r f o r the C language b u t is r a t h e r an a n a l y z e r f o r the new c o n s t r u c t s s u p p o r t e d b y parallel C. O t h e r tasks carried o u t b y this p h a s e o f t h e p r e p r o c e s s o r include s y m b o l t a b l e c o n s t r u c t i o n , e r r o r r e c o g n i t i o n and c r e a t i o n o f files t o receive o b j e c t c o d e f o r t h e various satellite p r o g r a m s . T h e s y m b o l t a b l e used in the p r e p r o c e s s o r was designed as a linked list of parallel calls w i t h e a c h e l e m e n t b e i n g a s s o c i a t e d w i t h a linked list o f f o r m a l p a r a m e t e r s , actual p a r a m e t e r s , a s s i g n m e n t s a n d r e t u r n results. This s t r u c t u r e also c o n t a i n s t o k e n s f o r t h e p a r a m e t e r s a n d c o d e g e n e r a t i o n d a t a fields. T h e e r r o r d e t e c t i o n a n d message g e n e r a t i o n p r o c e s s is c o n t a i n e d in t h e s y n t a c t i c a n a l y z e r phase o f t h e p r e p r o c e s s o r . Errors are r e p o r t e d b y line n u m b e r within a given parallel p r o c e d u r e . U n b a l a n c e d ( a n d } in t h e s o u r c e pc t e x t m a y cause line n u m b e r s to be r e p o r t e d i n a c c u r a t e l y and, in general, e r r o r r e p o r t i n g is n o t e l e g a n t as t h e s o u r c e c o d e will be passed t h r o u g h t h e complete C compiler. During t h e s y n t a c t i c a n a l y z e r phase, files are c r e a t e d t o s e p a r a t e p h y s i c a l l y globals, parallel f u n c t i o n s a n d t h e m a i n C p r o g r a m w h i c h Whi ~~,m on t h e h o s t p r o c e s s o r . T h e s e individual files will be later passed t o t h e full ( language c o m p i l e r a f t e r certain c o m m u n i c a t i o n s c o n s t r u c t s are a d d e d which s u p p o r t i n t e r p r o c e s s o r or i n t e r p r o e e s s c o m m u n i c a t i o n s .
120
5.3. Code generation Code generation takes place during a second pass t hrough the pc source t e x t during which new C code is merged with the user's pc code. During this phase, all special pc constructs are removed and interprocess c o m m u n i c a t i o n code n eed ed to s uppor t the distributed UNIX envi ronm ent is p u t in place. Parameter passing from the process running on the host to the parallel process running on the satellite processor is carried out via a pipe and a p r o t o c o l i m p l e m e n t e d in the m o n i t o r process running on the host and in the kernel running in the r e m o t e processor. The p r o t o c o l [11] was designed to make the c o m m u n i c a t i o n s u b n e t w o r k robust and has proved to correct transmission errors on the first try in the vast majority of cases. Unfortunately, the p r o t o c o l adds a fair a m o u n t o f overhead to the transmission process, making o u r 9600 Baud c o m m u n i c a t i o n s lines run at a substantially lower speed. The actual size of the parameters is sent via the pipe to the satellite process, allowing it to allocate buffer space for the actual parameters dynamically. 5.4. Compilation phase If no errors are d e t e c t e d during the preprocessing phase, the UNIX C compiler is th e n automatically invoked on each of the files generated by the previous phase. The o u t p u t o f the compilation phase is a set o f executable files that represent the host and satellite programs that comprise the parallel task. 5.5. Run time support Included with the code t hat runs on the host processor is linkage to a run time package which handles com m uni cat i ons with tile various satellite processors such as queuing and dequeuing pipe descriptors. A call to a parallel procedure causes all parameters to be passed and pipe descriptors to be queued. When a wait or a waitfor is execut ed, the pipe descriptor for the appropriate parallel function is d e q u e u e d and the pipe is then read. All fu r th er processing at the host then waits for the read from the pipe to be completed. When the requested data are received at the host, the pipes are closed and host processing continues. In addition to handling communications, the run time package also makes calls to the process that schedules the satellite processors. As the pr o g r ammer can specify several options in response to the availability o f satellite processors, the run time package has the ability to take several steps fr o m running the parallel process on the host to aborting the entire run. The options available to the pr ogr am m er are described in Section 7.
6. pc program structure A program written in parallel C takes the same general form as any program written in standard C: a main function, any n u m b e r o f callable
121
f u n c t i o n s and a n u m b e r o f callable parallel f u n c t i o n s . While t h e n u m b e r o f parallel f u n c t i o n s is logically limited to the n u m b e r o f satellite processors available at any given time, the p r o g r a m m e r m a y define as m a n y as h e / s h e requires because in the absence o f parallel resources the parallel f u n c t i o n s can be run on the h o s t processor. An e x t r e m e l y simple pc p r o g r a m takes the following f o r m : main( ) int a; a = foo( ); wait; p r i n t f ( " % d " , a) ;
parallel 11, O] foo( ) { main() return(1 );
6. l. S c h e d u l e r variables In the e x a m p l e above, the main p r o g r a m r u n n i n g on the host p r o c e s s o r w o u l d i n v o k e f u n c t i o n " f o o " on a satellite p r o c e s s o r which in t u r n w o u l d simply r e t u r n the value 1, which w o u l d t h e n be p r i n t e d at the host. T h e keyword " p a r a l l e l " f o l l o w e d b y ix, y ] supplies i n f o r m a t i o n t o the p r e p r o c e s s o r on h o w to generate code f o r each p a r t i c u l a r parallel f u n c t i o n . T h e parameters x and y can take on a n y o f the following values. x = 0 R u n t h e f u n c t i o n o n a s a t e l l i t e p r o c e s s o r o n l y . If t h e s c h e d u l e r i n d i c a t e s t h a i n o satellite processors are available, abort further processing on the entire run x = 1 R u n t h e f u n c t i o n o n a s a t e l l i t e p r o c e s s o r if p o s s i b l e b u t , if no s a t e l l i t e proe e s s ok' s are a v a i l a b l e , r u n t h e f u n c t i o n o n t h e h o s t p r o c e s s o r x = 2 If n o s a t e l l i t e p r o c e s s o r s are c u r r e n t l y a v a i l a b l e , w a i t u n t i l o n e b e c o m e s a v a i l a b l e and then continue execution of the overall task y = 0 R e q u e s t a s a t e l l i t e p r o c e s s o r w i t h a n y h a r d w a r e c o n f i g u r a t i o n as n o s p e c i a l h a r d w a r e is n e e d e d f o r f u n c t i o n e x e c u t i o n . In t h e N e w M e x i c o S t a t e U n i v e r s i t y s y s t e m t h e r e is a v a r i e t y o f h a r d w a r e in t h e PDP -11 a n d L S I - 1 1 f a m i l y . W h i l e t h e y all e x e c u t e the basic PDP-11 i n s t r u c t i o n set, s o m e of the p r o c e s s o r s have o t h e r special purpose hardware available which can be requested by the user y - 1 R e q u e s t a s a t e l l i t e p r o c e s s o r w i t h a f l o a t i n g p o i n t u n i t w h i c h is c o m p a t i b l e w i t h t h a t on the host ( P D P - 1 1 / 3 4 a ) . S o m e of t h e satellite p r o c e s s o r s have the floatinR p o i n t c h i p w h i c h is n o t s o f t w a r e c o m p a t i b l e y = 2 R e q u e s t a s a t e l l i t e w i t h a l o c a l m a s s s t o r a g e d e v i c e for l o c a l file h a n d l i n g . Mos l o f our satellite processors have no local mass storage 3' - 3 R e q u e s t a s a t e l l i t e p r o c e s s o r w i t h b o t h a c o m p a t i b l e f l o a t i n g p o i n t p r o c e s s , local mass storage device
the
:~nd a
The i d e n t i f i e r which follows the s c h e d u l e r i n f o r m a t i o n is the n a m e of parallel f u n c t i o n . Each parallel f u n c t i o n m a y c o n t a i n global d a t a
122 definitions and f u n c t i o n definitions as f o u n d in a n y standard C program. A parallel definition m a y not contain a n y o t h e r parallel f u n c t i o n definitions.
6.2. Globals and defined values As in a n y C program, the user m a y define m a c r o r e p l a c e m e n t constants and global variables to be used in f u n c t i o n s called by the main routine. For e x a m p l e the following C code would work in the multiprocessor environm e n t in exactly the way t h a t it would p e r f o r m in the traditional singleprocessor e n v i r o n m e n t . # d e f i n e SIZE 128 int array [ SI Z E ] ; main( ) { int i; for(i = 0 ; i < S I Z E ; i + + ) array[i] = i; p r i n t f ( " a r r a y sum is %d", sum(SIZE));
}
parallel [1, 0] sum( ) ( maln(upbound) int u p b o u n d ;
{ int j, sum; for(j = 0 ; j < up b o u n d ; j + + ) sum + = array [j ] ; return(sum);
} ) As in s t a n d a r d C, all parameters are passed by " v a l u e " unless the user specifies otherwise, either explicitly or implicitly. In the case o f pass by " r e f e r e n c e " an entire c o p y o f the d a t a structure is passed to the satellite processor and t h e n copied back to the h o s t processor at the t e r m i n a t i o n of function execution. While this feature does perform properly, it is quite slow w h e n using large d a t a structures as o u r c u r r e n t c o m m u n i c a t i o n s u b n e t w o r k runs at 9600 Baud. With the addition of a high speed local area c o m p u t e r n e t w o r k , the transmission of large d a t a structures should b e c o m e a m u c h more trivial m a t t e r .
6.3. Expressions and pointers Expressions included in calls to parallel f u n c t i o n s are first evaluated and t h e n passed by value t o the appropriate f u n c t i o n . The transmission o f expression values c o n f o r m s with the t y p e defined in the parallel f u n c t i o n definition. Further, " c a s t i n g " as s u p p o r t e d in standard C is also fully supp o r t e d in parallel C.
123
P o i n t e r s used in f u n c t i o n i n v o c a t i o n i m p l y t h e t r a n s m i s s i o n o f t h e c o r r e s p o n d i n g d a t a to t h e satellite and the use o f its address during processing at t h e r e m o t e site. P r o b l e m s o c c u r w h e n a user specifies t h e address of an a r r a y e l e m e n t t o be passed. In t h e n o r m a l C e n v i r o n m e n t t h e r e is no p r o b l e m as the entire a r r a y is r e s i d e n t a n d available t o a n y f u n c t i o n . In pc t h e decision was m a d e t o pass t h e entire a r r a y w i t h a p o i n t e r t o t h e specific e l e m e n t . In this w a y t h e user m a y still r e f e r e n c e in a n y d i r e c t i o n within the array.
6.4. Resume, wait and waitfor R e s u m e , wait a n d w a i t f o r are n e w c o n s t r u c t s a d d e d t o t h e C language t o p e r m i t s y n c h r o n i z a t i o n b e t w e e n h o s t a n d parallel f u n c t i o n e x e c u t i o n . In the p r e s e n c e of t h e r e s u m e o p t i o n , or in the a b s e n c e o f t h e wait o p t i o n , the h o s t p r o c e s s is a l l o w e d t o c o n t i n u e p r o c e s s i n g w i t h n o regard f o r the c u r r e n t status o f the f u n c t i o n r u n n i n g in t h e satellite p r o c e s s o r . I f the w a i t o p t i o n is specified, t h e h o s t process is p u t t o sleep until satellite e x e c u t i o n has finished, a t w h i c h t i m e t h e h o s t p r o c e s s is w o k e n a n d results are r e t u r n e d via t h e various m e t h o d s d e s c r i b e d a b o v e . T h e w a i t f o r f u n c t i o n call m a y be i n v o k e d at a n y t i m e a f t e r a call t o a parallel f u n c t i o n . T h e p a r a m e t e r list f o r the w a i t f o r f u n c t i o n c o n t a i n s the n a m e o f the variable in t h e h o s t p r o g r a m t h a t is t o receive t h e value r e t u r n e d b y t h e parallel f u n c t i o n . E v e r y p a r a m e t e r in the w a i t f o r list is associated w i t h a pipe d e s c r i p t o r w h i c h can be d e q u e u e d b y the run t i m e p a c k a g e . If t h e satellite p r o c e s s o r or p r o c e s s o r s are n o t finished, the h o s t waits for t e r m i n a t i o n o f t h e a p p r o p r i a t e p r o c e s s i n g u n i t s a n d the r e t u r n o f t h e i r values via t h e i r p i p e d e s c r i p t o r s . 6.5. Other notes T h e s c h e d u l e r f o r the N e w M e x i c o S t a t e U n i v e r s i t y d i s t r i b u t e d U N I X s y s t e m r u n s as a d e m o n w h i c h m a i n t a i n s r u n t i m e tables r e f l e c t i n g the a l l o c a t i o n of satellite p r o c e s s i n g e l e m e n t s . A t t h e end o f p r o c e s s i n g on a satellite u n i t , the u n i t is p l a c e d in a r e a d y p o o l a n d c o u l d be a l l o c a t e d t o a d i f f e r e n t h o s t process at a n y t i m e t h a t it is free. In a f u t u r e r e f i n e m e n t , we e x p e c t to a l l o w the user to s p e c i f y s o m e u p p e r b o u n d on t i m e t h a t it w o u l d require t h e s c h e d u l e r t o h o l d a satellite p r o c e s s o r t h a t is n o t in use. S t r u c t u r e s a n d a r r a y s passed to parallel f u n c t i o n s m a y n o t logically include p o i n t e r s t o o t h e r d a t a s t r u c t u r e s . I f s u c h d a t a are passed, results will m o s t c e r t a i n l y be i n c o r r e c t . As t h e h o s t p r o c e s s o r r u n s a m u l t i u s e r e n v i r o n m e n t and the satellite p r o c e s s o r r u n s a d e d i c a t e d singie-process e n v i r o n m e n t , it is n o t a l w a y s clear w h e t h e r the user will gain s o m e b e n e f i t f r o m writing c o d e t o r u n on a c o m b i n a t i o n o f h o s t a n d satellite p r o c e s s o r s . In an earlier s t u d y [ 12], tests were c o n d u c t e d t h a t i n d i c a t e d t h a t m o s t p r o c e s s - b o u n d tasks w o u l d generally e n j o y a s p e e d - u p b y r u n n i n g in a satellite p r o c e s s o r regardless o f its r a w speed and arithmetic support.
124 7. System p e r f o r m a n c e During a 6 m o n t h period, we carried out extensive testing to measure the p e r f o r m a n c e of the distributed UNIX envi ronm ent using the pc language. The tests measured overall system p e r f o r m a n c e under a variety of user loads and c o m p u t a t i o n a l tasks. The tests were automatically scheduled using the UNIX cron feature so that t h e y could be run with no o p e r a t o r intervention during different hours of the day. After the various tests had been run, data were recorded which contained i nf or m at i on t hat described the state of the system during the test and the actual results of the particular test. The test programs fell into three major categories: (1) c o m p u t a t i o n a l l y bound tasks t h a t carried o u t large numbers of integer c o m p u t a t i o n s ; ( 2 ) c o m p u t a t i o n a l l y b o u n d tasks that carried o u t large numbers of floating point calculations; (3) i n p u t - o u t p u t tasks t ha t transferred several thousand records of 512 bytes over the c o m m u n i c a t i o n lines f r om the host t o a satellite and return. The test e n v i r o n m e n t was the New Mexico State University multiprocessor UNIX system in which all satellite processors were c o n n e c t e d to the host via 9600 Baud serial lines. The different hardware configurations tested were (1) a host processor with both a floating p o i n t processor and an integer multiply divide unit, (2) a host processor w i t hout a floating point processor b u t with the integer m u l t i p l y - d i v i d e unit, (3) an LSI-11/23 with a floating point processor and an integer m u l t i p l y - d i v i d e unit, (4) an LSI-11/ 23 w i t h o u t the floating point processor but with the integer m u l t i p l y - d i v i d e unit, (5) LSI-11-based processing systems with no hardware mathematical support, (6) LSI-11-based processing systems with only an integer m u l t i p l y divide unit (or an e x t e n d e d instruction set (EIS)) and (7) LSI-11-based processing systems with b o t h an EIS and the floating instruction set. The tests run on the system over the evaluation period gave us data which allowed us to come t o several general conclusions concerning the efficiency o f the multiprocessor UNIX approach as currently implemented. After running more than a thousand tests on various c o m p u t a t i o n a l loads and hardware environments, we can draw the following very general conclusions. (a) Although the raw speed of our c o m m u n i c a t i o n lines is 9600 Baud, we have f o u n d t ha t on the average we can e x p e c t a 4800 Baud effective rate on the lines. This r educt i on in actual speed is a function of several factors including user load, errors and retransmissions and reduction in efficiency due to the overhead imposed by our c o m m u n i c a t i o n protocol. (b) Given the speed of the c o m m u n i c a t i o n subsystem, we feel t hat carrying o u t any task t h a t entails m or e than a small a m o u n t of i n p u t - o u t p u t processing on the current system is n o t productive. Therefore, tasks that require any more than a small a m o u n t o f disk s u p p o r t from the host processor should n o t be scheduled t o run on a satellite unit. Further, the impact of this s t a t e m e n t has an ef f ect on all use o f satellite processors for any purpose which requires the transmission of large quantities o f data on the communication links.
125
(c) |n all cases e x c e p t one, the run times o f our integer and floating p o i n t c o m p u t a t i o n a l tasks on the L S I - 1 1 / 2 3 were faster than or equal to t h o s e on the h o s t processor. This included the case in w h i c h the floating p o i n t test task was run on the P D P - 1 1 / 3 4 s y s t e m with a floating p o i n t processor c o m p a r e d with the same task on the L S I - 1 1 / 2 3 s y s t e m with no floating p o i n t processor. F u r t h e r we s h o u l d n o t e t h a t all c o m p a r i s o n s o f speed include the transmission time required to load the test task on the satellite p r o c e s s o r , t o run the test and t o send b a c k a result t o the h o s t processor. (d) Floating p o i n t tasks run on the L S I - 1 1 / 2 processors using their s o m e w h a t slower version o f a floating p o i n t unit were still faster than the same task run on the h o s t p r o c e s s o r ; the times o b t a i n e d again take into a c c o u n t the transmission time involw~d in sending the p r o g r a m and data to the satellite p r o c e s s o r . When the L S I - 1 1 / 2 processors with n o floating p o i n t unit were tested against the h o s t with a floating p o i n t u n i t , the L S I - 1 1 ' s perf o r m a n c e was b e t t e r t h a n t h a t o f the h o s t w h e n the user load on the host e x c e e d e d a single user. (e) Integer-based c o m p u t a t i o n a l tasks run on the LSI-11 processors with the EIS are o f marginal utility w h e n c o m p a r e d with the same task run on the h o s t p r o c e s s o r . The break-even p o i n t seems t o be when six or m o r e users log o n t o the h o s t . (f) Integer-based c o m p u t a t i o n a l tasks run on the LSI-11 processors w i t h o u t the EIS never seem even t o a p p r o a c h the speed o f the same task run on the h o s t p r o c e s s o r u n d e r a n y user load. The c o n c l u s i o n s drawn above can be easily d e d u c e d f r o m the three tables t h a t follow. E a c h table shows a c o m p a r i s o n o f c o m p u t a t i o n a l tasks b e t w e e n various h o s t c o n f i g u r a t i o n s and a variety o f satellite p r o c e s s o r c o n f i g u r a t i o n s . The integer and floating p o i n t test p r o g r a m s were both a p p r o x i m a t e l y 4 k b y t e s in length and b o t h r e t u r n e d 9 b y t e s o f d a t a to the s c h e d u l i n g d a t a - r e c o r d i n g task run u n d e r cron. Thus t o read the tables we m i g h t ask h o w an integer task run on the L S I - 1 1 / 2 3 p r o c e s s o r c o m p a r e d with the same task run on the h o s t p r o c e s s o r r u n n i n g f o u r users. Table 1 w o u l d tell us t h a t the L S I - 1 1 / 2 3 was able t o c o m p l e t e the same task in a b o u t 557~ o f the time r e q u i r e d b y the host. F u r t h e r , we can see t h a t the L S I - 1 1 / 2 with the EIS r e q u i r e d a b o u t 103% o f the time r e q u i r e d b y the h o s t and thai an L S I - 1 1 / 2 w i t h o u t the EIS t o o k a l m o s t f o u r times as long t o c o m p l e t e the same task. While Tables 1 - 3 s h o w us h o w o u r test p r o g r a m s c o m p a r e d in e x e c u t i o n time with the same p r o c e d u r e on the h o s t , it w o u l d be useful to be able t o p r e d i c t w h e t h e r a p a r t i c u l a r t y p e o f task was suitable to run in the m u l t i p r o c e s s o r e n v i r o n m e n t . Tables 4 - 6 s h o w strictly a c o m p a r i s o n o f e x e c u t i o n times o f the d i f f e r e n t t y p e o f process w i t h o u t p r o g r a m and data transmission times. Using these tables and the simple e q u a t i o n t h a t follows, we are able t o ascertain w h e t h e r it is useful t o schedule a task f o r satellite execution : SET = H E T * S E F +
PS + DI + DO 600
126 TABLE 1 Execution factors (including transit times) for the host processor (PDP-11/34a) versus satellite processors (integer programs) N u m b e r o f users
E x e c u t i o n factor LSI-11/23
L S I - 1 1 / 2 with EIS
L S I - 1 1 / 2 w i t h no EIS
0 1 2 3 4 5 6 7
1.00 0.88 0.77 0.60 0.55 0.52 0.45 0.37
1.66 1.52 1.34 1.14 1.03 1.02 0.84 0.73
7.04 6.05 5.38 4.18 3.80 3.76 3.07 2.66
Weighted average
0.51
1.16
4.45
TABLE 2 Execution factors (including transit times) for the host processor (PDP-11/34a) with a floating point unit (FPU) versus satellite processors N u m b e r o f users
E x e c u t i o n factor LSI-11/23 with F P U
LSI-11/23 with no F P U
LSI-11/2 with FPU
LSI-11/2 with no F P U
0 1 2 3 4 5 6
0.56 0.25 0.12 0.10 0.08 0.08 0.07
1.64 0.67 0.30 0.27 0.22 0.21 0.19
0.88 0.39 0.18 0.19 0.13 0.13 0.12
2.88 1.19 0.54 0.50 0.38 0.36 0.33
Weighted average
0.10
0.31
0.16
0.50
w h e r e S E T is t h e s a t e l l i t e e x e c u t i o n t i m e , H E T is t h e h o s t e x e c u t i o n t i m e o r the number of seconds required to run the specific task on the host process o r w i t h a g i v e n n u m b e r o f u s e r s s h a r i n g t h e s y s t e m , S E F is t h e s a t e l l i t e execution factor or the speed ratio of a particular satellite processor compared with the speed of the host processor under a given user load (these v a l u e s c a n b e f o u n d in T a b l e s 4 - 6 ) , P S is t h e p r o g r a m size o r t h e s i z e o f t h e e x e c u t a b l e m o d u l e m e a s u r e d in b y t e s , D I is t h e d a t a in o r t h e n u m b e r o f b y t e s o f d a t a t h a t m u s t b e t r a n s f e r r e d t o t h e p r o c e d u r e t o b e u s e d as t h e b a s i s f o r its c o m p u t a t i o n , m e a s u r e d in b y t e s , D O is t h e d a t a o u t o r t h e n u m b e r o f b y t e s o f d a t a t o b e r e t u r n e d b y t h e p r o c e d u r e in q u e s t i o n a n d 6 0 0 is the number of bytes per second that will be transferred on the communicat i o n s c h a n n e l s o f t h e s y s t e m (a r a w s p e e d o f 9 6 0 0 B a u d a n d an e f f e c t i v e speed of 4800 Baud).
1 '27
TABLE 3 E x e c u t i o n f a c t o r s ( i n c l u d i n g t r a n s i t t i m e s ) for t h e h o s t p r o c e s s o r ( P D P - 1 1 / 3 4 a ) w i t h o u t a f l o a t i n ¢ p o i n t u n i t ( F P U ) versus satellite p r o c e s s o r s :N'umber o f users
E.,cecution f a c t o r LSI-11/23 with FPU
LSI-II/23 u,ith no F P U
LSI-ll/2 with FPU
LSI 11/2 w i t h no l,'P1 ~
5 6
0.12 0.10 0.06 0.06 0.05 0.05 0.05
0.35 0.27 0.19 0,18 0.15 0.14 0.13
0.20 0.16 0.12 0.12 0.09 0.09 0.09
0.62 0.4s 0.33 0,32 0.26 0.26 0.21
W e i g h t e d average
0.07
0.19
0.10
0.33
0 1 '2 3
TABLE 4 Satellite e x e c u t i o n f a c t o r s ( n o t i n c l u d i n g t r a n s i t t i m e s ) for t h e h o s t p r o c e s s o r ( P D P - 1 1 / 3 4 a ) cersus satellite p r o c e s s o r s (inte~,er p r o g r a m s ) N u m b e r o f users
Satellite e x e c u t i o n /'actor LSI 11/23
LSI-11/2 with EIS
L S L 1 l / 2 w i t h no E I S
0 1 2 3 4 5 6 7
0.94 0.81 0.72 0.56 0.50 0.50 0.41 0.36
1.60 I .37 I .22 0.95 0.86 0.85 0.70 0.60
6.96 6.06 5.32 4.14 3.75 3.72 3.03 2.63
Weigh/ed average
0.54
0.92
4.08
T o a p p l y this f o r m u l a t o a s a m p l e c a s e , l e t us a s s u m e t h a t w e h a v e run a floating point job on the host processor equipped with a floating point unit and w o u l d l i k e t o d e t e r m i n e w h e t h e r it w o u l d be b e t t e r t o run it o n an LSI1 1 / 2 3 p r o c e s s o r w i t h n o f l o a t i n g p o i n t u n i t . L e t us a s s u m e t h a t t h e j o b w i l l run o n an L S I - 1 1 / 2 3 p r o c e s s o r w i t h o u t a f l o a t i n g p o i n t u n i t and t h a t t h e s t a t i s t i c s for t h e j o b o n t h e h o s t p r o c e s s o r are f o u r users o n t h e h o s t proc e s s o r , a 6 0 s run t i m e o n t h e h o s t , a p r o g r a m s i z e o f 5 0 2 2 b y t e s , a data-in s i z e o f 1 2 0 8 b y t e s a n d a d a t a ~ ) u t s i z e o f 2 0 0 b y t e s . O u r f o r m u l a is t h e n SErF = 6 0 * 0 . 1 6
+
5022 + 1208 + 200 600
= 20.32 s
128 TABLE 5 Satellite execution factors ( n o t including transit times) for the host processor (PDP-11/ 34a) with a floating point unit (FPU) versus satellite processors N u m b e r o f users
Satellite e x e c u t i o n f a c t o r LSL11/23 with FPU
LSI-11/23 w i t h no F P U
L S I - I 1/2 with FPU
LSI-11/2 w i t h no F P U
0 1 2 3 4 5 6
0.29 0.12 0.05 0.05 0.04 0.03 0.03
1.32 0 54 0.24 0.23 0.16 0.16 0.14
0.56 0.23 0.11 0.10 0.07 0.07 0.06
2.56 1.05 0.47 0.44 0.32 0.31 0.27
Weighted average
0.05
0.23
0.09
0.45
TABLE 6 Satellite execution factors ( n o t including transit times) for the host processor (PDP-11/ 34a) without a floating point unit (FPU) versus satellite processors N u m b e r o f users
Satellite e x e c u t i o n f a c t o r LSI-11/23 with FPU
LSI-I 1/23 w i t h no F P U
L S L I 1/2 with FPU
LSI-11/2 w i t h no F P U
0 1 2 3 4 5 6
0.07 0.05 0.03 0.03 0.02 0.02 0.02
0.28 0.22 0.15 0.14 0.11 0.11 0.10
0.12 0.09 0.06 0.06 0.05 0.05 0.04
0.55 0.42 0.28 0.27 0.22 0.21 0.19
Weighted average
0.03
0.14
0.06
0.28
I t s e e m s c l e a r i n t h i s case t h a t e x e c u t i o n o n a n L S I - 1 1 / 2 3 e i t h e r w i t h o r w i t h o u t a f l o a t i n g p o i n t u n i t w o u l d be f a s t e r t h a n o n t h e h o s t , a n d , i n f a c t , o n t h e a v e r a g e it w o u l d be w i s e r t o s c h e d u l e s u c h a t a s k o n a n y a v a i l a b l e s a t e l l i t e p r o c e s s o r t h a n o n t h e h o s t . H o w e v e r , if t h e u s e r l o a d w e r e o n l y o n e , a n d the o n l y processor available was an L S I - 1 1 / 2 with n o floating p o i n t u n i t , it w o u l d be b e t t e r t o r u n t h e t a s k o n t h e h o s t p r o c e s s o r . T h e c h o i c e o f p a r t i c u l a r s a t e l l i t e p r o c e s s i n g u n i t s a n d t h e i r s c h e d u l i n g is c o m p l e t e l y i n t h e h a n d s o f t h e p r o g r a m m e r , as d e s c r i b e d i n t h e p r e v i o u s s e c t i o n s o f t h i s p a p e r .
129
8. S u m m a r i z i n g r e m a r k s and f u t u r e w o r k T h e s y s t e m d e s c r i b e d in the c u r r e n t p a p e r is n o w o p e r a t i o n a l at N e w M e x i c o State University. It is being used as a t e s t bed for f u r t h e r studies in t h e area o f d i s t r i b u t e d processing and c o n c u r r e n t p r o g r a m m i n g . T h e r e are m a n y e n h a n c e m e n t s y e t t o be m a d e t o t h e s y s t e m if we e x p e c t t o classify it as a t r u l y d i s t r i b u t e d p r o c e s s i n g e n v i r o n m e n t . T o this end, w o r k will be carried o u t to install a less-centralized c o n t r o l s t r a t e g y such as t h a t used in the contract net protocol [ 30]. The g r e a t e s t w e a k n e s s in the c u r r e n t s y s t e m is the u n d e r l y i n g c o m m u n i c a t i o n s n e t w o r k w h i c h runs at 9.6 k B a u d . This p r o b l e m is c u r r e n t l y being c o r r e c t e d with the i n s t a l l a t i o n o f a C a m b r i d g e Ring (10 Mbits s ~i and an E t h e r n e t (10 Mbits s l) which will c o n n e c t the various p r o c e s s o r s via a high s p e e d i n t e r c o n n e c t i o n . F u r t h e r , as b o t h o f t h e s e c o m m u n i c a t i o n syst e m s w o r k in an e n v i r o n m e n t t h a t n e e d s no c e n t r a l c o n t r o l , m a n y o f the p r o b l e m s t h a t lead us i n t o a c e n t r a l l y c o n t r o l l e d a r c h i t e c t u r e will d i s a p p e a r . With c o m m u n i c a t i o n s r u n n i n g at these higher speeds, file m a n i p u l a t i o n on a satellite p r o c e s s o r will also b e c o m e feasible.
Acknowledgment T h e w o r k r e p o r t e d herein was s u p p o r t e d in p a r t b y t h e R o m e Air D e v e l o p m e n t C e n t e r H e a d q u a r t e r s , Air F o r c e Scientific Center, u n d e r Contract F49620-82-C-0023.
References 1 L. G. R o b e r t s a n d B. D. W e s s l e r , C o m p u t e r n e t w o r k d e v e l o p m e n t t o a c h i e w ' r e s o u r e t , s h a r i n g , ['rot'. A F I P S Spring J o i n t C o m p u t e r Conf., 1970, Vol. 3 6 , A m e r i c a n F~,(h,ra l i o n o f I n f o r m a t i o n P r o c e s s i n ~ S o c i e t i e s , M o n t v a l e , N J, 1 9 7 0 , p p . 5 1 3 - 5 1 9 . '2 N. A b r a n ~ s o n , T h e A L O H A s y s t e m , Pr(,,c. A F I I ) S I,'~lll Join! Con~pt~ler Con/'.. 1970. Vol. 3 7 , A m e r i c a n F e d e r a t i o n o f I n f o r m a t i o n P r o c e s s i n g S o c i e t i e s , M o m v a l e , N J. 1 9 7 0 , p p . 281 - 2 8 5 . ;:; R. M. M e t e a l f e a n d U. R. B o g g s , E t h e r n e t : d i s t r i b u t e d p a e k e l s w i t e h i n ~ f o r local ('orep u r e r n e t w o r k s , C o m m u n . A C M , 19 ( 7 ) ( 1 9 7 6 ) ?,95 - .104. -1 M. V. W i l k e s anti D. J. W h e e l e r , T h e C a m b r i d g e Dilgital C o m m u n i c a t i o n R i n ~ , t ' r o c L o c a l Area Compulinyg Netu,orl~ S),mp., M i t r e Cx:,rporation, B e d f o r d , M A , 1 9 7 9 , p p . 47 - 6 1 . 5 II. C. S a l w e n , In p r a i s e o f ring a r c h i t e c t u r e for local a n ' a n e l w o r k s , C o m p u t I)cm~L '2'2 ( 3 ) ( 1 9 8 3 ) 1 8 3 - 1 9 2 . 6 (~. A. A n d e r s o n a n d E. D. J e n s e n , C o m p u t e r i n t e r c o n n e c t i o n s t r u c t u r e s : l a x o n o m v c h a r a c t e r i s t i c s , a n d e x a m p l e s , C o m p u l . Surt,evs 7 ( 4 ) ( 1 9 7 5 ) 197 - 2 1 3 . 7 K. J. T h u r b e r a n d H. A. F r e e m a n . A r c h i t e c t u r a l c o n s i d e r a t i o n s f o r local c o m p u t e r n e t w o r k s , Proc. Isl Int. Conf. ot~ D i s t r i b u t e d Compulink,~ S y s t e m s , O c t o b e r 1979, I E E E , N ~ w Y o r k , 1 9 7 9 , p p . 131 - 1 4 2 . <~ E. D. d e n s e n , T h e H o n e y w e l l e x p e r i m e n t a l d i s t r i b u t ~ d p r o c e s s o r : a n ov~,rview. 1tC1(l,2 Trans. C o m p u l . , l 1 ( 1 ) ( 1 9 7 8 ) 28 - ~8.
130 9 R. J. Swan, S. H. Fuller and D. P. Siewiorek, C M * - - a m o d u l a r m u l t i - m i c r o p r o c e s s o r , Proc. A F I P S National Computer Conf., 1977, Vol. 46, A m e r i c a n F e d e r a t i o n o f Inform a t i o n Processing Societies, Montvale, NJ, 1977, pp. 637 - 644. 10 L. D. Wittie and A. M. van Tilborg, MICROS, a d i s t r i b u t e d o p e r a t i n g s y s t e m for M i c r o n e t , a r e c o n f i g u r a b l e n e t w o r k c o m p u t e r , IEEE Trans. Comput., 29 (12) (1980) 1 1 2 3 - 1144. 11 A. I. K a r s h m e r , D. J. DePree and J. Phelan, The New Mexico State University ring star s y s t e m : a d i s t r i b u t e d UNIX e n v i r o n m e n t , Software--Pract. Exper., to be published. 12 A. I. K a r s h m e r , d. Phelan, B. K e m p t o n and D. J. DePree, The New Mexico State University d i s t r i b u t e d U N I X s y s t e m : evaluation and e x t e n s i o n , Proc. 16th Hawaii Int. Conf. on Systems Science, Honolulu, 1983, Western Periodicals, N o r t h Hollyw o o d , CA, 1983, pp. 225 - 233. 13 L. A. R o w e and K. B. Birman, A local n e t w o r k based o n t h e U N I X o p e r a t i n g s y s t e m , IEEE Trans. Software Eng., 8 (2) (1982) 137 - 145. 14 P. M o c a p e t r i s , M. Lyle and D. F a r b e r , On t h e design o f local n e t w o r k interfaces, Proc. 1977 IFIP Congr., August 1977, Vol. 77, Elsevier, New York, 1977, pp. 427 - 430. 15 D . J . Farber, A ring n e t w o r k , Datamation, 21 (2) (1975) 44 - 46. 16 A. J. Wellings, I. C. Wand and G. M. T o m l i n s o n , D i s t r i b u t e d UNIX p r o j e c t 1981, Rep. YCS.47, 1981 ( C o m p u t e r Science D e p a r t m e n t , University o f York, Heslington, York ). 17 J. K. O u s t e r h o u t , D. A. Scelza and S. S. P a r d e e p , M E D U S A : e x p e r i m e n t in distribu t e d o p e r a t i n g s y s t e m s t r u c t u r e , Commun. ACM, 23 (2) ( 1 9 8 0 ) 92 - 105. 18 B. M. Broy, On language c o n s t r u c t s for c o n c u r r e n t p r o g r a m m i n g , Proc. Conf. on A nalysing Problem Classes and Programming for Parallel Computing, June 1981, in Lect. Notes Comput. Sci., I l l (1981) 141 - 153. 19 R. H. P e r r o t t , Language design a p p r o a c h e s for parallel processors, Proc. Conf. on Analysing Problem Classes and Programming for Parallel Computing, June 1981, in Lect. Notes Comput. Sci., 111 ( 1 9 8 1 ) 115 - 126. 20 P. Brinch Hansen, The p r o g r a m m i n g language c o n c u r r e n t P A S C A L , IEEE Trans. Software Eng., I (2) (1975) 199 - 206. 21 P. Brinch Hansen, The design o f E D I S O N , Software--Pract. Exper., 7 (April 1981) 363 - 396. 22 P. Brinch Hansen, E D I S O N : a m u l t i p r o c e s s o r language, S o f t w a r e - - Pract. Exper., 7 (April 1981) 325 - 362. 23 V. A. D o w n e s and S. J. Goldsack, Programming Embedded Systems with ADA, Prentice-Hall, E n g l e w o o d Cliffs, NJ, 1982, pp. 71 - 162. 24 J. M c C o r m a c k and R. Gleaves, M O D U L A - 2 : a w o r t h y successor to P A S C A L , Byte, 8 (April 1983) 385 - 395. 25 B. W. Kerighan and D. M. R i t c h i e , The C Programming Language, Prentice-Hall, E n g l e w o o d Cliffs, NJ, 1978. 26 R. P. Cook, * M O D - - a language for d i s t r i b u t e d p r o g r a m m i n g , Proc. 1st Annu. Con/: on Distributed Computing Systems, IEEE, New York, 1979, pp. 233 - 241. 27 Y. Wallach, A l t e r n a t i n g s e q u e n t i a l / p a r a l l e l processing, Lect. Notes Comput. Sci., 127 ( 1 9 8 2 ) 1 - 102. 28 A. B. Barak, A. Shapir, G. S t e i n b e r g and A. I. K a r s h m e r , A m o d u l a r d i s t r i b u t e d UNIX, Proc. 14th Annu. Hawaii Int. Conf. on Systems Science, Honolulu, January 1981, Western Periodicals, N o r t h H o l l y w o o d , CA, 1981. 29 D. M. Ritchie and K. T h o m p s o n , T h e UNIX time sharing s y s t e m , Commun. ACM, 1 7 (7) ( 1 9 7 4 ) 3 5 6 - 3 7 5 . 30 R. G. S m i t h , The c o n t r a c t net p r o t o c o l : high level c o m m u n i c a t i o n and c o n t r o l in a d i s t r i b u t e d p r o b l e m solver, Proc. 1st Int. Conf. on Distributed Computing Systems, IEEE, New York, 1979, pp. 185 - 192.