An efficient routing strategy to support process migration

An efficient routing strategy to support process migration

Micr0pr0cessing and Mic~0p~ogramming32 (1991) 153--160 North-H011and An Eff~cter~t 153 Ro~r~g Strategy to S~ppo~t Process M~gra~o~ Unive~s~t~ d e...

891KB Sizes 14 Downloads 63 Views

Micr0pr0cessing and Mic~0p~ogramming32 (1991) 153--160 North-H011and

An Eff~cter~t

153

Ro~r~g

Strategy to

S~ppo~t Process M~gra~o~ Unive~s~t~ d e Pads-Nord CSP - av. Jean-Bapfisze C ~ m e n z 954~0 Ville~aneuse

LRI-UA 4 ] 0 B~t. 490 - U n ~ v e t ~ d e P a r ~ - S u d 91~50may Cedex

This paper presents ~n original me=hod of process migratinn based upon routing ~b~s. The o r ~ m l b y of mechanism is the ability to determine the s h e ~ e ~ path between em~ter and receiver of m e ~ a ~

without

owning a complete description of the current location of the receiver. Due to this property our n ~ a n m well-fitted to process migration and con~itmes an elTtdem I~ck~nound to k ~ d - b a l a n c ~ m e c l u m ~ . 1 ~trOdU~k~

abilky to evacuate ~

from b~eaking down

site (wben failure evem can b e ~ - ~ ) .

M.I.M.D architectures can support thousands o f processes [GER 90]. One of the main problem in m a n a g i n g c o n c u r r e n t p r o c e s s e s o n such

is

-

_

: linking process and

resources is communications consuming.

architectures, is the spreading of the activity over the

Gathering r~,ource snd processes in o n e s~e by

entire network. This problem is known as the ioad-

r a i s i n 8 pu~esses c o n s u m e s an interesting

~ncin 8 ~

~ m i o n to reduce ~ g e

S~ffc loed-balancing consists in

a static mapping of the processes on the processor, it

exchanges ~

the

network.

implies a static process management (e. g. n o dynamic creation of processes). Dynamic lo~d-

-

_

_

_

: hetero-

b~lancing is achieved through process ~llgratiioa

geneous compmer networks contain severa!

[Smith 88]: a process is moved, during its execution, o n a less-loaded processor to s p r e a d the

upgrades and miscellaneous execution environmercs. Process migration is then viewed as an e ~ e n s i o n o f IWC pn~'edures ~lleimer 85L

compt~ta~ion load. This m~=.chanism a:'oid hot-spot phenomenon and increases the effective parallelism. However it implies a communication management

Proce~ exea.~ion can also be ~ r ~ e d at one ske owning a particular env~ronmem a n d cominued

overhead to route message m flying processes. This

in a remote ske, to pick u p others e n v i ~ n m e m

paper is devoted to the presentation of a routing

fe~'ures when required by the e x e c ~ i n n contexL

,~rategy well fitted to auppo~ process migration. Although process migratien is cleady devoted to ioed-balancin8, migrating processes are useful in o ~ e r fields lPowell 8~]: - ~ . a J ~ :

proce.~ migration gives the

- H

~

L

w

~

~

! ~ n e 90].- a l ~ t i o n

and dealio~tion of rob-network induea a netwo~c f~gment~ion similar to the fragmen~ion p r c ~ e m in ~ g n ~ n d memories. Process migration mechenism is also tk~'d to i m ~ e m e m a n e t w o ~ garb~ge c o l ~ x a r .

154 - ~

F. Delaplace, J.L. Giavitto .

~

[

~

[Vautherin 881: due to the

sending all data space and code space to a new site.

capability of gettlng autonomous code (in

Then process is re loaded a n d the execution

opposite of static code location on processor

continue from the breaking point.

memory), (explicit and programmer managed) process migration induces a new programming style that support the programation of distributed applications.

The first part of this paper presents the main p r o b l e m s o f process migration a n d s t a n d a r d

An alternative m e t h o d d e v e l o p e d

in V-system

[Theimer 88] consists o n sending a succession of bulks called pro-copies to the new site during the execution of the process. The process is stopped when exchanged space reaches a minimal state and then the least modifications are sent to the new site.

solutions. T h e second part is dedicated to the

in comparison to the previous mechanism, this

description of a routing strategy based on a concept

method tend to reduce pending time but does not

of region. Regions widely restrict the updates needed to route messages to migrating processes. C~nsequently the method reduces the migration ow=rhead a n d with the scale o f network. The

reduce me~ages overhead.

segmentation of hypercubes in regions is given. Then we discuss the implementation of the method on a network of Transputers using Occam.

In order to reduce mes.~ges overhead due to data and core migraUon, a migrating system based o n general shared memory can be implemented on a distributed memory.[Zayas 87]. Pages of a process are e m b o d i e d by a n imaginary Segment. This imaginary Segment are managed by the network manager: It maintains a table which describes where pages are located and is able to peek faulty pages.

2 Process ~ O o n Dynamic

load-balancing,

Migration is then reduced to core migration. and

the

subsequent

process migrations, is under the responsibility of the execution level a n d must be transparent to the programmer. So three problems occur o n process migration: moving data space a n d code space, insuring messages arrivals and avoiding dependence links breaking between process and its resources (channel,

external devices

..). The

migration

mechanisms induce overheads in space and time: they are due to the time wasted to ~ransf~r ~ process frem one site to another a n d the amount of data flowing across the network to realize this transfer.

z.z Message What ever methods is used to move a process, the problem is then to ensure that messages can reach their receiver. The message equity problem is e m e n d e d to an environment where proce&~es ~re flying from n o d e to node. This constraint is generally handled trough an additional routing layer based

on

some

standard

message

routing

techniques. A natural solution is to forward messages as in

Though, a key point in process migration is to

DEMOS/MP [Fowler 83]. M e r g e s are sent to the site

reduce ~,he pending time and minimize the amount

where the process is born and then re-sent to the

of exchanged messages.

next location of process until the message reach the

2.1 M ~

receiver. An improvement was made in LOCUS and SPRITE [Douglts 87] where a direct logical link is

~ m and Code

made between the first site of the process and the The easiest method to transfer data and code has been developed for the DEMOS/MP machine [Powell

85]. It merely consists on stopping process and

cunent site. ~ n h e r migration will caused the update of this link by sending the new address to the first

Routieg strategy to supportprocess m P j ~

155

Process locadon. Despite their easy implemenrmiom,

" ~ e main d~mtvmm~ge of this method is the c o ~ of

those methods have several ~awhacks:

updates. In order to limit this coat w e have d ~ v e l o p ~ a new method based o n a hie~-'hk:~

- They definitively increase the path length between emitter and receiver.

p~nt~on of the netw~k.

- Chains of forwarders con~ilmtes to bri~lle the em~e system. A failure of a site may imply the

3 Pmm~ m ~ - ~ n w i ~ ~

failure of a computation even if the site is m~

The routing strategy uses partial informations on

implied (although it was lmplL-d in the pasO.

Another way to solve the messages equiW Problem

process kmadon whilst preserving an optimal t o ~ i n g from emkter to receiver. Infom~ly, the id~ is to spl~i, the netwo~ Into sub*m~twotk cafled region. An mffect~inn procure/re,ton is valid ur~it the proomm ~ v e a tlm r ~ . So. w h e n the pmc~m migratea internally to a region, the others ones have

relies on the use of rouffng tables. These tables

not to be warned. This reduces the amount of

maintain the mapping procem+-reference/proce~-

updm,s that must be done.

location. Each migration is followed by an update of

A mm~smm~sent to a ptocessus is, in fzct, sent to a

those tables. For example in the V-system a set of

region. Each t~gion is aware of the process it con~.~inz but not exactly where. In addition, regions am structured in sub-reginn and a region also know in which sub~gion the target proce~ wU[ be. This i m # i e s the update of the reg~n i n f o r m a t ~ only when a migration between two sub-region takes

- At leaat they do nut con~kute an elTmtem mechanism to solve the load balancing Problem because messages forwarding # l u t e

the

network and induce additional load.

process n a m e d logical hosts are located o n a processor named the j ~ c a t borg. The link between a logical host and a physical host is kept in a table. Each migration of a logical ho~t invalidates this link. The u p d a t e

is p e r f o r m e d in a lazy way by broadcasting a logical he~ location request.

[Ravi 881 defines a distributed routing tables system. Information c o n c e r n i n g

process

location

are

distributed among sets of processors named set of acquaintances. A set of acquaintances detains the whole information location of the whole process in

place. A~

is semi from region to .~b-regkm until it

reaches the find destination. "Ibls mechanism does not insure that the m~.~age fMlow a ~torteat [mth betwc, n an e m i r and a receiver. This co~scaim o41 the message path is en.*,ured structurally by the

a system. When a message should b e sent to a

pattitinn made over the network: The network is

process, the kernel interrogates its acquainmncea. This method which Ihnits the size of routing tables in

divided with the respect of the ability to elect a site iocmed in a a m < t ~ t path j,~icl~ a region.

size, waste times to get information o n process location and to update tables.

Mote p~-cisely, we tepmsem a network a n odemed

Routing tables method constitutes a more complex

graph GO/',ID. E is a set of ¢ ~ g ~ mKl V a set+ of

method as forwardem but get interesting properties:

vertices. E represents tim physk~l links and V the ~ r s . Let [x,.u..yIG he a ~ f . - ~ °x" ~.o "y"

- The redundancy of Procoss/plocessor

-

infonnations allows to he morn failure tolerant.

cros~g the "M" ,vmtex. Min(x,y) is the set of ,lKmest paths between "x" and "y" vertices. A region C is a

Messages flows are directly sent to the

subgmph of o *atidyin8 the following property:

apl~ropdate target pmc~sor. So it minimizes the network load.

Vx ~VG-Vc

= 3 u ~ Vc m c h t i m V y s V c ~..u..y~G E MinOr,y)

,

156

F. Delaplaco, J.L. Giavitto

the vertex au" is called a gate of the region C. It

Divbions into regions of a 4x4 ~3rid, a ring and a bus

means that ['or ever,/pair of node (x, y), w~h x in the region a n d y outside the region, it exists a shortest

are represented in figure l. Regions are figured by

path that include u. A hierarchical segmentation of the graph is obtained with a set of nested regions. The biggest region

large boxes surrounding sites. The black boxes represent the emitters of a message. Gray boxes are gates of brother sub-regions. Below the networks are figured the a,~ociated trees

consists in the entire network and the smallest

regions are the individual nodes themselves.

The

region hierarchy is represented by a tree where

4~ b c

~ c n ~

[eaves are sites of the network a n d non4erminal

The determination of a gate a n d the network

nodes denotes regions. The root of the tree

segmentation

corresponds to the region containing the whole network.

topology. We give here, as a practical example, the treatment of a hypercube.

Routing a message is now viewed as finding a path

In the case of an hy,percube, the segmentation is

are

specific

to

each

hardware

between two leaves This p~oblem is equivalent to the

based o n the network node address. Each node is

determination of a euecassion of gate that a message

uniquely identified by a binary number of l e n ~ h n

should cross ~o enter into the final region where the

for a n-cube. The identifier of node S is denoted by

receiver is. In each gate a consultation of a table is

@S. We use the following conventions: A - {0,1} is the set of the digit values; An is the set of words over A of length n and if a is a word, ai is the ith digit.

carried out to determine in which immediate subregion (sub-tree) the process is localized. The lack of information means that the process still stands in the expected subregion (the subregion of which address is contained in message's header), Then the message is sent to the gate of the selected subregion.

The 8reedy routing strategy on an hypercube conslats in moving a me.~sagefrom one site to another, following the dimensions in a fixed order. The bitwise XOR of the sender and the receiver addresses gives which links should be cross.[Saad 87]

f l b ~ r e 1: D | v i s l o ~ o f a ~ l d , a r b ~ a n d a b u a b l t o ~ o n . ~

Routio'~gstrategy to supportproco,cs m ~ t i o n

157

Ceach ~et pM w ~ r e p is a p e m ~ t ~ also an ~ i H e set of res~m;).

Y

of the ~

gate detenntn~ion is ~:)lved by f i ~ i ~ ~ poim k~cated in ~ e r s e a i o n of a ~ , ¢ p L ~ g ~ o u g h

em~cer and the ~ c e

f e ~ e s e ~ i n g the ~*gion of

the ~ece~,r. So ~ e s ~ e s

O

are the uwae ~ t h e by the line equation. The o ~ " c o o ~ n ~ e ~ are the ~ as receiver for the d i m e m ~

l

~p~ed

s u b ~ ' m of the receiver point. ~de~fier of a gate p of a r e g ~ R GdR,D is defined as folk~w: @Pi ~ ]dR(0 for L(i) - I and @pi-@a i for the ochers dimensions. Definition of the gate insures the op~ima|~y of d~e ro~mg. The routi.~ algor~hm c o n ~

in the determination

of D the regkm where the receiver is and 2) the ga~e of this region. Reg/on d e ~ i ~ ; ~ When a m e s ~ g e has to be sent from a node a to a process, the emitter consuks its routing ~aHes to determine a Cpo~i~y wrong) addre~. This ad~e..~, b. is not the actual address of the process but represents enough information to determine ffm ~ur¢

Z: e x a m p l e o f a 3-cube s e g m e n ~ o a

In a n-cube, sites can be considered as points of a ndimensional space. A subspace is defined by two chains which pammeterize an equation. The first chain, called Mask, defines which dimensions are selected.. The .second chain, named the Identifier, possesses the fixed value of points in these dimensions: R OdR. Lp~ = { v ¢ V I [ d R = v and LR } A segmentation of the network consists in a set of hieratchized

subspaces.

As

masks

represent

dimensions of the spaces, hierarchy of reg!ons is entirely defined ~hrough hierarchy of masks. The set

region where the receiver B and the emitter ne¢. This dezennination is e~uiva~em to de,empiric the higher suhspace separating bach ~ites. A move cannot he operate inside expected d/:nensions. R is expressed as a determination of the higher ma:~k matching success with a bitwkse XOR: L=Mm {! e M I C ~ x o r @b and n¢¢D ~eC@a xor@b)~ Because of our choice of M, masks can he comk~ered as numbers and the higher this number is. m~ller is the co~esponding region. The identifier

of the region R is given by: [dR - L and @b Ga~ d ~ m / n a ~ n

of Masks "M" is ordered from the lem defin~e region (eg the whole space) m the more definite region (eg

The ga~e is d e ~ e r m ~ d by a n ~ c b i o g with the mask

the ~rnalle~ space): for example M= {000, 100, 110, III} in the figure 2. To I'm the ideas, we take M -

ide~hqea @p - Gto¢ L and ~ ) or (L and @b).

L of the region, the emitter idemif'~r and the reg~n

158

F. Delaplace, J.L Giavitto

Here follows the aJgorithm in a Oecam s~le. A d d ~ of a node are Mmgem. PROC B~L

Z~ VAL VAL

c o m p u t e . g a t e { V A L ZNT source~target, ZHT gate)

network [mrformance. Routing table strategy suffers of a hot spot effect; h o w e v e r the contention

decreases after process migrations. Our algorithm achieve the same behavior a n d in the same time minimize the hot spot effect

continue: level.max,dep: I~T max ~S 4: [4] ~ T seg [0,4,6,7]:

SEQ dep := source >< target level.max := 0 continue := TRUE W R Z L E continue SEQ continue

:= l e v e l . m a x

<= m a x

XF continue c o n t i n u e := d e p = N O T seg[level.max]

d. ( /\ dep)

~RUE SKZP level.max = level.max+l level.max := level.max - 1 gate :=(NOT segilevel.max] /\ source) \/ (segllevel.max] /\ target)

~r©

3: n e t w o r k l o a d ~woiution

6 imp~mmmtation of tim Mce.imnismof ~g~on Implementation has been proceed on a Transputer

5 l ~ o r m a a c ~ ~mpatison The evaluation of our algorithm is based on the

b a s e d network with an Hypercube topology [De]ap]ace 89]. The figure 4 gives an overview of the

communication o v e r h e a d induced by a process

software architecture. Process are symbolised by

migration. For an hy[mrcube topology, the minimal, maximal and average costs are:

tmxes and oriented channels are figured by arrows

Cmi'~= 2, Cmsx= 2"

. nH cav=¢~-t-~"n 2,-z.(2,.~)=~ ,

2"-1 "~

3

This must be compared with the [Ravi & Jefferson 88] strategy inducing a maximal cost of 2 n. The figure 3 represents a qualitative evaluation of the increase of the communication Ioa.~ due to the migration of m processes when m/2 processes have already migrated. We assume a fixed communication

m,tm

f / g u ~ 4: G e n e r a l O c c g c | e w o f R o u ~ g

I

System

flow. The figure does not take into account the travel time

Each process performs a specific task which

of a message, as it. is equal for all strategies.

contributes to realize the routing. Communications between unidirectional channel are realized Jhrough

Forwarding

methods

degrade

persistently the

Rout/ng s ~ t ~ , to supp¢~ p~c~s r n / g ~ n a producer-consumer system. So. ~ e

proce.~ act

as independent as possible. In order to keep val~l[~y of data during transactions, proce.~ which manage table ~.~ M~n.Rouffng and M~n.~ocess r e ~ i r e a n exclusive dialogue.ThH

constraint induces a n

Th~ rouzh~gs y ~

15~ ~ incL~lly

l ~ : e d to d~e way

of the u ~ t ~ are p r ~ . As virtual r~m~ing s y ~ m is based on h~-~rc~y, u p d ~ e s ~ aL~o o n hfi:rarchy. So update messages are p t o ~ e d from father to c h ~ ' e n and each ~

node is a

exclusive d i r ~ dialogue. Functk~nal descriptions of

gate of a region where gate of ~ e m a | s u b - r e ~ o ~

each p r o c e ~ and ~n~ture~ are deac~ibed below.

m u ~ be u p . t e d . The o p t e d

procedure is ~ c ~ e d

as soon as m e . ~ a ~ leach an a~om~c region (a ~ e is equal to a re~on).

Routing table manager

Man.Routing

:

Man.Proce~

: Proce~ table manager

Ext.Comm,ln

: Recep~ messages and

direct them

to Reeep~a~ or Ex~.Comm.Om Eat.Comm.Out

: ,Send ~ g e s

to the neat door

Im.Comm.In

: Reeel~ internal messages and

lnt.Comm.Out

~¢nd them to Reception : Deliver messages to user process

Reception

: Compute the next gate and direct messages either on ~ t C . ~ r ~

or

Our a i ~ h m

was implememed and walk,areal o n a small 3-O~be of Transputer w ~ h a total b l n a ~ segmentation. The f f a n s ~ t i o n

of the code was

done fo|lowing the scheme of lPowell 83L The Oeeam system s u p p o ~ does not allow the migration of processes. Thus we have developed a minimal kernel to implemem the migration. This kernel consL~s in a loader (for the dimamk: loading of codes and data), a monkor (to manage the various phase of a migratkm) and a m e m o ~ allocator (used

6~2 ~mc~m'es

by the IoadeO. D e , ill can b e found in [Delaplace in 6 steps

891. The migration of a prneess c o n s ~ MESSAGE ~ HEADER+ CONTAINS:

(see f ' ~ r e 5).

HEADER - [

D The node a reque~ a migration to the no4e b. The "high loade~ of the node b che< ks ,',.hat there

pid.seurce, p~d.targel, gate,

target.sRe, messages.|eng~h ! CONTAINS - [chamc~erl, ..., charactern] When a message reaches its current gate k~emified by HEADER.gate, the add~e~ of the next gate is then computed. This edd~ess depends on the cu~ent gate and O~ r~zceiversite. Once a consultation is made to determine if the receiver site has not changed, the Reception p r o c e ~ is able to compute the neat gate address. In order to harmonize the two ,sources of messages

(Eat.Comm.in and lm.Comm.in), newly injected m ~ a g e s own the current size address as gate address ~,nd the born site addre..~ of the receiver as targe~ site. A new computation of these addresses is performed in ~ Reeep~ion Service.

is ~:nough availaHe memory space. The node b confirm ~ e m ~ a ~ 3) The acknowledgement is t r a ~ m ~ e d to the ~lowlevel task" in charge to ~op, encapaulate and transmit the proce~. These tasks are very h~fica~ with the Transputer p t o c e ~ managemem a n d m u ~ be carefully performed Csee [Delaplace 89] for details). The proce~ is wanam~ed. ~) ,~t t v a n ~ i s ~ end cleaning actions are pe~fonned by the "low ~ (e.g. freeing the

process memo~). In r~e receiving site, the "high k~tdar" awange hemel data ~ c ~ u r e so that the new process is taken into accou~.

me~gea

a ~ forwa~ed).

160

F. Delaplace. J.L. Giavitto

Sansonnet, C. Germain, F. CapeHo and J-L Bechenec for their help, suggestions and support.

Ref~

F~rc

~: p v ~ e s ~ mige~tlon

7 ~ n The routing performance ~s a critical point in parallel architectures. The region based migration achieves an effective routing with a cost comparable to the existing methods. Moreover, information redundancy in the several routing table, makes the algorithm more fault-tolerant as other strategy (forwarding strategy for example). Our algorithm can be seen as a generalization of the [Ravi & Jefferson 88] method. This algorithm can be see as a region based migration where there exist only two level of region: the region corresponding to a single node and the entire network. An interesting generalization of our work consist in making several segmentations of the same network cohabit, in this case, the migration process selects the region that minimizes the update operations. For making this possible, the segmentation process must in the same time determine the accurate gates.

Ac.kuow~ts The authors wish to thank Free,else Baude and St~phane Boucheron for many helpful dis~ss[ons. We are also indebted to Dr D Etiemble and Dr J-P

[Delaplace 89] Delaplaca, "Migmlton am Processus', ~ppon de D.E.A, Univetstz6de Paris-Hord, 1989 [DouglL~ 87] Douglts) Ou•erhout, "Precis Mignst~n in the Sprite Operating Syst~nn; Distribued Computing, Computer Society Pre~a, PP 18-24, 1987 [GERg0| C. Gennain, J-L B(:chennec, D. Etiemble, J-P. Sansnnnet, "An lnta~onnectmn I ~ o ~ and a R~i)~ 8 Scheme for a M~siuoly Parallel Massage-Passing Mulffcompu~#, Frontiers 90 conference on Ma.~ively Parallel Computation, Oaober 8-10 College Park, MD. Wowler 861 RJ Fowler, " ~ e Company Using F~nu~d'Lng Adrss~s for D ~ r ~ n s l i ~ d O b j ~ Finding-, Proc A.C.M Symposium on Principles of Distrthued Computation, CalSaW, Canada, Ausust 1986 [Min8 90] Ming-Syan Chen, Kan8 G.Shin, "SubGu~ Allocation and Task Migraine tn Hypercub¢ Mulliprocessors-, IEEE Transactions on Computers, Vo159, Neg, PP 1146-1155, September 1990 [Maguim 881 Maguim, Smith, "Pwc~s M i ~ l i o ~ . ~/'ects on Scienii~w Compulation',A.C.M Sigplan , PP 102-106, March 1988 [Occam 88] C.A.R Hoare, OCCAM 2 Refereace Manual, International Series Computer Science. Prentice Hall) [Powell 85] Poweli° Miller, "Process Migration in Demos/MP~ Ptoc 9th Operatin 8 System Principles, PP 110-119. C~ober 1983 [Ravl & Jefferson 88] Ravi, Jefferson, "A Basic l~rotecol In Mignstin8 l~n~mses', International Conference on Parallel Proce~,ing, PP 188-197, August 1988 [Saad 87l Y Ssad, M.H Schuthz, "Topological propcrlifs o f ~ ~ Tran~aaion of Compmera Vo137 NeT, PP 867-871, July 1988 [Smith 88] Smith, "A S u ~ y of Process Migra|ion Mechanisms-, Operatin 8 Systems Review, A.C.M Sigopa,PP 28-40,July 1988 [l'anenbaum 81] Tanenbaum. Computer Networks, Prentice .H all Inc, 1981 IThehner 85] Theimer, "/~emFteb/¢ Remote F.xccution FaciHlies f o r the V-Systen~s', Proc of the t0~h Operatin8 SystemPrinc.lpl~ PP 2-12, December1985 [Vautherin 88l Vautherln, Millet, "Dynamic Crc.:ation o f Pnscesses on Transputor IVetwor~', Rappen de Recherche du L.R.I N°453, Universit(: de Pads-Sud Onay, July 1988 [Walker 8~[ Walker & AI, "Tb. locus D~r/bued Openst/n~ System', Proc of the 9th Symposium on Opetadn 8 System Principles, 1983 [Zayas 87] Z~yas, "Attacking the Process Mignstion ~Ot~/~Ck'~ Pro~ of the ]llth S~mnpesiumon Operating Sy~em Principles, PP 15-24. November 1987