Micr0pr0cessing and Mic~0p~ogramming32 (1991) 153--160 North-H011and
An Eff~cter~t
153
Ro~r~g
Strategy to
S~ppo~t Process M~gra~o~ Unive~s~t~ d e Pads-Nord CSP - av. Jean-Bapfisze C ~ m e n z 954~0 Ville~aneuse
LRI-UA 4 ] 0 B~t. 490 - U n ~ v e t ~ d e P a r ~ - S u d 91~50may Cedex
This paper presents ~n original me=hod of process migratinn based upon routing ~b~s. The o r ~ m l b y of mechanism is the ability to determine the s h e ~ e ~ path between em~ter and receiver of m e ~ a ~
without
owning a complete description of the current location of the receiver. Due to this property our n ~ a n m well-fitted to process migration and con~itmes an elTtdem I~ck~nound to k ~ d - b a l a n c ~ m e c l u m ~ . 1 ~trOdU~k~
abilky to evacuate ~
from b~eaking down
site (wben failure evem can b e ~ - ~ ) .
M.I.M.D architectures can support thousands o f processes [GER 90]. One of the main problem in m a n a g i n g c o n c u r r e n t p r o c e s s e s o n such
is
-
_
: linking process and
resources is communications consuming.
architectures, is the spreading of the activity over the
Gathering r~,ource snd processes in o n e s~e by
entire network. This problem is known as the ioad-
r a i s i n 8 pu~esses c o n s u m e s an interesting
~ncin 8 ~
~ m i o n to reduce ~ g e
S~ffc loed-balancing consists in
a static mapping of the processes on the processor, it
exchanges ~
the
network.
implies a static process management (e. g. n o dynamic creation of processes). Dynamic lo~d-
-
_
_
_
: hetero-
b~lancing is achieved through process ~llgratiioa
geneous compmer networks contain severa!
[Smith 88]: a process is moved, during its execution, o n a less-loaded processor to s p r e a d the
upgrades and miscellaneous execution environmercs. Process migration is then viewed as an e ~ e n s i o n o f IWC pn~'edures ~lleimer 85L
compt~ta~ion load. This m~=.chanism a:'oid hot-spot phenomenon and increases the effective parallelism. However it implies a communication management
Proce~ exea.~ion can also be ~ r ~ e d at one ske owning a particular env~ronmem a n d cominued
overhead to route message m flying processes. This
in a remote ske, to pick u p others e n v i ~ n m e m
paper is devoted to the presentation of a routing
fe~'ures when required by the e x e c ~ i n n contexL
,~rategy well fitted to auppo~ process migration. Although process migratien is cleady devoted to ioed-balancin8, migrating processes are useful in o ~ e r fields lPowell 8~]: - ~ . a J ~ :
proce.~ migration gives the
- H
~
L
w
~
~
! ~ n e 90].- a l ~ t i o n
and dealio~tion of rob-network induea a netwo~c f~gment~ion similar to the fragmen~ion p r c ~ e m in ~ g n ~ n d memories. Process migration mechenism is also tk~'d to i m ~ e m e m a n e t w o ~ garb~ge c o l ~ x a r .
154 - ~
F. Delaplace, J.L. Giavitto .
~
[
~
[Vautherin 881: due to the
sending all data space and code space to a new site.
capability of gettlng autonomous code (in
Then process is re loaded a n d the execution
opposite of static code location on processor
continue from the breaking point.
memory), (explicit and programmer managed) process migration induces a new programming style that support the programation of distributed applications.
The first part of this paper presents the main p r o b l e m s o f process migration a n d s t a n d a r d
An alternative m e t h o d d e v e l o p e d
in V-system
[Theimer 88] consists o n sending a succession of bulks called pro-copies to the new site during the execution of the process. The process is stopped when exchanged space reaches a minimal state and then the least modifications are sent to the new site.
solutions. T h e second part is dedicated to the
in comparison to the previous mechanism, this
description of a routing strategy based on a concept
method tend to reduce pending time but does not
of region. Regions widely restrict the updates needed to route messages to migrating processes. C~nsequently the method reduces the migration ow=rhead a n d with the scale o f network. The
reduce me~ages overhead.
segmentation of hypercubes in regions is given. Then we discuss the implementation of the method on a network of Transputers using Occam.
In order to reduce mes.~ges overhead due to data and core migraUon, a migrating system based o n general shared memory can be implemented on a distributed memory.[Zayas 87]. Pages of a process are e m b o d i e d by a n imaginary Segment. This imaginary Segment are managed by the network manager: It maintains a table which describes where pages are located and is able to peek faulty pages.
2 Process ~ O o n Dynamic
load-balancing,
Migration is then reduced to core migration. and
the
subsequent
process migrations, is under the responsibility of the execution level a n d must be transparent to the programmer. So three problems occur o n process migration: moving data space a n d code space, insuring messages arrivals and avoiding dependence links breaking between process and its resources (channel,
external devices
..). The
migration
mechanisms induce overheads in space and time: they are due to the time wasted to ~ransf~r ~ process frem one site to another a n d the amount of data flowing across the network to realize this transfer.
z.z Message What ever methods is used to move a process, the problem is then to ensure that messages can reach their receiver. The message equity problem is e m e n d e d to an environment where proce&~es ~re flying from n o d e to node. This constraint is generally handled trough an additional routing layer based
on
some
standard
message
routing
techniques. A natural solution is to forward messages as in
Though, a key point in process migration is to
DEMOS/MP [Fowler 83]. M e r g e s are sent to the site
reduce ~,he pending time and minimize the amount
where the process is born and then re-sent to the
of exchanged messages.
next location of process until the message reach the
2.1 M ~
receiver. An improvement was made in LOCUS and SPRITE [Douglts 87] where a direct logical link is
~ m and Code
made between the first site of the process and the The easiest method to transfer data and code has been developed for the DEMOS/MP machine [Powell
85]. It merely consists on stopping process and
cunent site. ~ n h e r migration will caused the update of this link by sending the new address to the first
Routieg strategy to supportprocess m P j ~
155
Process locadon. Despite their easy implemenrmiom,
" ~ e main d~mtvmm~ge of this method is the c o ~ of
those methods have several ~awhacks:
updates. In order to limit this coat w e have d ~ v e l o p ~ a new method based o n a hie~-'hk:~
- They definitively increase the path length between emitter and receiver.
p~nt~on of the netw~k.
- Chains of forwarders con~ilmtes to bri~lle the em~e system. A failure of a site may imply the
3 Pmm~ m ~ - ~ n w i ~ ~
failure of a computation even if the site is m~
The routing strategy uses partial informations on
implied (although it was lmplL-d in the pasO.
Another way to solve the messages equiW Problem
process kmadon whilst preserving an optimal t o ~ i n g from emkter to receiver. Infom~ly, the id~ is to spl~i, the netwo~ Into sub*m~twotk cafled region. An mffect~inn procure/re,ton is valid ur~it the proomm ~ v e a tlm r ~ . So. w h e n the pmc~m migratea internally to a region, the others ones have
relies on the use of rouffng tables. These tables
not to be warned. This reduces the amount of
maintain the mapping procem+-reference/proce~-
updm,s that must be done.
location. Each migration is followed by an update of
A mm~smm~sent to a ptocessus is, in fzct, sent to a
those tables. For example in the V-system a set of
region. Each t~gion is aware of the process it con~.~inz but not exactly where. In addition, regions am structured in sub-reginn and a region also know in which sub~gion the target proce~ wU[ be. This i m # i e s the update of the reg~n i n f o r m a t ~ only when a migration between two sub-region takes
- At leaat they do nut con~kute an elTmtem mechanism to solve the load balancing Problem because messages forwarding # l u t e
the
network and induce additional load.
process n a m e d logical hosts are located o n a processor named the j ~ c a t borg. The link between a logical host and a physical host is kept in a table. Each migration of a logical ho~t invalidates this link. The u p d a t e
is p e r f o r m e d in a lazy way by broadcasting a logical he~ location request.
[Ravi 881 defines a distributed routing tables system. Information c o n c e r n i n g
process
location
are
distributed among sets of processors named set of acquaintances. A set of acquaintances detains the whole information location of the whole process in
place. A~
is semi from region to .~b-regkm until it
reaches the find destination. "Ibls mechanism does not insure that the m~.~age fMlow a ~torteat [mth betwc, n an e m i r and a receiver. This co~scaim o41 the message path is en.*,ured structurally by the
a system. When a message should b e sent to a
pattitinn made over the network: The network is
process, the kernel interrogates its acquainmncea. This method which Ihnits the size of routing tables in
divided with the respect of the ability to elect a site iocmed in a a m < t ~ t path j,~icl~ a region.
size, waste times to get information o n process location and to update tables.
Mote p~-cisely, we tepmsem a network a n odemed
Routing tables method constitutes a more complex
graph GO/',ID. E is a set of ¢ ~ g ~ mKl V a set+ of
method as forwardem but get interesting properties:
vertices. E represents tim physk~l links and V the ~ r s . Let [x,.u..yIG he a ~ f . - ~ °x" ~.o "y"
- The redundancy of Procoss/plocessor
-
infonnations allows to he morn failure tolerant.
cros~g the "M" ,vmtex. Min(x,y) is the set of ,lKmest paths between "x" and "y" vertices. A region C is a
Messages flows are directly sent to the
subgmph of o *atidyin8 the following property:
apl~ropdate target pmc~sor. So it minimizes the network load.
Vx ~VG-Vc
= 3 u ~ Vc m c h t i m V y s V c ~..u..y~G E MinOr,y)
,
156
F. Delaplaco, J.L. Giavitto
the vertex au" is called a gate of the region C. It
Divbions into regions of a 4x4 ~3rid, a ring and a bus
means that ['or ever,/pair of node (x, y), w~h x in the region a n d y outside the region, it exists a shortest
are represented in figure l. Regions are figured by
path that include u. A hierarchical segmentation of the graph is obtained with a set of nested regions. The biggest region
large boxes surrounding sites. The black boxes represent the emitters of a message. Gray boxes are gates of brother sub-regions. Below the networks are figured the a,~ociated trees
consists in the entire network and the smallest
regions are the individual nodes themselves.
The
region hierarchy is represented by a tree where
4~ b c
~ c n ~
[eaves are sites of the network a n d non4erminal
The determination of a gate a n d the network
nodes denotes regions. The root of the tree
segmentation
corresponds to the region containing the whole network.
topology. We give here, as a practical example, the treatment of a hypercube.
Routing a message is now viewed as finding a path
In the case of an hy,percube, the segmentation is
are
specific
to
each
hardware
between two leaves This p~oblem is equivalent to the
based o n the network node address. Each node is
determination of a euecassion of gate that a message
uniquely identified by a binary number of l e n ~ h n
should cross ~o enter into the final region where the
for a n-cube. The identifier of node S is denoted by
receiver is. In each gate a consultation of a table is
@S. We use the following conventions: A - {0,1} is the set of the digit values; An is the set of words over A of length n and if a is a word, ai is the ith digit.
carried out to determine in which immediate subregion (sub-tree) the process is localized. The lack of information means that the process still stands in the expected subregion (the subregion of which address is contained in message's header), Then the message is sent to the gate of the selected subregion.
The 8reedy routing strategy on an hypercube conslats in moving a me.~sagefrom one site to another, following the dimensions in a fixed order. The bitwise XOR of the sender and the receiver addresses gives which links should be cross.[Saad 87]
f l b ~ r e 1: D | v i s l o ~ o f a ~ l d , a r b ~ a n d a b u a b l t o ~ o n . ~
Routio'~gstrategy to supportproco,cs m ~ t i o n
157
Ceach ~et pM w ~ r e p is a p e m ~ t ~ also an ~ i H e set of res~m;).
Y
of the ~
gate detenntn~ion is ~:)lved by f i ~ i ~ ~ poim k~cated in ~ e r s e a i o n of a ~ , ¢ p L ~ g ~ o u g h
em~cer and the ~ c e
f e ~ e s e ~ i n g the ~*gion of
the ~ece~,r. So ~ e s ~ e s
O
are the uwae ~ t h e by the line equation. The o ~ " c o o ~ n ~ e ~ are the ~ as receiver for the d i m e m ~
l
~p~ed
s u b ~ ' m of the receiver point. ~de~fier of a gate p of a r e g ~ R GdR,D is defined as folk~w: @Pi ~ ]dR(0 for L(i) - I and @pi-@a i for the ochers dimensions. Definition of the gate insures the op~ima|~y of d~e ro~mg. The routi.~ algor~hm c o n ~
in the determination
of D the regkm where the receiver is and 2) the ga~e of this region. Reg/on d e ~ i ~ ; ~ When a m e s ~ g e has to be sent from a node a to a process, the emitter consuks its routing ~aHes to determine a Cpo~i~y wrong) addre~. This ad~e..~, b. is not the actual address of the process but represents enough information to determine ffm ~ur¢
Z: e x a m p l e o f a 3-cube s e g m e n ~ o a
In a n-cube, sites can be considered as points of a ndimensional space. A subspace is defined by two chains which pammeterize an equation. The first chain, called Mask, defines which dimensions are selected.. The .second chain, named the Identifier, possesses the fixed value of points in these dimensions: R OdR. Lp~ = { v ¢ V I [ d R = v and LR } A segmentation of the network consists in a set of hieratchized
subspaces.
As
masks
represent
dimensions of the spaces, hierarchy of reg!ons is entirely defined ~hrough hierarchy of masks. The set
region where the receiver B and the emitter ne¢. This dezennination is e~uiva~em to de,empiric the higher suhspace separating bach ~ites. A move cannot he operate inside expected d/:nensions. R is expressed as a determination of the higher ma:~k matching success with a bitwkse XOR: L=Mm {! e M I C ~ x o r @b and n¢¢D ~eC@a xor@b)~ Because of our choice of M, masks can he comk~ered as numbers and the higher this number is. m~ller is the co~esponding region. The identifier
of the region R is given by: [dR - L and @b Ga~ d ~ m / n a ~ n
of Masks "M" is ordered from the lem defin~e region (eg the whole space) m the more definite region (eg
The ga~e is d e ~ e r m ~ d by a n ~ c b i o g with the mask
the ~rnalle~ space): for example M= {000, 100, 110, III} in the figure 2. To I'm the ideas, we take M -
ide~hqea @p - Gto¢ L and ~ ) or (L and @b).
L of the region, the emitter idemif'~r and the reg~n
158
F. Delaplace, J.L Giavitto
Here follows the aJgorithm in a Oecam s~le. A d d ~ of a node are Mmgem. PROC B~L
Z~ VAL VAL
c o m p u t e . g a t e { V A L ZNT source~target, ZHT gate)
network [mrformance. Routing table strategy suffers of a hot spot effect; h o w e v e r the contention
decreases after process migrations. Our algorithm achieve the same behavior a n d in the same time minimize the hot spot effect
continue: level.max,dep: I~T max ~S 4: [4] ~ T seg [0,4,6,7]:
SEQ dep := source >< target level.max := 0 continue := TRUE W R Z L E continue SEQ continue
:= l e v e l . m a x
<= m a x
XF continue c o n t i n u e := d e p = N O T seg[level.max]
d. ( /\ dep)
~RUE SKZP level.max = level.max+l level.max := level.max - 1 gate :=(NOT segilevel.max] /\ source) \/ (segllevel.max] /\ target)
~r©
3: n e t w o r k l o a d ~woiution
6 imp~mmmtation of tim Mce.imnismof ~g~on Implementation has been proceed on a Transputer
5 l ~ o r m a a c ~ ~mpatison The evaluation of our algorithm is based on the
b a s e d network with an Hypercube topology [De]ap]ace 89]. The figure 4 gives an overview of the
communication o v e r h e a d induced by a process
software architecture. Process are symbolised by
migration. For an hy[mrcube topology, the minimal, maximal and average costs are:
tmxes and oriented channels are figured by arrows
Cmi'~= 2, Cmsx= 2"
. nH cav=¢~-t-~"n 2,-z.(2,.~)=~ ,
2"-1 "~
3
This must be compared with the [Ravi & Jefferson 88] strategy inducing a maximal cost of 2 n. The figure 3 represents a qualitative evaluation of the increase of the communication Ioa.~ due to the migration of m processes when m/2 processes have already migrated. We assume a fixed communication
m,tm
f / g u ~ 4: G e n e r a l O c c g c | e w o f R o u ~ g
I
System
flow. The figure does not take into account the travel time
Each process performs a specific task which
of a message, as it. is equal for all strategies.
contributes to realize the routing. Communications between unidirectional channel are realized Jhrough
Forwarding
methods
degrade
persistently the
Rout/ng s ~ t ~ , to supp¢~ p~c~s r n / g ~ n a producer-consumer system. So. ~ e
proce.~ act
as independent as possible. In order to keep val~l[~y of data during transactions, proce.~ which manage table ~.~ M~n.Rouffng and M~n.~ocess r e ~ i r e a n exclusive dialogue.ThH
constraint induces a n
Th~ rouzh~gs y ~
15~ ~ incL~lly
l ~ : e d to d~e way
of the u ~ t ~ are p r ~ . As virtual r~m~ing s y ~ m is based on h~-~rc~y, u p d ~ e s ~ aL~o o n hfi:rarchy. So update messages are p t o ~ e d from father to c h ~ ' e n and each ~
node is a
exclusive d i r ~ dialogue. Functk~nal descriptions of
gate of a region where gate of ~ e m a | s u b - r e ~ o ~
each p r o c e ~ and ~n~ture~ are deac~ibed below.
m u ~ be u p . t e d . The o p t e d
procedure is ~ c ~ e d
as soon as m e . ~ a ~ leach an a~om~c region (a ~ e is equal to a re~on).
Routing table manager
Man.Routing
:
Man.Proce~
: Proce~ table manager
Ext.Comm,ln
: Recep~ messages and
direct them
to Reeep~a~ or Ex~.Comm.Om Eat.Comm.Out
: ,Send ~ g e s
to the neat door
Im.Comm.In
: Reeel~ internal messages and
lnt.Comm.Out
~¢nd them to Reception : Deliver messages to user process
Reception
: Compute the next gate and direct messages either on ~ t C . ~ r ~
or
Our a i ~ h m
was implememed and walk,areal o n a small 3-O~be of Transputer w ~ h a total b l n a ~ segmentation. The f f a n s ~ t i o n
of the code was
done fo|lowing the scheme of lPowell 83L The Oeeam system s u p p o ~ does not allow the migration of processes. Thus we have developed a minimal kernel to implemem the migration. This kernel consL~s in a loader (for the dimamk: loading of codes and data), a monkor (to manage the various phase of a migratkm) and a m e m o ~ allocator (used
6~2 ~mc~m'es
by the IoadeO. D e , ill can b e found in [Delaplace in 6 steps
891. The migration of a prneess c o n s ~ MESSAGE ~ HEADER+ CONTAINS:
(see f ' ~ r e 5).
HEADER - [
D The node a reque~ a migration to the no4e b. The "high loade~ of the node b che< ks ,',.hat there
pid.seurce, p~d.targel, gate,
target.sRe, messages.|eng~h ! CONTAINS - [chamc~erl, ..., charactern] When a message reaches its current gate k~emified by HEADER.gate, the add~e~ of the next gate is then computed. This edd~ess depends on the cu~ent gate and O~ r~zceiversite. Once a consultation is made to determine if the receiver site has not changed, the Reception p r o c e ~ is able to compute the neat gate address. In order to harmonize the two ,sources of messages
(Eat.Comm.in and lm.Comm.in), newly injected m ~ a g e s own the current size address as gate address ~,nd the born site addre..~ of the receiver as targe~ site. A new computation of these addresses is performed in ~ Reeep~ion Service.
is ~:nough availaHe memory space. The node b confirm ~ e m ~ a ~ 3) The acknowledgement is t r a ~ m ~ e d to the ~lowlevel task" in charge to ~op, encapaulate and transmit the proce~. These tasks are very h~fica~ with the Transputer p t o c e ~ managemem a n d m u ~ be carefully performed Csee [Delaplace 89] for details). The proce~ is wanam~ed. ~) ,~t t v a n ~ i s ~ end cleaning actions are pe~fonned by the "low ~ (e.g. freeing the
process memo~). In r~e receiving site, the "high k~tdar" awange hemel data ~ c ~ u r e so that the new process is taken into accou~.
me~gea
a ~ forwa~ed).
160
F. Delaplace. J.L. Giavitto
Sansonnet, C. Germain, F. CapeHo and J-L Bechenec for their help, suggestions and support.
Ref~
F~rc
~: p v ~ e s ~ mige~tlon
7 ~ n The routing performance ~s a critical point in parallel architectures. The region based migration achieves an effective routing with a cost comparable to the existing methods. Moreover, information redundancy in the several routing table, makes the algorithm more fault-tolerant as other strategy (forwarding strategy for example). Our algorithm can be seen as a generalization of the [Ravi & Jefferson 88] method. This algorithm can be see as a region based migration where there exist only two level of region: the region corresponding to a single node and the entire network. An interesting generalization of our work consist in making several segmentations of the same network cohabit, in this case, the migration process selects the region that minimizes the update operations. For making this possible, the segmentation process must in the same time determine the accurate gates.
Ac.kuow~ts The authors wish to thank Free,else Baude and St~phane Boucheron for many helpful dis~ss[ons. We are also indebted to Dr D Etiemble and Dr J-P
[Delaplace 89] Delaplaca, "Migmlton am Processus', ~ppon de D.E.A, Univetstz6de Paris-Hord, 1989 [DouglL~ 87] Douglts) Ou•erhout, "Precis Mignst~n in the Sprite Operating Syst~nn; Distribued Computing, Computer Society Pre~a, PP 18-24, 1987 [GERg0| C. Gennain, J-L B(:chennec, D. Etiemble, J-P. Sansnnnet, "An lnta~onnectmn I ~ o ~ and a R~i)~ 8 Scheme for a M~siuoly Parallel Massage-Passing Mulffcompu~#, Frontiers 90 conference on Ma.~ively Parallel Computation, Oaober 8-10 College Park, MD. Wowler 861 RJ Fowler, " ~ e Company Using F~nu~d'Lng Adrss~s for D ~ r ~ n s l i ~ d O b j ~ Finding-, Proc A.C.M Symposium on Principles of Distrthued Computation, CalSaW, Canada, Ausust 1986 [Min8 90] Ming-Syan Chen, Kan8 G.Shin, "SubGu~ Allocation and Task Migraine tn Hypercub¢ Mulliprocessors-, IEEE Transactions on Computers, Vo159, Neg, PP 1146-1155, September 1990 [Maguim 881 Maguim, Smith, "Pwc~s M i ~ l i o ~ . ~/'ects on Scienii~w Compulation',A.C.M Sigplan , PP 102-106, March 1988 [Occam 88] C.A.R Hoare, OCCAM 2 Refereace Manual, International Series Computer Science. Prentice Hall) [Powell 85] Poweli° Miller, "Process Migration in Demos/MP~ Ptoc 9th Operatin 8 System Principles, PP 110-119. C~ober 1983 [Ravl & Jefferson 88] Ravi, Jefferson, "A Basic l~rotecol In Mignstin8 l~n~mses', International Conference on Parallel Proce~,ing, PP 188-197, August 1988 [Saad 87l Y Ssad, M.H Schuthz, "Topological propcrlifs o f ~ ~ Tran~aaion of Compmera Vo137 NeT, PP 867-871, July 1988 [Smith 88] Smith, "A S u ~ y of Process Migra|ion Mechanisms-, Operatin 8 Systems Review, A.C.M Sigopa,PP 28-40,July 1988 [l'anenbaum 81] Tanenbaum. Computer Networks, Prentice .H all Inc, 1981 IThehner 85] Theimer, "/~emFteb/¢ Remote F.xccution FaciHlies f o r the V-Systen~s', Proc of the t0~h Operatin8 SystemPrinc.lpl~ PP 2-12, December1985 [Vautherin 88l Vautherln, Millet, "Dynamic Crc.:ation o f Pnscesses on Transputor IVetwor~', Rappen de Recherche du L.R.I N°453, Universit(: de Pads-Sud Onay, July 1988 [Walker 8~[ Walker & AI, "Tb. locus D~r/bued Openst/n~ System', Proc of the 9th Symposium on Opetadn 8 System Principles, 1983 [Zayas 87] Z~yas, "Attacking the Process Mignstion ~Ot~/~Ck'~ Pro~ of the ]llth S~mnpesiumon Operating Sy~em Principles, PP 15-24. November 1987