Copyright © I FAC 11th Triennial \l'orld Congress . TaIIillll . bloni a. LSSR . 19'10
A PROPOSED DESIGN APPROACH FOR A MARTIAN SPACESHIP ELECTRONIC BRAIN B. Goncharov and G. Mozharov Soviet Space Flight Cmter, ].//070 Kalillillgrad, M oscow R egioll , USSR
Abstract. Recently announced idea of a multinational manned Martian expedition should be reliably based on high-level autonomy. The basic element of autonomy is an onboard fault-tolerant computer system (FTCS). This paper discusses a new highly efficient dense packaging concept for a FTCS' which emphasizes avionics maintainabili ty and allows the crew to repair this most critical spaceship system during deep space mission without soldering or welding operations. The architecture developed within the research is called the Wafer-Scale Integration Dice Element System (W-SIDE System). It is oriented for easy-to-repair applications and can be tailored to achieve the specified reliability. Martian spaceship electronic brain is an example of a possible artificial intelligence implementation of a so-called dice concept described here. Keywords. Computer selection and evaluation; special purpose computers; space vehicles; reliability; maintenance engineering; autonomy; artificial intelligence; dice concept.
INTRODUCTION
from the Earth to the Mars and back is on the order of half an hour. So high level autonomy is a must.
Since 1961, the year of Gagarin's first successful orbit in space, the mankind has obtained significant knowledge in an extraterrestrial exploration. Nevertheless , recently announced idea of the Martian expedition needs multinational R&D effort before one can afford to realize it as a reasonably safe and scientifically fruitful trip through the Solar system . Reliability,availability and survivability are the major concerns of such a difficult mission and the major design goals for spaceship systems .
Anderson (1983) underlined that autonomy is fundamentally provided by onboard fault-tolerant computer systems (FTCS). With a very high cost for every kiligram of mass onboard the interplanetary spaceship it is reasonable to use all resources at hand to increase reliability and availability and, if necessary, to repair the most critical spaceship systems, preferrably the FTCS as the first candidate. The ability to repair presumes some"pool" of redundant or spare elements (modules). Conventional printed card boards (PCE) may be of various dimensions with components soldered or welded on one or both sides. A group of PCEs is electrically connected via motherboard and encased in an electronic box. These boxes are linked together through cables. A serious decrease of packaging density in terms of gates per unit volume occurs at each
Well-established principles of round-theEarth manned spaceflights with massive 24-hours-a-day ground support service and radio links having nonsignificant delays do not work if we specify the need for a continuous uninterrupted control aboard the Martian spaceship drifting in deep space many million kilometers from the Earth. The maximum lag of a radio signal
53
packaging level with ultimate gate density 10 3 to 10 4 less than that of a VLSI (Va!kov,1979). The requirement for maintainability makes these figures even worse.
grains execution, practically limitless expansion of the homogenous computer structure, technological identity of MPCs, and hence, better production potential and lower cost in comparison with non-homogenous structures.
Ordinary design packaging conventions allow three levels of redundancy: 1)high level -- an electronic box or a line replaceable unit (LRU)j 2) medium level -- a PCB or a shop ~epairable unit (SRU) j 3)10w level -- electronic components, e.g. IC package . Redundancy results in a mass,vol ume and power (MVP) penalty, the higher level of redundancy,the more MVP penalty. l,~any examples of FTCSs contribute the conclusion that advanced avionics technology makes possible to pack elaborate functions into a small VLSI chip but fails to a llow tolerable repair at the IC level; PCB level is rarity, and LRU level is most common. Decreasing the level of redundancy is also a very difficult task because "field" repair in the deep space requires ease of manue.l handling of replaceable units (any soldering or welding operations are not welcomed even in the presence of artificial gravity), and electrical connectors usually required to integrate PCB modules,add more latency, increase mass and volume, and degrade reliability and overall performance. So it is obvious that conventional packaging ideas are inadequate for this application. The purpose of this paper is to describe a new approach capable, from the authors' point of view, to fully exploit the benefits of the VLSI technology in the Martian spaceship electronic brain. HOMOGENOUS
New exotic principles of parallelism are being adopted together with new parallel algorithms. Unfortunately PMS examples are confined to a few prototypes which are mostly in the early epoch of development, as far as we are concerned (Kalyaev,1984). Considering a solid background of interesting theoretical results this phenomenon can be explained by the fact that until recently it was difficult to fubricate a large-scale cost-effective homogenous structure. Now the situation is much better due to a more optimistic VLSI fabrication yields achieved by microelectronic industries. Very soon reasonable lots of VLSI devices can be processed economically and used for prototyping a system with several hundreds of MPCs . To obtain the highest performance each algorithm, if possible, is immersed in an adequate PMS structure or, in other terms, graph network in which MPCs or nodes at any given algorithm step are interconnected in the same pattern as algorithm grains . Figure 1 depicts a graph presentation of the PMS which is appropriate for the construction of linear and orthogonal sets and for the effective realization of Discrete Fourier Transform, convolution, recurrent and non-recurrent filtering, matrix operations, triangular equation sets calculation, etc. At any moment the PMS structure corresponds to a graph whose properties are defined by actual cell (node) interconnections or, in other terms, by a current switching network (SN) configuration. From a technological point of view, the complexity of these interconnections should be topologically constrained so that 1) no one intersection is allowed (planar structure), or 2) nodes are edges or vertices of a regular convex polyhedron (3D structure). To construct a homogenous PMS structure
SYSTEMS
Parallel mul tiproce sso r systems (PMS) containing a large amount of microprocessor cells (MPC) with regular lines between them are now b eing thoroughly investigated by researchers in several countries. One possible advantage of these systems is the capabili ty to achieve extremely high speed on the order of 100 to 1000 MIPS with a moderate performance of a single MPC (5 to 10 MIPS). Some other attractive features are high regularity of links and parallelism of algorithm
~)-I
let an equivalent graph G=(V,E) be given, with the vertices enumeration d so that for any pair (i,j) another enumeration exists which gives number j to an i-th element and vice versa. This type of enumeration is an automorphism which is defined by d(i,j)=
teractions in the sys t em, the more, the better. A PMS structure specifies not only both types of performance but also reconfiguration and reliability potential. A useful ,though not exhaustive,PMS structure evaluation list includes 1) system complexity, 2) expandability, 3) average inter-cell message delay, 4) commutativity,and 5) connectivity (Pyavtchenko,1987).
and preserves interconnections. System complexity is defined by a total number of nodes ( N) and links ( Nw) of the PMS structure graph. Expandability is determined as a mi nimum number of nodes ( ~D) that may be added to the Ptill without altering the PMS homogenous graph pat tern. Inter-cell message delay is supposed to be proporti onal t o a distance be t ween two int eracting nodes. This distance is the minimum number of edges which must be traversed to get from one node to another. The maximum distance Tmax between any pair of graph nodes is called the diameter. Obviously it is proportional to a maximum inter-cell message delay in the PMS. Average inter-cell messa ge delay Tcp corresponds t o a mean graph diame t er which is calculated as an arithmetic average of all non-zero distances within the graph. Another way to calculate Tcp is t o find a sum of all absolut e values of powers of group constituent s . Commut a t ivi ty is defi ned a s the ra ti o of t he t otal number of l i nks between p r.1S cells Nw to a full graph link number N( N-1)/2, that is v*=2 N~ / (N(N-1». Conne c tivity addre ss es t h e problem of fault-tole r ance; s ome cell s can di s conne ct a Pr.iS graph , if failed. We define connectivity as the minimum number of nodes whose extraction results in splitting the PMS graph on t wo disjoint graphs. Relative connectivity is fomulated as the ratio of the graph cyclomatic number to the minimum number of links necessary to support the graph structure: CIt*=(N -N+1 )/(N-1).
Group graph is a directed graph G with vertices corresponding to elements of group r and vice versa. Two vertices of graph G are connected by the edge (e,e') if an element eo€r exi3ts, so that o e'=ee o , and r c r is a finite subset where
-1
o
eo E ro
and
eoe
-1
=1.
Elements of the group r 0 are called group constituents. A group may be presented as a network with directed edges, nodes corresponding to elements,and edges corresponding to multiplication by group constituents e and their inverse e- 1 Such graphs are known as Cayley graphs. We are more interested in groups than corresponding crystal lattice pattern because the latter can be easily constructed from a group graph by substituting each pair of opposite-directed edges connecting two vertices, with one undirected edge (Grossman,1964). The group may be described with certain defining relations which denote links between constituents of the group. For example, rsfr -1 s -1 f -1 =1 denotes the Abel group having three constituents r,s,f and I is the unity of the group r. Figure 1 describes corresponding surface mapping with regular hexagons (1a), and its group graph (1b). It is evident that commutativity and isotropy may be useful for an PMS design. These properties allow equivalent alternative routing between two given interacting MPCs and are valuable for faulttolerance. STRUCTURE
w
Let us calculate the abovementioned parameters for a structure given on Fig. 1. Its equivalent transformation is shown on Fig .2 as a triangular pattern lattice. Then, N=w 1w2w ; NW =w 1w2 (4w -1 )-w3 (2(w 1+ 3 3
EVALUATION
The less transit MPCs are en route the better interaction performance between two interacting cells. System performance depends on a number of simultaneous in-
+w2 )-1);
.l . )
AD= min
{w 1w2 ; w2w3 ; w1w3 }
j
after anothe r and vigoro usly suppor ting the most vital high-p enalty servic es until the final (letha l) fault.
w
-1
~rsf»)'
i=O where K
1/(N(N -1 »j
ji-n 11 Ij-n21 \z-nJI 10g( o~)r s f
for odd oct.j F rsfIj-n 21 li-n 11 for Iz-nJI g t s r even Og(r,s, f,g/ (rs)t oct. j g=rsj t=min {i-n j j-n } 1 2 v*= 2(w1w2 (4wJ-1 )-WJ(2 (W1+w 2)-1)j
1
r,."..
w1w2wJ(w1w2wJ-1) * ~
=
w1w2(J wJ-1)-W J(2(w1+ w2)-1 )+1
-
FAULT
w1w2 w -1 J TOLERANCE
Fault- tolera nt system s practi ce yields effect ive method s of switch ing from a failed cell to a backup when a total amount of cells in the system is not large. In this case one can afford to develo p a switch ing networ k capabl e to arbitr ary include from a "pool" any approp riate backup cell in place of a failed one. A system atic applic ation of this approa ch to homogenous FTCSs with more and more cells leads to a deadli ne where the additi onal comple xity of a switch ing networ k absorb s furthe r growth of reliab ility. Due to this fact large- scale FTCSs may scarse ly use a full-g raph switch ing networ k and therefore other approa ches are badly needed . One of the crucia l moments of a redund ant system is a select ion of diagno sis and recovery method s which are devise d to sustain the proper functio ning of a FTCS. These method s have been under invest igation since the early 1960s (Avizi enis,19 70, 1985,1 986jHe cht,19 77jPug atchev ,1966) . The genera l approa ch consis ts of severa l steps includ ing fault detect ion, fault isolat ion, system reconf igurat ion and system recove ry. Sophi sticate d FTCSs tolera te multip le faults and demon strate gracef ul degrad ation discar ding low-pe nalty functi ons one
Let a lattic e of some hundre d MPCs be given where each cell has physic al links with six neighb ours. When a physic al link of an MPC is masked in its interf ace, it is a void link,o therwi se it is a plenip otent link. Maskin g allows to conform a PMS to differ ent graph config uratio ns . The requirem ent to backup any workin g cell by any cell from a pool of spares can not be specif ied due to the abovem entione d SN compl exity. Dual-c ell approa ch leads us to a two-fo ld increa se in cells and physic al links but seemin gly is not the best decision consid ering a non-ze ro proba bility of both workin g and backup cell failur es in the same node. Redund ancy decrea ses if only one backup cell is entere d in each column (Myam lin, 1988). Strict ly speaki ng, in this case the workin g lattic e with cell coordi nates w , 1 w2 ,w is comple mented vdth a backup latJ tice with cell coordi nates w w • A failed 2 J cell is replac ed by a backup one in the same column throug h a maskin g proced ure that blocks up failed cell links and activates new links as if "shift ing" workab le cells to recons truct the column . The number of additi onal physic al links is not very high for this approa ch. The maskin g proced ure presum es that a"co lumn state word"(CSW) exists which allows to recons truct the links of a failed column. Then the whole proces s should be split into a sequen ce of steps, each step for every column is descri bed in a non-vo latile memory by a step identi fier, a CSW, and a "rollb ack point" .An error detect ion invoke s diagno sis and fault isolat ion by column recove ry follow ed by a retry from a roll back point one step back (Avizi enis, 1985) • Table 1 illust rates the compa rative reliabilit y increa se for a "one-s pare-c ellin-a-co lumn" approa ch. Reliab ility data have been calcul ated for differ ent time interv als and failur e rates. Cell failur es are consid ered to be indepe ndent events , and the proba bility of a cell failur e is
56
found as P(t)=1-exp(-H), where>. is a cell failure rate for the exponential law. Then the probability that no failure has occured in a non-redundant system w1 w2 w
fails as the gate count approaches the 1M gates per chip and goes beyond. Then N becomes 2 to 3 orders of magnitude less then that calculated with the Rent's law; the exact number of leads depends on the FTCS design and the way it is split on VLSI.So the basic idea to lessen the lead number of a VLSI chip is to achieve a functionally self-contained VLSI design highly sophisticated inside and with a moderate input-output data rate.
3
in a given time interval t can be found as p o (t)=exp(->.tw 1w2 w ). 3 Now let a system of N working cells and s spare (backup) cells be given, and a system failure occurs after a lethal (s+1) cell failure. For a (N+s) cell lattice the number of lattice states equals to 2N+s. The probability of exactly j workable cells and (N+s-j) failed cells can be found using a binomial distribution as follows: ( 1)
The probability that a system is not failed is calculated as a sum of Eq.(1) from j=N until j=N+s. This yields N+s N P =Lcj pj(1_p)N+s-j_p (N+S)(1-P)S; (2) 1 j=N N+s fac(shP s For N=w 1w2w and s=w2 w the probability 3 3 of this system being operational is found from Eq.(2) as
It can be noted that the proposed approach promises satisfactory reliability figures with a moderate additional mass, volume and power for a backup structure.
To further reduce the lead number None can use serial data paths instead of parallel data paths, and in order to sustain the throughput, the data rate should be increased to transfer the same volume of data in the same time interval. Serial data paths are common practice for regular structure (homogenous) systems with distributed memory in each cell and programmable interconnection switching (Kalyaev, 1984).Bionic systems show another example of serial (and often analogous) type of data transfer. It is worth to note that a very high rate digital data flow can be replaced with an equivalent analogous signal of a predetermined precision and time scale. The problem of high frequency cross-talks is nullified if the signal is supplied at optical frequencies. That also makes the system highly immune to the electromagnetic interference. The dense packaging arrangement can be illustrated with a 3D "brickwall" rectangular lattice. If we have a brickwall consisting of cubic-shaped bricks then each cube is a cell surrounded by six others. peripheral cubes are exeptions. To get rid of unreliable electrical leads and to simplify repair in the deep space each cubic cell is provided with light emitting sources and photorecievers, so it can exchange information with its heighbours "eye-to-eye". Hence a cubic cell has no external soldered or welded joints and can be easily extracted from the "brickwall" if failed. Power plates on top and bottom of a cube are the only electrical contacts, information and control signals outside cells travel in the form of lightwaves. Special panels interlined between cell layers supply the
NEW DICE PACKAGING CONCEPT The mass and volume penalty of electrical connectors and interconnections is of a prime concern for a designer striving to increase packaging density of a FTCS. The maximum gain is obtained when interconnection count is maximized at the lowest possible level (VaIkov,1979). A critical fundamental issue often neglected in the desire to achieve high packaging density is the growing number of IC leads which in turn are the source of unreliability. According to the Rent's law, the more gates ~ are placed on a chip area, the more interconnections N one needs to put the chip into the electronic scheme (Blodget, 1983): (.5 to .75) N=(2. 5 to 3.5) ~ This law holds for the LSI technology but 57
electric power, remove heat from the cells, provide mechanical stiffness and radiation hardness. Waveguides deepened into the panels are used for system purposes. This is the main idea of the proposed "dice concept", the name is derived from the image of cubic cells recembling dice as it is illustrated in Fig.3 (Goncharov, 1 985 ). Figure 4 describes another (hexagonal) type of a cell which is called a WaferScale Integration Dice Element or a W-SIDE. The hexagonal formfactor of a W-SIDE offers the possibility to simultaneously contact and exchange information with eight neighbours. This provides enough flexibility to conform its links to various algorithm graphs. Wafer-Scale Integration is now the most promising technology to pack a cell with elaborate functions and to reach the highest packaging density when assembled in form of W-SIDEs into a complete system, as shown in Fig.5. It is a sandwich 3D structure with special panels between layers of W-SIDEs . The multipurpose role of the panels should be disclosed in more detail. These panels shown ih Fig. 6 with heat pipes inside are used as effective cold plates which transfer heat from adjoining cells to a fluid moving through exchanger ducts in the panels and then to a heat radiator of a spaceship thermal control system (TCS). As the conductive properties of the panels are used to supply electric power to W-SIDEs, odd panels are connected to a positive bus and even panels to a negative (ground) bus of the spaceship power system . For this reason two TCS loops are available, one for each polarity, t o prevent a short -circui t and to backup a heat removal function. Planar waveguides deepened into the panels have special outcoupling gratings to extract a photonic signal from them and to send it into \v- SIDE photorecievers. Orthogonal structure of waveguides allows to select, address and instruct any IV- SIDE choosing it by its row and column. This function is supplied by a peripheral electronics encapsulated inside walls of the panels. Waveguides are also used to synchronize cooperating W-SIDEs during a work phase.
The information essential to the fundamental operation of a W-SIDE within the system i s permanently stored in a firmware and a read-only-memory (ROM). The data and code needed to compute certain portions of a task at a predetermined phase of the mission are transmitted beforehand from the external mass memory via peripheral electronics to a chosen group of W-SIDEs and are stored in a non-volatile random-access memory (RAM) of the cells. So the difference between W-SIDEs is determined during three stages of their individualization. First is the stage of a firmware design; then a ROM implementation and, at last, RAM loading. From the evolutionary point of view, further individualization is possible should a W- SIDE have a capability to learn by experience. CONCLUSION Huge lo gic complexity and volumous memory hereditary to a IV-SIDE cell open the enormous prospects for AI -based fault tolerance mechanisms where a IV-SIDE System is treated as a society with a common goal, a single IV- SIDE is a member, a group of members working at the same task forms a shop, etc. Voting and judgement are usual procedures in a pre-work phase, when abnormal behavior of any member is discussed by his "eye-to-eye" partners and verdicted. Modified wave algorithms are used to find optimal, sub-optimal or alternative paths for a given job. The ideas of dormancy are best implemented in the realm of W- SIDEs with safeguard members and switched-off society . Preliminary research of these and other ideas gives optimism to a dice concept applied for a Martian expedition. These ideas are yet to be explored in detail. While there is a considerable potential for use of a IV- SIDE electronic brain onboard the manned 1.1artian spaceship much additional development effort will be required before autonomous spacecraft control is considered to be a state-of-theart capability. REFERENCES Anderson,J.L.(1983). Space station auto-
nomy requirements. In Pro c. 4th AlAA Computer in Aerospace Conf. , Hartford , Connecticut , USA.pp.164-170. Avizienis,A. A.(1970) . Self- testing and reparing computer. US Patent 3,51 7 ,171, US Cl.235-153. AVizienis,A.(1985). The N- version approach to fault-tolerant software. IEEE Trans . Software Eng. ,11,1491-1501. AVizienis,A.(1986). Dependable computing: from concepts to design diversity. Proc . IEEE , 74 , 629 - 638 . Blodget,E.J.,Jr.(1983). IC assembly and mounting methods . Inside the World of Science, No .9, 46 - 58 . (Russian edition of Sci . American ). Goncharov,B .( 1985) . Homogenous computer structure . USSR Invention Sertificate SU 1.161.937, Int . Cl . G 06 F 7/ 00 . Grossman ,I. and W. Magnus (1964). Groups and their graphs. Random House, L.W. Singer Co . ,London . (Trans.into Russian by MlR Publ i shers, 1971).
TABLE 1
Hecht , H. (1977) . Fault-tolerant computers for spacecraft . J . Spacec raft,l1, 579 . Kalyaev,A.V.(1984). Multiprocessor sys tems with programming architecture, Radio and Comm ., Moscow. (In Russian) . Myamlin,A . N.,L. A. Pozdnyakov, E. I . Kotov , I . B. Zadykhailo (1988). Increasing reliability of the matrix PMS . In V. V. Przyalkovsky ( Ed . ) , Electronic Computer Technology , 2nd issue , Radio and Comm. , Mos cow. (In Russian) . Pugatchev ,V. S.( Ed.)( 1966) . Redundancy me thods for computer systems . Soviet Radio , Moscow. (In Russian) . Pyavtchenko,O.N. , B. Goncharov,G . Mozharov (1987). On the evaluation of computer systems structural parameters. In Proc.7th North Caucasus Workshop on Computer Sci .and Tech ., Taganrog, USSR. (In Russian) . VaIkov,V.M.(1979). Microelectronic computer complexes for control. Mashinostroenie, Leningrad . (In Russian) .
Reliability of Non- Redundant versus Redundant Homogenous Systems
Reliabili ty of a system with given dimensions : Fai l ure Time inrate,h- 1 terval,h
6*6*6
4*4*4 wlo backup
redundant
wlo backup
9*9*9
redundant
wlo backup
redundant
10- 4
10 50 250 1250
0 . 938005 0 . 726149 0 .201897 0 . 000335
0 . 999840 0 . 996067 0 . 911122 0 . 159507
0 .805735 0 .3 39596 0 . 004517 0 . 000000
0 . 999248 0 . 981670 0 . 652577 0 . 000000
0 . 482391 0 . 026121 0 . 000000 0 . 000000
0 . 996384 0 . 915451 0 .139257 0 . 000000
10- 5
10 50 250 1250
0 . 993620 0 . 968507 0 . 852144 0 .449329
0 . 999998 0 . 999960 0 . 999008 0.976188
0 . 978632 0 . 897628 0 . 582748 0.067206
0 . 999992 0 . 999811 0 . 995336 0 .893962
0 . 929694 0 . 694544 0 . 161621 0 . 000110
0 .999964 0 .999092 0 .977823 0 . 589719
10- 6
10 50 250 1250
0 . 999360 0 .996805 0 .98412 7 0 . 923116
0 . 999999 0 . 999999 0 .999 990 0 .999751
0.997842 0 . 989258 0 . 947432 0 . 763379
0 . 999999 0 . 999998 0 . 999953 0 . 998826
0 . 992737 0 .964206 0 . 833393 0 . 402021
0 .999999 0 .999991 0 .999773 0 . 994365
10- 7
10 50 250 1250
0 . 999936 0 .999680 0 .998401 0 . 992032
0 .9 99999 0 .999999 0 .999999 0 . 999998
0.999784 0 . 998921 0 . 994615 0 .973361
0 . 999999 0 . 999999 0 . 999999 0 . 999988
0 . 999271 0 . 996362 0 . 981940 0 .9 12904
0 . 999999 0 . 999999 0 . 999998 0 . 999943
59
w,_)
Fig . 1 . Hexagonal crystal lattice (a) and its group graph (b).
•
0
0
•
•
0
Fig. 2. Triangular pattern lattice.
•
i n tra layer light emittersphoto r ec ieve r s
0
top and bott om system photo recievers
•
power 7""""-- plates WSIs . .____~
inter l ayer '-~
-
light emi tters phot orecievers and lightguide s
:::::::=== I
Fig. 3. Dice packaging concept.
positive and negative TCS loops
power converter
- - - - - plug
Fig . 4 . W- SIDE - - a hexagonal dice cell.
electricallydriven screw four - le g j ack
jack ho l e
cel~
extractor hand
Fi g . 5 . ,1. ".'I - SIDE system with an opened layer .
60
Fig . 6 . A fragment of a special panel . (Plug holes not shown) .