Experimental results on the error detection capability of a concurrent test architecture for massively-parallel computers

Experimental results on the error detection capability of a concurrent test architecture for massively-parallel computers

Parallel Computing 18 (1992) 1079-1103 North-Holland 1079 PARCO 717 Practical aspects and experiences Experimental results on the error detection ...

2MB Sizes 0 Downloads 63 Views

Parallel Computing 18 (1992) 1079-1103 North-Holland

1079

PARCO 717

Practical aspects and experiences

Experimental results on the error detection capability of a concurrent test architecture for massively-parallel computers * Marius V.A. Hancu a, Kazuhiko Iwasaki b, Yuji Sato c and Mamoru Sugie d " CRIM, (Centre de Recherche Informatique de Montreal), 3744 rue Jean Brillant, Bureau 500, Montreal, Quebec H3T I PI, Canada b Chiba University, Dept. of Information and Computer Sciences, Faculty of Engineering, Yayoi-cho, 260 Chiba, Japan c Central Research Laboratory, Hitachi Ltd., Kokubunji, Tokyo 185, Japan d Central Research Laboratory, Hitachi Ltd., Kokubunji, Tokyo 185, Japan Received 15 August 1991 Revised 4 February 1992

Abstract Hancu, M.V.A., K. lwasaki, Y. Sato and M. Sugie, Experimental results on the error detection capability of a concurrent test architecture for massively-parallel computers, Parallel Computing 18 (1992) 1079-1103. In a previous paper, we introduced a new concurrent testing (or on-line monitoring) architecture for Massively-Parallel Computers. In the proposed test architecture, on-line checks for both control flow and data routing are accomplished by enforcing the run-time test of compressed (signatured) versions of the control and data dependences of the algorithm executed in the parallel computer. This paper focuses on the results of simulation experiments on the error detection of the proposed test architecture as applied to the routing process. Four sets of experiments were executed, with two compressors or signature analyzers (an MISR and an LFSR) and two error models (the 2m-ary and the Binary Symmetric Channel). Using a randomized routing process and a randomized fault insertion, we have obtained detailed figures for the undetected errors at all crucial detecting points of our proposed detection method: the source, the expected destination and the false destination of the messages. High detection ratios for multiple errors were obtained for compressors of only moderate size, supporting the use of this method in practical applications. The results are independent of the topology of the interconnection network and the detailed routing algorithm.

Keywords. Aiiasing probability; concurrent testing; control flow checking; data dependence checking; massively-parallel computers; on-line error-detection, packet-switched routing; parallel computing; signature analysis; system-level error detection. * This work was performed while Marius V.A. Hancu was a visiting i,.searcher at the Central Research Laboratory, Hitachi Ltd. Correspondence to: Marius Hancu, Centre de Recherche lnformatique de Montreal (CRIM), 3744, rue Jean-Brillant, Bureau 500, Montreal, Quebec H3T 1P1, Canada, email: [email protected] 0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

1080

M. EA. Hancu et al.

1. Introduction

Massively-Parallel Computers (MPCs) are considered [10,21] a cost-effective alternative solution for implementing parallel supercomputers, especially in terms of hardware. One of the major issues to be faced in the course of their design and use is how to detect, as quickly as possible, errors in their behavior caused by temporary or permanent faults. In many applications, it is imperative to detect the errors in the system behavior on-line, without disturbing the execution of the user's programs. There is a considerably body of work on concurrent testing as applied to uniprocessors [3,16,17,20,26,27,30-33]. However, the question of on-line testing of multiprocessor systems has been approached only sparsely in the literature. One of the best-known test architectures proposed is the 'roving monitor' [3,31]. In this case, the on-line behavior of a multiprocessor system is checked by a monitor (an auxiliary test processor). The program execution is checked by following exclusively the control flow, i.e. the sequencing of instructions executed by each processor. The program is first divided in program segments (blocks). For each such program segment, the instructions received by each processor are sequentially compressed by encoding circuits known as signature analyzers. The signatures corresponding to the instructions executed in each block by each processor are stored in the order of their execution in associated signature queues. The monitor reads all the monitor queues in a certain known sequence and compares each individual signature with a reference signature generated at compilation time. The roving monitor is one of the few on-line testing techniques for multiprocessor systems proposed in the literature. However, it has important disadvantages in terms of its application for multiprocessors containing large numbers of processors "(as are the MPCs), as a single roving monitor must check all the processors on a time-shared basis. As a consequence, tor such multiprocessors a long latency (time from error occurrence to error detection) is obtained. Another disadvantage of the roving monitor approach as proposed in the prior work is that it does not check another characteristic of the program executionmthe pattern of data flow from one processor to another. This is especially important in the context of multiprocessors, in which the interconnection network has a crucial role in the proper execution of the program. In a previous paper [9], we introduced a new test architecture which we consider is cost-effective in performing the required on-line system-level tests. The proposed method of achieving concurrent testing relies on compressing the data and control flows of the multiprocessor. The compression is achieved by using separate or distinct signatures for the routing and control flows. The run-time signatures are generated by using hardware signature analyzers. These signatures are compared with reference signatures created at compilation time. The proposed concurrent test architecture relies on the use of signature analysis in combination with new concepts concerning: - the applicationmthe interconnection network and the routing process (previous work has not tried, as far as we know, to approach explicitly the concurrent testing of the routing process, previous attempts focusing on the control flow only). We consider that in a multiprocessor (and especially in a massively-parallel one) the importance of the network and of the routing is equal to that of the processors and of the local computation. This was, in effect, the starting point and the motivation for our research. - the packet structure (the inclusion of the source address in order to enable monitori~ the sequence of sources for the packets incoming to a destination). the items to be compresses and the spec([ic procedure (specifically, destination addresses at

Error detection capability of a test architecture for massively-parallel computers

1081

the sources and source addresses at the destinations, on a block-by-block basis) in order to create a complete and effective monitoring framework for the routing process at both ends of the routing path. - the use of parallel comparison with the reference signature (the roving monitor [31], the best previous attempt, in our opinion, at concurrent monitoring of multiprocessors, uses a round-robin, serial type of comparison, which is extremely time-consuming, increasing the error detection latency, especially for large numbers of processors, as is the case of massively-parallel processors). The path signaturing method, which is used in this paper, relies on checking the structural integrity of the program graph at execution time in comparison with a reference graph or its equivalents obtained at compilation. It is usually designed to be datu-independent (e.g. [16,17,20,25,27] for reasons of generality. One should not mistakenly make an equivalence between our proposed method for concurrent monitoring of routing (which uses path signaturing in addition to new concepts on packet structure and compression) and parity encoding, which could of course be used for checking the integrity of the messages (for example in parallel with our proposed methodology), including of course their data fields. Parity is used only for individual messages and does not check if a sequence of routing cycles in a full computational/routing block is correct. In simple words, what we propose to monitor/check is if each of the sources sends messages in a correct sequence to a predefined sequence of destinations and if each destination receives messages in a correct sequence from a predefined sequence of sources. The data fields of the packets are not checked because the method is designed to be data-independent. Thus, at a new iteration, with new initial data but the same computational/ routing program, the reference signatures will remain the same. Reference signatures which are data-dependent cannot be computed at compilation time. The purpose of the present paper is to present simulation results on the error detection performance of the proposed on-line monitoring architecture. The work focuses on the routing process and its monitoring. The simulations were performed with such parameters as the total number of processors, the blocksize (total number of instructions or routing cycles in the block) and the total number of inserted faults. Two error models (the 2m-ary error model and the Binary Symmetric Channel error model) were used, in order to verifiy the error detection under different error assumptions. Also, two types of compressors were used, an MISR (Multiple-lnput Signature Register), more suitable for monitoring routing via parallel links, and a serial-input LFSR (Linear-Feedback Shift-Register), more suitable for monitoring serial links. The error insertion was done for all the major error cases identified in the previous paper in combination with the two error models mentioned above. The results obtained show conclusively that the error detection performance of the proposed test architecture is very good for all combinations of parameter values, even when using compressors of relatively small size (i. e. signature analyzers with a relativey small number of bits). In terms of organization of this paper, the major sections are: - In Section 2, we introduce shortly our proposed concurrent test architecture; - in Section 3, we present the organization of the simulation experiments and the simulation results; - in Section 4, the simulation experiments on the error detection process are discussed. 2. T h e p r o p o s e d

test architecture

The proposed method of achieving concurrent testing relies on generating signature patterns for the routing and control flows of the multiprocessor in order to check their

1082

M.V.A. Hancu et al.

11

Fig. 1. The compression of control flow information and routing information in the proposed on-line concurrent testing (on-linemonitoring)architecture. sequencing (history). The run-time signatures are generated by using distributed hardware signature analyzers. These signatures are then compared with reference signatures created at compilation time (and stored in the local memory of each processing element/router node) by using parallel compare operations. Figure 1 is a graphic representation of the basic principle of operation of the proposed on-line monitoring method of multiprocessor systems. In this figure, a 2-dimensional mesh interconnection network is shown, connecting 9 processing elements (PEs). Three signature analyzers (SAs) are introduced as testing (monitoring) devices at every PE/router node of the interconnection network: INSTR-SA, DA-SA and SA-SA. These SAs respectively compress the following information: - INSTR-SA: the instructk, as (I I-I~) arriving at the PEs located at the respective processor/ router nodes; DA-SA: the destination of the outgoing packets; - SA-SA: the source of the incoming packets. For packet-switched interconnection networks (which provide our main application case) we suggest in Fig. 2 a possible packet format that supports the proposed on-line test architecture. The packet consist of destination address, source address and data. The destination address is normally included [10] in the packet in any packet-switched interconnection network in order to direct the routing itself. In our proposal, the destination address is compressed at the beginning of the routing path by the DA-SAs in order to provide part of the test response related to the data dependences of the program executed. The source address field has no role in the routing process proper. It is introduced by us in the structure of the packet with the express purpose of performing the proposed on-line test of the program execution based on the program data dependences. The source address of the packets is compressed at the destination of the packet by SA-SAs in order to enforce the test of the data dependences of the packet. The packet also contains the data (resulted from the on-line program execution) that is sent to other processors in the course of the parallel execution of the program. This data is not used directly for test purposes in our proposed monitoring method. Our monitoring method is thus data-independent. In other words, what we propose to monitor is the compressed history of routing patterns (plainly speaking, from where the data packets come and to where they go), not the data themselves. -

Error detection capability of a test architecture for massively-parallel computers

To SA

To SA

at the destination

at the source

Source Address

I Desli.ation Address

1083

I Data

V

From

Packet Source Fig. 2. Packet format and assembly during the proposed on-line-concurrent testing (on-line monitoring) for packet-switched interconnection networks.

The component of the test signatures based on the compression of instructions arriving at the PEs, generated by the INSTR-SAs, makes possible the test of the behavior of the computer system from the point of view of the control flow (instruction sequencing). This component will not be analyzed in the present paper as it is presented in the literature [3,16,17,20,26,27,30-33], however, mainly only for uniprocessors. The component of the test signatures based on the compression of the destinations of the outgoing packets makes possible the test of the interconnection network from the point of view of the test of the packet assembly blocks of the routers. This test component contributes to the on-line test of the program execution based on the data dependences of the program. The component of the test signatures based on the compression of the source addresses of the incoming packets makes possible a more complete test of the interconnection network involving the routers and the links. This test component is the crucial component used for the on-line test of the program execution based on the data dependences of the program. An important advantage of compressing the source address is the fact that this signature component is independent of the route followed by the packets. In a packet-switched interconnection network, any given packet is routed to their expected destination only if it has a correct destination addresses and the routing circuitry operates properly. If this does not happen, the packet is routed to a false destination, as indicated in Fig. 3. Consequently, the signatures at both the expected destination and the false destination are affected, as the signature analyzer at the expected destination misses the compression of a packet while the packet at the false destination compresses one extra packet. As a result of analyzing the possible errors that can affect the messages in a packet-switched interconnection network (we presently focus our study on such networks, even if our method can be applied also to circuit-switched networks), several major message error cases were identified: 1. Message with true destination address A d and false source address A s. In this case, the packet is routed to the expected destination, but a false source address signature will be

1084

M.V.A. Hancu et aL

IA i I

--D SI

false

destination

fault-free case v

source

,,)

IA , I iAsJI j I Asli I / IA j--I /

source

(b)

/

L Sl._, ~

:Adkl I%-I I_.._l

correct destination

false destination

I

faultycase i~m

,llll~

fault-free case

correct destination

Fig. 3. Routing and addresscompression in the presenceof destination addresserrors.

generated at the expected destination as a result of the false source address.

2. Message with false destination address A a and true source address As. In this case, as a consequence of having a false destination address, the destination address signature at the source will be in error. The packet is routed to a false destination. At the expected destination, the source address signature will be erroneous as a result of missing one packet. At the same time, at the false destination, the source address signature will be in error as a result of compressing one extra packet. 3. Message with false destination address A~ and false source address As. In this case, as a consequence of having a false destination address, the destination address signature at the source will be in error. The packet is routed to a false destination. At the expected destination, the source address signature will be erroneous as a result of missing one packet. At the same time, at the false destination, the source address signature will be in error both as a result of compressing one extra packet and because the source address of the extra packet is false.

Error detection capability of a test architecture )'or massively-parallel computers

1085

4. No message. In this case, the destination address signature at the source where the error has occurred will be in error, as a result of missing the compression of a destination address. Also, at the expected destination, the source address signature will be in error, as a result of missing the compression of a packet. (Note: in the context of this paper, 'false' means 'erroneous', while 'true' means 'correct'). Some of the expected advantages of the proposed test architecture in comparison with the previous ones are: • it provides means for the on-line detection of errors in the patterns of data distribution, including: the routing process and interconnection network (physical layer) while using address signatures that are independent of the actual routes followed by the message packets; the mapping process; the errors in the data dependences of the parallel algorithm (in the logical and software layers); • has a distributed form, appropriate for implementation of MPCs in the present VLSI technology; • can be easily be complemented by related off-line BIST (Built-ln Self-Test). In a related paper [9], we have shown that as a result of the nature of the compression process it is possible that some of the errors will not be detected. In the test literature, it is considered that the phenomenon called aliasing [18] happens when correct signatures are mistakenly obtained for test response sequences that contain errors. The aliasing probability is the probability of occurrence of this phenomenon. In [9], we obtained expressions for the aliasing probability of our test architecture for several message error cases and error models. For example, in the case of the 2m-ary error model, for the single-error case (Case 3), we have obtained the following expression for the aliasing probability at the expected destination [91:

Ljl

Pals~(L-l) E •

E ( - 1 ) h J ( 2 r a t ' - h - ' ' - 1)

J=2

h

h=O

× ( p / ( 2 m - l ) ) J ( 1 - p ) L-I-J, while at the false destination, when the addresses of all received packets are compressed [9]: PaI,~(L-I+

1)=

J--~2

L - l + j 1 h=0 - 1 )

(2 'ntJ-h-I}- 1)

X ( p / ( 2 m - 1))J(1 - p ) L-l-J+' In these expressions, L denotes BLOCKSIZE, I is the index of the routing cycle in which the error is generated, p is the error probability and m denotes SASlZE. For the multiple error case, we have obtained closed-form expressions for the a|iasing at the source:

,,((),_2 ( ) )

Pald~= " JE =2

t,J

E

h=0

(-1)

h

hJ ( 2 r e ( J - h - l ) - - 1) ( p / ( 2 ' ~ - 1 ) ) J ( 1 - p ) ' ' - J ,

(1)

and bounds for the aliasing probability at the destinations: Pals~< _

)-'. ~=2

L + N e x J-2 ~ (-1) h J

(2 re{s-h-l)- 1)

h=0

× ( p / ( 2 m - 1))J(1 - p ) L+N'~-J.

(2)

1086

M.V.A. Hancu et aL

l_

O2

DesOneUonI i Address E'r°r'Fm GeneraUon

1

3

Destination Address Error Inssrtlon

1 4

l-J Address Error Insertion

(

5

Destination Address Signature Generation

(

6

8

i Source Address Signature Generation

[sloneture I

ICompsrlsonl I w"" I IReferance I iSl9 natures |

Fig. 4. The flow chart of the simulation of the error-detection process.

In the last two expressions, t~ is total number of routing cycles in which destination-address errors occur and Nex is total number of packets arriving erroneously at a destination in the course of all L routing cycles associated to a computational block.

3. Experiments on the error detection capability of the proposed concurrent (on-line) test architecture 3.1. Simulation organization

We would like to point out first that the results to be presented are independent of the topology of the interconnection network and the detailed routing algorithm. The error detection process in our proposed test architecture is based on the comparison between the run-time signatures generated by the signature analyzers and the reference signatures computed at compilation time. This comparison is made at three observation points: - at the source, where the destination address signatures are taken into account; - at the expected destination, where the source address signatures are considered; - at the false destination, where the source address signatures are considereJ~

Error detection capability of a test architecture ]'or massively-parallel computers

1087

The flow diagram of the simulation of the error detection process is presented in Fig. 4. The main segments of the simulation are the following: 1. Seed Initialization. 2. Error-Free Destination Address Generation. 3. Destination Address Error Insertion. 4. Source Address Error Insertion. 5. Destination A,~dress Signature Generation. 6. Source Address Signature Generation. 7. Test of Block End. 8. Signature Comparison with the Reference Signatures. The simulation starts (block 1 in Fig. 4) with the setup of the daseed and saseed arrays, which contain the initial seeds for the destination address signature and source address signature arrays. The error-free source address generation (block 2 in Fig. 4) is trivial, as the source addresses (stored in oksa) are taken to be the indexes of the array PEs. The error-free destination address generation is achieved in our simulation based on the premise that only permutations will be used in the routing process (i.e. in the error-free case each source PE sends oply one packet to a given destination PE, and all destinations are distinct from each other). This maximizes the number of packets to be routed in each routing cycle (equal under the permutation assumption with the number of processing elements). At the same time, permutations are general enough, as any routing can be partitioned into permutations, if some of the sources send virtual packets. In order to resemble real programs, the destination address permutations are randomized differently in each routing cycle. In order to obtain this, in a first step, all the possible destinations are generated sequentially. Then, in a second step, each destination address is stored randomly in a distinct location in the okda array. Previously to the error insertion, the content of the faulty destination address array da is identical to the content of the okda array. In simulating the error detection performance of the proposed test architecture, artificial errors ti~,ve to be inserted in the routed messages. The error insertion process modifies the error-free destination and source address arrays okda (block 3 in Fig. 4) and oksa (block 4 in Fig. 4) in order to generate the corresvonding arrays containing faulty addresses, da and sa. The error i~ls~rtion algorithm implements one of the two error models which are used here: the 2m-avj and the Binary Symmetric Channel (BSC) model. This process is repeated in each routing cycle. In this work, we have inserted the same number of errors in each routing cycle. Also, the numbers of destination and source address errors ndaerr and nsaerr are equal to each other in the error cases (3 and 4) where both types of errors are introduced. The error insertion is performed by changing selected bits in the destination and source address fields of the routed packets. The selection of the affected bits is made according to two error models, the 2m-ary symmetric channel and the Binary Symmetric Channel (BSC) error models. The packet data fields are not affected or included in the simulation, as the proposed method of monitoring is data-independent as mentioned in the Section 2. The 2m-ary symmetric channel error model [14] is represented in Fig. 5. This model is used especially for data in bit-parallel (word-like) form, for example in the case that the words are seen as elements of a Galois field [22]. As such, in the test literature this model is the one used most frequently for describing the behavior of parallel-input signature analyzers, also known as MISRs (Multiple-Input Signature Registers). When using the 2m-ary symmetric error model, we have assumed that p (the error probability for a symbol, a word in this case) is a derived variable, dependent on the number of errors introduced in each routing cycle: p - ( NDAERR + N S A E R R ) / N P R O C S

1088

M.V.A. Hancu et ai.

symbol 0

1-

symbol 1

symbol0 symbol 1

symbol2 m.1

symbol 2 m.1 q- p/(2 m-l)

Fig. 5. The 2m-ary symmetric error model.

where NDAERR is the number of destination address errors inserted in a routing cycle, NSAERR is the number of source address errors inserted in a routing cycle and NPROCS is the total number of processing elements. This has allowed us to cover clearly the case of multiple errors. The number of errors is constant for all routing cycles in one simulation run. However, the errors are distinct from a routing cycle to another, for more generality. They thus represent transient errors of one cycle in duration (the shortest possible). The detection of transient errors is one of the main objectives of the proposed concurrent test architecture. Permanent errors can be represemed in this approach by transient errors that are identical for all routing cycles. We have not dealt with the permanent errors explicitly in this work, as they represent a particular case of the transient errors. In contrast to the 2m-ary symmetric error model, the BSC error channel is mostly used in order to represent errors affecting data in serial form. A representation of the BSC [22] error model is given in Fig. 6, where p is the error probability for any bit of the serial sequence. This probability is independent of the value of the bit and independent of all the other bits in the sequence. In the test literature, this error model is used frequently when the performance of serial input signature analyzers (serial-input LFSRsmLinear Feedback Signature Registers) is analyzed. In the simulations, when using the BSC model, p is an independent variable defining implicitly the number of errors in each routing cycle. In this case, we do not use NDAERR and NSAERR as independent variables entered to the simulation, ~s we do in the case of the 2"-ary error model (they are listed as N/A--non-applicable). This is motivated by the fact that in this case the number of errors in a routing cycle is constant only on average.

1-p

1-p

1

Fig. 6. The Binary Symmetric Channel (BSC) error model.

Error detection capability of a test architecture for massively-parallel computers

1089

We would like to underline the fact that the error insertion process is randomized both in terms of: - which addresses (i.e. which locations of the da and sa arrays) are affected by errors; - how these errors affect individual bits of respective addresses. The randomized selection of addresses and individual address bits for error insertion is different for the two error models used. In the case of the 2m-ary symmetric channel error model, the addresses are seen as bit-parallel words (symbols) which must be treated as a whole in terms of error insertion (Fig. 9). Thus, after the address to be affected by the errors has been selected by a first random draw, all its bits are substituted with the result of a second random draw. In this model (Fig. 3) p is the probability that a symbol (in this case, an address word) is in error. In the case of the BSC error model, the addresses are seen as serial words (symbols) in which each bit must be treated individually in terms of error insertion (Fig. 10). This is caused by the fact that in this model the bits are seen as members of serial sequences, each bit having the same error probability p (Fig. 6), independent of the others. Thus, after the address to be affected by the errors has been selected by a first random draw, each of its bits is treated individually in terms of error insertion. A decision is taken to change the value of each bit from its error-free value based on the value of a binary random variable which takes the value 1 with the probability p (the error probability characterizing the channel). As a result of this process, some of the address words are left totally unaffected, while others have one or more bits in error. The error-free and faulty signature arrays okdasig/dasig and oksasig/sasig are initialized at the beginning of each block with the values of the seeds contained in the daseed and saseed arrays. Each location of the signature arrays is updated at the end of each routing cycle. The error-free destination and source address signature arrays represent of course the compression of the error-free destination and source addresses, while the faulty destination and source address signature arrays represent the compression of the faulty destination and source addresses. The error insertion has been made in accordance not only with the error models, but also by following up on the definition of the 4 major cases listed above. Thus the error insertion affects independently the destination address A d and the source address A s fields of the messages (Table 1). We have to mention that the BSC error model cannot be applied in the error case 4. In the case 4, all the bits must be zero, this contradicting the definition of the BSC error model, in which each of the bits can be in error independent of the others. The program segment used for the error insertion in the case of the 2 m - ary channel error model is listed in Fig. 11, while the program segment used for the error insertion in the case of the BSC model is listed in Fig. 1.?, where bscrand( ) is a function able to insert randomized errors, using the BSC model, in the destination and source addresses. The error case in effect is indicated by the value of the err_case. Following the error insertion, the next step in the simulation is the generation of the destination address and source address signatures (blocks 5 and 6 in Fig. 4), for both the error-free and faulty routings. In the case of the destination address signatures, this is a Table 1 The error insertion with respect to the major error cases

Case 1 Case 2 Case 3 Case 4

,

l

ad error-free

I error insertion I error insertion I no messacje

I As error insertion error-free

error insertion no message

1090

M.V.A. Hancu et ai.

lee

m inputs

Fig. 7. A Multiple-lnput Signature-Register (MISR).

straighforward application of the compressor function to the last destination address issued at the respective source. The generation of the source address signatures is somewhat more complex. We have to consider all messages and compress them based on their destinations, truc or false. Thus, as a result of erroneous destination addresses, the signature analyzers at some destinations may compress more than one message in a routing cycle (the pure permutation is valid only for the error-free routing, not for the faulty one). At the same time, the signature analyzers at other destinations may not compress any message in some routing cycles. As a consequence, the resulting signatures are distinct from the error-free signatures (if no aliasing occurs). This is of course the essence of the error detection based on signatured routing. Two types of signature analyzers have been used in our ,~imulations as compressors. The first (Fig. 7) is a MISR (Multiple-Input Signature Register) which is recommended for monitoring MPCs using parallel links in the interconnection network. In Fig. 7, the g's are the feedback coefficients (0 if the corresponding feedback connection is off, 1 if it is on). The second signature analyzer used (Fig. 8) is a serial-input Linear-Feedback SignatureRegister (LFSR). This is recommended especially for monitoring serial links. Analytically, the signature analyzers can be described by their feedback (characteristic) polynomials or by their transition matrices. The question which are the best polynomials for such compressors is still an open question in the test literature. However, it is largely admitted that primitive polynomials have generally the best behavior. We chose all polynomials characterizing the MISR and serial-input LFSR compressors from the Peterson and Weldon [22] table of primitive polynomials such as to have the minimum number of feedback

Fig. 8. A seria)-input Linear-Feedback Shift-Register (LFSR).

Error detection capability of a test architecture for massively-parallel computers

Error-free ddreea

/ I

1091

I"

I

m m I

I

I

• • I

!

I

I

~111

|

I

Faulty address

j

I

I

I

I

L-th routing cycle Fig. 9. A representation of the error insertion process using the 2m-ary channel error model.

connections (non-zero coefficients) for a given degree, which simplifies the implementation. They are: x4+x+l X 8 d - X 4 d - X 3 d - X 2 "t- 1

x 12+x 6 + x 4 + x + 1 x 16+x 12+x a + x + 1 1.

X 32 + X 23 d-X 5 d-X 2 d-X d-

The coefficients defining the polynomials for the signature analyzer are stored in the

safeed[SASIZE] array. The end of each block is determined in block 7 of Fig. 4 by a simple test against the value of blocksize. In the final stage of the simulation (block 8 in Fig. 4), all the signatures are compared with the reference values, which in the real system are computed at compilation time. In the simulation itself, the error-free signatures (which constitute the reference signatures) are computed along with the faulty signatures. In order to monitor the performance of the error detection process, an important issue is how the detection events are counted (recorded) at the critical observation points: the source,

Error-free /address I ~"

I

I

I

I

I

I

I

L-th

I

routingcycle

l

I

I

I

I

!

I

I

I Erroneous 'Faulty bit F_ address I

!

I

1st routing cycle

Fig. 10. A representation of the error insertion process using the BSC error model.

1092

M.V.A. Hancu et al.

if (err_case - - 1) { for (j-O; j
}

}

}

else if (err_case - - 2) { for (J-O; j
}

}

}

else if (err_case == 3) { for (j-O; j
}

}

)

da[rl][j] - r2; sa[rl][j], r3;

else (err_case . . 4) { for (j--O; j
}

} Fig. 11. The error insertion process using the 2"-ary channel error model.

the expected destination and the false destination. In order to clarify the process, we present a listing of the corresponding C language program segment in Fig. 13. In this program segment, v is the expected destination, u is the faulty destination and diag is a detection flag (0 if no detection has been successfull up to that point, 1 if any of the detection mechanisms has been already triggered). Several explanations are needed in order to help in the understanding of the program segment listed in Fig. 13:

Error detection capability of a test architecture ]'or massively-parallel computers

1093

if (err_case ,,. 1) { for (j-O; j
)

)

else if (err case -- 2) { for (j-O; j
}

)

else (err_case == 3) { for (j=O; j
}

)

Fig. 12. The error insertion process using the BSC error model.

• The first doubly-nested for loop corresponds to the implementation of the detection event counters triggered by destination address errors. The destination address errors can cause errors either at the destination or at the source. The error detection at the destination (be it expected or false) is based on the values of the source address signatures stored in the array sasig[NPROCSI[BLOCKSIZE]. However, these signatures may be different from their reference values as a result of two different causes, each of these causes triggering distinct detection event counters. The same error can be detected by several mechanisms at the same time and thus the detection event can be recorded by several counters. The first cause for false source address signatures is the existence of false destination addresses leading to incorrect routing and thus to misdirection of the items to be compressed. In this case, at the false destination extra packets will be compressed, while at the expected destination packets will be missing from the compression and thus a false source address signature will be generated. This detection event increments respectively the da_fdest and da_edest counters in the first double for loop. The second cause for false source address signatures is the existence of errors in the source addresses themselves, which are the items to be compressed at the destinations. Thus a false source address signature will be generated. This detection event increments the sa_ edest and sa_ fdest counters in the second double for loop, as mentioned in the following. The undetected errors are listed in the da_nOet and sa_ndet counters. The error detection at the source is based on the compression of destination addresses in the signatures stored in the array dasig[NPROCS][BLOCKSIZE]. False destination address signatures trigger the da_sour counter in the first loop. This mechanism is pretty straightforward. • The second doubly-nested for loop corresponds to the implementation of the detection event counters triggered by source address errors.

1094

M. V~4. Hancu et al.

for (j-O; j < b!ocksize; j++) ( for (i-- O; i
}

if (sasig[u][blocksize-1] !. oksasig[u][blocksize-1]) { diag -- 1; da_fdest++;

}

if (dasig[i][blocksize-1] !oksasig[i][blocksize-1]) { diag - 1; da_sour++; if (diag -,,0) { da_ndet++;

}

for (j.O; j < blocksize; j++) { for (i,, O; i
}

if(v !-. u && sasig[u][blocksize-1] !oksasig[u][blocksize-1]) { diag ,, 1; sa..fdest++;

}

if(v !== u && sasig[v][blocksize-1] i= oksasig[v][blocksize-1]) { diag - 1; sa_edest++;

)

if (diag ==0){ sa_ndet++;

}

Fig. 13. Counting the detection events at the main observation points.

Error detection capability of a test architecture for massicely-parallel computers

1095

In this case, the error detection can be evidenced only at the destination. We consider that the detection of source address errors can be caused by false signatures at either the expected destination (error detection events recorded in the sa edest counter) or the false destination (error detection events recorded in the sa_ fdest counter). As can be seen from the listing, the sa_edest counter is triggered independently of whether a false destination is in effect (v different of u) or not. In these listings, the da_ndet and sa_ndet are respectively the counters for the undetected destination address and source address errors. 3.2. Simulation results

The space does not allow us to present the results for all combinations of parameter values used in the course of the simulations. We display however the results corresponding to several representative configurations. In this section, we only introduce the main results of the simulations, while we comment on their significance in Section 4. In our simulations, the size of the blocks B L O C K S I Z E has been limited to 100 instructions (a block is defined as a instruction sequence without any branches). Based on known program statistics [13], it is unlikely to have longer non-branch blocks. From the same references, the expected blocklength should be 4-20 instructions (to which it corresponds the same number of routing cycles). For the 2 m - a r y e r r o r model and NPROCS < 4096, it has been felt that inserting up to 16 destination and source address errors (NDAERR, N S A E R R < 16) in each routing cycle should be realistic. For the BSC error model, two values have been chosen for the error probability p: 1/16 and 1/256. All the 4 combinations between the two compressors (MISR and serial-input LFSR) and the two error models are covered. A detailed picture of the error detection results at three critical observation points along the path of the messages: the message source, the expected destination and the false destination is presented in Tables 2a-b. These detailed results are presented only for two particular experimental configurations. In the first configuration, described in Table 2a, we used MISR

Table 2(a) Error detection events at the main observation points (MISR, 2m-ary model, Case 3, SASIZE --- 12, NPROCS = 4096)

BLOCKSIZE NDAERR NSAERR DA_EDEST DA FDEST DA_SOUR DA_NDET TOT DAERR SA_EDEST SA_FDEST SA NDET TOT_SAERR UNDETECTED ER~S TOTAL ERRORS Paid3 Pals3

nunmn

nmu

100 2 2

4 5 5

20 12 20

n~ ~'~

~~u u~

200 123 200 0

0

~-~ n~

u]~:

200 200 123

20 20 12

NF~'~INIT~'~I~ ]

n~.

B~I~]NtnNE BL~! m - ~

mE1-4 m l ~

0i 64

0

0

200

20

0

0

400

20

128

1.08 E-6 8.62 E-6

8.71 E-9 3.88 E-7

8.85 E-8

2.42 E-05

M.V.A. Hancu et ai.

1096

Table 2(b) Error detection events at the main observation points (LFSR, BSC model, CASE 4, SASIZE = 12, NPROCS = 4096, p = i / 1 5 )

Table 2(b) (continued) (LFSR, BSC model, CASE 4, SASIZE = 12, NPROCS = 4096, p = 1/255) BLOCKSIZE NDAERR NSAERR DA_EDEST DA_FDEST DA_SOUR DA_NDET TOT_DAERR SA EDEST SA_FDEST SA_NDET I TOT_SAERR I UNDETECTED ! ERRORS I TOTAL

I ERRORs

4 N/A NIA 1532 1238 1533 0 1533 1422 97 _ 0

11422 J 0 I J 2955

I

m

m

m

m

m

m

I f f 36 (. 36 c. 36(`

m

m

m

m

m

36(` 36 c. 3," 36 c.

m i

73980

distributed compressors of 12 bits in width, applied to a MPC containing 4096 PEs, with errors inserted according to the 2"-ary model and to the Case 3 of the message error cases (see Table 1). In the second configuration, described in Table 2b, we used LFSR distributed compressors of 12 bits in width, applied to a MPC containing 4096 PEs, with errors inserted according to the BSC model (with two values for the error probability p = 1/15 and p = 1/255) and to the Case 3 of the message error cases. While detailed data were obtained for many other configurations, they are consistent with the results displayed in these two tables and because of lack of space we cannot present them here. In the case of the detailed results displayed in Tables 2a-b, the following event counters for the detected errors are listed at three points: - at the source (only for destination address errors): DA _SOUR; - at the expected destination: DA _EDEST and SA _EDEST; - at the false destination: DA _FDEST and SA _FDEST. In the same tables, the total number of undetected errors at any of the three points is listed grouped separately for the destination address errors (DA _NDET) and source address errors

Error detection capability of a test architecture for massively-parallel computers

1097

(SA _NDET). The total number of undetected errors is listed for each case, together with the total number of errors inserted. In the last row of the Table 2a, we have several values predicted by the theory in [9] for the aliasing probability. These values result from the application of the formulas (1)-(2) for the aliasing probability. The reader will observe the very small values predicted by the theory for the aliasing probability. The results of the simulation give the value zero for all corresponding simulation configurations for the ratio UNDETECTED ERRORS~TOTAL ERRORS. Each of these values is a sample which can be used (with an infinite number of experiments) to obtain the simulated value of the aliasing probability. We have not taken several samples in each configuration as we consider that the randomization introduced in the simulation makes the results representative without many samples of the same configuration. Such samples could be obtained for example by varying the seeds of the pseudo-random generators used in the program. The computational costs for configurations containing many processing elements are quite prohibitive. While the Tables 2a-b give a detailed image on the error detection efficiency at the main observation points, the potential user of the proposed test architecture is probably mainly concerned with the overall performance of the error detection scheme for a specific configuration in terms of total number of undetected errors vs. the total number of injected errors. Consequently, in Tables 3a-b and 4a-b, we offer exactly that information for many possible configurations. In the Table 3a, we present the number of undetected errors vs. the total number of errors for the case that the errors were inserted according to the 2m-ary error model when using MISR signature analyzers, ~hiie in Table 3b the use of the LFSRs is assumed. The upper figure in each cell of the table corresponds to the number of undetected errors while the Table 3(a) The undetected errors vs. the total no. of errors for the 2"-ary error model (MISR compressor)

I~i ,7.;~= ;! ; i / I I~.'L;!=;l;m/I

I

4

16

2

4

16

3

4

16

4

4

16

1

8

256

2

8

256

3

8

256

4

8

256

I

12

4096

2

12

4096

3

12

4096

4

12

4096

Case

SASIZE

NPROCS

/ I ml

0 4 0 4 0 8 0 7 0 4 0 4 0 8 0 8 0 4 0 4 0 8 0 8

m I ml

0 19 0 19 I. 38 0 37 0 2O 0 20 0 40 0 40 0 20 0 20 0 40 0 40

6 95 0 95 0 193 0 190 0 IO0 0 100 0 200 0 199 0 100 0 100 0 200 0 200

m,i ui.i

mm,i/i mwi i |

0 8 0 8 0 14 0 11 0 8 0 8 0 16! 0 16 0 61 O: 8i 0 16' 0 16

0 37 0 4O 0 75 0 71 0 4O 0 40 0 80 0 79 0 40 0 40 0 80 0 80

36 186 0 183 1 369 0 364 4 2OO 0 200 2 299 0 398 0 200 0 200 0 400 0 400

i l I ! I1-1 I!.!

Oi 14J 0 13 1 29 0 30 0 20 0 2O 0 38 0 4O 0 20 0 2O 0 20 0 40

I ! I !

20 28 88 415 0 0 90 404 I 0 163 817 0 0 163 852 0 0 99 496 0 0 IO0 496 0 0 196 995 0 0 2OO 991 0 0 100 500 0 0 100 499 0 0 ?00 1000 0 0 200 1000

1[.1 i[|

i[| i[; 111 i[:

N/A

N/A

N/A i

N/A

N/A

N/A

NIA N/A 0 63 0 63 0 122 0 124 0 54 0 64 0 128 0 320

!E;II l l ~ 0 313 0 313 0 624 0 621 0 320 0 32O 0 638 0 64O

24 1547 0 1546 0 3087 0 3086 0 1595 0 1595 0 3191 0 3190

1098

M. V.A. Hancu et al.

Table 3(b) The undetected errors vs. the total no. of errors for the 2"-ary error model (LFSR compressor)

i :11~el~;~-~b,4:1I E i I I ' ~ , i ~ ~ I l q , l i [,I,l | ! B E ~ I E EEIB EEIB EEIB EE]E mwa mwa

I!~

,,muuuuii; /lml /tWl lmlP

| EmilEllmmWml ll m: mmmmmmmmme um,.mmmmmme emmmmmmmme immmmmmmmP, ammmmmmmme

1 2 3 4 1 2 3 4

['.~ E f J g B E ~ i I ~ ~!1 m [ . l m [ . l m[:, P l a m t ~ mr.1 Bit;

'

! !

¢. ¢.

| mmmaitmmml ammmmmmmme ammmmmmmmR

1 2 3

,12 I SASIZE

I

.=

1C

N/A

I N/A[ N/A II NIA i N/A

N/A

N/A ! N/A

0 63 0 63 0 122 0 124 0 64 0 64 0 128

O: 7 313 1547 0 0 313 1546 0 0 624 3087 0 0 621 3086 0 0 32O 1595 0 0 32O 1595 0 0 638 3191

N/A

,

4096 NPROC.~

lower figure corresponds to the total number of inserted errors in that particular simulated configuration. Thus, for example, whet: inserting errors according to the 2"-ary model and using MISRs (Table 3a), in the following conditions: - Ca~e 3 of the message error case; - SASIZE = 8 (i.e. the width of the MISR input was 8 bits); - NPROCS = 256 (i.e. 256 PEs), - BLOCKSIZE -- 100 (i. e. the length of the signatured instruction blocks is 100 instructions); - N D A E R R - 2 (i.e. 2 destination address errors were randomly inserted in each routing cycle); - N S A E R R = 2 (i.e. 2 source address errors were randomly inserted in each routing cycle); the number of undetected errors was 2, while the corresponding total number of errors was 299. The Tables 4a-b are similar to the Tables 3a-b, but in this case the BSC model is used. According to the explanations given in the previous section, in this case the number of

destination and source address errors NDAERR and NSAERR are substituted with the error probability p.

4. Discussion of the simulation results

The ratio of the undetected errors to the total number of errors is an indication of the aliasing produced by the compression mechanism. The simulation results indicate the fact that the numbers of undetected errors are significant only for the smallest signature analyzer size,

Error detection capability of a test architecture for massively-parallel computers

1099

of 4 bits, which of course is exceedingly low. The number of undetected errors decreases very quickly with the width of the compressors. It can be seen that when inserting errors according to the 2m-ary error model, perfect detection is achieved already for signature analyzers of 12 bits in width (Tables 3a-b). Another observation is that for both the MISR and the LFSR cases, the error detection is better for the case of errors inserted according to the 2m-ary model than according to the BSC model (see Tables 3 and 4). The only explanation we have for this phenomenon is based on the error insertion process: in the case of the 2m-ary model the whole address word is replaced by an erroneous one (Fig. 8), while in the case of the BSC model only some of the bits are in error. We conjecture that when the whole word to be compressed is in error, the aliasing probability is smaller (which leads to a better detection) than when only some of the bits of the word are erroneous. We have made the following general observations from comparing all the simulation results (not only those displayed in this paper). • For the same error model, there is no real difference between the performancc~ of the two compressors employed in t h e simulations. We thus conclude that the choice .,r,i~ be made based on ease of implementation in a given design. The MISR would t~hus fit better in an interconnection network using parallel links, while the serial-input Lt:SR is more appropriate for the interconnection networks using serial links. J

Table 4(a) The undetected errors vs. the total no. of errors for the BSC error model (MISR ~_~mpressor)

BLOCKSIZE m,mm~ l!

16

4

20

100"

1/15

1/255

1/15

1/255

1/15

0 23 0 23 0 44 0 715 0 715

0 I 0 I 0 4 6 53 0 53

2O 121 0 121 0 249 0 3299 0 3299

0 9 0 9 0 15 4 312 0 312

0 730 0 730 20 1381 63 16713 0 16713

2 115

6 6685

2 611

0 33529

1~,755

1

4

2

4

16

3

4

16

1

8

256

2

8

256

3

8

256

0 1399

1

12

4096

61 13064

24 16 1 4 6 6 65292

88 165 7417 326851

22 36989

2

12

4096

0 13064

0 0 1 4 6 6 65292

0 0 7417 326851

0 73908

3

12

4096

2 26149

11 6 "6 0 2955 130712 14735 654133

0 73908

16

4096

52 13064 0 13064 2 26149

24 0 88 7? 1 4 6 6 65292 7 4 1 7 3268S~ 0 0 0 0 1 4 6 6 65292 7417 32685"1 11 0 6 0 2955 130712 14735 654133

0 36989 0 36989 0 73908

i

i

i

ii

5 39 0 39 7 86 8 1536 0 1536

w

4 3129

i

16 . ~

3

|

4096 -.

.

16

4096

I ~..~=nm,Lm;.~."]~ ~

i

~l;,1-'~o:~

1100

M.V.A. H¢zncu et al.

Table 4(b) The undetected errors vs. the total no. of errors for the BSC model (LFSR compressor) BLOCKSIZE

p 1 2

4

16

4

16 ii

3

4

16 ,

1

8

256

2

8

256

3

8

256

1

12

4096

2

12

4096

3

1P

4096

1

16

4096

2

16

4096

3

16

4096

Case

SASIZE

NPROCS

4 1115 3 23 0 23 1 44 0 715 0 715 0 1399 0 13064 0 13064 0 26149 0 13064 0 13064 0 26149

20 1/255

100

1115

1/255

0 15 1 121 0 0 1 121 0 11 4 249 0 0 53 3299 0 O 53 3299 0 0 115 6685 0 0 1466 65292 0 0 1466 65292 0 7 2955 130712 0 0 1 4 6 6 65292 0 0 1 4 6 6 65292 0 0 2955 130712

0 9 0 9 0 15 0 312 0 312 0 611 3 , 7417 0 7417 5 14735 0 7417 0 7417 0 14735

i

i

1115 51 730 0 730 0 1381 122 16713 0 16713 0 33528 0 326851 0 326851 29 654133 157 326851 0 326851 0 654133 " "

1/255 2 39 0 39 0 86 0 1536 0 1536 0 3129 0 36989 0 36989 0 73980 0 36989 0 36989 0 73908

• The errors associated with the Case 1 are the most difficult to detect. The reason for this is that in this case, the destination addresses are all correct, and only the source addresses are affected by errors. This implies that the detection at the false destination is not in effect in this situation. The errors associated with the Case 3 follow in the order of detection difficulty. In this case, both the destination and the source addresses are affected, which increases the potential for aliasing, especially for the case of small compressors. What happens in this case is the following: even if a packet reaches the false destination, the probability that it might bring in an erroneous source address for compression is non-zero. In comparison with this, in Case 2 the source addresses are all unaffected, so the aliasing probability at the false destination is smaller. However, we have to say that this effect is truly significant only for small compressors. • The detection of destination address errors based on the signatures accumulated at the expected destination is more effective than that based on the signatures at the false destination. The reason for this seems to be the fact that, in case of an error, at the expected destination the compression of the expected packet is always missing, and thus will normally generate a erroneous signature. At the same time, at the false destination a source address will be always be compressed, and with a non-zero probability the value of it could be such that it will generate a correct signature, and thus aliasing. We have to point out that the destination addresses are not compressed at the destinations (this can be done, but we felt the related overhead would not be justified). However, the effect of errors in their values is reflected indirectly as mentioned above in the source address signatures.

Error detection capability of a test architecture for massively-parallel computers

1101

• The detection of source address errors based on the signatures accumulated at the expected destination is more effective than that based on the signatures at the false destination. The reason is the same as the one outlined in the previous paragraph. • For large numbers of inserted errors and compressors of reasonable size (SASIZE > 12), practically all the errors are detected (see, for example the Table 2b). We conjecture that the explanation for this is as follows. When a large number of errors occur, the probability that a signature will be affected by the compression of multiple errors is increased. In the presence of multiple unrelated errors (our error injection process produces only such errors) the probability for aliasing decreases. • All three detection mechanisms: at the source, expected destination and false destination are effective. • In a separate series of simulations, the destination address and source address signature analyzers have been combined in a single one, timed appropriately. The error detection rate is slightly smaller. However, this is a more economical approach in terms of hardware implementation.

5. Conclusions This paper presents results of simulation experiments concerning the error detection performance of a new concurrent test architecture for multiprocessors (especially massivelyparallel processors). This architecture is able to checg on-line both the data routing and the control flows by signaturing them and comparing the resulting signatures with reference signatures computed at compilation time. In this study, we focus on the detection of errors in the routing process only, as the detection of errors in the control flow is studied in the literature (even not to a sufficient extent, especially in the context of muitiprocessors). Four sets of experiments were executed, using two compressors (an MISR and an LFSR) and two error models (the 2m-ary and the Binary Symmetric Channel). Using a randomized routing process and a randomized fault insertion, we have obtained detailed figures for the undetected errors at all three crucial detecting points of our proposed detection method: the source, the expected destination and the false destination. High detection ratios for multiple errors were obtained for compressors of only moderate size, supporting the use of this method in practical applications. The results are independent of the topology of the interconnection network and the detailed routing algorithm.

Appendix: Notations The following notations are used throughout this work: -width of the compressor (signature analyzer) SASIZE -total number of processing elements/router nodes NPROCS -number of instructions and routing cycles in a block, in the BLOCKSIZE error-free case -the number of destination address errors inserted in a routing NDAERR cycle -the number of source address errors inserted in a routing cycle NSAERR -the number of destination address errors detected at the exbA _EDEST pected destination

1102

M.V.A. Hancu el al.

-the number of destination address errors detected at the false destination ~the number of destination address errors detected at the source DA _SOUR -the number of destination address errors not detected DA _NDET -the total number of destination address errors TOT_DAERR -the number of source address errors detected at the expected SA _ EDEST destination -the number of source address errors detected at the false SA _ FDEST destination -the number of source address errors not detected SA _ NDET -the total number of source addre~ errors TOT_ SAERR UNDETECTED ERRORS -the number of undetected errors at the end of a computational block TOTAL ERRORS -the total number of errors at the end of a computational block N/A -not applicable The 2-dimensional data arrays used in the course of simulating the error detection process and partially mentioned in the course of this work are: okda [NPROCS] [BL OCKSIZ E ] -error-free destination address array dalNPROCSlIBLOCKStZEI -faulty destination address array oksalNPROCSIlBLOCKSIZEI ~ -error-free source address array salNPR OCSl iBL OCKSIZEI -faulty source address array okdasig[NPROCS][BLOCKSIZE! -error-free destination address signature array dasig[NPROCS][BLOCKSIZE] -false destination address signature array oksasig[NPROCS][BLOCKSIZEi -error-free source address signature array sasig[NPROCS][BLOCKSIZE] -false source address signature array daseed[NPROCS] -initial seed of the destination address signature array saseedlNPROCS] -initial seed of the source address signature array

DA_FDEST

References [1] D.P. Agrawal, Testing and fault-tolerance of multistage interconnection networks, IEEE Comput. (Apr. 1982) 41-52. [2] G. Birkhoff and S. Maclane, A Survey of Modern Algebra (Macmillan, New York, 1977). [3] M.A. Breuer and A.A. Ismaeel, Roving emulator as a fault detection mechanism, Proc. 13th Fault-Tolerant Computing Symp. (June 1983) 206-215. [4] W.J. Daily and C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks, IEEE Trans. Comput. C-36 (5) (May 1987) 547-553. [5] N.J. Davis IV, W.T.-Y. Hsu and H.J. Siegel, Fault-location techniques for distributed control interconnection networks, IEEE Trans. Comput. C-34 (10)(Oct. 1985)902-910. [6] T.-Y. Feng and C.-H Lin, Fault-diagnosis for a class of multistage interconnection networks, IEEE Trans. Comput. C-30 (10) (Oct. 1981) 743-758. [7] J.A.B. Fortes and C.S. Raghavendra, Gracefully degradable processor arrays, IEEE Trans. Comput. C-34 (11) (Nov. 1985) 1033-1044. [8] S.K. Gupta and D.K. Pradhan, A new framework for designing and analyzing BIST techniques: Computation of exact aliasing probability, Proc. Internat. Test Conf. (1988) 329-342. [9] M.V.A. Hancu, K. lwasaki, Y. Sato and M. Sugie, A concurrent test architecture for massively parallel computers and its error detection capability, Proc. Internat. Test Conf. (1991) 758-767. [10] W. Hillis, The Connection Machine (MIT Press, Cambridge, MA, 1985). [11] K. Iwasaki and F. Arakawa, An analysis of the aliasing probability of the multiple-input signature registers in the case of the 2m-ary symmetric channel, IEEE Trans. Computer-Aided Design 9 (4) (Apr. 1990) 427-438. [12] K. lwasaki and N. Yamaguchi, Design of signature circuits based on weightdistributions of error-correcting codes, Proc. lnternat. Test Conf. (1990)779-785.

Error detection capability of a test architecture for massively-parallel computers

1103

[13] M. Kobayashi, Dynamic profile of instruction sequences for the IBM System/370, IEEE Trans. Comput. (Sep. 1983) 859-861. [14] S. Lin and D.J. Costello, Jr., Error-Correcting Codes, 2nd ed. (Prentice-Hall, Englewood Cliffs, NJ, 1983). [I5] J.-C. Liu and K.G. Shin, Polynomial testing of packet switching networks, 1EEE Trans. Comput. 38 (12) (Feb. 1989) 202-217. [16] D.J. Lu, Watchdog procesors and structural integrity checking, 1EEE Trans. Comput. C-31 (7) (July 1982) 681-685. [17] A. Mahmood and E.J. McCluskey, Concurrent error detection using Watchdog processorswA survey, IEEE Trans. Comput. 37 (2)(Feb. 1988)160-174. [18] EJ. McCluskey, Built-in self-test techniques, 1EEE Design Test Compul. 2 (2)(Apr. 1985) 21-28. [19] D.I. Moldovan, On the design of algorithms for VLSI systolic arrays, Proc. IEEE 71 (1) (Jan. 1983) 113-120. [20] M. Namjoo, Techniques for concurrent testing of VLS! processor operation, Proc. 1EEE Test Conf. (1982) 461-468. [21] .LR. Nickols, The design of the MasPar MP-I: A cost-effective massively-parallel computer, Proc. Compcon Spring (1990) 25-28. [22] W.W. Peterson and E.J. Weidon Jr., Error-Correcting Codes, ed ed. (MIT Press, Cambridge, MA, 1972). [23] D.K. Pradhan, S.K. Gupta and M.G. Karpovsky, Aliasing probability for multiple input signature analyzer, IEEE Trans. Computer-Aided Design 39 (4) (Apr. 1990) 586-590. [24] S.K. Rao and T. Kailath, Regular iterative algorithms and their implementation on processor arrays, Proc. IEEE 76 (3) (Mar. 1988) 259-269. [25] N.R. Saxena and E.J. McCiuskey, Control-flow checking using watchdog assists and extended-precision checksums, Proc. 19th. lnternat. Syrup. on Fault-Tolerant Computing (1989)428-435. [26] M.E. Schmid, R.L. Trapp, A.E. Davidoff and G.M. Masson, Upset exposure by means of abstraction verification, Proc. 12th Fault-Tolerant Computing Syrup. (1982) 237-244. [27] M.A. Schuette and J.P. Shen, Processor control flow monitoring using signatured instruction streams, IEEE Trans. Comput. 36 (3)(March 1987)264-276. [28] D.P. Sieworek and L.K. Lai, Testing of digital systems, Proc. IEEE 69 (Oct. 1981) 1321-1331. [29] J.E. Smith, Measures of the effectiveness of fault signature analysis, IEEE Trans. Comput. C-29 (6) (June 1980) 510-514. [30] T. Sridhar and S.M. Thatte, Concurrent checking of program flow in VLSI processors, Proc. IEEE Test Conf. (1982) 191-199. [31] S.P. Tomas and J.P. Shen, A roving monitoring processor for detection of control flow errors in multiple processor systems, Proc. IEEE lnternat. Conf. Comput. Design: VLSI Comput. (1985) 531-539. [32] K.D. Wilken and J.P. Shen, Embedded signature monitoring, Proc. lnternat. Test Conf. (1987) 324-333. [33] K.D. Wilken and J.P. Shen, Continuous signature monitoring: Efficient concurrent-detection of processor control Errors, Proc. lnternat. Test Conf. (1988) 914-925.