Computer-communication network reliability: Trends and issues

Computer-communication network reliability: Trends and issues

Microelectron Reliab., Vol. 21, No. 1. pp. 79-95, 1981. Printed in Great Britain. 0026-2714/81/010079-17502.00/0 ¢) 1981 Pergamon Press Ltd. C O M P...

1001KB Sizes 1 Downloads 91 Views

Microelectron Reliab., Vol. 21, No. 1. pp. 79-95, 1981. Printed in Great Britain.

0026-2714/81/010079-17502.00/0 ¢) 1981 Pergamon Press Ltd.

C O M P U T E R - C O M M U N I C A T I O N NETWORK RELIABILITY: TRENDS A N D ISSUES

INDER M. SOI a n d K. K. AGGARWAL Department of Electronics and Communication, Regional Engineering College, Kurukshetra--! 32119, India

(Receivedfor publication 7 August 1980)

Abstract

- People working in Computer-communication

distinguish

between the enti

The former

via the subneto

network.

discussion

of i s s u e s

hardware;

reliable

hardware,

software,

This paper presents

involved and trends software;

prevailing

and reliable

computer

communication

available for preventing,

and recovering

a comprehensive

from malfunctions.

working frequently

people working in computer

communications

has now evolved into a

s t r a t e g y fO r s u p p o r t i n g a r a n g e of e n d - u s e r

devices,

applications.

via the subnet.

Computer network

communications

throughput,

effective,

interfacing the network with a

variety

of t e r m i n a l s ,

networks. transcend

Frequently

in electronic

that control or interof c o m p u t e r

communic-

has taken a dramatic

microprocessors. are increasingly

problem

integrated

circuits

As computer put into use,

izations they serve are becoming

information

ingly concerned

source to destination. 79

and

networks

the organ-

io e. t h a t of e n s u r i n g t h e r e l i a b l e f l o w of from

up-

technology such as medium-

and large-scale

and other

these problems

the pure communications

intercommunication

swing, following significant developments

that they are

computers,

The former

This logically includes the

processes

ation networks

t h e d e s i g n of e f f i c i e n t

a n d to e s t a b l i s h

resident

and Computer

The complexity

flow control,

protocols

subnet.

face with the subnet.

delay and

routing techniques,

queueing problems,

in

include topological

optimization for cost,

network and the

includes the latter plus the terminals,

and widely used architectural problems

net-

distinguish between the

Computer-communications

once simply a topic for interest

The major

net-

detecting, diag-

Indeed,

recognized

by

in producing reliable

INTR ODUCTION

to researchers,

and

can be improved

T h e f i e l d of C o m p u t e r - c o m m u n i c a t i o n networks,

computer

that are either inherent due to network topology

work while reviewing techniques correcting

i.e.

The n e t w o r k reliability

or can be provided in the network.

I.

Greater

network reliability can obviouslybe achieved by maximizing

exploiting the redundancies

nosing,

devices and computer

with the subnet.

t h e r e l i a b i l i t y of i t s c o n s t i t u e n t c o m p o n e n t s communication

usually subnet.

This logically includes the resident

that control or interface

communication

networks

and the communications

includes the latter plus the terminals,

intercommunication processes

r e system

increas-

about data and network

80

INDER M. SOl and K. K. AGGARWAL

reliability and availability as they realize

joined together b y transmission.links to

the accomplishments

f o r m a network.

on s y s t e m s w h i c h

A h o m o g e n e o u s network

s e l d o m c r a s h b e c a u s e of m a l f u n c t i o n s t h a n

is one consisting of physically or logically

on s y s t e m s w h i c h r u n v e r y r a p i d l y b e t w e e n

identical processors which are capable of

frequent crashes.

executing copies of the s a m e software

C o m p o n e n t s of a c o r n -

purer-communication

classified into software, communications.

system.

network are broadly hardware

A heterogeneous

that is not homogeneous but may contain

and

homogeneous

To achieve greater

subnetworks e.g.

computer network availabilit 7 and reliability;

is a heterogeneous

one needs to maximize

t h a t a r e on t h e A R P A N E T

the availability and

r e l i a b i l i t y of i t s c o m p o n e n t s . discusses

This paper

communication

systems.

and then discusses failures,

communications)

hardware

Availability is the proba-

ming its assigned functions correctly time T.

computer networks and methods available

III.

RELIABILITY PROBLEMS

OF A

C o m p o n e n t s of a c o m p u t e r n e t w o r k a r e

correct-

ing s and r e c o v e r i n g f r o m t h e s e p r o b l e m s .

broadly classified into software,

Emphasis

and communications.

i n t h i s p a p e r i s on t h e i m p o r t a n c e

status.

A comprehensive

hardware

To achieve greater

computer network availability and reliabi-

to be

lity;

d e t e c t e d w i t h o u t o v e r a l l k n o w l e d g e of t h e network's

at

COMPUTER NETWORK

computer network reliability by

of n e t w o r k d e s i g n s t h a t allow e r r o r s

and

than some

b i l i t y t h a t t h e n e t w o r k i s c a p a b l e of p e r f o r -

and other problems

preventing s detecting, diagnosing,

is greater

positive value t.

and software

t h a t c a n a d v e r s e l y a f f e c t t h e r e l i a b i l i t y of

to improve

Reliability is the proba-

network (including software,

available

at length the errors,

deadlocks,

(7).

subnetwork

We d i s c u s s i n

to produce reliable hardware

and use the Tenex

b i l i t y t h a t the t i m e b e t w e e n f a i l u r e s in the

software and

brief the methods and approaches

ARPANET

network while PDP-10's

operating system constitute homogeneous

the r e l i a b i l i t y p r o b l e m s in

p r o d u c t i o n of h a r d w a r e ~

network is one

review

one needs to maximize

a n d r e l i a b i l i t y of i t s c o m p o n e n t s b y e x p l o i t ing the r e d u n d a n c i e s

of d i s c u s s i o n of t e c h n i q u e s i s p r e s e n t e d t o

the availability

that are either irherent

achieve a reliable hardware s software and

o r can be p r o v i d e d in t h e s e c o m p o n e n t s .

communication

Hardware

systems.

As costs increase

reliability is improved by con-

quite rapidly as the ideal system is approach-

servative design,

ed especially when last few tenth percents

using reliable components,

of

unreliability are eliminated so organizations

periodic testing,

careful implementation through initial

redundancy within units

s h o u l d b e s a t i s f i e d w i t h t h e l e v e l of r e l i a b i l i t y

a n d p o s s i b l y t h e u s e of r e d u n d a n t u n i t s a n d

needed for satisfactory operation depending

external observers

(1).

K e y s to s o f t w a r e

o n t h e n e e d s of t h e o r g a n i z a t i o n r a t h e r t h a n

reliability are not only structure

a t t e m p t to a c h i e v e an e x t e r n a l m a c h i n e .

in design~ i m p l e m e n t a t i o n

II.

of s o f t w a r e ,

DEFINITIONS OF RELATED TERMS A c o m p u t e r n e t w o r k is e i t h e r an i n t e r -

connection among several computers s e t of t e r m i n a l s computers.

data;

a node

connected to one or more

and a detailed information

of s o f t w a r e (4),

error

A single computer may are

information,

c o d i n g of

monitoring

f a c i l i t i e s t o d e t e c t f a i l u r e s of c o m m u n i c a tion equipment,

or network.

provide backup.

nodes and terminals

b y t h e u s e of

detecting and correcting

the transmitted

interfaces between the user and the computer Hostsj

R e l i a b i l i t y of a C o m m u -

nication system is maximized

is a computer that primarily Terminals

structures,

about what constitutes expected behaviour

A host is a computer whose

be both a host and a node.

b u t a l s o e f f e c t i v e u s e of

r e d u n d a n c y i n t h e f o r m of r o b u s t d a t a

o r of a

f u n c t i o n i s s e p a r a t e f r o m t h a t of s w i t c h i n g

is only a switch.

and care

and verification

are

and redundant facilities to In a d d i t i o n t o a b o v e

Computer Communication Network Reliability techniques the

used for single computer

following major

faced by computer special treatment i)

ii)

reliability

networks

systems;

problems

are

(7) : -

Incorrect

iv)

of

L o s s of d a t a o r c o n t r o l i n f o r m a t i o n

as

mistakes v)

of n e t w o r k c o n g e s t i o n

L o s s of s y n c h r o n i z a t i o n lated processes

cterizing networks add to the cost of

network; vi)

or

in routing algorithms;

and c o m p l e x interdependeneies chara-

diagnosis and recovery

components;

design or implementation

a result

Complexity of network control algorithms

of c o m m u n i c a t i o n

protocol s ;

other nodes of the network;

between re-

distributed

across

the

Network deadlock and lockup.

operations as c o m p a r e d to single

N o i s e on c o m m u n i c a t i o n

c o m p u t e r systems;

m o s t obvious sources of error but error

l i n k s i s o n e of t h e

rates are considerably lowered by the use

is higher than for an isolated c o m p u t e r

of digital transmission facilities in place of

s y s t e m due to m a n y geographically

analog facilities.

distributed c o m p o n e n t s including

similar to processor reliability but occa-

communication

components being sub-

jected to noise and other environmental

vi)

iii)

which warrant

iii) T h e probability of failure of c o m p o n e n t s

v)

Failure

Propagation of an error at one node to

detection,

iv)

ii)

81

sionally f a i l u r e

Link reliability is quite

of c o m m u n i c a t i o n

links may

not be as frequent as that of h a r d w a r e or

problems;

software c o m p o n e n t s failures in a network.

L o s s of data and control information

N e t w o r k performance,

as it passes through the range of net-

availability are sensitive to design, i m p -

work components ;

lementation,

In large networks,

conventional

reliability, and

and interaction of the various

levels of protocols in a c o m p u t e r network

approaches to deadlock, detection and

(31).

prevention are not economically feasible

which a path through the network f r o m

due to the n u m b e r

source to destination is determined.

of m e s s a g e s that are

Routing is the decision process by

The

necessary to recover and resynchronize

objective is to find the best path keeping in

introducing a large time delay;

view the reliability, availability and per-

N o doubt, computer network reliability

formance.

can be increased by the presence of

m o r e traffic being offered to a node than it

Congestion resulting due to

multiple, reasonably a u t o n o m o u s

can handle leads ultimately to degradation

processors but in a completely hetero-

in p e r f o r m a n c e in the f o r m of increased

geneous computer network,

delay, or data arriving out of sequence.

it is usually

very difficult for one processor to take

L o s s of synchronization of activities dis-

over the functions of a dissimilar

tributed across the network f o r m s a

processor.

common

IV.

NATURE

OF MALFUNCTIONS

c o m p u t e r system are:

a mistake in design

a failure in a component;

and an error introduced by a h u m a n operator.

Achieving

proper synchronization of activities in a

T h r e e m a i n types of error sources in a

or implementation;

source of error.

user or

T h e individual nodes of a c o m -

network and detecting synchronization errors are both m o r e

difficult than in a

single isolated c o m p u t e r systems,

where

it is possible to determine the status of all processes.

T h e occurrence of lock-up

puter network are subject to the s a m e types

implies that nothing can proceed in the

of errors and failures as isolated c o m p u t e r

portion of the network affected but the

systems.

In addition, there are p r o b l e m s

concept of dead-lock is not as well defined

of particular concern in a c o m p u t e r network

and understood for C o m p u t e r networks as

as given below :

for single computer

i)

of deadlock c o m m o n l y

Noise on c o m m u n i c a t i o n links.

MR 2 1 : 1 - F

systems.

Four forms

identified are (7) :-

82

INDER M. SOl and K. K. AGGARWAL

i)

exercise

Deadlock may occur when user's processes

can request

resources

circuits,

distant hosts; ii)

are

incorporated

into circuit packs,

double-sided

printed circuit boards.

sender and receiver

hardware

wait for each other

usually on The

design philosophy is based on

to act before proceeding;

careful and thorough engineering

Store-and-forward

using proven materials and manufacturing

lockup occurs

members

when

reliability getting into the design.

for other

of t h e d e a d l o c k e d

s e t of

T h e advent of low-cost LSI devices is

switches;

revolutionizing in the w a y that simple arch-

Reassembly message

lockup occurs

assemblybuffer

used by partially resulting

when all

in messages

itecture r e a l i s e d

storage is

reassembled

from being received

messages,

being prevented by their destina-

HARDWARE

RELIABILITY

from designer's

implemented

devices.

Hence from a rellability

eer's

viewpoint,

with very reliable

it is now more

than how frequently

ISSUES

view contributes

f i e l d s of : D e f i n i n g ,

Observing

characterizing

quantitative

failure data;

dependence

rates;

determining

controllable

determining

survivability

achievable

limits

a s a f u n c t i o n of

are operating

system

hardware

failure

than component Modern digital

at such speeds that

implementation

becomes

but an RF design

Hence careful attention has to be

given to the circuitry

in the development;

fails rather

(FMEA) is a

calculations.

devices

as well.

engin-

important

it m a y f a i l i . e .

and effects analysis

not only a logic design,

of c o m p o n e n t

of c o m p o n e n t r e l l a b i l l t y

modes

more fruitful exercise

in the and recording

isolating failure mechanisms;

variables

structures

to study how a digital system

T h e h a r d w a r e reliability as considered

failure

through fairly unreliable

components are giving way to highly complex

tions.

failures;

design

processes, thus preventing devices of poor

switches have all their

b u f f e r s f u l l of m e s s a g e s

V.

mostly MSI/LSI mixturess

defined,

are improperly

two or more

iv)

System hardware

is such that integrated

they may contain situations in which

If p r o t o c o l s

iii)

undesirable.

implementation

at

reflections,

layout to reduce

Crosstalkj

Coupling,

etc.,

d e v e l o p i n g t h e o r y to c o m p o u n d c o m p o n e n t

Failure

to do so w o u l d r e s u l t i n r a n d o m

reliability

system

failures

liability;

into subsystem

optimizing the distribution

unreliability; hardware tool.

and system

etc.,

Failure

rate (assumed

Hardware

log-normal,

These parameters

systems,

devoted a major

Repair

weibull cycle

system behaviour. the reliability

In

p a r t of h i s t i m e i n a n a l y z i n g

individual component

stresses

to arrive

at

f a i l u r e r a t e s b u t t h e a d v e n t of

LSI hardware

has made

such a rigorous

or the degree

of a u t o m a t i c

recovery

role in system

performance

but is also a very difficult

capability

parameter

to b e e s t i m a t e d

completion

of d e s i g n .

coverage

from

even after the

Computations of h a r d w a r e

90% t o 99% i n c r e a s e s

hardware

of m a g n i t u d e

MTBF by almost

in duplicated

systems

of h i g h c o v e r a g e

fully designed fault-tolerant sophisticated

diagnostics

a substantial

development

coverage

engineer target

the an order (27).

calls for carehardware

software.

and Since

effort is needed

to achieve a very high coverage, reliability

plays a

reliability

indicate that an improvement

Realization

engineer

fault coverage

very critical

system

have to be

early in the development

in order to predict

subsystem

satisfactorily

fault has occurred);

time (exponential, estimated

are :

( p r o b a b i l i t y of

continuing to perform

• distribution).

reliability,

to b e c o n s t a n t ) ;

fault coverage

when a hardware

earlier

Hardware

reliabillty theory a useful design

In evaluating the hardware

in the field due to RF

phenomenon.

of

These points make

the most significant parameters

system

re-

the

must set a realistic

after a careful analysis

of

Computer Communication Network Reliability s y s t e m requirements.

The system hardware

architecture and the fault detection and recovery

mechanisms

engineered VI.

must then be properly

RELIABILITY

level uses only the facilities hardware,

provided by

while each successive

level is

designed to provide added facilities these provided by lower levels.

to m e e t t h e s e t o b j e c t i v e .

SOFTWARE

83

ISSUES

Software reliability can be i m p r o v e d by

down approach

is preferred

by the designer

who has faith in his ability to estimate

preventing errors f r o m occurring; by de-

feasibility

tecting errors as soon as possible after they

m a t c h a s e t of s p e c i f i c a t i o n s

occur;

bottom-up

and by designing and implementing

using

The top-

of c o n s t r u c t i n g approach

while the

is preferred

by the

the s y s t e m so that it attempts to continue to

designer

provide service inspite of malfunctions

u t i l i t y of t h e c o m p o n e n t t h a t h e h a s d e c i d e d

while corrective and repair actions take

he can construct

place.

Techniques e m p l o y e d for prevention

of errors are;

structured design and i m -

plementation of p r o g r a m s ;

proof of correct-

who prefers

the

a component to

receiving are:

to estimate

(16).

Three

refinement;

decomposition;

functional

and programming

clusters.

procedure for testing.

starting point is an abstract

In s t e p w i s e

used technique for producing a reliable

which if implemented,

software s y s t e m is that of structured

whole problem.

programming

tion the starting

i.e. constraining the flow of

control to eliminate errors resulting f r o m

whole problem

the poor logical construction of the p r o g r a m .

In programming

Structured p r o g r a m m i n g

is a concept that

starting

associated

ment, design methods, and p r o g r a m m i n g

requirements

technology.

met.

A good w a y of structuring the

"Levels

is as a hierarchy

of a b s t r a c t i o n " ,

effective means

A hierarchy

to d e c o m p o s e

into successively

more

operations.

L e v e l s of a b s t r a c t i o n

detailed

provide

for achieving a clear

desired

which provides

by the user

During its design,

UP approaches

is identified;

first the

the features

of t h e s y s t e m

is designed.

the need for lower levels

these are then designed,

of t h e p r o b l e m

and

general functions, component

modules

be implemented BOTTOM

provided by

"Top down" design thus

by successive

refinements

until arriving or programs

of

known as "The Jackson recently

of t h e s y s t e m

closely follows the

structure of the p r o b l e m domain.

The

Jackson design methodology appears to be applicable to a significant class ofprcblerrs which are heavily oriented to input/output, such as many common processing

machine

commercial

applications.

Reliability

can also be improved

A "VM" is a

d u p l i c a t e of a r e a l

existing computer

system

that must

istically

s u b s e t of t h e v i r t u a l

In the

the first (lowest)

dominant

of

by design-

m a k i n g u s e of v i r t u a l

(VM) c o n c e p t .

hardware-software

data

at the

to build the system.

UP approach,

as top-

is based on the ideas that the

ing the programs

proceeds

and functional de-

and, more

designed

the hardware.

can always be

the first two approaches

in Europe

software

only facilities

of

that the

can be characterized

s o on, u n t i l t h e l o w e s t ( a n d l a s t ) l e v e l requires

of c l u s t e r s

(23) w h i c h i s g a i n i n g

structure

In the top-down design,

the

acceptance in U.S.

are used.

of p i e c e s .

Design Methodology"

To

highest level,

of t h e

by action clusters,

Another approach

design,

both TOP DOWN and BOTTOM

into small number

actions that guarantee

composition)

accomplish

system

point is a dissection

Broadly,

and logical design for a system. a hierarchical

would solve the

down, while the third as bottom-up.

functional

requirements

a conceptual framework

is an

the

program,

In functional decomposi-

(Stepwise refinement

of

by action

refinement

point is recognition

encompasses programming team manage-

d e s i g n of a s y s t e m

approaches

popular attention in literature

step-wlse

ness of critical m o d u l e s and an organized The most commonly

the

in which stat-

processor's instructions are executed directly on the host processor in native

84

INDER M. SOl and K. K. AGGARWAL

mode.

The virtual machines

are created

conditions, for which the correct results

by a small "virtual machine

monitor"

are k n o w n to determine whether any in-

which,

c a n be m a d e

correct results occur.

more

because tellable

operating

it i s s r n a l l , than a large,

general

purpose

proving is a reductive process,

program

program

testing is an affirmative process since

system.

Establishing

Whereas

a "Proof

of C o r r e c t n e s s "

everything done in testing can potentially

is the only s u r e way to be c e r t a i n a b o u t the

contribute information about the quality of

correct

p r o g r a m being tested.

f u n c t i o n i n g of t h e s o f t w a r e

but there is,

systems

at least for the present,

no

Program

testing

techniques are based on an a m a l g a m

of

f o o l p r o o f w a y of p r o v i n g a n y p r o g r a m

m e t h o d s d r a w n f r o m graph theory, prog-

correct

ramming

using mathematical

logic.

Debugg-

languages,

reliability a s s e s s m e n t

i n g i s n o t s u f f i c i e n t s i n c e it s h o w s o n l y t h e

and reliable testing theory.

presence

fied discussion on this fast developing

of e r r o r s

A good approach correctness

and not their absence.

is to e s t a b l i s h

for those parts

a p r o o f of

of t h e s y s t e m

only w h i c h a r e b e l i e v e d to be " c r i t i c a l " m a k i n g u s e of t h e t r a d i t i o n a l techniques Direct

debugging

to t h e r e m a i n d e r

applications

ness-proving

but this can be overcome modularizing

of t h e s y s t e m .

of t h e p r e s e n t

techniques

and

correct-

is quite difficult by p r o p e r l y

the system first and then

proving individual modules

correct

which

technology will itself require a full-length paper, w e m a k e no attempt in this paper to give an exhaustive treatment to p r o g r a m testing art and r e c o m m e n d

excellent presentation of the cross-section of p r o g r a m testing technology - ranging f r o m philosophical issues to research and development concepts is given by dividing p r o g r a m testing technology into six primary

m a y t h e n b e u s e d to e s t a b l i s h t h e c o r r e c t -

areas:

foundations;

s e t of t e s t c a s e s

system.

An exhaustive

can be determined

m a k i n g u s e of a n a l y t i c a l m e t h o d s .

by If t h i s

set of test cases can be proven to be exhaustive and the p r o g r a m

processes t h e m

the interested

reader to ref. (ii) by Miller wherein an

ness

of t h e c o m p l e t e

Since a justi-

Philosophy of testing;

Theoretical

Tools and Techniques;

Measurement

and Planning;

Management

and Control;

R e s e a r c h and Development.

W e e m p h a s i z e that to i m p r o v e

software

reliability, it is necessary to devise

correctly, the p r o g r a m is then said to be

m e t h o d s of planning and m e a s u r e m e n t

correct.

are appropriate to specific testing methods,

for E L X 8

THE

multiprogramming

system

(36) explains the use of proof of

and which are technically sound and econ-

correctness approach in the design of an

omically viable.

operating s y s t e m while an interesting

i.e. organizing a series of tests in a

variation of the m e t h o d in the design of

rational m a n n e r

large reliable p r o g r a m s is given by Mills

runs smoothly and efficiently is highly

(37).

Extensive use of proofs of correct-

that

recommended.

U s e of structured testing

so that the testing activity

General guidelines to

ness is restricted on account of a n u m b e r

achieve structured testing can be s u m m -

of other difficulties as explained in ref.

arized as : Adopting specific criteria to

(16). in addition to the considerable effort

govern unit testing of all p r o g r a m s ,

required.

scheduling progressive

O n account of various difficulties encountered with establishing the proof of correctness,

the p r o g r a m testing approach

tests to build up to

a representative full s y s t e m test; using program

analyzers to assure that all

p r o g r a m functions have been exercised;

to increase the reliability of soft--re

using a fault reporting process to m a n a g e

systems,

debugging and testing;

is maturing rapidly.

Program

Conducting re-

testing is defined as the process of executing

gression tests after p r o g r a m

programs

been done;

with representative input data or

r e w o r k has

Supplementing integration

Computer Communication Network Reliability testing with system validation reviews;

and

malfunctions

85

such as performance

degrada-

planning a shakedown period after delivered

t i o n a n d u n e x p e c t e d o r i n v a l i d s e q u e n c e of

software is installed (3).

events or states.

In spite of the care taken in designing, implementing,

and testing a software system;

Internal observation

tools include "inline checks", programs"

"audit

and "watchdog timers".

"In

errors do occur during execution and as such

line checking" i m p r o v e s

an important w a y to i m p r o v e the s y s t e m

s y s t e m reliability by including code in the

reliability is to quickly detect a malfunction

s y s t e m to check the validity of data struc-

to m i n i m i z e the d a m a g e

tures each time these are processed by

a

rapid recovery.

caused and effecting

Detection of malfunctions

s y s t e m routines.

the software

"Audit P r o g r a m s "

is carried out by observing the behavlour of

sample rather than continuously observe

the c o m p u t e r s y s t e m and c o m p a r i n g the s a m e

the system's behaviour,

with the information that constitutes proper

overhead than in line checking.

s y s t e m behaviour.

with " W a t c h d o g timers" is to set to sound

A sequence of states

and require less T h e idea

described by the c o m p u t e r s y s t e m during its

an a l a r m after a time sufficient enough for

execution phase is used to characterize the

the s y s t e m to p e r f o r m its function unless

system's behaviour.

T h e state of the s y s t e m

something goes wrong.

Self-checking

is represented by the state variables such as

techniques (i.e. a software s y s t e m is m a d e

program

to check its o w n operation to s o m e extent

status indicators;

indicators;

cation status indicators; contents.

process status

I/O status indicators;

communi-

and m e m o r y

Such observations m a y be m a d e

continuously periodically,

or only w h e n

by having two separate algorithms p e r f o r m the s a m e function and then c o m p a r i n g results) are associated with the d r a w b a c k s of doubling the size of the p r o g r a m

and

trouble is suspected by m a k i n g use of the

halving its execution speed and thus are not

techniques such as (16) :- Observation of

of m u c h

practical value.

Structured m a n -

s y s t e m state to detect invalid state or state

a g e m e n t technique (4) can be successfully

sequence;

used to reduce the a m o u n t of information

Observation of data and data

structures;

Observation of characteristic

performance measures

(e.g. response time,

that m u s t be observed to detect malfunctions and to organize the observations.

In short,

time required to p e r f o r m a standard function)

redundancy is a key to error detection and

and c o m p a r i n g these with already established

is provided in the f o r m of robust data

threshold value;

and use of software and

h a r d w a r e protection m e c h a n i s m s

i.e. an

attempt is m a d e to execute an instruction

structures and information about what constitutes expected behaviour of software. Diagnosis is necessary to k n o w the

that contains an invalid operation code or

extent of the d a m a g e

address,

of the malfunction to be able to carry out

or that violates a protected portion

of h a r d w a r e

and the probable cause

appropriate repairs needed for an effective

or software.

T h e observations as discussed above m a y

recovery.

T h e generalized diagnostic

be classified as internal observations or

approach involves collections of systems of

external observations depending on whether

problem;

clues to its origin; caused;

observation

the observer and redundant information used

of d a m a g e

f o r m s a part of the system or external to the

of probable causes by using a "maintenance

s y s t e m respectively.

dictionary".

tools are:

Hardware

f i r m w a r e monitors;

External observation monitors;

Software or

and Hybrid monitoring

and then to isolate a set

Diagnosis is usually carried

out in three phases:

Survey of d a m a g e ;

Studying of event sequence;

systems consisting of both h a r d w a r e and

maintenance dictionary.

software.

carried out manually,

T h e idea is to set up a monitoring

s y s t e m to watch for telltale signs of s y s t e m

or automatically.

and use of

Diagnosis m a y be

semiautomatically

Automaticity adds to

86

INDER M. SOI and K. K. AGGARWAL

the complexity, sources

more

of e r r o r s

these tasks successfully,

cost, and more

of i t s o w n b u t i s f a s t e r

the reliability

e n g i n e e r will h a v e to d e v e l o p a n e w

and e f f i c i e n t a s c o m p a r e d to m a n u a l

a p p r o a c h to h i s w o r k , the c l a s s i c a l

approach.

w a r e " a p p r o a c h b a s e d on c o m p o n e n t c o m -

G e n e r a l t e n d e n c y is to m a k e a

compromise

and t h u s to h a v e s e m i a u t o m a t i c

plexity and stress

a n a l y s i s w i l l n o t w o r k in

software engineering.

means. With malfunction having been detected, data necessary analysed,

for diagnosis collected and

an attempt is made for correc-

tion and recovery

operations,

b e clone m a n u a l l y ,

which may

semiautomatically

or

"hard-

T h e s i z e of t h e

software module has a dramatic

e f f e c t on

the design and verification effort necessary to assure

r e l i a b i l i t y in l a r g e s o f t w a r e

systems.

It i s s t i l l v e r y m u c h a n a r t to

produce complete requirement

specifica-

c o m p l e t e l y a u t o m a t i c a l l y d e p e n d i n g on t h e

tions and partition the design into modules

n a t u r e of t h e s y s t e m w h e r e i t i s u s e d e . g .

and subsystems

essential

effects,

services

m a y l i k e to h a v e a u t o -

which have minimal

and t h e r e f o r e ,

side

reduced test

m a t i c o p e r a t i o n as f a r as p o s s i b l e to

requirements.

reduce the down time.

s c i e n c e (25) o n t h e p a r t of r e l i a b i l i t y

More commonly

engineer can play a vital role.

used techniques may be :i)

from the malfunction;

the same is

i g n o r e d and o p e r a t i o n is a l l o w e d to continue while maintenance personnel w o r k s i d e b y s'ide t o c a r r y out

diagnosis and take corrective ii)

iii)

actions;

techniques and tools which have been formu l a t e d to p r o d u c e a r e l i a b l e s o f t w a r e f o r computer networks.

We s t r o n g l y f e e l

t h a t i f f e w of t h e r e c o m m e n d a t i o n s i n r e f . (3) a r e p r a c t i s e d

as

given

in the r e s p e c t i v e

p h a s e s of s o f t w a r e d e v e l o p m e n t l i f e c y c l e ,

timing problems then the same will be

t h e n o n e c a n a c h i e v e t h e g o a l of p r o d u c i n g

overcome by retrying the operation

reliable s o f t w a r e .

b u t w i t h c h a n g e d o r d e r of e v e n t s ;

VII.

Roll b a c k to the m o s t r e c e n t c h e c k point and restart

COMMUNICATION

R ELIABILIT Y ISSUES

Having discussed the issues of hardw a r e and software reliability which m a y be

R e d u n d a n c y i s m a d e u s e of t o r e data structures

• a n d t h e n a p p l y s t e p s (i) t o (iii); System may be re-initialized by " W a r m "

COMPUTER SYSTEM

evaluation from

construct or correct

v)

o u r r e v i e w of

If t h e m a l f u n c t i o n i s t h e r e s u l t of

there; iv)

Table I summarizes

In c a s e a l i t t l e o r n o d a m a g e r e s u l t s

can

A k n o w l e d g e of s o f t w a r e

e.g. individual switches and hosts in a network, w e discuss m e t h o d s of improving

either

the reliability of c o m p u t e r c o m m u n i c a t i o n systems in c o m p u t e r networks.

or "Cold" restart.

Extent of d a m a g e ,

applicable for isolated c o m p u t e r systems

probable cause of m a l -

These

m e t h o d s can be discussed according to

function, nature of malfunction and cost

whether they prevent, detect, diagnose,

decide about which of the available

correct, or recover f r o m errors, failures,

corrective and recovery technique be

deadlocks,

a p p l i e d f o r a g i v e n t y p e of m a l f u n c t i o n . A r e l i a b i l i t y e n g i n e e r in his a t t e m p t

or lockups.

Failures are prevented in two w a y s : C a r e in design and implementation of

to produce reliable software must under-

c o m m u n i c a t i o n protocols;

take the key tasks such as : formulatinn

congestion.

of s p e c i f i c , m e a s u r e a b l e

reliability and

and controlling

K e y s to a sophisticated

design and implementation of c o m m u n i c a -

test objectives;

d e t a i l e d r e v i e w of t h e

tion protocols are s u m m a r i z e d

logic,

and fault-tolerance

i)

design;

structure

of t h e

and p a r t i c i p a t i o n in the v e r i f i c a -

tion and integration testing.

To execute

as below :-

Objective in design of a set of protocols should be to m i n i m i z e interactions between levels of protocols

Computer Communication Network Reliability which help in carrying valldation,

z)

testing,

and recovery

procedures

independently

for each level;

multiplexing deadlocks

separately

control,

flow control, without

techniques reliability

network;

in computer

issues

and

1)

implementation,

allocations,

a resource

preferred

2)

it is

requests

and to

wherever

techniques

where prevention

An optimal scheme and recovery

route

3)

-

of B e l l T e l e p h o n e Data

a nationwide digital data net-

work implemented

by the Trans-Canada

Telephone

(33);

System

Distinction between failures errors

caused by

or by malfunctions

of t h e o r i g i n a t i n g n o d e i s m a d e b y u s i n g error

some dead-

detecting and correcting

like parity,

has been explained by

cyclic redundancy

and Hamming

Prevention of

Data

to collect and analyze

transmission

can't be used.

for deadlock detection

Hutchinson et. al. (6).

System

d a t a o n t e l e p h o n e t r a f f i c (20);

possible and other

by pro-emptying

locked processes

Few

are the

and Administrative

Laboratories

and

to a network

of t h i s a p p r o a c h

Acquisition be

subnet should

prevention

links in the system

malfunctions

Engineering

using

out by observing

t h e s t a t e of a l l n o d e s a n d

control center for diagnosis.

and detection techniques in view of

for deadlock,

and hardware

of n e t w o r k m a l f u n c t i o n i n g

examples

over the deadlock avoidance

u s e a m i x of s t r a t e g i e s

Detection

reporting

computer

under the sections

reliability

communication

request;

overhead costs. Typically, a communication

network are

i s s u e s;

periodically

a global descrip-

techniques

discussed

may be carried

location or at the node attempting process

communication

of s o f t w a r e

either at some central

Deadlock prevention

M e t h o d s u s e d f o r d e t e c t i o n of f a i l u r e

reliability

a deadlock

t i o n of n e t w o r k r e s o u r c e s

as

:-

the same as for isolated

of p r o t o c o l s ;

to construct

diagnosis,

from failures,

can be detected by implement-

computer

(Z) b e u s e d f o r

detection or avoidance algorithm, necessary

and recovery

systems

In order to implement

do o c c u r a n d t h u s of c o m m u n i c a t i o n

i n i n d i v l d u a l s w i t c h e s a n d h o s t s of a

software

M o d e l s of p r o t o c o l f u n c t i o n l i k e " G r a p h

and verification

failures

a quick detection,

Failures

be used to verify its

specification,

of

in communic-

ing the following techniques

functioning ;

formal

failures

reliability

correction

t o p r o o f of c o r r e c t -

M o d e l of C o m p u t a t i o n "

6)

to increase

behaviour;

similar

ness methods

used for preventing

Regardless

design techniques

they occur is needed.

be specified in such a way as

to allow easy implementation

5)

the

a n d a n y s i g n s of u n d e s i r e d

Protocols

correct

and

of o u t p u t l i n k s .

care and sophisticated ation network,

and synchronization

nondete rministlc

4)

the number

diagnostic

A protocol be designed to perform f u n c t i o n of e r r o r

3)

out design,

detection,

87

4)

codes check

Codes;

Nodes may check erroneous

operation

failures by controlling congestion in

of o t h e r n o d e s b y r e c e i v i n g

an incorr-

c o m m u n i c a t i o n subnet is based on the idea

ect message

of imposing a limit on the size of output

to receive

queues in switch nodes.

a reasonable

This limit depends

on the level of traffic flow through the switch

from the node or by failing an expected message

f a i l u r e of a n o d e t o a c c e p t a m e s s a g e .

and is chosen to m i n i m i z e the probability of

This amounts to error

having to drop packets or s o m e t i m e s approxi-

using redundancy

m a t e d by setting the output queue limit equal to the total number

and by tlme-out

of p a c k e t b u f f e r s

in the switch divided by the square-root

of

within

time or by detecting the

5)

When network

detection by

provided by protocols procedures;

control algorithm

tions such as flow control,

opera-

congestion

88

INDER M. SOt and K. K. AGGARWAL

control,

or routing,

n o d e s but d o e s not s e e m to be f u n c t i o n -

are distributed

throughout the network,

then failure

ing properly

detection is encountered

with special

out either by sending test messages

problems.

(17) h a s a s s o c i a t e d

the difficulty that an error can cause propagation routing information

further

w i t h it

Another

personnel remove

data may travel

through the network as "empties") associated

course

of a c t i o n in

of t h e p r o b l e m

the problem.

and temporarily

Maintenance

ary can be used as references

is

probable

with the difficulty that the

causes

and cures

diction-

to find

w h i c h w o r k e d in

s u p p l y of p a c k e t s i s a n e t w o r k - w i d e

past and can be made available as data

resource

base at a network maintenance

which is not under the direct

c o n t r o l of a n y n o d e . tion algorithm

Resource

alloca-

(8) c a n b e u s e d t o a v o i d

this latter problem 6)

cost-effective

a n o t h e r n o d e m a y be to i n f o r m m a i n t e n a n c e

flow

c o n t r o l (in w h i c h p a c k e t s n o t b e i n g u s e d to t r a n s f e r

n o d e to i n i t i a t e its

i n c a s e of a n o d e d e t e c t i n g a n e r r o r

has been

Isorithmatic

or by

which will

own diagnostic procedures.

to o t h e r n o d e s a n d

no effective way except for checks at

found out as yet;

data about the problem

cause the receiving

to s o m e e x t e n t ;

data base is updated whenever causes

recognition techniques

data

maintenance

structures

redundant

sophisticated

structural networks

information

individual systems Diagnostic techniques e x t e n t of d a m a g e malfunction computer

for computer

with distributed

data bases

or

Pattern

can provide a very

automatic

diagnosis.

automaticity

a d d s to c o m p l e x i t y and efficient,

the diagnosis

c a u s e of

manually

considered

Since

and cost

careful consid-

case failure detection indicates

an error

However,

functions may be performed

or automatically.

Once the detection and diagnosis in in

a node (individual switch, host) independent of t h e r e s t of t h e n e t w o r k .

new systems,

coupled with on-line

dictionary

but is faster

are the same as for isolated already

This

e r a t i o n h a s to be g i v e n in c h o o s i n g w h e t h e r

(7). needed to know the

and probable

systems,

center.

and cures are discovered.

Designing and implementing i n a w a y to c a r r y

and

to g a t h e r

sending a special message

in one node

of i n v a l i d

t h e o r i g i n a l n o d e in e r r o r

may be carried

examining the responses

Adaptive routing in

ARPANET

- diagnosis

when

a node detects an error in another node or

stages

are over;

following techniques

employed,

d e p e n d i n g on t h e s i t u a t i o n ,

error

correction

and recovery

may be for

in c o m p u t e r -

c o m m u n i c a t i o n networks : i)

M e t h o d s of recovering f r o m software errors in single nodes be treated in a

in a c o m m u n i c a t i o n line then the following situations arise :-

manner

i)

E r r o r detected is a complete lack of

treating the node as a single c o m p u t e r

c o m m u n i c a t i o n with an adjacent node-

system.

has failed. attempted

Communication by any alternate

R e c o v e r y Blocks as studied

b y Randell (35) m a y

diagnosis determines whether the distant node or the communication

as that discussed earlier by

line

may be r o u t e of

2)

reasonably independent central processor in a n e t w o r k f r o m error r e c o v e r y point of view.

2)

Problem

detected is witha specific line -

diagnosis determines

whether the line it-

s e l f o r s o m e p a r t of t h e l i n e i n t e r l a c e i s a t f a u l t b y m a k i n g u s e of h a r d w a r e

which allows

t h e l i n e to be l o o p e d b a c k into the m o d e m ; 3)

Node is still communicating

with other

also be used;

A n important factor is the presence of

Since there is very

small probability

of f a i l u r e

nodes simultaneously; failure

of a l l t h e

i n t h e e v e n t of

of a n o d e e i t h e r d u e t o h a r d w a r e

or software

problems,

some other

n o d e i s a v a i l a b l e to r e s t a r t hardware

it.

in the c o m m u n i c a t i o n

Special inter-

Computer Communication Network Reliability face for each processor enable the restarting

by another processor.

For a processor

to be r e s t a r t e d

remotely,

restart

d e p e n d s on t h e t y p e of r o u t i n g e . g .

is i n c l u d e d to

random,

of a c o m p l e t e l y

failed processor

4)

cause a processor

erroneous

storage device or the processor

hardware

is to

or omitted actions by

r e d u n d a n c y i n t h e d e s i g n of a

network allowing communication f a i l u r e s as u s e d in A R P A N E T

may 5)

be reloaded by sending a special

In s o m e

cases,

node failures

whose text is a bootstrap

(17);

t h e i m p a c t of s w i t c h

can be decreased

u s e of " b y p a s s s w i t c h e s " ,

a procedure

which can be remotely activated to

for remotely reloading

i.e.

by the

r o u t i n e a s d e s c r i b e d b y B i n d e r (38) i n

switches

s w i t c h n o d e s on t h e A L O H A N E T .

c a u s e t r a f f i c f r o m one line c o n n e c t e d to

R e l o a d i n g of t h e n u c l e u s of a s y s t e m

a n o d e to f l o w d i r e c t l y o n t o a n o t h e r line;

is possible without disturbing the c o n t e n t s of a s y s t e m ' s failed processor

tables.

be restarted

Ignoring the tables altogether; resuming

The

6)

In c a s e of l o o p n e t w o r k s ,

the communi-

c a t i o n on l o o p d e p e n d s o n t h e c o r r e c t

by :

f u n c t i o n i n g of a l l l l n e s a n d a l l l i n e

by

interfaces,

operation with the assumption

that all the system's

so a s e r i o u s p r o b l e m in

any interface or link could disable the

tables are correct;

b y r u n n i n g a c o m p l e t e s e r i e s of d a t a

entire loop.

verification programs

n i q u e s s u c h a s u s e of b y p a s s s w i t c h e s

errors

in s y s t e m ' s

resuming

to check for

Special recovery tech-

to carry data past disabled interfaces

tables before

( t h i s s c h e m e f a i l s i n c a s e of l i n k

operation and repairing the

t a b l e s found to be in e r r o r .

failure);

An

u s e of d u p l i c a t e c a b l i n g ;

or a

interesting technique for making

c o m b i n a t i o n of t h e s e t w o t e c h n i q u e s a r e

l i m i t e d u s e of s y s t e m ' s

a p p l i e d in c a s e of l o o p n e t w o r k f a i l u r e s .

restarted

3)

procedure

individual switch nodes by providing

which would simply to e x e c u t e a n i n i t i a l

load sequence from a locally attached

message

adaptive or fixed;

A simplified recovery

m a k e t h e n e t w o r k " f o r g i v i n g " of

may be accomplished

b y u s e of a m e s s a g e

89

processor

t a b l e s in a

has been

T a b l e II s u m m a r i z e s

t h e a b o v e r e v i e w of

proposed for the Distributed Computing

techniques and tools.

System by Farber

VIII.

(8);

be recovered

by automatic action, then

SUMMAR Y AND

CONCLUSIONS.

The problem of producing reliable

In c a s e a link or node f a i l s w h i c h c a n ' t

computer-communication

the network is made to adapt the

considered

operation without the unusable compon-

parts:

ents either by exploiting the topological

cation network.

networks is

a s c o m p o s e d of t h r e e d i f f e r e n t

Hardware,

Software and CommuniThe issues involved and

d e s i g n of t h e n e t w o r k t o a v o i d d i s c o n n e c - s o m e t e c h n i q u e s f o r a c h i e v i n g r e l i a b l e t i o n o r by m o d i f y i n g the r o u t i n g to a v o i d

hardware,

failed components.

communication

analysis for minimum

Network topology cost configuration

reliable software and reliable networks have been compre-

hensively reviewed

and discussed by class-

h a v i n g c a p a b i l i t y to r e m a i n c o n n e c t e d

ifying these as prevention,

i n s p i t e of l i n k o r n o d e f a i l u r e s i s

nosis,

carried

Following observations

out b y a s s u m i n g

randomness

either the

of f a i l u r e s o r f a i l u r e s

b e i n g c a u s e d by an i n t e l l i g e n t e n e m y who knows the structure

of t h e n e t w o r k .

correction

detection,

and recovery

diag-

methods.

can be d r a w n to

conclude the discussions. 1)

Hardware

reliability is improved by

conservative

design; carefully imple-

Modifying the routing through the net-

menting the design using reliable com-

w o r k to h a n d l e the f a i l e d c o m p o n e n t s

ponents;

carrying

out t h o r o u g h i n i t i a l

90

INDER M. SOl and K. K. AGGARWAL

and periodic testing; redundancy within

microprocessors for continuous moni-

units and possibly through the use of

toring and periodic reporting of the

redundant units and external observa-

status of links or nodes, (ii) Data

tions.

mangling to determine robustness of

Software reliability is achieved

data structures and software system,

through structured and careful design, implementation and verification;

(iii) Dividing large networks into

effective use of redundancy in the f o r m

mutually supportive groups of nodes

of robust data structures;

and finding cost-effective ways of

observing

detection and correction of errors.

the expected behaviour of software

REFER ENCES

through internal and external observation tools.

Communication system

i.

ing: A n overview", I E E E Computer,

error-detecting and correcting coding

Vol. 4, No. i, Jan-Feb. 1971, pp. 5-8.

of the transmitted information;

Z.

of C o m p u t e r C o m m u n i c a t i o n Protocols"

cation equipment failures; and using

C o m p u t e r Science Department, Univ-

redundant facilities to provide backup.

ersity of California, Los Angeles, Jan.

Overall

1970.

system reliability of a 3.

gement: A P r i m e r for Project M a n a g e -

network topology design that allows

m e n t and Quality Control", C o m p u t e r

errors to be detected without the

Science and Technology, N B S Special

overall knowledge of the network's

Publication 500-Ii, U.S. Department

Special reliability problems result due

of C o m m e r c e , 4.

Survey of Methods of Achieving R eliab]e

while certain other characteristics

Software", I E E E Computer, Vol. 104 5.

tion in Packet-Swit chlng Network s ",

Ability of computer networks to per-

I E E E Trans. on Communications,

t h e e f f e c t of a s i n g l e f a i l u r e p r o v i d e s

Vol. 20, No. 3, June 197Z, pp. 546-550. 6.

D . A . Hutchinson, S°A. M a h m o u d

and

a major opportunity for its reliability

J.S. Eiordon, "A Recursive Algorithm

improvement;

for Deadlock Pre-emption in C o m p u t e r

Methods available are only adhoc and

Networks", - Information Processing

have yet not been subjected to cost-

77, Proc. IFIP Congress '77, Toronto,

effective analysis criterion;

Aug. 1977, pp. 24/- Z46.

A s costs increase rapidly as the ideal

7.

D . E . Morgan, D.J. Taylor and G.

system is approached especially w h e n

Custeau, "A Survey of Methods for

last few tenth percents of unreliability

Improving C o m P u t e r Network R ellabil-

are eliminated, so organizations should

ity and Availability", I E E E Computer,

be satisfied with the level of reliability

Vol. 10, N u m b e r II, Nov. 1977, pp. 4Z-

needed for satisfactory operation depending on the needs of the organiza-

6)

D . W . Davies, "The Control of Conges-

m o r e rellable ;

f o r m in a m a n n e r so as to m i n i m i z e

5)

1977.

D . E . M o r g a n and D.J. Taylor, "A

to s o m e characteristics of networks

are also helpful to m a k e the networks

4)

D. ~r. Fife, " C o m p u t e r Software M a n a -

m a x i m i z e d by placing emphasis on the

status during instantaneous failures;

3)

B.J. Postel, "A G r a p h Model Analysis

monitoring facilities to detect c o m m u n i -

computer-communication network is

z)

A. Avizienis, "Fault-Tolerant C o m p u t -

reliability is m a x i m i z e d by using

51. 8.

D.J. Farber, L.C. Kenneth, "The

tion rather than attempt to achieve an

Structure of a Distributed C o m p u t e r

eternal machine;

System-Software", Presented at the

Detailed investigations are required to

s y m p o s i u m on C o m p u t e r C o m m u n i c a -

be carried out regarding: (i) Using of

tion Networks and Teletra/flc sponsored

Computer Communication Network Reliability b y the P o l y t e c h n i c I n s t i t u t e of B r o o k l y n , M i c r o w a v e R e s e a r c h I n s t i t u t e , 197Z. 9.

10.

F. Boesch (ed.) "Large-

scale Net-

works:

T h e o r y and D e s i g n " , I E E E

Press,

1976.

1974, pp. lZ5-145. Z0. J.A. Grandle, R.E. Machol, " E A D A S A N e w Traffic Collection Record, Dec. 1975, N e w Orleans, LA, Vol. l, pp. 7-21 to 7-24.

E . H a n s l e r , G . K . M c A u l i f f e , and

21. L. Kleinrock, "Analytical and Simula-

R.S. Wilkov, "Optimizing the R eliabi-

tion Methods in Computer Network

lity in Centralized Computer Networks"

Design", A F I P S SJCC, Proceedings,

I E E E Trans. on Communications, Vol. 20, No. 3, June 1972, pp. 640E.F.

Vol. 36, (1970)p. 569. ZZ. L. Kleinrock, Communication Nets: Stochastic Message Flow and Delay.

644. 11.

91

Miller (editor), Program

Testing Techniques.

I E E E Computer

N.Y. McGraw-Hill, 1964. Z3. IV[. Jackson, "The Jackson Design

Society publication, 1977. I E E E

Iv[ethodology", Infotech State of the

Catalog No. E H C 130-5.

art Report, Structured Programming,

12. G.L. Fultz and L. Kleinrock, "Adaptive Routing Techniques for Store-and-Forward Computer

published in Infotech Internatlonal, U.K. Z4. IV[. Gerla, "The Design of Store-and-

C o m m u n i c a t i o n Networks", NTIS,

F o r w a r d (S/F) Networks for Computer

Report AD-727-989, J u l y 1972.

Communications", NTIS Report A D -

13. H. Frank, I.T. Frisch, C o m m u n i cation, Transmission, and Transportation Networks; Addison Wesley, 1979. 14. H. Frank et. al., " T o p o l o g i c a l

758-704, Jan. 1973. 25. M . H . Halstead, "Elements of Software Science", A m e r i c a n Elsevier Publishing Co. Inc. 1977. Z6. N. Wirth, " P r o g r a m Development by V o l . 14

Considerations in the Design of the

Stepwise Refinement", C A C M ,

ARPA

Network", A F I P S Conf.,

No. 4, A p r i l 1971, pp. ZZ1-ZZ7.

Proc. 1970, SJCC, Vol. 36, pp. 581-

Z7. O. K r t e n , D. R a h a , " A p p l i c a t i o n of new concept in Switching System

587. 15. H. Frank, "Providing Reliable Net-

Reliability", Proc. 1976 International

works with Unreliable Components",

Switching Symposium, 1976, pp. 443Z.

Data Networks: Analysis and Design, Proc. Third Data Communication

28. Peter Freeman, A.I. Wasserman,

Symposium, St. Petersburg, FL,

(editors); Tutorial on Software Design

Nov. 1973, pp. 161-164.

Techniques, I E E E Computer Society

16. Inder M. Sol, "Some aspects of Reliable Software Packages", M. Sc., (Engg.) Thesis, 1978, Kurukshetra University, Kurukshetra, India. 17. J.M. McQuillan and D.C. Walden, "ARPANET

Design Decisions",

Computer Networks, Vol. i, No. 5, September 1977. 18. J. Martin, Systems Analysis for Data Transmission, Englewood Cliffs, N.J., Prentlce-Hall, 197Z. 19. J.W. Suurballe, "Disjoint Paths in a Network", Networks, Vol. 4, No. Z,

publication, 1977, I E E E Catalog No. 7 6 C H I145-Z6. 29. R. Boorstyn and H. Frank, "Large Scale Network Topological Optimization", I E E E Trans. C o m m . , COM-25,

Vol.

No. i, Jan. 1977, pp. 29-47.

30. R.S. Wilkov, "Analysis and Design of Reliable Computer Networks", I E E E Trans. on Communications, Vol. 20, No. 3, pp. 660-678. 31. S.R. Kimbleton and G . M . Schneider, "Computer Communication Networks: Approaches, Objectives, and Perform-

92

INDER M, SoI and K. K. AGGARWAL

ance Considerations" A C M

Computing

Surveys, Vol. 7, No. 3, Sept. 1975,

Software Fault Tolerance" IEEE Trans

pp. Izg-17Z. 3Z. S. Lin, Introduction to Error-Correcting Codes, Englewood Cliffs, N J:

on S o f t w a r e E n g i n e e r i n g , V o l . 1, No. Z June 1975, pp. ~20-232. 36. E . W . D i j k s t r a ,

Prentice-Hall, 1970. 33. S. Frankel, O. Pearce, and W. Chan, "A Minicomputer based Performance IVIonltoring System for the Data Route", National Telecommunications

June 1975, pp. 233-240. 35. B. R a n d a l l , " S y s t e m S t r u c t u r e f o r

Con/.,

" T h e S t r u c t u r e of

THE m u l t i p r o g r a m m i n g s y s t e m " , CACM, Vol. 11, No. 5, pp. 341-346. 37. H . D ° M i l l s , "On the D e v e l o p m e n t of Large Reliable Programs",

Record

Conference Record Dec. 1975, N e w

1973 I E E E S y m p o s i u m on C o m p u t e r

Orleans, LA, Vol. i, pp. 7-21 to

Software Reliability, N.Y., April

7 -Z4.

30 - May Z, 1973, pp. 155-159.

34. W. Wulf, "Reliable Hardware/Soft-

38. R. B i n d e r s A h o h a n e t P r o t o c o l s , T h e

ware Architecture", I E E E Trans.

A l o h a S y s t e m , U n i v e r s i t y of H a w a i i ,

Software Engineering, Vol. i, No. Z,

Sept. 1974, ( T e c h . R e p o r t B 7 4 - 7 ) .

Computer Communication Network Reliability

93

TABLE I Softwa r e S y s t! e m

=,

#

M a l f u n c t i on

Malfu fnction

Occurs

is

prevented

,

I

|

Structured design and implementation of p r o g r a m s

1

Organised procedure for testing

Proof of Correctness by c r i t i c a l modules

_ _ O b s e r v a t i o n of s y s t e m s t a t e to d e t e c t i n v a l i d s t a t e of s t a t e sequence

[

--Observation structures

DETECTION

of d a t a a n d d a t a

~Observation of c h a r a c t e r i s t i c performance measures Software & hardware -- M e c h a n i s m s

protection

[ - - S u r v e y of d a m a g e ]

DIAGNOSIS

, ~ S u r v e y of e v e n t sequence i

a--Use of Maintenance Dictionary --Ignore and continue operation -Retry

COR R ECTION AND R ECOVER Y

--Roll back to recent check points & restart --Reconstruct data and data structures then resume R e -initialize system

GOOD SYSTEM .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

T A B L E II Computer-Communication P

.

.

.

.

.

.

C a r e in d e s i g n a n d i m p l e m e n t a t i o n of c o m m u n i c a t i o n protocols

i I.

M i n i m i z e d interaction between levels of protocols.

Z.

Avoidance of deadlocks and undesired nondeterministic behaviour in protocol functioning.

3.

Proof-of-correctness like implementation

4o

U s e of " G r a p h M o d e l s " for specifications implementation, and verification of protocols.



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Network

Failure is prevented

I

.

Failure Occurs i

I

GO T O TABLE n(b)

B y Controlling Congestion

]

Technique of i m p o s i n g a l i m i t on t h e s i z e of o u t p u t q u e u e s i n switch nodes

Constructing global description of network resources requests and allocation at s o m e central station. 6. Deadlock prevention techniques to be ....... p_r_e_f_erred_ pve_r_ d e_a_moc_k _ayo ! d an c_e _ t_echnj _qu2 _s_...................................

94 .

.

IND[B~.M. Sol and K. K. AGGARWAL .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

T A B L E II

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

(b)

Failure occurs

I

Detection

Yes

-

G o t o Table I

No

P e r i o d i c observation of state of all nodes and c o m m u n i c a t i o n links

Failure is in n e t w o r k

of error-detecting and correcting codes

--Use

B y using redundancy provided by protocols and by time-out procedures - - U s e of Adaptive routing and Isorithmatic flow control in Distributed N e t w o r k s Robust data structures ~implementatlon. ~Inform

the maintenance

personnel

U s e of m a i n t e n a n c e d i c t i o n a r y

- - U s e of Pattern recognition techniques in conjunction with an on-line maintenance dictionary

Diagnosis

- - U s e of alternate route or routing -- B y m a k i n g use of h a r d w a r e to loop the faculty line back to modem. U s e of special test m e s s a g e s Correction and Recovery

S e e T a b l e II

(c)

Computer Communication Network

Reliability

95

I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TABLE

--Use --

Correction and Recovery

II (c)

of R e c o v e r y b l o c k s

E x p l o i t i n g the p r e s e n c e of m u l t i p l e x r e a s o n a b l y independent central p r o c e s s o r i n Network I g n o r e and C o n t i n u e o p e r a t i o n without the unusable components by e m p l o y i n g t o p o l o g i c a l d e s i g n of network. Making n e t w o r k " f o r g i v i n g " of erroneous or omitted actions by individual nodes by using hardware redundancy Use of " B y p a s s S w i t c h e s " Use of S p e c i a l recovery techniques for loop n e t w o r k s . .

Use of B y p a s s s w i t c h e s Use of d u p l i c a t e cabling C o m b i n a t i o n of B y p a s s switching and duplicate cabling.