Microelectron Reliab., Vol. 21, No. 1. pp. 79-95, 1981. Printed in Great Britain.
0026-2714/81/010079-17502.00/0 ¢) 1981 Pergamon Press Ltd.
C O M P U T E R - C O M M U N I C A T I O N NETWORK RELIABILITY: TRENDS A N D ISSUES
INDER M. SOI a n d K. K. AGGARWAL Department of Electronics and Communication, Regional Engineering College, Kurukshetra--! 32119, India
(Receivedfor publication 7 August 1980)
Abstract
- People working in Computer-communication
distinguish
between the enti
The former
via the subneto
network.
discussion
of i s s u e s
hardware;
reliable
hardware,
software,
This paper presents
involved and trends software;
prevailing
and reliable
computer
communication
available for preventing,
and recovering
a comprehensive
from malfunctions.
working frequently
people working in computer
communications
has now evolved into a
s t r a t e g y fO r s u p p o r t i n g a r a n g e of e n d - u s e r
devices,
applications.
via the subnet.
Computer network
communications
throughput,
effective,
interfacing the network with a
variety
of t e r m i n a l s ,
networks. transcend
Frequently
in electronic
that control or interof c o m p u t e r
communic-
has taken a dramatic
microprocessors. are increasingly
problem
integrated
circuits
As computer put into use,
izations they serve are becoming
information
ingly concerned
source to destination. 79
and
networks
the organ-
io e. t h a t of e n s u r i n g t h e r e l i a b l e f l o w of from
up-
technology such as medium-
and large-scale
and other
these problems
the pure communications
intercommunication
swing, following significant developments
that they are
computers,
The former
This logically includes the
processes
ation networks
t h e d e s i g n of e f f i c i e n t
a n d to e s t a b l i s h
resident
and Computer
The complexity
flow control,
protocols
subnet.
face with the subnet.
delay and
routing techniques,
queueing problems,
in
include topological
optimization for cost,
network and the
includes the latter plus the terminals,
and widely used architectural problems
net-
distinguish between the
Computer-communications
once simply a topic for interest
The major
net-
detecting, diag-
Indeed,
recognized
by
in producing reliable
INTR ODUCTION
to researchers,
and
can be improved
T h e f i e l d of C o m p u t e r - c o m m u n i c a t i o n networks,
computer
that are either inherent due to network topology
work while reviewing techniques correcting
i.e.
The n e t w o r k reliability
or can be provided in the network.
I.
Greater
network reliability can obviouslybe achieved by maximizing
exploiting the redundancies
nosing,
devices and computer
with the subnet.
t h e r e l i a b i l i t y of i t s c o n s t i t u e n t c o m p o n e n t s communication
usually subnet.
This logically includes the resident
that control or interface
communication
networks
and the communications
includes the latter plus the terminals,
intercommunication processes
r e system
increas-
about data and network
80
INDER M. SOl and K. K. AGGARWAL
reliability and availability as they realize
joined together b y transmission.links to
the accomplishments
f o r m a network.
on s y s t e m s w h i c h
A h o m o g e n e o u s network
s e l d o m c r a s h b e c a u s e of m a l f u n c t i o n s t h a n
is one consisting of physically or logically
on s y s t e m s w h i c h r u n v e r y r a p i d l y b e t w e e n
identical processors which are capable of
frequent crashes.
executing copies of the s a m e software
C o m p o n e n t s of a c o r n -
purer-communication
classified into software, communications.
system.
network are broadly hardware
A heterogeneous
that is not homogeneous but may contain
and
homogeneous
To achieve greater
subnetworks e.g.
computer network availabilit 7 and reliability;
is a heterogeneous
one needs to maximize
t h a t a r e on t h e A R P A N E T
the availability and
r e l i a b i l i t y of i t s c o m p o n e n t s . discusses
This paper
communication
systems.
and then discusses failures,
communications)
hardware
Availability is the proba-
ming its assigned functions correctly time T.
computer networks and methods available
III.
RELIABILITY PROBLEMS
OF A
C o m p o n e n t s of a c o m p u t e r n e t w o r k a r e
correct-
ing s and r e c o v e r i n g f r o m t h e s e p r o b l e m s .
broadly classified into software,
Emphasis
and communications.
i n t h i s p a p e r i s on t h e i m p o r t a n c e
status.
A comprehensive
hardware
To achieve greater
computer network availability and reliabi-
to be
lity;
d e t e c t e d w i t h o u t o v e r a l l k n o w l e d g e of t h e network's
at
COMPUTER NETWORK
computer network reliability by
of n e t w o r k d e s i g n s t h a t allow e r r o r s
and
than some
b i l i t y t h a t t h e n e t w o r k i s c a p a b l e of p e r f o r -
and other problems
preventing s detecting, diagnosing,
is greater
positive value t.
and software
t h a t c a n a d v e r s e l y a f f e c t t h e r e l i a b i l i t y of
to improve
Reliability is the proba-
network (including software,
available
at length the errors,
deadlocks,
(7).
subnetwork
We d i s c u s s i n
to produce reliable hardware
and use the Tenex
b i l i t y t h a t the t i m e b e t w e e n f a i l u r e s in the
software and
brief the methods and approaches
ARPANET
network while PDP-10's
operating system constitute homogeneous
the r e l i a b i l i t y p r o b l e m s in
p r o d u c t i o n of h a r d w a r e ~
network is one
review
one needs to maximize
a n d r e l i a b i l i t y of i t s c o m p o n e n t s b y e x p l o i t ing the r e d u n d a n c i e s
of d i s c u s s i o n of t e c h n i q u e s i s p r e s e n t e d t o
the availability
that are either irherent
achieve a reliable hardware s software and
o r can be p r o v i d e d in t h e s e c o m p o n e n t s .
communication
Hardware
systems.
As costs increase
reliability is improved by con-
quite rapidly as the ideal system is approach-
servative design,
ed especially when last few tenth percents
using reliable components,
of
unreliability are eliminated so organizations
periodic testing,
careful implementation through initial
redundancy within units
s h o u l d b e s a t i s f i e d w i t h t h e l e v e l of r e l i a b i l i t y
a n d p o s s i b l y t h e u s e of r e d u n d a n t u n i t s a n d
needed for satisfactory operation depending
external observers
(1).
K e y s to s o f t w a r e
o n t h e n e e d s of t h e o r g a n i z a t i o n r a t h e r t h a n
reliability are not only structure
a t t e m p t to a c h i e v e an e x t e r n a l m a c h i n e .
in design~ i m p l e m e n t a t i o n
II.
of s o f t w a r e ,
DEFINITIONS OF RELATED TERMS A c o m p u t e r n e t w o r k is e i t h e r an i n t e r -
connection among several computers s e t of t e r m i n a l s computers.
data;
a node
connected to one or more
and a detailed information
of s o f t w a r e (4),
error
A single computer may are
information,
c o d i n g of
monitoring
f a c i l i t i e s t o d e t e c t f a i l u r e s of c o m m u n i c a tion equipment,
or network.
provide backup.
nodes and terminals
b y t h e u s e of
detecting and correcting
the transmitted
interfaces between the user and the computer Hostsj
R e l i a b i l i t y of a C o m m u -
nication system is maximized
is a computer that primarily Terminals
structures,
about what constitutes expected behaviour
A host is a computer whose
be both a host and a node.
b u t a l s o e f f e c t i v e u s e of
r e d u n d a n c y i n t h e f o r m of r o b u s t d a t a
o r of a
f u n c t i o n i s s e p a r a t e f r o m t h a t of s w i t c h i n g
is only a switch.
and care
and verification
are
and redundant facilities to In a d d i t i o n t o a b o v e
Computer Communication Network Reliability techniques the
used for single computer
following major
faced by computer special treatment i)
ii)
reliability
networks
systems;
problems
are
(7) : -
Incorrect
iv)
of
L o s s of d a t a o r c o n t r o l i n f o r m a t i o n
as
mistakes v)
of n e t w o r k c o n g e s t i o n
L o s s of s y n c h r o n i z a t i o n lated processes
cterizing networks add to the cost of
network; vi)
or
in routing algorithms;
and c o m p l e x interdependeneies chara-
diagnosis and recovery
components;
design or implementation
a result
Complexity of network control algorithms
of c o m m u n i c a t i o n
protocol s ;
other nodes of the network;
between re-
distributed
across
the
Network deadlock and lockup.
operations as c o m p a r e d to single
N o i s e on c o m m u n i c a t i o n
c o m p u t e r systems;
m o s t obvious sources of error but error
l i n k s i s o n e of t h e
rates are considerably lowered by the use
is higher than for an isolated c o m p u t e r
of digital transmission facilities in place of
s y s t e m due to m a n y geographically
analog facilities.
distributed c o m p o n e n t s including
similar to processor reliability but occa-
communication
components being sub-
jected to noise and other environmental
vi)
iii)
which warrant
iii) T h e probability of failure of c o m p o n e n t s
v)
Failure
Propagation of an error at one node to
detection,
iv)
ii)
81
sionally f a i l u r e
Link reliability is quite
of c o m m u n i c a t i o n
links may
not be as frequent as that of h a r d w a r e or
problems;
software c o m p o n e n t s failures in a network.
L o s s of data and control information
N e t w o r k performance,
as it passes through the range of net-
availability are sensitive to design, i m p -
work components ;
lementation,
In large networks,
conventional
reliability, and
and interaction of the various
levels of protocols in a c o m p u t e r network
approaches to deadlock, detection and
(31).
prevention are not economically feasible
which a path through the network f r o m
due to the n u m b e r
source to destination is determined.
of m e s s a g e s that are
Routing is the decision process by
The
necessary to recover and resynchronize
objective is to find the best path keeping in
introducing a large time delay;
view the reliability, availability and per-
N o doubt, computer network reliability
formance.
can be increased by the presence of
m o r e traffic being offered to a node than it
Congestion resulting due to
multiple, reasonably a u t o n o m o u s
can handle leads ultimately to degradation
processors but in a completely hetero-
in p e r f o r m a n c e in the f o r m of increased
geneous computer network,
delay, or data arriving out of sequence.
it is usually
very difficult for one processor to take
L o s s of synchronization of activities dis-
over the functions of a dissimilar
tributed across the network f o r m s a
processor.
common
IV.
NATURE
OF MALFUNCTIONS
c o m p u t e r system are:
a mistake in design
a failure in a component;
and an error introduced by a h u m a n operator.
Achieving
proper synchronization of activities in a
T h r e e m a i n types of error sources in a
or implementation;
source of error.
user or
T h e individual nodes of a c o m -
network and detecting synchronization errors are both m o r e
difficult than in a
single isolated c o m p u t e r systems,
where
it is possible to determine the status of all processes.
T h e occurrence of lock-up
puter network are subject to the s a m e types
implies that nothing can proceed in the
of errors and failures as isolated c o m p u t e r
portion of the network affected but the
systems.
In addition, there are p r o b l e m s
concept of dead-lock is not as well defined
of particular concern in a c o m p u t e r network
and understood for C o m p u t e r networks as
as given below :
for single computer
i)
of deadlock c o m m o n l y
Noise on c o m m u n i c a t i o n links.
MR 2 1 : 1 - F
systems.
Four forms
identified are (7) :-
82
INDER M. SOl and K. K. AGGARWAL
i)
exercise
Deadlock may occur when user's processes
can request
resources
circuits,
distant hosts; ii)
are
incorporated
into circuit packs,
double-sided
printed circuit boards.
sender and receiver
hardware
wait for each other
usually on The
design philosophy is based on
to act before proceeding;
careful and thorough engineering
Store-and-forward
using proven materials and manufacturing
lockup occurs
members
when
reliability getting into the design.
for other
of t h e d e a d l o c k e d
s e t of
T h e advent of low-cost LSI devices is
switches;
revolutionizing in the w a y that simple arch-
Reassembly message
lockup occurs
assemblybuffer
used by partially resulting
when all
in messages
itecture r e a l i s e d
storage is
reassembled
from being received
messages,
being prevented by their destina-
HARDWARE
RELIABILITY
from designer's
implemented
devices.
Hence from a rellability
eer's
viewpoint,
with very reliable
it is now more
than how frequently
ISSUES
view contributes
f i e l d s of : D e f i n i n g ,
Observing
characterizing
quantitative
failure data;
dependence
rates;
determining
controllable
determining
survivability
achievable
limits
a s a f u n c t i o n of
are operating
system
hardware
failure
than component Modern digital
at such speeds that
implementation
becomes
but an RF design
Hence careful attention has to be
given to the circuitry
in the development;
fails rather
(FMEA) is a
calculations.
devices
as well.
engin-
important
it m a y f a i l i . e .
and effects analysis
not only a logic design,
of c o m p o n e n t
of c o m p o n e n t r e l l a b i l l t y
modes
more fruitful exercise
in the and recording
isolating failure mechanisms;
variables
structures
to study how a digital system
T h e h a r d w a r e reliability as considered
failure
through fairly unreliable
components are giving way to highly complex
tions.
failures;
design
processes, thus preventing devices of poor
switches have all their
b u f f e r s f u l l of m e s s a g e s
V.
mostly MSI/LSI mixturess
defined,
are improperly
two or more
iv)
System hardware
is such that integrated
they may contain situations in which
If p r o t o c o l s
iii)
undesirable.
implementation
at
reflections,
layout to reduce
Crosstalkj
Coupling,
etc.,
d e v e l o p i n g t h e o r y to c o m p o u n d c o m p o n e n t
Failure
to do so w o u l d r e s u l t i n r a n d o m
reliability
system
failures
liability;
into subsystem
optimizing the distribution
unreliability; hardware tool.
and system
etc.,
Failure
rate (assumed
Hardware
log-normal,
These parameters
systems,
devoted a major
Repair
weibull cycle
system behaviour. the reliability
In
p a r t of h i s t i m e i n a n a l y z i n g
individual component
stresses
to arrive
at
f a i l u r e r a t e s b u t t h e a d v e n t of
LSI hardware
has made
such a rigorous
or the degree
of a u t o m a t i c
recovery
role in system
performance
but is also a very difficult
capability
parameter
to b e e s t i m a t e d
completion
of d e s i g n .
coverage
from
even after the
Computations of h a r d w a r e
90% t o 99% i n c r e a s e s
hardware
of m a g n i t u d e
MTBF by almost
in duplicated
systems
of h i g h c o v e r a g e
fully designed fault-tolerant sophisticated
diagnostics
a substantial
development
coverage
engineer target
the an order (27).
calls for carehardware
software.
and Since
effort is needed
to achieve a very high coverage, reliability
plays a
reliability
indicate that an improvement
Realization
engineer
fault coverage
very critical
system
have to be
early in the development
in order to predict
subsystem
satisfactorily
fault has occurred);
time (exponential, estimated
are :
( p r o b a b i l i t y of
continuing to perform
• distribution).
reliability,
to b e c o n s t a n t ) ;
fault coverage
when a hardware
earlier
Hardware
reliabillty theory a useful design
In evaluating the hardware
in the field due to RF
phenomenon.
of
These points make
the most significant parameters
system
re-
the
must set a realistic
after a careful analysis
of
Computer Communication Network Reliability s y s t e m requirements.
The system hardware
architecture and the fault detection and recovery
mechanisms
engineered VI.
must then be properly
RELIABILITY
level uses only the facilities hardware,
provided by
while each successive
level is
designed to provide added facilities these provided by lower levels.
to m e e t t h e s e t o b j e c t i v e .
SOFTWARE
83
ISSUES
Software reliability can be i m p r o v e d by
down approach
is preferred
by the designer
who has faith in his ability to estimate
preventing errors f r o m occurring; by de-
feasibility
tecting errors as soon as possible after they
m a t c h a s e t of s p e c i f i c a t i o n s
occur;
bottom-up
and by designing and implementing
using
The top-
of c o n s t r u c t i n g approach
while the
is preferred
by the
the s y s t e m so that it attempts to continue to
designer
provide service inspite of malfunctions
u t i l i t y of t h e c o m p o n e n t t h a t h e h a s d e c i d e d
while corrective and repair actions take
he can construct
place.
Techniques e m p l o y e d for prevention
of errors are;
structured design and i m -
plementation of p r o g r a m s ;
proof of correct-
who prefers
the
a component to
receiving are:
to estimate
(16).
Three
refinement;
decomposition;
functional
and programming
clusters.
procedure for testing.
starting point is an abstract
In s t e p w i s e
used technique for producing a reliable
which if implemented,
software s y s t e m is that of structured
whole problem.
programming
tion the starting
i.e. constraining the flow of
control to eliminate errors resulting f r o m
whole problem
the poor logical construction of the p r o g r a m .
In programming
Structured p r o g r a m m i n g
is a concept that
starting
associated
ment, design methods, and p r o g r a m m i n g
requirements
technology.
met.
A good w a y of structuring the
"Levels
is as a hierarchy
of a b s t r a c t i o n " ,
effective means
A hierarchy
to d e c o m p o s e
into successively
more
operations.
L e v e l s of a b s t r a c t i o n
detailed
provide
for achieving a clear
desired
which provides
by the user
During its design,
UP approaches
is identified;
first the
the features
of t h e s y s t e m
is designed.
the need for lower levels
these are then designed,
of t h e p r o b l e m
and
general functions, component
modules
be implemented BOTTOM
provided by
"Top down" design thus
by successive
refinements
until arriving or programs
of
known as "The Jackson recently
of t h e s y s t e m
closely follows the
structure of the p r o b l e m domain.
The
Jackson design methodology appears to be applicable to a significant class ofprcblerrs which are heavily oriented to input/output, such as many common processing
machine
commercial
applications.
Reliability
can also be improved
A "VM" is a
d u p l i c a t e of a r e a l
existing computer
system
that must
istically
s u b s e t of t h e v i r t u a l
In the
the first (lowest)
dominant
of
by design-
m a k i n g u s e of v i r t u a l
(VM) c o n c e p t .
hardware-software
data
at the
to build the system.
UP approach,
as top-
is based on the ideas that the
ing the programs
proceeds
and functional de-
and, more
designed
the hardware.
can always be
the first two approaches
in Europe
software
only facilities
of
that the
can be characterized
s o on, u n t i l t h e l o w e s t ( a n d l a s t ) l e v e l requires
of c l u s t e r s
(23) w h i c h i s g a i n i n g
structure
In the top-down design,
the
acceptance in U.S.
are used.
of p i e c e s .
Design Methodology"
To
highest level,
of t h e
by action clusters,
Another approach
design,
both TOP DOWN and BOTTOM
into small number
actions that guarantee
composition)
accomplish
system
point is a dissection
Broadly,
and logical design for a system. a hierarchical
would solve the
down, while the third as bottom-up.
functional
requirements
a conceptual framework
is an
the
program,
In functional decomposi-
(Stepwise refinement
of
by action
refinement
point is recognition
encompasses programming team manage-
d e s i g n of a s y s t e m
approaches
popular attention in literature
step-wlse
ness of critical m o d u l e s and an organized The most commonly
the
in which stat-
processor's instructions are executed directly on the host processor in native
84
INDER M. SOl and K. K. AGGARWAL
mode.
The virtual machines
are created
conditions, for which the correct results
by a small "virtual machine
monitor"
are k n o w n to determine whether any in-
which,
c a n be m a d e
correct results occur.
more
because tellable
operating
it i s s r n a l l , than a large,
general
purpose
proving is a reductive process,
program
program
testing is an affirmative process since
system.
Establishing
Whereas
a "Proof
of C o r r e c t n e s s "
everything done in testing can potentially
is the only s u r e way to be c e r t a i n a b o u t the
contribute information about the quality of
correct
p r o g r a m being tested.
f u n c t i o n i n g of t h e s o f t w a r e
but there is,
systems
at least for the present,
no
Program
testing
techniques are based on an a m a l g a m
of
f o o l p r o o f w a y of p r o v i n g a n y p r o g r a m
m e t h o d s d r a w n f r o m graph theory, prog-
correct
ramming
using mathematical
logic.
Debugg-
languages,
reliability a s s e s s m e n t
i n g i s n o t s u f f i c i e n t s i n c e it s h o w s o n l y t h e
and reliable testing theory.
presence
fied discussion on this fast developing
of e r r o r s
A good approach correctness
and not their absence.
is to e s t a b l i s h
for those parts
a p r o o f of
of t h e s y s t e m
only w h i c h a r e b e l i e v e d to be " c r i t i c a l " m a k i n g u s e of t h e t r a d i t i o n a l techniques Direct
debugging
to t h e r e m a i n d e r
applications
ness-proving
but this can be overcome modularizing
of t h e s y s t e m .
of t h e p r e s e n t
techniques
and
correct-
is quite difficult by p r o p e r l y
the system first and then
proving individual modules
correct
which
technology will itself require a full-length paper, w e m a k e no attempt in this paper to give an exhaustive treatment to p r o g r a m testing art and r e c o m m e n d
excellent presentation of the cross-section of p r o g r a m testing technology - ranging f r o m philosophical issues to research and development concepts is given by dividing p r o g r a m testing technology into six primary
m a y t h e n b e u s e d to e s t a b l i s h t h e c o r r e c t -
areas:
foundations;
s e t of t e s t c a s e s
system.
An exhaustive
can be determined
m a k i n g u s e of a n a l y t i c a l m e t h o d s .
by If t h i s
set of test cases can be proven to be exhaustive and the p r o g r a m
processes t h e m
the interested
reader to ref. (ii) by Miller wherein an
ness
of t h e c o m p l e t e
Since a justi-
Philosophy of testing;
Theoretical
Tools and Techniques;
Measurement
and Planning;
Management
and Control;
R e s e a r c h and Development.
W e e m p h a s i z e that to i m p r o v e
software
reliability, it is necessary to devise
correctly, the p r o g r a m is then said to be
m e t h o d s of planning and m e a s u r e m e n t
correct.
are appropriate to specific testing methods,
for E L X 8
THE
multiprogramming
system
(36) explains the use of proof of
and which are technically sound and econ-
correctness approach in the design of an
omically viable.
operating s y s t e m while an interesting
i.e. organizing a series of tests in a
variation of the m e t h o d in the design of
rational m a n n e r
large reliable p r o g r a m s is given by Mills
runs smoothly and efficiently is highly
(37).
Extensive use of proofs of correct-
that
recommended.
U s e of structured testing
so that the testing activity
General guidelines to
ness is restricted on account of a n u m b e r
achieve structured testing can be s u m m -
of other difficulties as explained in ref.
arized as : Adopting specific criteria to
(16). in addition to the considerable effort
govern unit testing of all p r o g r a m s ,
required.
scheduling progressive
O n account of various difficulties encountered with establishing the proof of correctness,
the p r o g r a m testing approach
tests to build up to
a representative full s y s t e m test; using program
analyzers to assure that all
p r o g r a m functions have been exercised;
to increase the reliability of soft--re
using a fault reporting process to m a n a g e
systems,
debugging and testing;
is maturing rapidly.
Program
Conducting re-
testing is defined as the process of executing
gression tests after p r o g r a m
programs
been done;
with representative input data or
r e w o r k has
Supplementing integration
Computer Communication Network Reliability testing with system validation reviews;
and
malfunctions
85
such as performance
degrada-
planning a shakedown period after delivered
t i o n a n d u n e x p e c t e d o r i n v a l i d s e q u e n c e of
software is installed (3).
events or states.
In spite of the care taken in designing, implementing,
and testing a software system;
Internal observation
tools include "inline checks", programs"
"audit
and "watchdog timers".
"In
errors do occur during execution and as such
line checking" i m p r o v e s
an important w a y to i m p r o v e the s y s t e m
s y s t e m reliability by including code in the
reliability is to quickly detect a malfunction
s y s t e m to check the validity of data struc-
to m i n i m i z e the d a m a g e
tures each time these are processed by
a
rapid recovery.
caused and effecting
Detection of malfunctions
s y s t e m routines.
the software
"Audit P r o g r a m s "
is carried out by observing the behavlour of
sample rather than continuously observe
the c o m p u t e r s y s t e m and c o m p a r i n g the s a m e
the system's behaviour,
with the information that constitutes proper
overhead than in line checking.
s y s t e m behaviour.
with " W a t c h d o g timers" is to set to sound
A sequence of states
and require less T h e idea
described by the c o m p u t e r s y s t e m during its
an a l a r m after a time sufficient enough for
execution phase is used to characterize the
the s y s t e m to p e r f o r m its function unless
system's behaviour.
T h e state of the s y s t e m
something goes wrong.
Self-checking
is represented by the state variables such as
techniques (i.e. a software s y s t e m is m a d e
program
to check its o w n operation to s o m e extent
status indicators;
indicators;
cation status indicators; contents.
process status
I/O status indicators;
communi-
and m e m o r y
Such observations m a y be m a d e
continuously periodically,
or only w h e n
by having two separate algorithms p e r f o r m the s a m e function and then c o m p a r i n g results) are associated with the d r a w b a c k s of doubling the size of the p r o g r a m
and
trouble is suspected by m a k i n g use of the
halving its execution speed and thus are not
techniques such as (16) :- Observation of
of m u c h
practical value.
Structured m a n -
s y s t e m state to detect invalid state or state
a g e m e n t technique (4) can be successfully
sequence;
used to reduce the a m o u n t of information
Observation of data and data
structures;
Observation of characteristic
performance measures
(e.g. response time,
that m u s t be observed to detect malfunctions and to organize the observations.
In short,
time required to p e r f o r m a standard function)
redundancy is a key to error detection and
and c o m p a r i n g these with already established
is provided in the f o r m of robust data
threshold value;
and use of software and
h a r d w a r e protection m e c h a n i s m s
i.e. an
attempt is m a d e to execute an instruction
structures and information about what constitutes expected behaviour of software. Diagnosis is necessary to k n o w the
that contains an invalid operation code or
extent of the d a m a g e
address,
of the malfunction to be able to carry out
or that violates a protected portion
of h a r d w a r e
and the probable cause
appropriate repairs needed for an effective
or software.
T h e observations as discussed above m a y
recovery.
T h e generalized diagnostic
be classified as internal observations or
approach involves collections of systems of
external observations depending on whether
problem;
clues to its origin; caused;
observation
the observer and redundant information used
of d a m a g e
f o r m s a part of the system or external to the
of probable causes by using a "maintenance
s y s t e m respectively.
dictionary".
tools are:
Hardware
f i r m w a r e monitors;
External observation monitors;
Software or
and Hybrid monitoring
and then to isolate a set
Diagnosis is usually carried
out in three phases:
Survey of d a m a g e ;
Studying of event sequence;
systems consisting of both h a r d w a r e and
maintenance dictionary.
software.
carried out manually,
T h e idea is to set up a monitoring
s y s t e m to watch for telltale signs of s y s t e m
or automatically.
and use of
Diagnosis m a y be
semiautomatically
Automaticity adds to
86
INDER M. SOI and K. K. AGGARWAL
the complexity, sources
more
of e r r o r s
these tasks successfully,
cost, and more
of i t s o w n b u t i s f a s t e r
the reliability
e n g i n e e r will h a v e to d e v e l o p a n e w
and e f f i c i e n t a s c o m p a r e d to m a n u a l
a p p r o a c h to h i s w o r k , the c l a s s i c a l
approach.
w a r e " a p p r o a c h b a s e d on c o m p o n e n t c o m -
G e n e r a l t e n d e n c y is to m a k e a
compromise
and t h u s to h a v e s e m i a u t o m a t i c
plexity and stress
a n a l y s i s w i l l n o t w o r k in
software engineering.
means. With malfunction having been detected, data necessary analysed,
for diagnosis collected and
an attempt is made for correc-
tion and recovery
operations,
b e clone m a n u a l l y ,
which may
semiautomatically
or
"hard-
T h e s i z e of t h e
software module has a dramatic
e f f e c t on
the design and verification effort necessary to assure
r e l i a b i l i t y in l a r g e s o f t w a r e
systems.
It i s s t i l l v e r y m u c h a n a r t to
produce complete requirement
specifica-
c o m p l e t e l y a u t o m a t i c a l l y d e p e n d i n g on t h e
tions and partition the design into modules
n a t u r e of t h e s y s t e m w h e r e i t i s u s e d e . g .
and subsystems
essential
effects,
services
m a y l i k e to h a v e a u t o -
which have minimal
and t h e r e f o r e ,
side
reduced test
m a t i c o p e r a t i o n as f a r as p o s s i b l e to
requirements.
reduce the down time.
s c i e n c e (25) o n t h e p a r t of r e l i a b i l i t y
More commonly
engineer can play a vital role.
used techniques may be :i)
from the malfunction;
the same is
i g n o r e d and o p e r a t i o n is a l l o w e d to continue while maintenance personnel w o r k s i d e b y s'ide t o c a r r y out
diagnosis and take corrective ii)
iii)
actions;
techniques and tools which have been formu l a t e d to p r o d u c e a r e l i a b l e s o f t w a r e f o r computer networks.
We s t r o n g l y f e e l
t h a t i f f e w of t h e r e c o m m e n d a t i o n s i n r e f . (3) a r e p r a c t i s e d
as
given
in the r e s p e c t i v e
p h a s e s of s o f t w a r e d e v e l o p m e n t l i f e c y c l e ,
timing problems then the same will be
t h e n o n e c a n a c h i e v e t h e g o a l of p r o d u c i n g
overcome by retrying the operation
reliable s o f t w a r e .
b u t w i t h c h a n g e d o r d e r of e v e n t s ;
VII.
Roll b a c k to the m o s t r e c e n t c h e c k point and restart
COMMUNICATION
R ELIABILIT Y ISSUES
Having discussed the issues of hardw a r e and software reliability which m a y be
R e d u n d a n c y i s m a d e u s e of t o r e data structures
• a n d t h e n a p p l y s t e p s (i) t o (iii); System may be re-initialized by " W a r m "
COMPUTER SYSTEM
evaluation from
construct or correct
v)
o u r r e v i e w of
If t h e m a l f u n c t i o n i s t h e r e s u l t of
there; iv)
Table I summarizes
In c a s e a l i t t l e o r n o d a m a g e r e s u l t s
can
A k n o w l e d g e of s o f t w a r e
e.g. individual switches and hosts in a network, w e discuss m e t h o d s of improving
either
the reliability of c o m p u t e r c o m m u n i c a t i o n systems in c o m p u t e r networks.
or "Cold" restart.
Extent of d a m a g e ,
applicable for isolated c o m p u t e r systems
probable cause of m a l -
These
m e t h o d s can be discussed according to
function, nature of malfunction and cost
whether they prevent, detect, diagnose,
decide about which of the available
correct, or recover f r o m errors, failures,
corrective and recovery technique be
deadlocks,
a p p l i e d f o r a g i v e n t y p e of m a l f u n c t i o n . A r e l i a b i l i t y e n g i n e e r in his a t t e m p t
or lockups.
Failures are prevented in two w a y s : C a r e in design and implementation of
to produce reliable software must under-
c o m m u n i c a t i o n protocols;
take the key tasks such as : formulatinn
congestion.
of s p e c i f i c , m e a s u r e a b l e
reliability and
and controlling
K e y s to a sophisticated
design and implementation of c o m m u n i c a -
test objectives;
d e t a i l e d r e v i e w of t h e
tion protocols are s u m m a r i z e d
logic,
and fault-tolerance
i)
design;
structure
of t h e
and p a r t i c i p a t i o n in the v e r i f i c a -
tion and integration testing.
To execute
as below :-
Objective in design of a set of protocols should be to m i n i m i z e interactions between levels of protocols
Computer Communication Network Reliability which help in carrying valldation,
z)
testing,
and recovery
procedures
independently
for each level;
multiplexing deadlocks
separately
control,
flow control, without
techniques reliability
network;
in computer
issues
and
1)
implementation,
allocations,
a resource
preferred
2)
it is
requests
and to
wherever
techniques
where prevention
An optimal scheme and recovery
route
3)
-
of B e l l T e l e p h o n e Data
a nationwide digital data net-
work implemented
by the Trans-Canada
Telephone
(33);
System
Distinction between failures errors
caused by
or by malfunctions
of t h e o r i g i n a t i n g n o d e i s m a d e b y u s i n g error
some dead-
detecting and correcting
like parity,
has been explained by
cyclic redundancy
and Hamming
Prevention of
Data
to collect and analyze
transmission
can't be used.
for deadlock detection
Hutchinson et. al. (6).
System
d a t a o n t e l e p h o n e t r a f f i c (20);
possible and other
by pro-emptying
locked processes
Few
are the
and Administrative
Laboratories
and
to a network
of t h i s a p p r o a c h
Acquisition be
subnet should
prevention
links in the system
malfunctions
Engineering
using
out by observing
t h e s t a t e of a l l n o d e s a n d
control center for diagnosis.
and detection techniques in view of
for deadlock,
and hardware
of n e t w o r k m a l f u n c t i o n i n g
examples
over the deadlock avoidance
u s e a m i x of s t r a t e g i e s
Detection
reporting
computer
under the sections
reliability
communication
request;
overhead costs. Typically, a communication
network are
i s s u e s;
periodically
a global descrip-
techniques
discussed
may be carried
location or at the node attempting process
communication
of s o f t w a r e
either at some central
Deadlock prevention
M e t h o d s u s e d f o r d e t e c t i o n of f a i l u r e
reliability
a deadlock
t i o n of n e t w o r k r e s o u r c e s
as
:-
the same as for isolated
of p r o t o c o l s ;
to construct
diagnosis,
from failures,
can be detected by implement-
computer
(Z) b e u s e d f o r
detection or avoidance algorithm, necessary
and recovery
systems
In order to implement
do o c c u r a n d t h u s of c o m m u n i c a t i o n
i n i n d i v l d u a l s w i t c h e s a n d h o s t s of a
software
M o d e l s of p r o t o c o l f u n c t i o n l i k e " G r a p h
and verification
failures
a quick detection,
Failures
be used to verify its
specification,
of
in communic-
ing the following techniques
functioning ;
formal
failures
reliability
correction
t o p r o o f of c o r r e c t -
M o d e l of C o m p u t a t i o n "
6)
to increase
behaviour;
similar
ness methods
used for preventing
Regardless
design techniques
they occur is needed.
be specified in such a way as
to allow easy implementation
5)
the
a n d a n y s i g n s of u n d e s i r e d
Protocols
correct
and
of o u t p u t l i n k s .
care and sophisticated ation network,
and synchronization
nondete rministlc
4)
the number
diagnostic
A protocol be designed to perform f u n c t i o n of e r r o r
3)
out design,
detection,
87
4)
codes check
Codes;
Nodes may check erroneous
operation
failures by controlling congestion in
of o t h e r n o d e s b y r e c e i v i n g
an incorr-
c o m m u n i c a t i o n subnet is based on the idea
ect message
of imposing a limit on the size of output
to receive
queues in switch nodes.
a reasonable
This limit depends
on the level of traffic flow through the switch
from the node or by failing an expected message
f a i l u r e of a n o d e t o a c c e p t a m e s s a g e .
and is chosen to m i n i m i z e the probability of
This amounts to error
having to drop packets or s o m e t i m e s approxi-
using redundancy
m a t e d by setting the output queue limit equal to the total number
and by tlme-out
of p a c k e t b u f f e r s
in the switch divided by the square-root
of
within
time or by detecting the
5)
When network
detection by
provided by protocols procedures;
control algorithm
tions such as flow control,
opera-
congestion
88
INDER M. SOt and K. K. AGGARWAL
control,
or routing,
n o d e s but d o e s not s e e m to be f u n c t i o n -
are distributed
throughout the network,
then failure
ing properly
detection is encountered
with special
out either by sending test messages
problems.
(17) h a s a s s o c i a t e d
the difficulty that an error can cause propagation routing information
further
w i t h it
Another
personnel remove
data may travel
through the network as "empties") associated
course
of a c t i o n in
of t h e p r o b l e m
the problem.
and temporarily
Maintenance
ary can be used as references
is
probable
with the difficulty that the
causes
and cures
diction-
to find
w h i c h w o r k e d in
s u p p l y of p a c k e t s i s a n e t w o r k - w i d e
past and can be made available as data
resource
base at a network maintenance
which is not under the direct
c o n t r o l of a n y n o d e . tion algorithm
Resource
alloca-
(8) c a n b e u s e d t o a v o i d
this latter problem 6)
cost-effective
a n o t h e r n o d e m a y be to i n f o r m m a i n t e n a n c e
flow
c o n t r o l (in w h i c h p a c k e t s n o t b e i n g u s e d to t r a n s f e r
n o d e to i n i t i a t e its
i n c a s e of a n o d e d e t e c t i n g a n e r r o r
has been
Isorithmatic
or by
which will
own diagnostic procedures.
to o t h e r n o d e s a n d
no effective way except for checks at
found out as yet;
data about the problem
cause the receiving
to s o m e e x t e n t ;
data base is updated whenever causes
recognition techniques
data
maintenance
structures
redundant
sophisticated
structural networks
information
individual systems Diagnostic techniques e x t e n t of d a m a g e malfunction computer
for computer
with distributed
data bases
or
Pattern
can provide a very
automatic
diagnosis.
automaticity
a d d s to c o m p l e x i t y and efficient,
the diagnosis
c a u s e of
manually
considered
Since
and cost
careful consid-
case failure detection indicates
an error
However,
functions may be performed
or automatically.
Once the detection and diagnosis in in
a node (individual switch, host) independent of t h e r e s t of t h e n e t w o r k .
new systems,
coupled with on-line
dictionary
but is faster
are the same as for isolated already
This
e r a t i o n h a s to be g i v e n in c h o o s i n g w h e t h e r
(7). needed to know the
and probable
systems,
center.
and cures are discovered.
Designing and implementing i n a w a y to c a r r y
and
to g a t h e r
sending a special message
in one node
of i n v a l i d
t h e o r i g i n a l n o d e in e r r o r
may be carried
examining the responses
Adaptive routing in
ARPANET
- diagnosis
when
a node detects an error in another node or
stages
are over;
following techniques
employed,
d e p e n d i n g on t h e s i t u a t i o n ,
error
correction
and recovery
may be for
in c o m p u t e r -
c o m m u n i c a t i o n networks : i)
M e t h o d s of recovering f r o m software errors in single nodes be treated in a
in a c o m m u n i c a t i o n line then the following situations arise :-
manner
i)
E r r o r detected is a complete lack of
treating the node as a single c o m p u t e r
c o m m u n i c a t i o n with an adjacent node-
system.
has failed. attempted
Communication by any alternate
R e c o v e r y Blocks as studied
b y Randell (35) m a y
diagnosis determines whether the distant node or the communication
as that discussed earlier by
line
may be r o u t e of
2)
reasonably independent central processor in a n e t w o r k f r o m error r e c o v e r y point of view.
2)
Problem
detected is witha specific line -
diagnosis determines
whether the line it-
s e l f o r s o m e p a r t of t h e l i n e i n t e r l a c e i s a t f a u l t b y m a k i n g u s e of h a r d w a r e
which allows
t h e l i n e to be l o o p e d b a c k into the m o d e m ; 3)
Node is still communicating
with other
also be used;
A n important factor is the presence of
Since there is very
small probability
of f a i l u r e
nodes simultaneously; failure
of a l l t h e
i n t h e e v e n t of
of a n o d e e i t h e r d u e t o h a r d w a r e
or software
problems,
some other
n o d e i s a v a i l a b l e to r e s t a r t hardware
it.
in the c o m m u n i c a t i o n
Special inter-
Computer Communication Network Reliability face for each processor enable the restarting
by another processor.
For a processor
to be r e s t a r t e d
remotely,
restart
d e p e n d s on t h e t y p e of r o u t i n g e . g .
is i n c l u d e d to
random,
of a c o m p l e t e l y
failed processor
4)
cause a processor
erroneous
storage device or the processor
hardware
is to
or omitted actions by
r e d u n d a n c y i n t h e d e s i g n of a
network allowing communication f a i l u r e s as u s e d in A R P A N E T
may 5)
be reloaded by sending a special
In s o m e
cases,
node failures
whose text is a bootstrap
(17);
t h e i m p a c t of s w i t c h
can be decreased
u s e of " b y p a s s s w i t c h e s " ,
a procedure
which can be remotely activated to
for remotely reloading
i.e.
by the
r o u t i n e a s d e s c r i b e d b y B i n d e r (38) i n
switches
s w i t c h n o d e s on t h e A L O H A N E T .
c a u s e t r a f f i c f r o m one line c o n n e c t e d to
R e l o a d i n g of t h e n u c l e u s of a s y s t e m
a n o d e to f l o w d i r e c t l y o n t o a n o t h e r line;
is possible without disturbing the c o n t e n t s of a s y s t e m ' s failed processor
tables.
be restarted
Ignoring the tables altogether; resuming
The
6)
In c a s e of l o o p n e t w o r k s ,
the communi-
c a t i o n on l o o p d e p e n d s o n t h e c o r r e c t
by :
f u n c t i o n i n g of a l l l l n e s a n d a l l l i n e
by
interfaces,
operation with the assumption
that all the system's
so a s e r i o u s p r o b l e m in
any interface or link could disable the
tables are correct;
b y r u n n i n g a c o m p l e t e s e r i e s of d a t a
entire loop.
verification programs
n i q u e s s u c h a s u s e of b y p a s s s w i t c h e s
errors
in s y s t e m ' s
resuming
to check for
Special recovery tech-
to carry data past disabled interfaces
tables before
( t h i s s c h e m e f a i l s i n c a s e of l i n k
operation and repairing the
t a b l e s found to be in e r r o r .
failure);
An
u s e of d u p l i c a t e c a b l i n g ;
or a
interesting technique for making
c o m b i n a t i o n of t h e s e t w o t e c h n i q u e s a r e
l i m i t e d u s e of s y s t e m ' s
a p p l i e d in c a s e of l o o p n e t w o r k f a i l u r e s .
restarted
3)
procedure
individual switch nodes by providing
which would simply to e x e c u t e a n i n i t i a l
load sequence from a locally attached
message
adaptive or fixed;
A simplified recovery
m a k e t h e n e t w o r k " f o r g i v i n g " of
may be accomplished
b y u s e of a m e s s a g e
89
processor
t a b l e s in a
has been
T a b l e II s u m m a r i z e s
t h e a b o v e r e v i e w of
proposed for the Distributed Computing
techniques and tools.
System by Farber
VIII.
(8);
be recovered
by automatic action, then
SUMMAR Y AND
CONCLUSIONS.
The problem of producing reliable
In c a s e a link or node f a i l s w h i c h c a n ' t
computer-communication
the network is made to adapt the
considered
operation without the unusable compon-
parts:
ents either by exploiting the topological
cation network.
networks is
a s c o m p o s e d of t h r e e d i f f e r e n t
Hardware,
Software and CommuniThe issues involved and
d e s i g n of t h e n e t w o r k t o a v o i d d i s c o n n e c - s o m e t e c h n i q u e s f o r a c h i e v i n g r e l i a b l e t i o n o r by m o d i f y i n g the r o u t i n g to a v o i d
hardware,
failed components.
communication
analysis for minimum
Network topology cost configuration
reliable software and reliable networks have been compre-
hensively reviewed
and discussed by class-
h a v i n g c a p a b i l i t y to r e m a i n c o n n e c t e d
ifying these as prevention,
i n s p i t e of l i n k o r n o d e f a i l u r e s i s
nosis,
carried
Following observations
out b y a s s u m i n g
randomness
either the
of f a i l u r e s o r f a i l u r e s
b e i n g c a u s e d by an i n t e l l i g e n t e n e m y who knows the structure
of t h e n e t w o r k .
correction
detection,
and recovery
diag-
methods.
can be d r a w n to
conclude the discussions. 1)
Hardware
reliability is improved by
conservative
design; carefully imple-
Modifying the routing through the net-
menting the design using reliable com-
w o r k to h a n d l e the f a i l e d c o m p o n e n t s
ponents;
carrying
out t h o r o u g h i n i t i a l
90
INDER M. SOl and K. K. AGGARWAL
and periodic testing; redundancy within
microprocessors for continuous moni-
units and possibly through the use of
toring and periodic reporting of the
redundant units and external observa-
status of links or nodes, (ii) Data
tions.
mangling to determine robustness of
Software reliability is achieved
data structures and software system,
through structured and careful design, implementation and verification;
(iii) Dividing large networks into
effective use of redundancy in the f o r m
mutually supportive groups of nodes
of robust data structures;
and finding cost-effective ways of
observing
detection and correction of errors.
the expected behaviour of software
REFER ENCES
through internal and external observation tools.
Communication system
i.
ing: A n overview", I E E E Computer,
error-detecting and correcting coding
Vol. 4, No. i, Jan-Feb. 1971, pp. 5-8.
of the transmitted information;
Z.
of C o m p u t e r C o m m u n i c a t i o n Protocols"
cation equipment failures; and using
C o m p u t e r Science Department, Univ-
redundant facilities to provide backup.
ersity of California, Los Angeles, Jan.
Overall
1970.
system reliability of a 3.
gement: A P r i m e r for Project M a n a g e -
network topology design that allows
m e n t and Quality Control", C o m p u t e r
errors to be detected without the
Science and Technology, N B S Special
overall knowledge of the network's
Publication 500-Ii, U.S. Department
Special reliability problems result due
of C o m m e r c e , 4.
Survey of Methods of Achieving R eliab]e
while certain other characteristics
Software", I E E E Computer, Vol. 104 5.
tion in Packet-Swit chlng Network s ",
Ability of computer networks to per-
I E E E Trans. on Communications,
t h e e f f e c t of a s i n g l e f a i l u r e p r o v i d e s
Vol. 20, No. 3, June 197Z, pp. 546-550. 6.
D . A . Hutchinson, S°A. M a h m o u d
and
a major opportunity for its reliability
J.S. Eiordon, "A Recursive Algorithm
improvement;
for Deadlock Pre-emption in C o m p u t e r
Methods available are only adhoc and
Networks", - Information Processing
have yet not been subjected to cost-
77, Proc. IFIP Congress '77, Toronto,
effective analysis criterion;
Aug. 1977, pp. 24/- Z46.
A s costs increase rapidly as the ideal
7.
D . E . Morgan, D.J. Taylor and G.
system is approached especially w h e n
Custeau, "A Survey of Methods for
last few tenth percents of unreliability
Improving C o m P u t e r Network R ellabil-
are eliminated, so organizations should
ity and Availability", I E E E Computer,
be satisfied with the level of reliability
Vol. 10, N u m b e r II, Nov. 1977, pp. 4Z-
needed for satisfactory operation depending on the needs of the organiza-
6)
D . W . Davies, "The Control of Conges-
m o r e rellable ;
f o r m in a m a n n e r so as to m i n i m i z e
5)
1977.
D . E . M o r g a n and D.J. Taylor, "A
to s o m e characteristics of networks
are also helpful to m a k e the networks
4)
D. ~r. Fife, " C o m p u t e r Software M a n a -
m a x i m i z e d by placing emphasis on the
status during instantaneous failures;
3)
B.J. Postel, "A G r a p h Model Analysis
monitoring facilities to detect c o m m u n i -
computer-communication network is
z)
A. Avizienis, "Fault-Tolerant C o m p u t -
reliability is m a x i m i z e d by using
51. 8.
D.J. Farber, L.C. Kenneth, "The
tion rather than attempt to achieve an
Structure of a Distributed C o m p u t e r
eternal machine;
System-Software", Presented at the
Detailed investigations are required to
s y m p o s i u m on C o m p u t e r C o m m u n i c a -
be carried out regarding: (i) Using of
tion Networks and Teletra/flc sponsored
Computer Communication Network Reliability b y the P o l y t e c h n i c I n s t i t u t e of B r o o k l y n , M i c r o w a v e R e s e a r c h I n s t i t u t e , 197Z. 9.
10.
F. Boesch (ed.) "Large-
scale Net-
works:
T h e o r y and D e s i g n " , I E E E
Press,
1976.
1974, pp. lZ5-145. Z0. J.A. Grandle, R.E. Machol, " E A D A S A N e w Traffic Collection Record, Dec. 1975, N e w Orleans, LA, Vol. l, pp. 7-21 to 7-24.
E . H a n s l e r , G . K . M c A u l i f f e , and
21. L. Kleinrock, "Analytical and Simula-
R.S. Wilkov, "Optimizing the R eliabi-
tion Methods in Computer Network
lity in Centralized Computer Networks"
Design", A F I P S SJCC, Proceedings,
I E E E Trans. on Communications, Vol. 20, No. 3, June 1972, pp. 640E.F.
Vol. 36, (1970)p. 569. ZZ. L. Kleinrock, Communication Nets: Stochastic Message Flow and Delay.
644. 11.
91
Miller (editor), Program
Testing Techniques.
I E E E Computer
N.Y. McGraw-Hill, 1964. Z3. IV[. Jackson, "The Jackson Design
Society publication, 1977. I E E E
Iv[ethodology", Infotech State of the
Catalog No. E H C 130-5.
art Report, Structured Programming,
12. G.L. Fultz and L. Kleinrock, "Adaptive Routing Techniques for Store-and-Forward Computer
published in Infotech Internatlonal, U.K. Z4. IV[. Gerla, "The Design of Store-and-
C o m m u n i c a t i o n Networks", NTIS,
F o r w a r d (S/F) Networks for Computer
Report AD-727-989, J u l y 1972.
Communications", NTIS Report A D -
13. H. Frank, I.T. Frisch, C o m m u n i cation, Transmission, and Transportation Networks; Addison Wesley, 1979. 14. H. Frank et. al., " T o p o l o g i c a l
758-704, Jan. 1973. 25. M . H . Halstead, "Elements of Software Science", A m e r i c a n Elsevier Publishing Co. Inc. 1977. Z6. N. Wirth, " P r o g r a m Development by V o l . 14
Considerations in the Design of the
Stepwise Refinement", C A C M ,
ARPA
Network", A F I P S Conf.,
No. 4, A p r i l 1971, pp. ZZ1-ZZ7.
Proc. 1970, SJCC, Vol. 36, pp. 581-
Z7. O. K r t e n , D. R a h a , " A p p l i c a t i o n of new concept in Switching System
587. 15. H. Frank, "Providing Reliable Net-
Reliability", Proc. 1976 International
works with Unreliable Components",
Switching Symposium, 1976, pp. 443Z.
Data Networks: Analysis and Design, Proc. Third Data Communication
28. Peter Freeman, A.I. Wasserman,
Symposium, St. Petersburg, FL,
(editors); Tutorial on Software Design
Nov. 1973, pp. 161-164.
Techniques, I E E E Computer Society
16. Inder M. Sol, "Some aspects of Reliable Software Packages", M. Sc., (Engg.) Thesis, 1978, Kurukshetra University, Kurukshetra, India. 17. J.M. McQuillan and D.C. Walden, "ARPANET
Design Decisions",
Computer Networks, Vol. i, No. 5, September 1977. 18. J. Martin, Systems Analysis for Data Transmission, Englewood Cliffs, N.J., Prentlce-Hall, 197Z. 19. J.W. Suurballe, "Disjoint Paths in a Network", Networks, Vol. 4, No. Z,
publication, 1977, I E E E Catalog No. 7 6 C H I145-Z6. 29. R. Boorstyn and H. Frank, "Large Scale Network Topological Optimization", I E E E Trans. C o m m . , COM-25,
Vol.
No. i, Jan. 1977, pp. 29-47.
30. R.S. Wilkov, "Analysis and Design of Reliable Computer Networks", I E E E Trans. on Communications, Vol. 20, No. 3, pp. 660-678. 31. S.R. Kimbleton and G . M . Schneider, "Computer Communication Networks: Approaches, Objectives, and Perform-
92
INDER M, SoI and K. K. AGGARWAL
ance Considerations" A C M
Computing
Surveys, Vol. 7, No. 3, Sept. 1975,
Software Fault Tolerance" IEEE Trans
pp. Izg-17Z. 3Z. S. Lin, Introduction to Error-Correcting Codes, Englewood Cliffs, N J:
on S o f t w a r e E n g i n e e r i n g , V o l . 1, No. Z June 1975, pp. ~20-232. 36. E . W . D i j k s t r a ,
Prentice-Hall, 1970. 33. S. Frankel, O. Pearce, and W. Chan, "A Minicomputer based Performance IVIonltoring System for the Data Route", National Telecommunications
June 1975, pp. 233-240. 35. B. R a n d a l l , " S y s t e m S t r u c t u r e f o r
Con/.,
" T h e S t r u c t u r e of
THE m u l t i p r o g r a m m i n g s y s t e m " , CACM, Vol. 11, No. 5, pp. 341-346. 37. H . D ° M i l l s , "On the D e v e l o p m e n t of Large Reliable Programs",
Record
Conference Record Dec. 1975, N e w
1973 I E E E S y m p o s i u m on C o m p u t e r
Orleans, LA, Vol. i, pp. 7-21 to
Software Reliability, N.Y., April
7 -Z4.
30 - May Z, 1973, pp. 155-159.
34. W. Wulf, "Reliable Hardware/Soft-
38. R. B i n d e r s A h o h a n e t P r o t o c o l s , T h e
ware Architecture", I E E E Trans.
A l o h a S y s t e m , U n i v e r s i t y of H a w a i i ,
Software Engineering, Vol. i, No. Z,
Sept. 1974, ( T e c h . R e p o r t B 7 4 - 7 ) .
Computer Communication Network Reliability
93
TABLE I Softwa r e S y s t! e m
=,
#
M a l f u n c t i on
Malfu fnction
Occurs
is
prevented
,
I
|
Structured design and implementation of p r o g r a m s
1
Organised procedure for testing
Proof of Correctness by c r i t i c a l modules
_ _ O b s e r v a t i o n of s y s t e m s t a t e to d e t e c t i n v a l i d s t a t e of s t a t e sequence
[
--Observation structures
DETECTION
of d a t a a n d d a t a
~Observation of c h a r a c t e r i s t i c performance measures Software & hardware -- M e c h a n i s m s
protection
[ - - S u r v e y of d a m a g e ]
DIAGNOSIS
, ~ S u r v e y of e v e n t sequence i
a--Use of Maintenance Dictionary --Ignore and continue operation -Retry
COR R ECTION AND R ECOVER Y
--Roll back to recent check points & restart --Reconstruct data and data structures then resume R e -initialize system
GOOD SYSTEM .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
T A B L E II Computer-Communication P
.
.
.
.
.
.
C a r e in d e s i g n a n d i m p l e m e n t a t i o n of c o m m u n i c a t i o n protocols
i I.
M i n i m i z e d interaction between levels of protocols.
Z.
Avoidance of deadlocks and undesired nondeterministic behaviour in protocol functioning.
3.
Proof-of-correctness like implementation
4o
U s e of " G r a p h M o d e l s " for specifications implementation, and verification of protocols.
5°
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Network
Failure is prevented
I
.
Failure Occurs i
I
GO T O TABLE n(b)
B y Controlling Congestion
]
Technique of i m p o s i n g a l i m i t on t h e s i z e of o u t p u t q u e u e s i n switch nodes
Constructing global description of network resources requests and allocation at s o m e central station. 6. Deadlock prevention techniques to be ....... p_r_e_f_erred_ pve_r_ d e_a_moc_k _ayo ! d an c_e _ t_echnj _qu2 _s_...................................
94 .
.
IND[B~.M. Sol and K. K. AGGARWAL .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
T A B L E II
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(b)
Failure occurs
I
Detection
Yes
-
G o t o Table I
No
P e r i o d i c observation of state of all nodes and c o m m u n i c a t i o n links
Failure is in n e t w o r k
of error-detecting and correcting codes
--Use
B y using redundancy provided by protocols and by time-out procedures - - U s e of Adaptive routing and Isorithmatic flow control in Distributed N e t w o r k s Robust data structures ~implementatlon. ~Inform
the maintenance
personnel
U s e of m a i n t e n a n c e d i c t i o n a r y
- - U s e of Pattern recognition techniques in conjunction with an on-line maintenance dictionary
Diagnosis
- - U s e of alternate route or routing -- B y m a k i n g use of h a r d w a r e to loop the faculty line back to modem. U s e of special test m e s s a g e s Correction and Recovery
S e e T a b l e II
(c)
Computer Communication Network
Reliability
95
I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TABLE
--Use --
Correction and Recovery
II (c)
of R e c o v e r y b l o c k s
E x p l o i t i n g the p r e s e n c e of m u l t i p l e x r e a s o n a b l y independent central p r o c e s s o r i n Network I g n o r e and C o n t i n u e o p e r a t i o n without the unusable components by e m p l o y i n g t o p o l o g i c a l d e s i g n of network. Making n e t w o r k " f o r g i v i n g " of erroneous or omitted actions by individual nodes by using hardware redundancy Use of " B y p a s s S w i t c h e s " Use of S p e c i a l recovery techniques for loop n e t w o r k s . .
Use of B y p a s s s w i t c h e s Use of d u p l i c a t e cabling C o m b i n a t i o n of B y p a s s switching and duplicate cabling.