© IF .-\( : P()\H'r Sys te lll s alld Power PI,tlll COlllrol. Seoul. Korea, I q H~ 1
/-,\1"> ( ()" I R()1. ( I " I LR">
( :()p~right
NEW COMPUTER CONFIGURATION AND MAJOR SOFTWARE REDESIGN FOR ONTARIO HYDRO'S ENERGY MANAGEMENT SYSTEM Hae Choon Chung ,\ \.111'111 COlllm/ C I' IIII'I', PO
65 1\1'//11''''
S-"III'III
O/}(' ml/(JII
f) "' 1I /(JII ,
Slrl'l'l. RI'Xrill/l', Olllllri o,
Oll/rnio 11 H/III ,
('1/1/(/1/11
~.
Ontario Hydro is relocating its existing System Control Centre (SCC) and installing a new Energy Management System (EMS), which includes a distributed configuration for the computer system in order to meet more rigorous reliability requirements and utilize new technological advances. This paper will discuss the rationale for the distributed system, prioritization of the EMS application programs, and the acquisition/(jistribution of power system data. Failover strategy is provided to ensure high availability for the high priority application programs. An Integrated System Console which provides easy control of the distributed systems is discussed. Two essential software tools for the EMS environment, namely Alarm Management and Man/Machine software, are briefly discussed. Keywords Energy management System; Distributed Computer System Design; Software design for Energy Management System. INTRODUCTION Ontario Hydro is a Canadian electrical utility which is responsible for generating and transmitting power throughout the province of Ontario . The presently installed generating capacity is 26,000 MW with a peak demand of 23,086 MW met in January, 1989.
OVERVIEW OF SYSTEM CONFIGURATION Figure 1 shows the existing EMS configuration which consists of a single Live computer, various subsystems, a backup computer and redundant subsystem interfaces. The backup computer is used as a development system when it is in the standby mode.
Ontario Hydro operates and manages its Bulk Electric System (BES) with the aid of a sophisticated EMS. The EMS monitors and displays the current status of the power system (generation, transmission, and demand) and executes a number of application programs which provide assistance for dispatchers to assess complex power system operating situations quickly. These applications are designed to ensure that the BES is operating securely and economically. The EMS has been operational since 1976. Since that time, two major upgrades have been performed. The third and most significant upgrade was started in 1986 with an expected in-service date of December, 1990. The driving force for the major upgrade is the fact that the BES operating requirements are becoming more stringent. In turn, this requires more sophisticated application functions with new and rigorous reliability requirements. This requires an expansion and radical design change for the computer system at the SCc. However, the existing SCC has space restrictions and other site considerations independent of any deficiency there might be with the existing EMS performance. So, a decision was made to build a new SCC and install a new EMS using new computer hardware, based on a Unisys 2200/400 system. Major redesign of system software was necessary to provide a higher level of reliability/availability for critical application functions. This paper describes various areas in which Ontario Hydro's new EMS is being redesigned. Under each area, the existing and the new EMS designs are discussed.
AGe. Aulomalic Generntion Control DAS . Data AcCJJisition SUbsystem
nTU· Remote Terminal Unit
DCP . Distributed Communication Processor
UG . Load and Generation
VAXUNK
MMS . Man f.1achlne Suhsyslem
FIG. 1 : CONFIGURATION FOn EXISTING EMS
Figure 2 shows the new EMS configuration which consists of three clusters: Live, Backup and Development. Each cluster consists of three mabframe computers. The Live and backup clusters form the on-line comp\t;x. Any combination of three computers may be configured as the Live cluster. The remaining three computers are used for quality assurance tests of new software. Details of the switching mechanism within the on-line complex is described in the Failover!Recovery section of this paper. A detailed explanation of the Live cluster is in the next section. The Development cluster is solely used for program development and maintenance purposes.
Hae (:hool1 Chung
2tiH
LJve Clusl",
&ckup Cluster
DAS/CAS CONTllOL nom. IIIIS
BACK
OhS . Dab t\cqljsitim SA.A:lsysIoTl CJ\S • ConIroI J\c1ion Slb5ysUm Node • Urisy> 2l!OO'<\02
FACE
IIIIS
IIIMIC IIOAIlD
FIG. 3 : UVE SYSTEM CLUSTER - NODE CONFIGURATION NolO : DtM:IIopllenl Ouskr ~ simil;w
conIiUll3tion _ Uve 0us1Br
FIG. 2 : CONFlGUAIInoN FOR NEW EMS
RELIABILITY/AVAILABILITY In the existing system, the display function and numerous application programs are running in one mainframe computer. When the mainframe goes down, all application functions become unavailable, even though some application programs are more critical than others in terms of operating the power system securely and economically. The current reliability requirement for Ontario Hydro's EMS has two levels, i.e. high level (total EMS as whole) and low level (individual constituent subsystems). The high level standard for the EMS is 99.6% (percentage of time in the available state). However, a simple availability measure like this does not represent the actual performance requirements needed by the power system dispatchers. Power system conditions do not usually change much over a short period of time. Therefore, one long outage to the EMS is more impactive to the dispatchers than numerous short duration outages. An outage may be caused by a hardware or software failure. Hardware failure usually causes a short duration outage because the failed hardware can be replaced by redundant (standby) hardware. However, software failure (program failure, data corruption) causes a longer duration outage, because it takes time to find the causes and degree of damage. Therefore, there is a concern that a lengthy outage could be caused by a failure of a less important program, causing serious impact on the total EMS' reliability. Also there is concern that the performance of time critical programs may be degraded by less critical programs when the computer resources (CPU, memory, I/O) are short. In summary, the existing EMS is either in a "healthy", "failed", or "degraded" state. The reliability standards for the most and the least important programs are inseparable. Ontario Hydro is primarily a transmission lil"!1ited utility. Electricity demand has been rising and new generation is being commissioned to meet the demand, but Ontario Hydro has beer. restricted in building new transmission facilities. Therefore, many application functions have been added to manage the BES securely. For example, complex generation and load rejection schemes and extremely complex stability limit monitoring software have been added in recent years. As a result of different levels of criticality for each application function, new and more rigorous reliability requirements for each function were defined. In order to support these requirements, a distributed system design for the new computer configuration has been adopted.
Figure 3 shows the Live system cluster which consists of three mainframe computers. Each mainframe computer is referenced as a "node" hereafter. The functional distribution of each node is: - Real Time node contains all the functions required to monitor and control the current state of the power system. That is, to detect, display, annunciate and log all status changes in the BES equipment and control aids. This node will also issue the control signals for AGC and special protection schemes. - Assessment node contains all the functions required to analyze the real time conditions of the BES. That is, to determine, monitor and alarm violations in the operating security limits in real time. In addition, this node contains functions to perform economic assessments whose response is required in a timely manner to avoid significant economic consequences. - Study!Forecast node contains all the functions required to analyze, study or determine anticipated or scheduled conditions. The expected reliability figures based on the above functional distribution are shown below: Node Forced Q!.!l!!ll~
Real time Assessment Study
28 56 84
Planned
Total
Q!.!tall~ Q!.!tag~
24 88 152
Availa12ili~
52 144 236
Stand-
% ard %
99.99 99.92 99.64
99.94 99.90 99.60
Each node is designed to be as self contained as possible, to avoid failures affecting more than o.le node and to avoid performance degradation across nodes, e.g., a node should be able to determine what work is to be performed itself rather than relying on control requests from another node. DATA BASE MANAGEMENT There are two types of data base used by the EMS, namely Memory Resident Data Base (MRDB) and Disk Resident Data Base (DRDB). In the existing EMS, approximately 9000 data points are collected via the Data Acquisition Subsystem (DAS) from 125 Remote Terminal Units (RTU) every 2 seconds. The telemetered data is stored in the MRDB and accessed by all system and application programs. DRDB contains traditional data bases, such as, historical data, data required for system recovery, data base directory, less frequently used data, etc. Because all data are stored in one centralized system, there are no restrictions in accessing data. Any data flow between different programs can be managed easily.
Red es ign for Ontario Hydro 's En e rg\'
In the new EMS design under the Distributed System concept, a new concept is required for managing the data base. The following describes major differences in terms of structure, limitations, and data flows between programs.
~lanagclIlcnt
S\ stCIIl
Re-employment enables the second and succeeding nodes to establish connections to the files that are already being used by an operating node. Communication between nodes of locking information, is either via special microcode in the designated disk controllers or via hyperchannel.
MemoQ' Resident Data Base (MRDB) The MRDB is a key component in providing the high performance necessary to meet the system performance standards. It provides a central data storage and retrieval mechanism for data which is used very frequently, where the overhead of performing a disc I/O request is not acceptable. The design ensures that the expected number of requests can be handled and that two or more programs (either on the same node or on different nodes) do not have conflicting update requests. In order to maintain consistent data, the data must-be locked before updating. A common data set updated by many programs may create a significant lock conflict. So for the new EMS, a common solid state disk device would have been the simplest design. But the number of accesses and the num5er of locking requests can not be satisfied within the required time. Therefore, the MRDB is split into three areas and each area is designated, as being "owned" by only one node. A node is able to lock and update only the area it owns, but can read all areas. This is the most effective way to guarantee that a stall condition due to lock conflicts between nodes is avoided. Furthermore, the basic requirement for the nodes on the distributed system design is to make each node as free standing as possible. Periodic transfer of the MRDB keeps the data sufficiently up to date in other nodes (MRDB data is transferred among nodes every 2 seconds). Figure 4 shows the layout of the new MRDB.
~~
~
ASSMNT .. NODE
NODE
FVT AREA
ASSMNT
..
~ NODE
WRlTEJ
wruTE!
LOCK
LOCK
WRITE! LOCK
I-
Data Exchan~e Arnon~ Nodes In the existing EMS, there are no problems in exchanging data among various application programs because everything needed for the EMS is fully accessible by all application programs. In the distributed system design, each node performs a defined subset of the functional capabili ties. With the current split of the application functions among nodes, there is a need to provide a capability to exchange data residing in the MRDB. Since the MRDB in one node is not directly accessible by other nodes, the data transfer between nodes is performed periodically. However, data transfer between nodes should be controlled carefully and should be in a downward direction, i.e., from the Real Time node to Assessment node and/or Study node, or from Assessment node to Study node. Thus, a failure of a lower priority function would not cause the failure of a higher priority function. Neither would the performance of high priority function s be degraded by a low priority function. The nodes are connected via a high speed data transfer medium (Hyperchannel). The following data is transferred via this medium:
FVT AREA
FVT AREA
ASSMNT
1-
The shared files provide a common point of failure, however by configuring the DRDB as "dual" copies, the impact of a hardware failure can be reduced. Two coincident )tardware failures will be required to cause a system failure, although a corruption of data contents is still possible. Idealy, files should be allocated onto separate disk units for each node to reduce the likelih:>od of common failure. A unit failure would then affect only one node. It was decided not to undertake the large effort to split the applications DRDB between the nodes initially. It may be optimized at a later date.
ASSMNT
STIlDY
STUDY
STUDY
MRDB
MHOB
MRDB
I-
. TABLE OWNERSHIP (NO CRDSS-NODE LOCKSM'RITES)
- Memory resident data: The Real Time node scans 125 RTUs every 2 seconds. A quick analysis is performed to identify whether the power system configuration has changed since the last scan. Then, the scan data are transferred to the Assessment and the Study nodes. Also data processed by the Assessment and Study nodes are transferred to other nodes at this time .
. DATA TRANSFER EVERY TWO SECONDS
FIG. 4 : LAYOUT FOR NEW MRDB
Disk Resident Data Base (PRDB) The DRDB is shared between all nodes. A Multi-Host File Sharing (MHFS) facility provides the capability, whereby the DRDB may be accessed from more than one node. The MHFS provides capability for files to be defined as shared between all nodes of the described network or local to an individual node. It controls the directory information to ensure co-ordination between all nodes during file assign, delete and modification and during system recovery. An additional level of file recovery, in addition to initialization and recovery, that of re-employment is used.
- Transaction Processing: Application programs are distributed over the three nodes. A means to control the efficient scheduling and execution of the transactions among three nodes is required because an execution of a program on a node may require execution of another program on another node. Also there must be means to pass data to a program on a node that is being scheduled by another node. The data may be created either as a result of input from a man/machine terminal or from a program that is initiating the schedule request. - Miscellaneous data: There are many different types of data which are exchanged among the three nodes, such as, Heartbeat of each node, Locking information of the MRDB and DRDB, Subsystem status connected to each node, etc.
Hae ChoUIl Chullg
270 F AlLOVER/RECOVERY
The existing EMS is based on a single main frame compu ter which executes all the application programs. If the computer fails, the first recovery action is to try to establish the system on the failed computer because it may be a transient failure. If this automatic recovery fails, then the peripherals are switched manually to the backup computer to recover the system. A system outage due to a hardware failure is recovered by switching to redundant hardware. However, an outage due to program failure and/or data corruption requires a complex recovery process by using the previous consistent power system data which was archived on disk. A number of recovery procedures are provided to computer operators. While the recovery process is being performed, the total EMS (including critical application functions) is not available to the dispatchers. In the new EMS design, the application functions are split into three groups and each group is designated to a node, i.e., Real Time, Assessment or Study node, respectively. A different priority is given to each node. In order to lessen the outage duration for the high priority node, an automatic detection of failure and an automatic fail over scheme are necessary. A function of performance monitoring provides a detection scheme by using a "heart beat" message. When a node is healthy, the node sends an indicator to the other two nodes via hyperchannel every two seconds. Normally the Study node is designated as the lowest priority node. Therefore, if the Real Time node or Assessment node were to fail, an automatic fail over sequence will begin. As an example, when the Real Time node fails, the Assessment and Study nodes will not receive a heartbeat message. Then, the Assessment and the Study nodes exchange a detection indicator (Real Time failure) to ensure that the failure is not related to the Assessment or Study node itself. Also, a fail over will not be attempted if the Real Time node has failed because of a failure in an attached subsystem. As an example, DAS failure to acquire telemetered data in the Real Time node will cause the same failure when the Study node takes over the role of the Real Time node function. The status of the attached subsystems is passed as part of the heartbeat message. So, when the Assessment and the Study nodes agree that the failure of the Real Time node was caused by the Real Time node itself, then the Study node will terminate all application programs performing study/forecast functions. The new Real Time node (previously Study node) will send out signals to the Block Multiplex Channel Switch in order to switch the necessary subsystems and resume the Real Time node functions . This automatic fail over is achieved by using a "re-employment" recovery. The re-employment recovery consists of an already operational node or standby node taking over a failed node's function. Meanwhile, the failed node (previously Real Time node) will automatically recover and assume the Study node functions if the failure was transitory, e.g., failure of Operating system software. However, if the failure is solid (e.g., hard failure of CPU or memory), the failed node will not recover. It will be removed from the Live cluster and a node from the backup cluster will be manually switched into the configuration to resume the Study node functions. The main objective of this automatic fail over is to ensure that the different levels of availability required for the Real Time, Assessment and Study node functions in a distributed system are met.
INTEGRATED SYSTEM CONSOLE(ISC) The ISC comprises integrated console capability, configuration control capability and integrated perfonnance monitor capability. Each cluster requires an ISC subsystem. Functional Requirement In order to achieve a high availability requirement on a distributed system, a tool is essential for the computer operators to mitigate the increased complexity. The ISC provides the following functions. - A single point of operational control for each cluster of nodes. An operator will always be able to operate a cluster (e.g., Live or Backup) from the same physical set of console terminals in the computer room regardless of which physical computers that cluster is using. - A centralized place to manually configure the various physical computer systems and peripherals. - A single place to monitor, analyze and report the performance of a cluster. It is important that ease of use and the presentation of data in a clean and easy to understand format are seriously considered, in order that the high level of availability may be achieved. The ISC does not replace the individual system consoles and switching mechanisms for the peripherals. It provides additional aids to make the task of controlling the computer complex easier and less error prone.
ISC ConfiiYration As shown in Figure 5, the major hardware components of the ISC are the SUN works tat ion, a configuration panel, and a digital input/output system. The digital input/output system is connected to the Local Area Network, disks, hyperchannel and configuration panel.
UVE
BACKUP
FIG. 5 : ISC CONFIGURATION
Re d cs ig- Il for Olltario Hydro's
Ell e r ~ \' \Ialla ~e lll e lll
S\Sl l' 1ll
:!71
AlARM MANAGEMENT SOFTWARE The ISC software operates under the UNIX Operating System in order to perform message parsing and handling, to validate the configuration selections entered at the Configuration panel, to collate and analyze the performance data transferred from the nodes and to display and update pictures on the cluster console. There are numerous software modules for maintenance and data base management. Major Tasks of ISC The ISC must be able to detect fail over conditions for the received console messages and provide clear, concise, timely information from which the operator can make a decision. The ISC must be able to perform those functions automatically that can reasonably be predefined for a fail over situation when a failure occurs. As an example, When the ISC does not receive a "heartbeat" message, it will initiate a failover procedure which is predefined. The procedure involves raising audio and visual alarms, allowing the operator to monitor the failover process, setting up a window to bring up the appropriate "failover template" to be followed, watching for "checklist" messages to monitor milestones defined for the type of fail over, matching operator actions to a predefined checklist and alarming possible missing or out of sequence steps to the operator, updating the operator's configuration panel for the computer systems and peripherals, and analyzing and reporting on the probable causes of the failover. The analysis of hardware problems is also performed by the ISC. The standard operating system reports hardware failures by rather cryptic messages that usually do not show the real cause of the failure. Therefore, the operator can not determine clearly what course of action to follow. The ISC will employ software which will analyze the hardware error messages, interpreting the abbreviations into plain English. Also, a data base of recent hardware errors is maintained so that the software may do an analysis of the underlying cause of the failure, based on prior failures and failures reported by other nodes. The ISC will inform the operator of the results of the failure analysis and suggest an appropriate recovery action. One of the major tasks is configuration control, which also involves peripheral switching. With multiple main frames, communication computers and peripherals to be configured, it is too much to expect that this be done via a traditional method, configuring all possible combinations in the software, and then manually downing those components not required. All switchable peripherals and communication computers are connected through the Block Multiplex Channel subsystem. This subsystem will be controlled by the ISC which will issue the appropriate commands to cause the correct connections to be made. The operator will make the initial configuration selection through the Configuration Control Panel. The ISC will validate the selections, issue the switching commands to the Block Multiplex Switch and apply the correct operating system and communication computer commands. During the normal operation of the system, the ISC will handle online configuration, automatically issuing peripheral switching commands and keeping the Configuration Control Panel up to date.
In the existing EMS, the processing of an alarm request is carried out on a "first-inlfirst-out" basis. There are several key functions in the alarm management software, Le., create, delete, log, archive/retrieve, and volume control. An alarm may be created because the status of BES equipment has changed and/or an alarm condition has been detected during security assessment of the power system by application programs. All alarms from the entire system are Otltput on the dedicated alarm screens in the control room. The alarm screen is split into two sections. The first section presents an overview of the critical alarms. It consists of many cells, each being driven by the dynamic text capability. This is a man/machine software function which allows text to reflect status using different colours. Application or system programs are responsible for setting the correct status in the associated data base location. This section of the screen works somewhat like an annunciator. If no alarm exists, the cell is shown as blank. If an alarm condition exists, then the appropriate text is displayed in the cell. The second section of the screen contains alarm text which is classified as low or high priority. Each text also shows its status, such as active, passive, or update. The alarm text is deleted from the screen automatically after the alarm condition is cleared. In the existing system design, the potential exists for low priority functions to generate a large volume of alarms, either legitimately or more likely because of software error that could severely affect the performance of the total EMS system. In this situation high priority functions, such as data acquisition, monitoring and display of power system statuses can not be performed. In the new EMS design, to ensure that the high priority functions continue to function under such circumstances, the alarm management software is redesigned to perform the processing of all alarms on the node from which the alarm has been generated. Each node will have its own alarm file . Only the final merging of the alarms to display on the alarm screen will be performed by the Real Time node. The control of alarm display is performed by the Real Time node, hence the two lower priority nodes can not affect the performance of the Real Time node. MAN MACHINE SUBSYSTEM (MMS) The EMS consists of a number of facilities, one of the most critical being the MMS, the "connection" between the dispatcher and the computer systems. The MMS allows the dispatcher to see the power system "in pictures". The MMS consists of the A YDIN hardware and software to connect it to the computer system. The MMS provides fast response, high performance reliability and dynamic configuration capability. There are two types ofMMS; that is, "local" and "remote". The "local" MMS is located immediately adjacent to the main computer and uses a high speed, directly connected I/O channel. The "remote" MMS is designed to meet a growing need to examine power system conditions from remote locations. The "remote MMS" is an attractive facility because it offers added flexibility and independence. It can be installed anywhere via normal data communication lines. It offers a range of response times depending upon the speed of the communication link used. Built-in security ensures no interference with the dispatchers' live control ofthe power system. An additional component of the local MMS is the Mimic Board which provides an overview of the power system status. The software provides capabilities necessary for generating and displaying pictures, updating the Mimic Board and controlling the various MMS hardware devices.
Hat.' Chooll Chullg
In the existing EMS, all MMS consoles are treated as equal priority whether they are located in the control room or the supporting area (back office). Some hardware problems or excessive use of the console by the back office may affect the performance of the control room consoles. At present, there are nearly as many consoles in the back office providing support functions as there are consoles in the control room. The consoles required for support functions are expected to continue to increase whereas only a few additional consoles are predicted for the control room. In the new EMS design, the control room consoles are driven by the Real Time node and the back office consoles are driven by the Study node. In this new design, the back office consoles can not affect performance of the control room consoles. However, the software must be designed such that the distributed configuration of the MMS consoles does not introduce inconsistent data displays because many pictures contain data which may originate from two or three different nodes. So, the initial display software will assemble the data to complete the dynamic portion of the picture, including data supplied by the requesting program if necessary. It will then convert the picture information into the appropriate display and control characters and output the picture to the MMS screen. One of the most critical requirements in the MMS software for the new design is not to mislead the power system dispatchers. This could occur if one node goes down and the data base for that node is not being updated. As an example, when the Assessment node is down, the state estimated values would not be up to date while the other portions of the screen which might be displaying telemetered values are correct. Therefore, the display software (whether initial or dynamic update) must indicate whether dynamic entities are current. CONCLUSION The new EMS to be installed at the new System Control Centre, with an expected in-service date of December 1990, is based on the "Distributed system" concept. The main thrusts of this concept are: - Improved reliability at 3 different functional levels: monitoring/ annunciation, security assessment and study/forecast. - Improved consistency of performance at 3 different levels with consistent data display. - Increased flexibility to add computer resources as needed. The distributed system design allows a new node to be added easily with minimum cost. On the other hand, the centralized system concept often requires replacement of the total system due to the limitations of its expansion. It was not possible to describe all components which require software redesign for the distributed system concept. Also, the detailed designs of some software components described in this paper are not finalized. It is certain that more optimization and adjustment will be made as the development work progresses. Our initial objective for this project is to establish the distributed system environment without losing existing application functionality. The subsequent objective is to expand the EMS system capability based on a fully distributed system.
REFERENCE - IEEE paper, 89 WM 130-6 PWRS, titled as "Computer Configuration For Ontario Hydro's New Energy Management System", by D.Barrie, D.S. Hill, and A Yuen.