Procedia Computer Science Procedia Computer Science 101, 2016, Pages 341 – 350 YSC 2016. 5th International Young Scientist Conference on Computational Science
Distributed Monitoring System For Reconfigurable Computer Systems∗ I.G. Danilov1 , A.I. Dordopulo1 , Z.V. Kalyaev1 , I.I. Levin1 , V.A. Gudkov2 , A.A. Gulenok2 , and A.V. Bovkun2 1 2
Scientific Research Centre of Supercomputers and Neurocomputers Co Ltd, Taganrog, Russia
[email protected] A.V. Kalyaev Scientific Research Institute of Multiprocessor Computer Systems at Southern Federal University, Taganrog, Russia
[email protected]
Abstract The paper covers the history of development of reconfigurable computer systems including currently being developed modern system with independent circulation of the cooling liquid. The authors introduced the distributed monitoring system for reconfigurable computer systems which provides continuous status diagnostics of the computational module components for reduction of the low-productive time periods of equipment and for minimization of burden when worst-case situations are detected. High cooling efficiency with power reserve for the designed perspective FPGA families, resistance to leaks and their consequences, and compatibility with traditional water cooling systems based on industrial chillers are the distinctive features of the designed immersion liquid cooling system. Keywords: FPGA, reconfigurable computer systems, liquid cooling system, hardware monitoring system
1
Introduction
One of perspective approaches to achieve high real performance of a computer system is adaptation of its architecture to a structure of a solving task, and creation of a special-purpose computer device which hardwarily implements all computational operation of the information graph of the task with the minimum delays. A natural requirement for a modern computer system is hardware support of modification of both the algorithm of the solving task and the task itself, that is why FPGAs are used as a principal computational resource of reconfigurable computer systems [5]. ∗ The project has been funded by the Ministry of Education and Science of the Russian Federation. Grant agreement 14.578.21.0006, unique identifier RFMEFI57814X0006
Peer-review under responsibility of organizing committee of the scientific committee of the 5th International Young Scientist Conference on Computational Science © 2016 The Authors. Published by Elsevier B.V. doi:10.1016/j.procs.2016.11.040
341
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
The main advantages of programmable logic devices (PLD) are possibility of implementation of complicated parallel algorithms, availability of CAD-tools for complete system simulation, possibility of programming or modification of in-system configuration, compatibility of various design projects when they are converted in a VHDL, AHDL, Verilog descriptions or in any other hardware description language. The history of the PLD architectures started in the end of the 1970s, when the first PLDs with programmable-AND and programmable-OR arrays appeared. Such architectures were called FPLAs (Field Programmable Logic Array) and FPLSs (Field Programmable Logic Sequencers) [6]. Their main disadvantage is weak use of programmable-OR array. Introduction of FPGAs (Field Programmable Gate Array) ignited revolution of devices with programmable logic. The FPGA class includes Xilinx XC2000, XC3000, XC4000 and Spartan, Actel ACT1, ACT2, Altera FLEX8000 family and some Atmel and Vantis PLDs. The FPGA configurable logic blocks (CLBs) are connected by a programmable switch matrix. The logic blocks consist of one or several rather simple logic cells based on a 4-input look-up table (LUT), a program-controlled multiplexer, a D-flip-flop. Input/output blocks (IOB) that provide bidirectional input/output, tri-state, etc., are typical for the FPGA-architectures. The FPGA chips have the following advanced features: a JTAG port that supports all mandatory boundary-scan instructions specified according to the IEEE 1149.1 standard and a master configuration mode (that required a build-in oscillator). The FPGAs with a dedicated block RAM were the result of further development of the FPGA architecture, owing to which the FPGAs can be used with no external memory devices. The FPGAs have a high logic capacity, an easy-to-use architecture, a quite high reliability and an optimal ratio “price/logic capacity”, therefore they match various requirements, claimed by circuit engineers. In 1998-1999 augmentation of FPGA equivalent logic capacity had changed attitude to CAD-tools of both software developers and users. Till the end of the 1990s the main tool of project description and schematic entry was a graphic editor and libraries of standard primitives and macros such as logic elements, elementary combinational and sequential functional units, analogues of standard integrated circuits of small-scale and medium-scale integration. At present, circuit engineers widely use hardware description languages for FPGA-based implementation of algorithms. Besides, up-to-date CAD-tools support both standard hardware description languages (such as VHDL, Verilog) and specialized hardware description languages developed by FPGA vendors specially for their own needs, CAD-tools and FPGA families with special architecture features. Such example is AHDL (Altera Hardware Description Language), which is supported by the Altera CAD-tools MAX PLUS II and Quartus. HDL-languages are very user-friendly tools for description of various interfaces, but implementing complicated calculations the developer cannot influence the mapping of HDL code on hardware resource of the chip and this has some negative influence on efficiency of the implementation. Besides, it is obvious that circuit description is evident, so when we implement some complex computational algorithm we prefer use a graphic editor. FPGAs, as any other complex systems, need specialized tools for monitoring of their condition. Since the 5th family all Xilinx FPGAs contain specialized monitoring units called a System Monitor [13]. With the help of a JTAG-interface, it is possible to receive various information from the system monitor concerning FPGA condition. Such information contains temperature, power supply voltage, warning that the temperature critical value is achieved in the FPGA core. Information, received from the system monitor, together with information, received from other devices of a computational module, is the principal source of data in the designed distributed monitoring system for reconfigurable computer systems. 342
Distributed Monitoring System For . . .
2
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
Reconfigurable computer systems
At present, there are two kinds of high-performance computer systems which use FPGAs as principal components. The first kind is so called hybrid computer systems, i.e. classic cluster computers which contain FPGAs in their microprocessor nodes and use them as accelerators of calculations. The examples of such hybrid supercomputers are XT4 by Cray and RASC by SiliconGraphics. In these systems blocks of programmable co-processors are implemented in FPGAs, interconnected with themselves and the principal processors by high-speed busses. The second kind of computer systems which use FPGAs as principal components is reconfigurable computer systems (RCS). In RCSs FPGAs are principal computational components, whereas general-purpose processors are minor components, which control operating of the reconfigurable part of the system. The acknowledged leader in the domain of design of reconfigurable computer systems based on FPGA computational fields is Taganrog scientific school, founded by academician A.V. Kalyaev. Today it is represented by various RCSs of the supercomputer class designed in Scientific Research Institute of multiprocessor computer systems at Taganrog State University of Radio-Engineering (now at Southern Federal University) and Scientific Research Centre of Supercomputers and Neurocomputers. The principal computational resource of such systems is not microprocessors, but a set of FPGA chips united into computational fields by high-speed data transfer channels. The spectrum of produced and designed products is rather wide: from completely stand-alone small-size reconfigurable accelerators (computational blocks), computational modules of desktop and rack design (based on Xilinx Virtex-6, Virtex-7 and UltraScale FPGAs) to computer systems which consist of several computer racks placed in a specially equipped computer room. Since 2001 four generations of FPGA-based reconfigurable computer systems [6] changed one another owing to production of new FPGA families and growth of computational complexity of problems that require continuous increasing of RCS performance. RCSs with macroprocessor architecture (RCS MPA) were the first generation of RCSs.They consisted of a number of basic modules implemented on FPGAs and a personal computer. Each basic module (BM) is a reconfigurable computational device designed according to the same architectural principles that the whole system. Such approach provided natural implementation of structural procedural parallel programs for different granularity of parallelism and piping of calculations. RCS MPAs gave way to reconfigurable computer systems of the second generation such as RCS with macroobject architecture. Using the RCS with macroobject architecture the developer has two levels of its architecture programming [5]. Owing to design experience of implementation of the RCS MPA and the RCS “Bear”, it became possible to implement RCSs of the third generation. The first representatives are RCSs of a family “Ursa Major”, which were designed on the base of three types of basic modules such as 16V5-75, 16V5-50 and 16S3-25. The basic module 16V5-75 (the most high-performance one) was used in such computer systems of the family as RCS-5, RCS-1R and RCS-0.2-WS. The basic modules 16V5-50 and 16S3-25 were the components of such accelerators of the personal computer as RASC-50 and RASC-25. Due to growing demands to RCS performance the density of placement of RCS components increased in 4 times. In 2012-2014, on the base of Xilinx Virtex-7 FPGAs and computational modules (CM) 24V7-750 (Pleiad) and Taygeta [6], the fourth generation of RCS was designed. Due to a number of problems with the cooling system, the fifth generation of RCS was designed. According to the obtained experimental data, conversion from the FPGA family 343
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
Virtex-6 to the next family Virtex-7 leads to growth of the FPGA maximum temperature on 1115℃. Therefore further development of FPGA production technologies and conversion to the next FPGA family Virtex Ultra Scale will lead to growth of FPGA overheat on additional 10. . . 15℃. This will shift the range of their operating temperature to 80. . . 85℃, which means that their operating temperature exceeds the permissible range of the FPGA operating temperature (65. . . 70℃), and hence, this will have negative influence on their reliability. That is why use of air cooling for the next generation of FPGAs - Virtex UltraScale, which contain about 100 million equivalent gates and have power consumption not less than 100 Watt per FPGA chip, will not provide stable and reliable operating of the RCS when its chips are filled up to 85-95% of available hardware resource. This circumstance requires a quite different cooling method which provides keeping of growth rates of the RCS performance for promising designed Xilinx FPGA families: Virtex UltraScale, Virtex UltraScale+, Virtex UltraScale 2, etc. At present the technology of liquid cooling of servers and separate computational modules is developed by many vendors and some of them have achieved success in this direction[4, 3, 7]. However, most of these technologies is intended for cooling computational modules which contain one or two microprocessors. All attempts of its adaptation to cooling computational modules which contain a large number of heat generating components (an FPGA field of 8 chips), have proved a number of shortcomings of liquid cooling of RCS computational modules [6]. Since 2013 the scientific team of SRC SC and NC has actively developed the domain of creation of next-generation RCS on the base of their original liquid cooling system for printed circuit boards with high density of placement and the large number of heat generating electronic components. The principal element of modular implementation of open loop immersion liquid cooling system for electronic components of computer systems is a reconfigurable computational module of a new generation (see the design in Figure 1 ,a). The CM of a new generation consists of a computational section, a heat exchange section, a casing, a pump, a heat exchanger and a fitting. In the casing, which is the base of the computational section, a hermetic container with dielectric cooling liquid and electronic components with elements that generate heat during operating, is placed. The electronic components can be as follows: computational modules (not less than 12-16), control boards, RAM, power supply blocks, storage devices, daughter boards, etc. The computational section is closed with a cover. The computational section adjoins to the heat exchange section, which contains a pump and a heat exchanger. The pump provides circulation of the heat transfer agent in the CM through the closed loop: from the computational module the heated heat-transfer agent passes into the heat exchanger and is cooled there. From the heat exchanger the cooled heat-transfer agent again passes into the computational module and there cools the heated electronic components. As a result of heat dissipation the agent becomes heated and again passes into the heat exchanger, and so on. The heat exchanger is connected to the external heat exchange loop via fittings and is intended for cooling the heat-transfer agent with the help of the secondary cooling liquid. As a heat exchanger it is possible to use a plate heat exchanger in which the first and the second loops are separated. So, as the secondary cooling liquid it is possible to use water, cooled by an industrial chiller. The chiller can be placed outside the server room and can be connected with the reconfigurable computational modules by means a stationary system of engineering services. The design of the computer rack with placed CMs is shown in Figure 1 ,b. The computational and the heat exchange sections are mechanically interconnected into a single reconfigurable computational module. Maintenance of the reconfigurable computational module requires its connection to the source of the secondary cooling liquid (by means of valves), to the power supply or to the hub (by means of electrical connectors). 344
Distributed Monitoring System For . . .
a)
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
b)
Figure 1: The design of the computer system based on liquid cooling (a - the design of the new generation CM, b - the design of the computer rack)
In the casing of the computer rack the CMs are placed one over another. Their number is limited by the dimensions of the rack, by technical capabilities of the computer room and by the engineering services. Each CM of the computer rack is connected to the source of the secondary cooling liquid with the help of supply return collectors through fittings (or balanced valves) and flexible pipes; connection to the power supply and the hub is performed via electric connectors. Supply of cold secondary cooling liquid and extraction of the heated one into the stationary system of engineering services connected to the rack, is performed via fittings (or balanced valves). A set of computer racks placed in one or several computer rooms forms a computer complex. To maintain the computer complex it is connected to the source of the secondary cooling liquid, to the power supply, and to the host computer that controls this computer complex. For maintenance of such complex computer system we need effective facilities of distributed monitoring and control with the following characteristics: scalability, fault-resistance, expandability (addition of new formats of monitoring data and new control operations), controllability 345
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
(the complexity of the facilities must not grow when the number of nodes is growing) and portability to various platforms.
3
Distributed monitoring system for reconfigurable computer systems
From the third generation and because of considerable increasing of RCS size and transition from desk-size computer systems to complexes which consists of several computer racks, it is necessary to design and create means of monitoring of the current condition of principal nodes of a computer system (the temperature, the voltage, the curent and power consumption of FPGA chips, rotation speed of fans of cooling system, and other parameters) to provide uninterruptible and safe functioning of equipment. The FPGA feature is, in comparison with microprocessors, a lower zone of operating temperatures (65. . . 70℃). It is not reasonable to overrun the borders of this zone, because it badly influences their reliability. Hardware architecture of RCS computational nodes, i.d. CM, considerably differs [5] from microprocessor architecture of nodes of traditional cluster computer systems, and since RCS computational modules contain a greater number of blocks of various types (several FPGA chips, fans, memory blocks, FPGA power supply blocks, etc.). Use of existing systems of computational resource monitoring, such as Ganglia [8] and Nagios [9], for such kinds of computational nodes becomes more and more complicated. The wide-spread distributed monitoring system Ganglia [8] is designed on the base of a mechanism of network sockets for data transfer, an open XML-standard for data representation on the application level and on an XDR-standard for data compressing and representation of the transport level. The basis of Ganglia is a specialized protocol “subscription-publication”, which is used for interaction of gmond monitoring services installed on all cluster nodes. The gmond service collects the node statistics in a certain periods of time and can transfer it in the XDR-format to the other gmond services according to the hierarchy. The gmond services receive the statistics, aggregate it and transfer it further. Finally, the collected statistics of the whole cluster is transferred on request and in the XML-format to the gmetad services of the federal level. These services aggregate it as a specialized database and transfer to the clients when it is necessary. The shortcoming of Ganglia is lack of facilities which can detect exceptions, can solve them in an automatic or computer-aided mode, can inform interested users. The system Nagios [9] has such functions. It contains a comprehensive warning system and it can be extended by means of plug-ins. The shortcoming of the system Nagios for multichip RCS monitoring is an exceptional complexity of plug-in adaptation for a great variety of components of CMs, which require monitoring. The principal component of the CM which receives low-level data of condition of all units and devices is a CM microcontroller. First of all it is information from FPGAs received via the JTAG-interface from the system monitor, and information from power supply devices, fan and indication control boards, a power supply unit, received via combined interfaces (I2C and GPIO). Then, via UART-interface, the microcontroller transfers the collected data to the loading and control board (LCB), which maps the transferred data onto a PCI-Express bus and can be read by a user application with the help of a specialized driver. From the point of view of an end user as well as the system software developer, the principal problem of development of the RCS monitoring system is unification of interaction with various system blocks. Interaction with the CM boards, with a block of the united CM boards, with several CMs or with the whole RCS must be practically the same. The architecture of the designed system is developed and implemented with the help of service-oriented technologies 346
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
Figure 2: The architecture of the distributed monitoring and resource control system for RCS and has a hierarchic structure which is shown in Figure 2. The hierarchy consists of 2 kinds of network services of different levels: 1. the lowest level — the level of computational node of the system — contains a service of a node controller (NDC), which is run by a LCB and has direct access to CMs of the RCS computation block, presenting access interface as a set of operations available through the network; 2. the metacluster controller service (MCC) is an end-point of access to the system. The client sends a request to perform a certain operation and can from time to time poll the MCC if its status data is updated. Using such approach we can create a multilevel system of monitoring and control with an arbitrary hierarchic structure. Any MCC is an end-point of access to the status data of its domain for the upper member of the hierarchy. The client sends a request to perform a certain operation on a NDC or on a group of NDCs, and receives asynchronous notifications about modification of its condition. The MCC redirects client requests to subordinate MCCs of the lower levels of the hierarchy which beforehand joint it using the protocol “subscription-publication”. Information about the status of performing of some certain operation is aggregated by the MCC and is transferred to the upper levels of the hierarchy. The NDC, placed on the lowest level is run on the LCB and has direct access to functional capabilities of the RCS computational block through an interface of dynamically loading plugins. Each plug-in is implemented as a dynamic library of the OS and provides some certain diagnostic, control or monitoring function, through a unified interface. The NDC, when it starts functioning, scans the prescribed folder and tries to find dynamic libraries, which implement the interface described above. If the plug-in is found it is added into the list of operations of the node. Access to plug-ins is provided via network SOAP-operations of the MCC/NDC service: • doRegistersTest, doMemTest, doLinksTest — start functions of plug-ins, which perform testing of registers, the CM memory, and FPGAs connections, respectively; 347
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
• doLoadFPGA — starts a function of a plug-in, which performs FPGAs firmware loading; • doTMonitoring, doCMonitoring, doSMonitoring, doPUMonitoring — start functions of plug-ins, which performs monitoring of parameters of the FPGA System Monitor, monitoring of CM power supply sources, monitoring of fans and indication unit, and the CM power supply unit, respectively; • stopOperation — stops an operation which was started earlier on the given NDC; • getOperationData — request of data of a monitoring operation started earlier on the given NDC; • setOperationData — setting of parameters of a monitoring operation started earlier on the given NDC; • describeResources — request of hardware and software (a list of available plug-ins, a list of performed operations) configuration of NDC. The NDC automatically detects modification of the status and/or new data for each of performing operations and transfers this information to the MCC or the clients (if directly connected). Specially with this purpose the following operations were added to the interface of MCC service in addition to the above mentioned functions of network SOAP-operations of MCC: • operationStatusChanged — request for modification of the status of the operation started on a domain of operation resources controlled by MCC; • operationLogChanged — request for addition of new logging information for an operation started on a resource domain controlled by MCC. Information about modification of operation status and new information about all system is accumulated in MCC and transferred further to other MCCs. In the case if MCC is a root of a tree, information about modification of the operation status is transferred to all clients requested the describeResources operation of the given MCC. To get logging information the client must address to additional operation getOperationLog of the service. The operation status contains information about general status of operation performing (run, stopped or aborted) and about task-performance errors and their types. On basis of the described principles of functioning using the package Axis2/C [12, 1] and the modified module Savan/C [11] we developed a system of monitoring and control of RCS resources which provides: • addition of new blocks of computational modules without stop or restart of the rest part of the computer system; • diagnostics of components of the selected group of the CMs or of CMs of the whole RCS (register tests, memory, connections between FPGAs of the computation field); • programming of FPGAs of the selected group of the CMs or of CMs of the whole RCS; • status monitoring of components of the selected group of CMs or of CMs of the whole RCS (monitoring of temperature, FPGA power, etc.); • following the system status, warning of the user when the errors are detected, perform reset or turn on/off of the CM. 348
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
Use of service-oriented technologies provides implementation of such distributed systems with minimum labour costs. The example is a cloud infrastructure Eucalyptus [2], which has a similar architecture of services. A protocol “subscribe-publication” is a standard in the domain of service-oriented architecture and is described, for example, in a web-service specification WS-Eventing. Owing to its use and loosly-coupled services it is possible to get a necessary scalability of the system with ability of “hot” adding of new services to the system. Use of the XML language which is a base of service-oriented technologies provides necessary extensibility of the system, in particular simple adding of new formats of monitoring data, and portability of the whole system. Adding of new service operations is performed by a simple modification of the contract of the correspondent service, with its further re-compilation and re-starting which has no influence on the rest part of the system. Requirement of burden reduction defines selection of implementation instrument such as infrastructure Apache Axis2/C. It is written in the C-language and demonstrates the best performance of web-services in comparison with similar infrastructures written in the language Java [1]. The given framework supports various standards of web-services with the help of complete units, in particular the protocol “subscribepublication” (the web-service specification WS-Eventing). Maintenance of the designed system of monitoring and control on operating models of various RCS (Taygeta, Rigel etc. [10]) proves that service-oriented approach used for development of the system provides scalability, fault-resistance, expandability, control and portability of the monitoring system to various RCS configurations.
4
Conclusion
The system of monitoring and resource control, developed with the help of service-oriented technologies, provides monitoring of RCS of various architectures and configurations for systems with air and liquid cooling. Owing to use of service-oriented technologies we can implement the expandable monitoring system with minimal costs. New operations can be added to such system by simple modification of a contract of a corresponding service with its recompilation and restarting with no influence to the rest part of the system.
References [1] Damitha Kumarage. Apache Axis2/C Performance Round #2. [online], 2008. http://wso2.com/ library/3868/, last viewed July 2016. [2] Hewlett Packard Enterprise Development LP. HPE Helion Eucalyptus. [online], 2016. http: //www8.hp.com/us/en/cloud/helion-eucalyptus-overview.html, last viewed July 2016. [3] Iceotope. PetaGen. [online], 2016. http://www.iceotope.com/product.php, last viewed July 2016. [4] Immers. High Performance Computing System IMMERS 6 R4. [online], 2012–2016. http:// immers.ru/sys/immers660/, last viewed July 2016. [5] I.A. Kalyaev, I.I. Levin, E.A. Semernikov, and V.I. Shmoilov. Reconfigurable multipipeline computing structures. Nova Science Publishers, New York, USA, 2012. [6] I.I. Levin, A.I. Dordopulo, A.M. Fedorov, and I.A. Kalyaev. Reconfigurable computer systems: from the first fpgas towards liquid cooling systems. Supercomputing frontiers and innovations, 3(1), 2016. http://superfri.org/superfri/article/view/97, last viewed July 2016. [7] LiquidCool Solutions. LiquidCool Solutions makes liquid cooling practical. [online], 2009. http: //www.liquidcoolsolutions.com/, last viewed July 2016. 349
Distributed Monitoring System For . . .
Danilov, Dordopulo, Kalyaev, Levin, Gudkov, Gulenok, Bovkun
[8] Matthew L. Massie, Brent N. Chun, and David E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(5–6):817–840, 2004. [9] Nagios. Nagios – The Industry Standard In IT Infrastructure Monitoring. [online], 2009–2016. https://www.nagios.org/, last viewed July 2016. [10] Supercomputers and Neurocomputers Research Center. Computing units. [online], 2004–2016. http://superevm.ru/index.php?page=boxes, last viewed July 2016. [11] The Apache Software Foundation. Apache Savan/C The WS-Eventing Module for Apache Axis2/C. [online], 2005–2007. http://axis.apache.org/axis2/c/savan/, last viewed July 2016. [12] The Apache Software Foundation. Apache Axis2/C — The Web Services Engine. [online], 2005– 2009. http://axis.apache.org/axis2/c/core/, last viewed July 2016. [13] Xilinx. Virtex-5 FPGA System Monitor User Guide. [online], 2011. http://www.xilinx.com/ support/documentation/user_guides/ug192.pdf, last viewed July 2016.
350