Computers them. Engng, Vol. 13, No. 7, pp. Plinted in Great Britain. All rights reserved
855-857,
PERFORMANCE SYSTEM
1989 Copyright
0
009%1354/89 $3.00 + 0.00 1989 Pergamon Press plc
OF A PROCESS FLOWSHEETING ON A SUPERCOMPUTER B. K. HARRISON
Department of Chemical Engineering, University of South Alabama, Mobile, AL 36688, 1J.S.A. (Received
24 October 1988; final revision received received fir publication I8 Jammy
28 December 1989)
1988;
Abstract-A sequential modular flowsheet simulator, FLOWTRAIN, was installed on the Cray XMP/24 supercomputer with minimal code modifications. Comparisons of performance for the flowsheet simulator were made between the supercomputer and a conventional mainframe computer, the IBM 4341-2. Test cases ran between 3.5 and 26 times faster on the supercomputer.These speedupnumbers, although
significant,are much less than speedupspredictedbased on peak computationrates. To achieve the potential performance improvements supercomputers offer, code changes would be required to take
advantage- of supercomputer architecture,
INTRODUCI’ION
Supcrcomputers have the potential to overcome the degradation in computational performance that is exhibited by common computers for complex flowshceting problems. For instance, a Cray II supercomputer has a peak computational speed that is approx. 3000 times faster than a VAX 1 l/780, a historically popular computer for process simulation applications. Past experience with computers has shown that the performance achievable on today’s supercomputers will be available in the common computers of tomorrow. Speedup factors in terms of comparisons of peak speeds in MFLOPS (millions of floating point operations per second) give the maximum performance improvements obtainable in moving to more powerful computers. Actual experience is usually less than the peak spcedup factors as most codes do not take full advantage of the architecture of the various computers. Jordan (1987) in testing a scientific workload representative of the chemical and petroleum industries, found speedups on supercomputers to be about one-third that predicted based on a comparison of peak speeds. It is obvious that there is the potential for much more rapid flowsheet calculations, especially if advantage can be taken of computer architecture. However, the only reported attempt to install a Rowsheeting system on a supcrcomputer showed little speedup. Duerre and Bumb (1982) found that when they installed the ASPEN flowsheeting system on a Cray-1, it ran only two or three times faster than on an IBM 370. Peak speed comparisons of the two computers would have indicated a spcedup on the order of 50. As pointed out by Duerre and Bumb, one of their primary problems was the use of double precision on the Cray when single precision on the Cray would have given the same number of significant figures as using double precision on the
IBM. The Cray architecture imposes severe penalties for the use of double precision. For instance a multiply operation takes approx. 30 times as long in double precision as compared to single precision. In order to further investigate the performance improvement of a sequential modular flowsheeting system on supercomputers, this study will compare the performance of the academic version of the FLOWTRAN system (Seader ef af., 1987) on a typical mainframe computer with that obtained on a Cray XMPJ24 Major code supercomputer. modifications were avoided but the Cray compiler does attempt some structuring of the code to take advantage of the supercomputer architecture. It is impossible to accurately predict performance improvements without actual experiments on the computers.
PROCEDURE
The FLOWTRAN system was installed on the Cray XMP/24 supercomputer with minimal code modifications. The operating system language procedures had to be rewritten in the UNICOS language used by the Cray. A few changes to the code had to be made because of machine-specific items. A Cray compiler option allowed a blanket change of double precision code to single precision at compile time. The Cray CFT77 compiler attempts both scalar optimization of the code and vcctorization. No attempt was made to take advantage of multitasking computations on the Cray XMP/24’s two parallel processors. After installation a series of 27 flowsheet test cases were run on the Cray as well as on a popular mainframe computer, the IBM 4341-2. The test cases covered a broad scope of operations for flowsheeting systems including distillation systems, reactors, optimization and physical properties. CPU times were 855
856
B. Table
Test
1. Comparison XMP/24
number
of total CPU time and IBM 4341-2
--.__ IBM
1 2 3 4 5 6 7 8 9 IO 11 12 13 14 15 16 17 18 ;: 23 61
on
K.
Cray
Total CPU time (s) -. .__.._~____ 4341-2 Cray XMP/24
il.62 8.7 10.4 7.9 10.0 8.5 10.2 21.9 17.8 12.9 17.5 7.7 8.9 9.9 4.9 5.2 5.8 7.2 5.3 5.3 5.3 335.5
2.0 1.8 1.8 1.7 1.9 t.7 1.8 2.2 1.9 2.0 2.0 1.5 1.6 1.7
1.3
I .4 .4
I 1.h I .4 I .5 14 12.8
recorded for the test cases on the two computers. Clock times were not compared as they are very much a function of system load at the time of execution. In addition to total CPU time, statistics were taken concerning the portion of the CPU time that was devoted to system tasks as opposed to user tasks. A study was also made of the time spent in the various parts of the sequential modular Aowsheet system. A performance monitoring tool available on the Cray allows monitoring the amount of time spent in each routine and the number of times a routine is called. Also the Cray provides for control of the automatic scalar optimization and vectorization features on its compiler. Tests were run with various settings of these features. RESULTS
AND
DISCUSSION
A comparison of total CPU times for each of 22 test cases for the IBM 4341-2 computer and the Cray XMP/24 computer is given in Table 1. The tests involve normal flowsheet simulations except for “Test 61” which is a flowsheet optimization problem. Table 2 provides a similar comparison for auxiliary matures of a flowsheet system involving database retrieval, physical property correlation and VLE correlation. The tests show speedup factors between the two computers in the range of 3.5-26. On average the normal flowsheet cases ran 6.2 times faster on the Cray. The calculation intensive optimization test case ran 26 times faster on the Cray. In all cases the output from the two different computers was identical to the limits of engineering accuracy. The same number of iterations were required to converge to solutions on both computers. The speedup factors experienced are well below the factor of 333 predicted by a comparison of peak
HARRISON
speeds of the two computers (400 MFLOPS for the Cray XMP/24 and 1.2 MFLOPS for the IBM 4341-2). This suggests that even with the Cray compiler that attempts scalar optimization and veetorization, much less than full advantage is being taken of the architecture of the supercomputer. The effects of the automatic scalar optimization, which includes some generation of low level parallel tasks, and vectorization on the Cray were further studied. Results indicate that scalar optimization provides a performance improvement about one third better than that due to the Cray’s fast clock speed alone. Vectorization provides essentially no improvement however. In some of the test cases up to a 5% improvement was observed with vectorization, in others a slight worsening of performance was are that the faster clock observed. Implications cycle on the Cray is the overwhelming factor for performance improvement. A penalty for employing vectorization on certain codes has been observed before (Jordan, 1987). As setting up pipeline vector calculations involves overheads, long vectors are needed to gain the economies of the calculations. Apparently the FLOWTRAN flowsheet system has a sizeable number of short vectors in which the overhead costs of vectorization are not overcome. An examination of sample code reveals that about half the loops identified by the compiler were vectorized. The vectorization situation might be expected to be worse on many other supercomputers which have been designed to favor long vector calculations even more than the Cray. Selective code modification would be expected to give some benefits in terms of vectorization. An analysis of the CPU time spent in various routines coupled with the number of times various routines were called indicates that such efforts should be focused on low-level physical property routines which are frequently utilized. In particular, vapor pressure routines and liquid fugacity routines consume relatively large amounts of- CPU time. No code modifications of this type were attempted in this study however. A sequential modular simulator spends time in a variety of support processing tasks in addition to execution of the main flowsheet problem. For instance the FLOWTRAN system steps through several tasks under the control of an operating system program. First the FLOWTRAN system defines several files and then it retrieves needed information from a physical property database. Next a preprocessor program interprets the user inputs to structure
Table
2. Auxiliary
features
perfortnance
Total Test Database retrieval Physical property correlation VLE correlation
IBM
4341-2
9.5 4.5 27.4
CPU
Time Crav
(s\ XMPi24 0.7 0.3 1.2
Performance of a flowsheeting system on a supercomputer Table 3. CPU time for various FLOWTRAN
tasks
Optimization test case Total CPU time (s) Task Setup
IBM 434f-2 and
database
Pre-processor Compile
and
load
Executeprogram Cleanup
1.88 4.12 2.67 323.70 3.10
Cray XMP/24 0.13 0.30 1.12 11.18 0.18
a Fortran main program for the simulation. This program then has to be compiled and loaded including the various subroutines from a library. Finally, the main simulation program is run and this is followed by some file cleanup duties. To investigate the effects these support processing tasks were having on the performance observed, timing statistics were taken relative to each task the FLOWTRAN system accomplishes. The results are shown for the optimization test case in Table 3. Experimentation demonstrates that certain tasks (program setup and database retrieval, preprocessor operations, compile and load and file cleanup operations) take very similar times, no matter how complex or simple the simulation. As expected, the most important factor in terms of variation in CPU times between different test cases is the execution time for the main simulation program. However, one can expect the listed supporting tasks to take on the order of &lo CPU s on the IBM 4341-2 and l-l .7 CPU s on the Cray, regardless of the time the actual simulation program takes. The supporting tasks constitute a large part, if not a majority of the CPU time for the relatively short flowsheet problems, but are less significant for long problems like optimization. These overhead tasks are system dependent (only the preprocessor is a Fortran program) and difficult to restructure to take advantage of any particular machine architecture. To investigate the possibility that the performance of FLOWTRAN on the Cray may be somewhat inhibited by its frequent writing to a history file concerning the progress of its calculations, a special feature of the Cray was utilized. The Cray allows the option of file storage on a 32 megaword solid-state storage device (SSD) that is coupled to the central memory with high-speed communication devices. This provides faster access to files by more than two orders of magnitude as compared to a disk system. When FLOWTRAN was run using all input and output files on the SSD, CPU performance time improved less than 1% as opposed to using disks for files. There were noticeable improvements in clock time performance however.
857
As has been mentioned, no attempt was made to multitask FLOWTRAN on the two Cray processors. Presently, multitasking requires considerable intercession in the code. A Cray compiler to he released soon is capable of some automatic multitasking_ However, the maximum increase in speed for two processors as opposed to one is a doubling of speed. Much smaller figures are anticipated due to system overheads and limits to what can be multitasked, although Chen (1984) reports a speedup of 1.9 in some cases.
CONCLUSION Generalizing the results of this study, supercomputers offer a performance improvement to existing sequential modular flowsheeting systems. The improvement, however, is much less than the speedup predicted based on peak speeds. Indications are the sequential modular simulators, even with optimizing compilers, do not take much advantage of supercomputer architectures. To gain much of the performance potential represented by supercomputers, a greater utilization of supercomputer architecture has to be achieved. Eventually supercomputer utility programs may perform or at least ease this task. Even though much of the potential performance improvement has not been achieved, popular existing sequential modular simulators can run on current supercomputers and experience an immediate speedup in computations. This speedup can be significant for many purposes. Using the computers of this study by example, a major plant optimization simulation that might take 2 h on a conventional mainframe computer can be accomplished in 5 min. Shorter simulations can be run quickly enough to be interactive. Acknowledgements-This work was supported by a grant from Cray Research Inc. The Monsanto Company furnished their simulator FLOWTRAN for this study.
REFERENCES Chen S., Large-scale and high-speed multiprocessor system for scientific applications. Supercomputers: Design and Application (Kai Hwang, Ed.), pp. 46-59. IEEE Com$er Society Press. Sil&r St&e: Marvland (1984). AsPEN Dierre K H.- and A. C. Bumb,?mplementing Los on the Gray Computer. Report LA-UR-81-3528, Alamos National Laboratory, Los Alamos (1981). Jordan K. E., Performance comparison of large-scale scientific computers. Computer 20, 10-23 (1987). Seader J. D., W. D. Seider and A. C. Pauls FLOWTRAN Simulation-An Introduction, 3rd Edn. CACHE, Austin, Texas (1987).