North-Holland
Microprocessingand Microprogramming24 (1988) 17-20
17
MICROPROGRAMMING, MICROPROCESSING AND SUPERCOMPUTING
Joseph A. Fisher Multiflow Computer, Inc. Branford, CT USA
1. INTRODUCTION
In the mid 1970's, when I first heard about the Symposium on Microprocessing and Microprogramming, I wondered what microprogramming and microprocessing had to do with each other besides sharing the first three syllables. Indeed, I wondered if there were some fundamental confusion on the part of the organizers--perhaps they thought that microprogramming was programming microprocessors? (In fact, this very confusion once led the ACM to order SIGMICRO, its Special Interest Group On Microprogramming, to change its name. Too many people were joining it thinking they were going to explore the programming of their 8080's and 6800'sl Fortunately the ACM eventually relented, though not before a new name had been selected and was ready to go.)
Instead it turns out that the organizers must have had a clear vision of the future. Many people anticipated, by the mid 1970's, some sort of revolution involving microprocessors and low-cost computing. But in the 1980's a very different revolution has occurred. Microprogamming and microprocessors, sometimes in combination, have caused a revolution in low cost, very high performance computers, particularly those aimed at scientists and engineers.
The enabling technologies for this revolution were, until recently, the exclusive domain of university researchers and symposia like Euromicro. Now they are not only among the hottest areas of architecture and systems research, but they are recognized as being among the most promising new areas of industrial technology.
2. WHAT'S HAPPENED TO HORIZONTAL MICROCODE COMPACTION?
At the time Euromicro started, there was an obscure research topic in the microcode community called "Microcode Compaction." This had two subtopics, "Vertical Microprogram Compaction" and the even more obscure "Horizontal Microprogram Compaction." They were generally thought of as ways to reduce the amount of read-only memory required to store a microprogram, the lowest level control bits which caused the hardware to emulate the machine language of the CPU.
Vertical microcode compaction was what it seemed. Researchers concerned themselves with encoding the bits of a microprogram so that each instruction was as short as possible. This resembled gate minimization and other hardware design tool areas that became somewhat less important as hardware got cheaper, cooler, and denser.
Horizontal microcode was quite another matter. The topic was not well understood, sometimes not even by the research community looking into it. The goal was to take a microprogram and make its length shorter by compacting more operations into each line. Of course, each line of horizontal microcode is executed in a single minor clock cycle. Thus microprograms got faster as they got smaller. This is certainly a desirable goal, since a microprogram is always chugging away whenever the hardware is running, and the whole machine gets faster whenever this is successful.
What does it mean to learn how to do more operations at once in order to go faster? It means you are doing parallel processing. This is an example of "fine-grained parallel processing," and is a close relative of dataflow architectures, of the overlapped execution of the first supercomputers, and of the attached processors that started appearing in the mid-1970's. But the microcode compaction researchers by and large didn't recognize that they were doing parallel processing research, and the parallel processing community didn't know they existed.
The horizontal microcode problem turned out to be harder than people first realized. It was theoretically hard, being an NP-complete problem, so that there were unlikely to be practical, provably optimal solutions. (This didn't stop the appearance of solutions which claimed to be polynomial time and optimal. Some of these solutions were published in well-known journals. Had they been correct, the authors would have solved one of the most important problems in theoretical computer science.)
Of greater significance, the problem was very hard in practice. The optimality of solutions was unlikely to matter; in situations like this, reasonable heuristics usually do well enough. But horizontal microcode is very idiosyncratic. Moving around operations in order to pack more of them in a line is quite a challenge. Some functional elements are pipelined, and others throw results out into space (e.g. onto
18
J.A. Fisher / Microprogramming, Microprocessing and Supercomputing
a bus) and expect that some other operation will be there to pick them up at just the right moment.
The worst part of the problem was the effect of conditional jumps in the microprogram. Moving operations past them brought uncertainty into the picture, a serious complication. Yet it appeared to be necessary to move operations past jumps for there to be much compaction at all. (Indeed, largely unknown to the microcode researchers were experiments done to measure the overlapped execution potential for the earliest supercomputers, machines like the CDC-6600 and the IBM Stretch. These experiments showed that the real potential of overlapped execution in general was in solving the conditional jump problem. If operations cannot be overlapped when there is a conditional jump between them, the potential speedup averages about a factor of 2-4x. If they can, the potential speedup can be 50-100x or more.)
greater flexibility. Built by manufacturers like Floating Point Systems, Numerix, CSPI, CDC and others, these had near supercomputer power in a very low cost box. But the problems of horizontal microcode compaction made them unusable as general purpose computers. Efficient code had to be crafted by hand to use their great power. They solved neither the of limitations of horizontal microcode: idiosyncratic operations, nor the limitation of conditional jumps. (Though I'm told that one, the Numerix MARS-432 is somewhat less idiosyncratic than the others, and would be less of a problem to write a good compiler for.)
Now there are genuine VLIW's in which both problems are solved. (The Multiflow Trace family is what I'm thinking of.) The only resemblance these machines have to the attached processors is that they both do overlapped execution. The horizontal instructions consist of many RISC operations packed together in advance. (Today's machines already issue up to 28 operations per cycle. With pipelines, they sometimes keep 50 or more operations in flight at any one time.)
3. FROM HORIZONTAL MICROCODE COMPACTION TO SUPERCOMPUTING
So what does all this have to do with Supercomputing? You can build a computer in which the machine language itself is horizontal microcode, stored in RAM, rather than ROM. Even using ordinary circuitry, you will get a supercomputer class machine if, and only if, you can find many things to do at the same time within the microcode words.
To make all of this work, you have to deal with the two very big issues mentioned above: the idiosyncratic nature of microcode, and conditional jumps. As it happens, at least one solution of the conditional jump problem works very well. (It's time for me to wrestle with modesty and crass commercialism here: I ' m referring to my own "Trace Scheduling" algorithms and the systems built by Multiflow Computer.) To deal with the idiosyncratic nature of microcode, you build a machine with "normal" operations packed many to an instruction, not with horizontal microinstructions. You have the same kind of parallelism as does microcode, but the operations are considerably cleaner.
The supercomputer you're left with is an ordinary computer (in fact, a RISC machine, see below) in terms of its basic operations. But every cycle, a very long instruction, consisting of many of these basic operations bundled together, is fetched and executed. The operations are bundled together before the program runs. Machines like this are now called VLIW's (for Very Long Instruction Word architectures).
Although building these machines presents many difficult engineering problems, from an architectural point of view they could have been built decades ago. Why weren't they? Because the conditional jump problem would have stopped them from being any more useful than the attached processors. The conditional jump problem is solved using Trace Scheduling's "Escape Hatch Principle." Without going into detail, the basic idea is that operations can be moved up above or down below a jump, as long as their effect can be cancelled or ignored whenever the jump goes the opposite way. As long as jump directions are reasonably predictable, doing things well in advance will usually pay off, and the machine will go at supercomputer speeds even when built from minicomputer hardware components.
The work that led to these VLIW supercomputers came directly from the research world of horizontal microcode. It was largely compiler work, using many of the standard tools of optimizing compilers to enable the code motions that Trace Scheduling required. (My first writeups of Trace Scheduling appeared in the late 1970's. They are not clearly written and hard to muddle through. Far better descriptions exist today, if one is interested. The best are my student John Ellis' thesis, "BULLDOG: A Compiler For VLIW Architectures," which won the ACM Thesis Of The Year Award in 1985 and is a 1986 MIT Press book; and various technical documents available from Multiflow Computer, Inc., Branford, CT, USA.)
4. MICROPROCESSOR SUPERCOMPUTERS? From a hardware point of view, long instruction word machines started appearing in the late 1970's. Called attached processors, they were grownup signal processing machines in which the ROM microstore was replaced by RAM to allow
The second enabling technology for low cost, very high performance computers aimed at scientists and engineers is the microprocessor. The microprocessor revolution has been
J.A. Fisher /Microprogramming, Microprocessingand Supercomputing upon us for nearly two decades, but it is now appearing at the very high performance, near supercomputer, level. The difference has been: 1. The appearance of RISC microprocessors, which have led to very simple, and thus fast, circuits. 2. Good compiler technology, which encourages the use of simple CPU's. 3. The overlapped execution available with RISC microprocessors. 4. The occasional success of multiprocessor parallel processing. 5. The market viability of very fast floating point circuits, allowing low cost, numerically oriented machines.
Let's briefly examine these, in reverse order. The late 1980's have seen the introduction of -20 Mhz. mass-market microprocessors and "minisupercomputers" (e.g. Convex, Alliant and Multiflow). These products are all profitably augmented with pipelined floating point units (such as those from companies like Wietek and Bipolar Integrated Technologies) which have now reached critical mass. That means that it is a straightforward matter to build a high peak-megaflop machine, a necessary component in a scientific and engineering supercomputer.
The second effect has been the success of multiprocessor parallel processing. Although in my opinion this success is sometimes overstated, there are many important algorithms that can be made to run significantly faster using several processors ganged together. This becomes far easier, from a hardware point of view, when using microprocessors. These algorithms include those at the heart of dense matrix solvers, signal processing, and other important tasks.
Finally, let's consider the first three points. RISC is a new word, dating from the late 1970's. The concept is certainly not new. (There is some disagreement about what makes a processor a RISC architecture. People have variously suggested that a RISC machine should include: simple operations, few operations, unencoded operations, no microcode, sliding register windows, etc. When I talk about RISC, I mean the first on that list: simple operations in which the only access to memory is through simple loads and stores. I believe this is what really makes RISC different from the VAX/IBM/68000 architectures.)
RISC machines have been around forever; some of the first architectures were load/store. In particular, vertical microcode is usually a RISC architecture, though that fact is sometimes obscured once again by the idiosyncratic nature of microcode. Indeed, one reasonable view of RISC is that it is simply the elimination of the machine language level. Programs are compiled directly into the (cleaned up) verti-
19
cal microcode. Once again, as we saw above, a big savings comes from eliminating the machine language emulation level of a CPU. With good compilers, that level serves no purpose. (An interesting question arises. Does RISC mean the elimination of microcode, as some refer to it, or does it mean microcode always? Of course it means both: it is the elimination of the specific microprogram meant to emulate the machine language level, and the opening up of the microcode directly for all programs. Thus microcoders are probably correct in feeling threatened by the RISC style, since they are usually employed to write the machine language emulators.)
The late 1980's have seen the appearance of very fast RISC microprocessors (e.g. those from MIPS), and the promise of others, particularly built around ECL-based high density processes. With the addition of the floating point chips mentioned above, and the parallel processing capabilities of a multiprocessor, a formidable scientific engine can be produced (one that has been promised for later this year is the Ardent Graphics Workstation, which is intended to integrate a MIPS microprocessor, BIT floating point parts, Alliantstyle parallel processing, good compiler technology and Japanese manufacturing). These may approach the speeds of the minisupercomputers of just a couple of years ago, and will cost considerably less, even with the addition of high powered, integrated graphics.
Finally, one of the interesting directions of the RISC microprocessors is that they promise a small degree of overlapped execution, in the style of horizontal microcode. Today's designs often use delayed jumps, in which operations can be placed after a jump, even though they will be executed whichever way the jump goes. Thus they are computed in parallel with the jump instruction. Similarly, some RISC designs allow other operations to proceed while a floating point operation or a memory access is in flight. They even can allow unrelated operations to share pipestages simultaneously. These are all small special cases of the VLIW parallelism referred to above, and they all have the effect of speeding up the computation.
5. SUMMARY AND FUTURE PROSPECTS
Perhaps the organizers of the original Euromicro saw what most of us couldn't: the strong relationship between the apparently unrelated topics of microprogramming and microprocessing. I'm not sure that anyone could have predicted that the two would, thirteen years later, produce some of the highest power scientific processors. By using the power of horizontal microcode compaction techniques, and by using RISC principles in very fast microprocessors, that is just what we have.
It is a straightforward matter to see these two trends and predict that they will continue to flourish and join together.
20
J.A. Fisher / Micropro#ramming, Microprocessingand Supercomputing
It is reasonable to expect that over the next five to ten years, microprocessors will incorporate more of the VLIW style fine-grained parallelism to get speedups far beyond what the component technology allows. In time, we will probably see multiprocessors in which each node is a VLIW microprocessor, programmed using something equivalent to Trace Scheduling. At that point, Microprogramming, Multiprocessing and Supercomputing will all amount to the same thing.