Component test toward single-flux-quantum processors

Component test toward single-flux-quantum processors

Physica C 392–396 (2003) 1490–1494 www.elsevier.com/locate/physc Component test toward single-flux-quantum processors M. Tanaka a,*, T. Kondo a, A. Se...

212KB Sizes 0 Downloads 13 Views

Physica C 392–396 (2003) 1490–1494 www.elsevier.com/locate/physc

Component test toward single-flux-quantum processors M. Tanaka a,*, T. Kondo a, A. Sekiya a, A. Fujimaki a, H. Hayakawa a, F. Matsuzaki b, N. Yoshikawa b, H. Terai c, S. Yorozu d a

d

Department of Electronics, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8603, Japan b Yokohama National University, Hodogaya, Yokohama 240-8501, Japan c Communications Research Laboratory, 588-2 Iwaoka, Nishi-ku, Kobe 651-2492, Japan Fundamental Research Laboratories, NEC Corporation, 34 Miyukigaoka, Tsukuba 305-8501, Japan Received 13 November 2002; accepted 31 January 2003

Abstract We have developed essential components for most of microarchitectures based on the single-flux-quantum (SFQ) logic. The circuit under test is composed of two registers, an ALU and a control unit, and made up of about 540 cells, 1600 Josephson junctions on 1.5  2.4 mm area. We have obtained the correct results by a sequence of several instructions, in which two integers are written into the registers with the LOAD and COPY operations, and then execute the ADD operation. Some of their functions were tested at both low and high frequencies. As a result, we found that the designed components could work at 15 GHz for the designed bias current and up to 18 GHz for higher bias currents.  2003 Elsevier B.V. All rights reserved. PACS: 85.25.Hv; 85.25.Na Keywords: SFQ; Processor; Registers; ALU; Superconducting device

1. Introduction Future network servers are required to have not only the high throughput performance but also low power consumption. The single-flux-quantum (SFQ) logic [1] has high potential to satisfy these requirements. In addition, the SFQ logic is the only device by which large-scale integration is realized except for semiconductor devices. From these reasons, we have started to develop micro-

* Corresponding author. Tel.: +81-52-789-3324; fax: +81-52789-3160. E-mail address: [email protected] (M. Tanaka).

processors based on the SFQ circuit technology, while State University of New York, Stony Brook has already designed another microprocessor referred to as the FLUX chip [2]. The design concept of our SFQ microprocessors is the complexityreduced (CORE) architecture [3], in which complexity of the system is eased in exchange for using a high clock rate of the SFQ circuits. In our first implementation of a microprocessor named CORE1, we employed a simple architecture that is similar to the TIPPY chip [4] developed at Yokohama National University. The CORE1 handles 8-bit-wide bit-serial data to reduce the complexity and to avoid the timing adjustment in a lot of parallel data lines. The purposes of the CORE1 are the followings:

0921-4534/$ - see front matter  2003 Elsevier B.V. All rights reserved. doi:10.1016/S0921-4534(03)00749-4

M. Tanaka et al. / Physica C 392–396 (2003) 1490–1494 Data Bus 1b x 8

DS

clk reset_trig REG0_enable REG0_clear

IR

IRb

32Byte

PC

Address Bus 5b

fetch_Execute_sw

ALU

IRa

REG1_enable Opecode 3b

REG1

Zero

At the first step, we have developed a combination of registers and an ALU because the combination is essential for most of microarchitectures, and because the operating speed at these parts determines the performance of processor directly. In addition, the timing design is most difficult, because the registers must send data to the ALU and receive another data from the ALU at the same time. In this paper, we describe a design concept of the combination of the registers and the ALU, and report the experimental results at high-speed operation.

Data Bus 1b x 8

RAM REG0

REG1_clear

• To demonstrate key components. • To search a suitable architecture in which microprocessors operate at a target clock rate of a few tens of GHz under the CORE concept by designing and implementing the actual SFQ circuits. • To find key issues concerning high-speed operation and integration.

1491

IRb_trg Inc_PC_trg IRa_trg Load_PC_trg PC_trg PC_trg clk

mem_W_trg mem_R_trg

Controller HLT_trg

Fig. 1. The CORE1 processorÕs microarchitecture. Except the 5-bit address bus, data lines are bit-serial.

2. Component design Although the data-driven self-timing (DDST) technique was used in the TIPPY to reduce the difficulty in the timing design, we designed the registers and the ALU with the synchronous SFQ logic, because we considered that the synchronous design leads to a reduced area of the circuit and small delay between neighboring components such as an ALU-registers set. The block diagram of the CORE1 microarchitecture is illustrated in Fig. 1. As shown in Fig. 1, the register-0 (REG0) sends and receives data to/from the ALU simultaneously. The key issue in the data-path design is how to reduce the delay between the ALU and the register. To make latency much smaller, we employed a new clock distribution called ‘‘branch’’ shown in Fig. 2 in the registers and the ALU. In the branch clocking, clocked gates are arranged in the ringshape and both input and output ports are fed with clock pulses at first. About the half part of output side is driven at the counter flow clocking and the other half is at the concurrent flow. As a result, the register can send and receive a first bit of

Fig. 2. The branch clock distribution. The circuit is driven at the counter and concurrent flow clocking to reduce its latency.

a data stream immediately. Another benefit of the branch is relatively simple layout for the clock distribution as shown in Fig. 2, though the precise timing estimation is required. The Verilog-XL simulation based on the CONNECT cell-based design technique [5] helped us design these components extremely easily. All of our CONNECT cells have the bias-dependent timing information calculated by JSIM [6] and SCOPE [7]. These tools enable us to realize sophisticated timing design such as the branch. 2.1. Registers Fig. 3 shows the 8-bit register designed with the branch. To realize the function that the data is held at the register, a datapath for loop-back is added. The loop-back path is composed of four D

1492

M. Tanaka et al. / Physica C 392–396 (2003) 1490–1494 8bit shift-register DFF

DFF

DFF

clock

DFF

data_out

DFF enable /disable

DFF

DFF

DFF

DFF

DFF

data_in

DFF

through /clear DFF

4bit loop back

Fig. 3. Block diagram of the designed 8-bit register with the branch clock distribution. The datapath for loop-back with four D flip-flops below is used to hold the value. The X marks are NDROs to control data stream. The black dots are splitters and white circle represents confluence buffer.

flip-flops corresponding to the number of stages in the ALU. Thus, a dozen clock pulses are always required for all of register operations. Each of 8bit register has 257 Josephson junctions. To control data stream, each of the register has two flip-flops with non-destructive read-out (NDRO) denoted as X in Fig. 3. One determines whether the data is sent to the ALU or not (called ‘‘enable’’ or ‘‘disable’’ state). The loop-back datapath is connected or disconnected by the other NDRO (called ‘‘through’’ or ‘‘clear’’ state). 2.2. ALU

our first ALU was simply a full adder using two AND and two XOR gates with carry saving, because it had no support for SKIP instruction at that time. The ALU has four pipeline stages: two of them are for addition, one for resetting the carry pulse when an overflow occurs, and we insert a latch for the timing adjustment. The branch clocking is also used in the ALU. 3. Design of test circuit The test circuit is composed of two registers (REG0 and REG1), an ALU and a control unit. A ladder circuit is included to test them with highspeed clock. Fig. 5 shows a microphotograph of the test circuit that was fabricated using the NEC 2.5 kA/cm2 standard niobium process [8]. It is made up of about 540 cells and 1600 Josephson junctions on 1.5  2.4 mm area. The circuit can execute four functions that are required for the CORE1 instruction set: LOAD, STORE, ADD and COPY. These are selected by states of four NDROs at each register. In the CORE1 architecture, data is always loaded into REG1. To load data, we set REG1 to the clear state to write down, while REG0 is in the through state to keep the content. In addition,

Fig. 4 shows the designed ALU, which was made up of 202 Josephson junctions. Apparently, reset

D3FF

data_out

DFF

data_in_1 clock data_in_2

DFF

Fig. 4. Block diagram of the designed ALU. It is composed of 4-stage pipelined carry save serial adder. The branch clocking is also applied.

Fig. 5. Microphotograph of the test circuit. It is composed of two registers, an ALU and control unit, made up of about 1600 Josephson junctions on 1.5  2.4 mm area.

M. Tanaka et al. / Physica C 392–396 (2003) 1490–1494

1493

Table 1 The NDROs states for four functions at the test circuit Function

REG0

LOAD STORE ADD COPY

Disable Enable Enable Disable

REG1 Through Clear Clear Clear

Disable Disable Enable Enable

Clear Through Through Through

We set NDROs in the enable and through states.

both REG0 and REG1 are also set to be in the disable state not to pass any data to the ALU. Only data held at REG0 can be output by the STORE instruction. At that time, the data stored in REG1 is kept by setting REG1 at through and disable state. We decide to select the clear and enable for REG0 in this operation, though the data of REG0 can be kept for the selection of through and disable. Since the content of REG0 is rewritten through the ALU with being added to zero, REG0 results in keeping its original value. This sequence is valid for the SKIP instruction, where REG0 is compared with zero. To add REG1 to REG0, we set REG1 at through/enable and REG0 at clear/enable state, so that the ALU adds their data. The result is written back to REG0. We can execute the COPY operation by changing REG0 into disable state from the ADD operation. The sum of data in REG1 and zero is written to REG0, so that we can copy REG1 to REG0. Table 1 summarizes the NDROsÕ state for each function. In our test circuit, the control unit is 4bit shift register with taps. The 4-bit control code means the enable and through states of two registers. When a control code is written, all NDROs are once reset, and then the corresponding NDROs are set by the code. We can test the ADD and COPY operation at both low and high frequencies by using the on-chip test technology [9] which is included in the CONNECT cell library. The ladder circuit in the test system is designed to generate 16 GHz pulses, the target clock speed of the CORE1. 4. Test results Because each register has an output port, we can observe the contents of the two registers dur-

Fig. 6. One of the test results in the component test. The control codes were set according to Table 1. The COPY and ADD operation were tested with high-speed clock pulses generated by the ladder circuit. We put a reset code at the end of the sequence to clear registers instead of STORE.

ing the next operation independently of the NDROsÕ states when at low frequencies. Note that data always starts with the least significant bit (LSB) and the first eight bits mean the content and the last four bits are always set to zero. One of the test results is shown in Fig. 6, where we used DC/DFQ converters to generate SFQ pulses and SFQ/DC to monitor them. An input is fed with an SFQ pulse at a rising edge of the input signals in the DC/SFQ, and both positive and negative edge represents an SFQ arrival for the output signals in the SFQ/DC. In the case of Fig. 6, we could add two integers correctly as follows. At first, we loaded 21910 (1101 10112 ) into REG1 by a dozen of low clocks. Then, we copied it to REG0. Next, we loaded 23610 (1110 11002 ) into REG1, and add them. The COPY and ADD operation were done with high speed clock pulses. Finally, we read and verified the contents of REG0 with low clocks. We set all NDROs ‘‘0’’ to clear data at that time. The result should be 19910 (1100 01112 ) instead of 45510 because of the overflow. We can find from Fig. 6 that the last carry pulse is deleted successfully. Fig. 7 displays the operating frequency as a function of the bias current provided to the test circuit normalized by the designed value for the measurement and the cell-level simulation. The operating frequency was changed by the current supplied to the ladder circuit. The measured values agree with the simulated ones. The measured values of the maximum operating frequencies are

1494

M. Tanaka et al. / Physica C 392–396 (2003) 1490–1494

distribution named branch for these components in order to reduce their latency and to reduce the area. We tested successfully the combination with a sequence of several instructions at both low and high frequencies. The components worked at 15 GHz under the designed bias point and operated up to 18 GHz. The results of the component test agree well with those of the simulation. This agreement supports the effectiveness of our cellbased design technique based on the CONNECT cell library.

25

Frequency [GHz]

simulation measured

ies enc equ r f g atin per mo u m xi Ma

20

15

10 Minimum frequencies to be tested (limited by the ladder circuit)

5

0 -30%

Bias margin at low frequencies (-24%~19%) -20%

-10%

0%

10%

20%

30%

Bias current (normalized by the designed value) Fig. 7. The operating frequency for the measurement and celllevel simulation. The test circuit worked correctly at 15 GHz under the designed bias and operated up to 18 GHz.

smaller than the simulation by about 1 GHz. We think that the global spreads in the device parameters affected the maximum operating frequency. On the other hand, the minimum operating frequencies are almost the same, indicating the ladder circuit could not generate less than 10 GHz pulses. At the designed bias point, the maximum operating frequency was limited to 15 GHz. The carry path in the serial adder and the interconnections between the ALU and each register were the critical path in the test circuit. To increase the clock frequency to the CORE1 target frequency of 16 GHz, some of new wiring cells are needed to optimize these paths.

5. Conclusions We have designed the combination of the 8-bit registers and the ALU toward CORE1 processor. We used the synchronized logic with a new clock

Acknowledgements The authors would like to thank all the CONNECT members consisting of Nagoya University, NEC, CRL and Yokohama National University.

References [1] K.K. Likharev, V.K. Semenov, IEEE Trans. Appl. Supercond. 1 (1992) 3. [2] M. Dorojevets, P. Bunyk, D. Zinoviev, IEEE Trans. Appl. Supercond. 11 (2001) 326. [3] A. Fujimaki, Y. Takai, N. Yoshikawa, IEICE Trans. Electron. E85-C (2002) 612. [4] N. Yoshikawa, F. Matsuzaki, N. Nakajima, K. Fujiwara, K. Yoda, K. Kawasaki, Applied Superconductivity Conference, Houston, August 2002. [5] S. Yorozu, Y. Kameda, H.Terai, A.Fujimaki, T. Yamada, S. Tahara, 14th International Symposium of Superconductivity, Kobe, September 2001. [6] E.S. Fang, T.V. Duzar, Ext. Abst. 1989 Int. Supercond. Electron. Conf., Japan, 1989, p. 407. [7] N. Mori, A. Akahori, T. Sato, N. Takeuchi, A. Fujimaki, H. Hayakawa, Physica C 357–360 (2001) 1557. [8] S. Nagasawa, Y. Hasimoto, H. Numata, S. Tahara, IEEE Trans. Appl. Supercond. 5 (1995) 2447. [9] T. Yamada, A. Sekiya, A. Akahori, H. Akaike, A. Fujimaki, H. Hayakawa, Y. Kameda, S. Yorozu, H. Terai, Supercond. Sci. Technol. 14 (2001) 1071.