Asynchronous control of modules activity in integrated systems for reducing peak temperatures

Asynchronous control of modules activity in integrated systems for reducing peak temperatures

ARTICLE IN PRESS INTEGRATION, the VLSI journal 41 (2008) 447–458 www.elsevier.com/locate/vlsi Asynchronous control of modules activity in integrated...

818KB Sizes 1 Downloads 9 Views

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 41 (2008) 447–458 www.elsevier.com/locate/vlsi

Asynchronous control of modules activity in integrated systems for reducing peak temperatures Slawomir Mikulaa,, Gilbert De Meyb, Andrzej Kosa a

AGH University of Science and Technology, Cracow, Poland b University of Ghent, Ghent, Belgium

Received 30 March 2007; received in revised form 4 November 2007; accepted 4 January 2008

Abstract The paper describes a new control method of integrated circuit (IC) modules activity in a modern processor design. The control method leads to improved frequency ability of integrated systems. The proposed solution, based only on computing flow modification, could be easily integrated into all future designs, ranging from a portable computing to a multi-core computing. A new approach to the thermal control method is described along with simulation results. An example of incorporation in current and future integrated circuits into mainstream designs is presented with exemplary algorithms and final simulation results. r 2008 Elsevier B.V. All rights reserved. Keywords: Asynchronous control; Temperature reduction; VLSI; Microprocessors

1. Introduction Current research emphasizes that mobile systems are going to be a major player in future production plans [1]. Most consumer electronic top trend products are portable devices. Users need new technologies, commonly available. The global Internet network only emphasizes this trend. It is obvious that current demand for computing power rises every year [2]. For example, those technologies are needed in: multimedia streaming delivery, cryptography, computer graphics (e.g. Transform & Lighting), physics computing (e.g. Havok Physics/Complete) [3], etc. The said premises have led to intensive research efforts in the fields of power supply units [4] and improvements in the integrated circuits’ (IC) design. This article presents a new control method of integrated circuit modules activity in a modern processor design. The control method leads to improved frequency ability of integrated systems. The proposed solution, based only on computing flow modification, could be easily integrated

Corresponding author.

E-mail addresses: [email protected] (S. Mikula), [email protected] (G. De Mey), [email protected] (A. Kos). 0167-9260/$ - see front matter r 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2008.01.003

into all future designs, ranging from a portable computing to a multi-core computing. The exponential dependence of reliability on temperature is the reason for favoring lower operating temperatures [6], and that is why Viswanath et al. noticed that a reduction in temperature only by 10–15 1C can extend the lifespan of a device by as much as a factor of two [5]. The method presented in this paper could lead to minimize the maximum peak temperature of an integrated circuit. 1.1. Power consumption limiting For example, the Intel Centrino/AMD Mobile and other mobile designs lead to minimization of power consumption [8,9]. This trend has clear drawbacks. Power consumption limiting leads to a decrease in the computation power. Current methods of limiting power consumption could be put into three categories: dynamic frequency scaling (DFS), dynamic voltage scaling (DVS) and dynamic clock throttling (DCT). Variants of frequency and voltage scaling along with clock throttling could lead to a significant minimization of power consumption [16]. For example, Pentium M processors utilize two methods at the same time [22].

ARTICLE IN PRESS 448

S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

Voltage and frequency are scaled together to achieve reductions in energy per computation. Scaling frequency alone is insufficient because, while reducing the clock frequency, the processor power consumption is not reduced; the computation execution time is to the first approximation linearly dependent on the clock frequency, and the clock frequency reduction can mean that the computation takes more time and consume the same total energy. Because power consumption is quadratically dependent on voltage level, scaling the voltage level proportionally along with the clock frequency offers a significant total energy reduction while running a processor on a reduced performance level. Many algorithms were designed to control the voltage and the frequency in modern processors. The voltage scaling technique was researched into many publications [10,15,17,20,23–26,30]. The frequency scaling was also considered as a mechanism to minimize the power consumption [13, 21,27]. Research into an extreme usage of electronics (e.g. an extraterrestrial exploration) [7,29] and attempts made at power control on the algorithm level [14,19,28] have also exerted good influence on research effects in this field. The power needed for a computing effort is tightly connected with the chip temperature. Controlling power dissipation is one way to achieve a proper working temperature. This article presents another method, which is based on a combination of voltage scaling and schedule algorithm modification. 1.2. Processors design trends Nowadays, the design of common usage processors are similar. Frequency war has no big impact on new processor releases. Current design concentrates rather on concurrent computing than on increasing computation power by a higher working frequency. Dynamic power is tightly combined with the working frequency, and in portable designs it is the most obvious design show-stopper [18]. The same problem arises in a mainframe consist of many multi-core processors. The power consumption is the main problem nowadays. Blade designs [11] consist of many small and independent computing processors incorporated into a small chassis. Power consumption of normal consumer processor (Dual/Quad Core, Athlon, etc.) is too high for this solution. In these fields thermal management plays the main role in the industrial scale application. Based on the presented general usage processor specifications, it can be seen that the silicon structure is going to be more regular than ever before. The processor design trend is going to isolate functional blocks. Computing modules which appear as general purpose processors are as universal as possible. There is no functional specialization over all blocks. The modules are placed regularly on the chip surface (see Fig. 1 and 2). With this design principle, a modification to the dynamic clock throttling method along with dynamic voltage scaling can lead to interesting results.

Fig. 1. Current CELL processor design.

ROUTER

CORE

Fig. 2. Future design of Intel 80 cores processor (proposed in production for 2009) [31].

2. Asynchronous control of blocks activity Asynchronous control of integrated circuit functional blocks activity is based on two power saving methods: dynamic clock throttling and dynamic voltage scaling. Let us make a few assumptions:

   

New integrated circuits (e.g. processors) gain a universal applicability model. Integrated circuits have got many (N) modules of the same function. It is possible to switch selectively between an active module by limiting voltage and/or active clock signals in unused modules. It is possible to control the module switching activity by a specialized tracking software installed on the control level in the integrated circuit. The control level consists only of a simple control mechanism which activates/ deactivates computation modules.

A functional block is an independent unit of the digital integrated system, e.g. memory, CPU, D/A/D converter, etc. A functional block is in active state if it dissipates dynamic and static energy. The block remains inactive if it dissipates only static energy. Switching activity is defined as a change from active to inactive state or vice versa. Let us assume a simple model. Few modules are placed on a chip dye (Fig. 3). Functions for blocks are the same. Control modules, memory access, etc. are omitted in this example. A streaming computation (like a multimedia

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

449

1 SPE2

2

SPE3

Control unit

SPE1

3 SPE4

SPE5

4

SPE6

5 6 SPE7

SPE8

SPE9

SPE10 Fig. 4. Activity control with the block usage of different modules.

Fig. 3. Example placement of functional modules (based on the CELL processor architecture).

2

Table 1 Parallel processing efficiency with multi-core environment No. of processor cores

Parallel execution time (s)

Total time (s)

1 2 4 8 16

88 44 22 11 5.5

100 56 34 23 17.5

Speedup

1.000 1.786 2.941 4.348 5.714

1

3 Processing efficiency (%)

4

100 89.3 73.5 54.4 21.2

6

streaming, a T&L deformation) could be split into small chunks. During the normal usage of a processor it is possible that all computing efforts are cumulated into one (a few) module(s). According to the Amhdal law and benchmarks performed by Intel with it is Dual Core and Quad Core processors [40] most common technical applications do not scale well beyond 4–8 cores. Intel benchmarks with LS-DYNA application shows, that difference with Intel Dual Core and Quad Core is as 1.08–1.32 (compared to the Dual Core AMD-Opteron benchmark efficiency marked as 1). Another example is the case with simulation of parallelism. For comparison purpose, let us assume a 100 s computation effort of a processor. Eightyeight percent of the computation could be parallelized. Comparison of the processing efficiency is shown in Table 1. Table 1 shows that processing efficiency, after increase of processor count, is not so high as expected. In this case, we have made an assumption, that 0.5 duty cycle of a normal workload of cores is optimal for this paper. The proposed method is based on maximizing module’s activity switching. The control level mechanism is able to divide the activity diagram into the smallest possible chunks and control the execution with the algorithm selected (for example, based on a computing stream characteristic). The data stream is forwarded to different modules. An exemplary execution of this method is presented in Fig. 4 (before optimization) and Fig. 5 (after optimization).

5

Fig. 5. Activity control with the maximum activity switching method.

The increase in switching activity pulsation of functional blocks leads to reduce the maximum chip temperature. A concept verification is performed in the next section. 3. Thermal modeling In order to verify the statements made in the foregoing sections, one has to evaluate the temperature distribution in the chip as a function of space and time: T(x, y, z, t). The time dependent heat conduction equation is given by [33]: qT , (1) qt where k is thermal conductivity and Cv the thermal capacity per unit volume. Remark that Cv ¼ rCp where Cp denotes the heat capacity per unit weight and r the specific weight. In the literature, the thermal diffusivity k ¼ k/Cv is also often used. In order to determine the temperature distribution accurately, one should not only consider the silicon chip but also the package, the leads, the cooling fin, the convective cooling environment, etc. However, in this paper we are only interested to evaluate the influence of the switching activity on the (peak) temperature distribution. The total power dissipation and hence the average chip temperature is not influenced by the switching activities. Indeed switching activity means that the position of the heat source changes in space, but due to the high thermal conductivity of silicon the influence on the average temperature is negligible. In order to solve Eq. (1), the power dissipation P(t) can be considered as the superposition of a constant and uniform power dissipation kr2 T ¼ cv

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

450

P

P

P

= P0

+ PAC t

t

t

Fig. 6. The inverse power module switching method.

TOP VIEW y1

SIDE VIEW x1

b

y

Z c

X1

TAC=0

X2

b

y

a

X Fig. 7. Model layout.

P0 and an alternating component PAC(t) as shown in Fig. 6. The temperature distribution T is then in the periodic regime: Tðx; y; z; tÞ ¼ T 0 þ T AC ðx; y; z; tÞ,

(2)

where T0 is the steady state temperature distribution only due to P0. In order to find T0, the package and all surrounding materials have to be taken into account. In this paper, we are only interested in the peak temperatures because T0 cannot be influenced by the switching activities. Hence, we can limit ourselves to the evaluation of TAC or its peak values. It has been proved that a temperature field due to an alternating heat source like PAC (Fig. 6) is limited to the silicon chip itself if the switching frequency is not too low [34,35]. In order to find TAC Eq. (2) has to be solved inside the silicon chip. We consider now a chip with lateral dimensions a and b and thickness c as shown in Fig. 7. At the top z ¼ c and the sides x ¼ 0, x ¼ a, y ¼ 0 and y ¼ b the adiabatic boundary condition is applied. On the bottom side, the isothermal condition TAC is used. All heat sources are considered as being infinitely thin and located on the top surface z ¼ c. In order to solve Eq. (2), the Green’s function approach will be used. By definition the Green’s function is the temperature distribution due to a power pulse in the origin. In the infinite space the Green’s function is found to be [36]: Gðx; y; z; tÞ ¼

  pffiffiffiffiffi cv x 2 þ y2 þ z 2 . pffiffiffiffiffiffiffi 3 exp  4ðk=cv Þt 8ð pktÞ

(3)

If the power density inside the heat source (Fig. 7) is: pðtÞ ¼

PAC ðtÞ , ðx2  x1 Þðy2  y1 Þ

(4)

the temperature distribution is then found by superposition:   cv pðt0 Þdt0 z2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3 exp  T AC ðx; y; z; tÞ ¼ 0 4ðk=cv Þðt  t0 Þ 0 8ð pkðt  t ÞÞ   Z x2 ðx  x0 Þ2 exp   dx0 0Þ 4ðk=c Þðt  t v x1   Z y2 ðy  y0 Þ2  exp  (5) dy0 . 4ðk=cv Þðt  t0 Þ y1 Z

t

In Eq. (5) it has been implicitly assumed that all temperatures are zero at the starting time t ¼ 0. However, Eq. (5) is only correct for the infinite space. To take the boundary condition into account, series of image heat sources have to be added as outlined in the literature [37]. The integrations in Eq. (5) with respect to x0 and y0 can be carried out analytically. The time integration with respect to t0 has to be done numerically. Note that Eq. (5) is valid for any point (x, y, z) at any time t. For a chip with dimensions a ¼ 5 mm, b ¼ 5 mm and c ¼ 0.3 mm and rectangular heat source limited by x1 ¼ 0.5 mm, x2 ¼ 0.5 mm, y1 ¼ 2.5 mm and y2 ¼ 2.5 mm the temperature in the center point has been evaluated versus time. Five to 10 periods of simulation turned out to be enough to obtain the periodic steady state conditions. Fig. 8 shows the result for different values of the switching frequency.

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

451

Peak temperature 0.2 f=10 MHz f=500 MHz f=100 MHz f=50 MHz f=10 MHz

0.15

Relative temperature

0.1

0.05

0

-0.05

-0.1

-0.15 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Time [*100us] Fig. 8. Peak temperature with the 0.5 switching duty cycle.

Peak temperature 1 result

Relative peak temperature (pp)

help line (slope=1/2)

0.1

0.01

0.001

1e-04 10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

Activity pulsation [1/Hz] Fig. 9. Peak temperature Tpp versus switching pulsation.

The most obvious result is that the higher the switching frequency, the lower the peak values of TAC. This can be easily understood physically. The faster the switching, the less chance the temperature can reach a high peak value. Hence, fast switching will reduce the peak temperatures on the chip surface. Needless to stress here again that the average temperature remains unaltered, only the peak temperatures can be influenced. If one plots the peak-to-peak values observed in Fig. 8 as a function of the switching frequency, a remarkable

relation is observed (Fig. 9): 1 T  pffiffiffiffi , o

(6)

where o is the activity pulsation. According to research made by Shekhar Y. Borkar at Intel Labs [41] distribution of interconnect length is varying across modern processor chip design. Most common are short nets of length less than 100 mm. This leads to an assumption that most connections are made

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

452

inside functional blocks. Interconnect consumes nowadays ca. 50% of dynamic power [41]. Our method with activity switching promotes fast switching of a module activity (Fig. 8 and 9). Thus, we can say that power loss is at least 2–4 times smaller between modules than inside functional blocks. The result (6) agrees very well with the concept idea. Increasing the switching frequency will reduce the peak temperatures. This conclusion is used in the proposed implementation of an asynchronous modules activity control.

1

2

3

4

t

4. Asynchronous control—a real life case study This real life case study is based on the CELL processor architecture (Fig. 10). CELL is a joint venture project of three companies: IBM, Sony and Toshiba. Its architecture promotes concurrent code execution. The Cell Broadband Engine Architecture (CBEA) consists of many universal processing units (SPE—Synergestic Processing Element). The current version has got eight SPE modules. Both the size and universality of these modules are an excellent field of study for the proposed thermal management method. As presented in the previous section, an increase in the switching activity of functional modules could lead to a decrease in the maximum working temperature. In our case study, we chose two methods of module’s switching activity. The first one is based on Eq. (6), which promotes switching modules activity with maximum frequency, and the other one, for a comparison purpose—on the block control method. The block control method is based on principle of scheduling modules activity to one module for a given period of time. After this time running process is switched to the other one. The time, in which modules are active, is one order longer than with the first method. The maximum switching activity method is based, on a weighted round robin (WRR) algorithm, which is a besteffort scheduling discipline [38,39]. It is a one of the simplest form of generalized processor sharing discipline. WRR discipline chooses the next module for execution after calculating priority of each module, which is based on SPU

SPE

EIB (96 B/sec) PPE

Dual XDR Fig. 10. CELL processor architecture [32].

FlexIO

Fig. 11. Maximum switching algorithm (the round robin method).

1

2

3

4

t

Fig. 12. Maximum switching with the block control algorithm.

the total usage count and the priority level arbitrary set by the designer for each module. The least used module is chosen as the next one in a stream/flow computation. The block control moves computing to a one dedicated module for a specified amount of computing chunks. After that time, computing is moved to the other module. Let us define four modules M1–4 of the same functionality. The time period which is considered for the WRR discipline weight computation algorithm is defined as a unit of time in which one of the core can compute 100 computation chunks. Exemplary flows are presented in Figs. 11 and 12. The control unit is working with the switching frequency, which is one order smaller than the main working frequency of the active module. Thus, active power dissipated in the control unit is much smaller than in active modules and can be neglected in the proposed method. Algorithm used in decision block is presented in Figs. 13–15. Two tables are used during algorithm execution. First one (Table 2) is an entry definition of an each instruction

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

453

main algorithm decode instruction

fetch instruction get instruction table decode instruction

execute instruction

No

add module queue

end of instruction queue ? Yes

Fig. 13. (a) Main algorithm and (b) decode instruction algorithm.

execute instruction

get next module from module queue to execute

get next module

find least used

assign task

update counters

No

assign task to selected module

update usage counter for module, perform usage count check

last module to execute Yes

Fig. 14. Execute instruction algorithm.

and a time which consecutive module type needs in order to correctly execute this type of instruction. Table width depends on the different modules count (N). Numbers in rows describe execution time in arbitrary units (computation chunks). Next table (Table 3) is an execution stream representation for a defined instruction flow. It is used as a usage count table for the WRR algorithm and the input data for the temperature computation kernel. Number of steps (O) depends on instructions complexity and a given instruction flow. In the block control method, the computation software [12] is able to define a maximum module usage (max_ module_usage) and the time between assigning a new task to an inactive module (time_wait)—see Fig. 16. Exemplary flows are presented in Tables 4 and 5. The computed temperature values are presented in Figs. 17 and 18. Temperature difference between these two methods are almost 6 1C. Let us move to a more advanced case. The next test case describes the modern processor. The test object designed for temperature control is based on a module schematic which is shown in Fig. 19. Six universal processing units are placed over the chip. The control unit is used only for initial tasks, e.g. reading the code from an external memory. Because all of the processing units are of the same type, the related instructions do not differ in module type usage (as in

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

454

find least used

get list of all modules with the same function get module by function

get usage counters for all modules get usage counter

get module with smallest count number (list) get last used module

get distance (geometrical) between current and available modules' list get distance last and new

return module with min (distance) && min (usage counter) get next module

Fig. 15. Find least used module procedure.

Table 2 Instruction definition table

1

2

3

4 max_module_usage

Instruction Effort in Effort in Effort in y Effort in module type 1 module type 2 module type 3 module N 1 2 y M

2 3 y 1

1 1 y 0

0 0 y 1

y y y y

time_wait

t

0 1 y 3

Fig. 16. Possible variables of the algorithm. Table 3 Module queue table Module

Step 1

Step 2

y

Step O

1 2 y N

2 0 y 1

3 1 y 0

y y y y

0 2 y 3

normal CISC units, e.g. ALU/FPU). The computation time (in quantified units) is chosen as a difficulty factor of each task. A simple program flow based on above considerations was created. Instruction numbers are the time needed to perform this operation, e.g. 1- one time unit, 5- five time units.

Table 4 Sample computation results Cycle

M1

M2

M3

M4

0 1 2 3 4 5 6

1 1 1 1 1 1 0

1 1 1 1 1 1 1

0 0 0 0 0 0 0

0 0 0 0 0 0 0

Two of four modules in active state. Total 13 computation chunks.

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

The second flow generation algorithm is based on the first one enhanced with an activity switching method. In this case, we have to assume that the control unit can switch processing from one unit to another between time ticks. The alternated flow is presented in Fig. 21. When the execution time exceeds two computing chunks, it is rerouted to the other unit which has the smallest

maximum switching [T] Temperature above ambient [°C]

All modules are running on a 1 MHz clock. The program must be executed in one direct flow. The successive instruction depends on the result from the actual one. The examplary flow is presented on the Table 6. The first method uses an algorithm which, in turn, for the execution of the successive instruction uses the (geometrically) furthest module from the current active module. It is an extension of the proposed WRR algorithm presented on the beginning of this section. The module arrangement (cf. Fig. 19) ensures that this is not an accidental choice. The final result from the flow generator is presented in Fig. 20. The second method is based upon assumptions deriving from Eq. (6). Let us assume, that a maximum thermodynamical endurance for an every computing module (e.g. heat generated during module activity) is measured as a possibility to run N-count time chunks before physical damage. The unit reaches its boundary in two computing chunks (two units execution time per one processing module). This value could be measured in real time, during processing of the unit. For this paper, let us assume it is a static value.

455

12 10.14

10 8

7.49

6

5.44

4

3.85

2 0.00

0 0

0.1

0.00 0.2 0.3 0.4 Time [units]

0.5

0.6

0.7

Fig. 18. Temperature above ambient versus execution time for activity of all of modules with the maximum switching activity pulsation. Table 5 Exemplary switching activity—all modules were used M1

M2

M3

M4

0 1 2 3 4 5 6

1 0 1 0 1 0 0

0 1 0 1 0 1 0

1 0 1 0 1 0 1

0 1 0 1 0 1 0

Control Unit

Cycle

Total 13 computation chunks.

SPE1

SPE2

SPE3

SPE4

SPE5

SPE6

Fig. 19. The test object.

full duty cycle [T]

Temperature above ambient [*C]

18 16.56

16 15.09

14

13.27

12 10.90

10 8

7.49

6 4 2 0.00

0 0

0.1

0.2

0.3 0.4 Time [units]

0.5

0.6

Fig. 17. Temperature versus execution time for two of four active modules.

0.7

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

456

thermodynamical state (an accumulated sum of execution chunks). An exemplary flow with execution activity of all modules is presented in Table 7. For the determination of the flows shown above, the temperature value was calculated. Respective results are presented in Fig. 22 as an aggregation of these two flows. There is an obvious temperature gain using the method proposed. The maximum peak temperature and the execution time are shown in the Table 8. The temperature over the ambient level is measured. The execution time is measured in units equal to a one computing chunk. The switching technique leads to a significant temperature decrease. The implementation of this method could be possible even in commercial, already built, designs, e.g. CELL processor, etc.

5. Conclusion In this paper, we have presented a new attempt to control the integrated circuit activity. The proposed concept of the module switching activity combines flawlessly with the modern processor architectures. The concept verification confirms the usability of this method. An increase in the switching frequency leads to a decrease in

Table 6 Program definition Instructions 1- 2- 1- 3- 5- 3- 1- 1- 1- 4- 1- 2- 1

the working peak temperature of an integrated circuit. The method of an asynchronous control of modules thermal activity and the results based on the theoretic model can be promising for its successful implementation in a real system.

Table 7 The activity flow for maximum distance and maximum distance with switching Maximum distance 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0

Maximum distance+switching 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1 2 3 4 5 6 Fig. 20. Program flow with maximum distance method.

1 2 3 4 5 6

Fig. 21. Program flow with maximum distance and switching method.

0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0

0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ARTICLE IN PRESS S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

457

Peak temperature above ambient [C]

Peak temperature evaluation 10 9 8 7 6 5 4 3 2

Using maximum geometrical distance Maximum geometrical distance with switching method

1 0 0

0.5

1

1.5 Time [units]

2

2.5

3

Fig. 22. Temperature evaluation with two methods: maximum geometrical distance and maximum geometrical distance with switching method.

Table 8 Final results comparison

Execution time Maximum temperature

Maximum distance method

Maximum distance+power switching

2.6 (units) 9.13 (1C)

2.6 (units) 7.18 (1C)

Acknowledgment This work was prepared as a part of project number N515425933 from the Polish Ministry of Science and Higher Education, Poland. References [1] R. Simpson, E.J. Zwar, User Survey Report: Enterprise Mobile Usage, South Korea, 2005, 12 September 2006, /Gartner.comS. [2] A. Cheng, Power Management: Unplugged Gartner Dataquest, SEMC-WW-DP-0058, September 2001. [3] M. Macedonia, Why graphics power is revolutionizing physics, Computer Magazine, IEEE Computer Society, August 2000, pp. 91–92. [4] Power sources: fuel cells, solar cells and batteries, TRN Magazine, September 2003. [5] R. Viswanath, W. Vijay, A. Watwe, V. Lebonheur, Thermal performance challenges from silicon to systems, Intel Technol. J. 4 (3) (2000) Microprocessor Packaging. [6] K. Skadron, M.R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, D. Tarjan, Temperature-aware microarchitecture: extended discussion and results, UVa CS Technical Reports, University of Virginia, USA, April 2003. [7] D. Keymeulen, X. Guo, R. Zebulum, I. Ferguson, V. Duong, A. Stoica, High temperature experiments for circuit self-recovery, in: Proceedings of the 2004 NASA/DoD Conference on Evolvable Hardware, IEEE Computer Press, June 2004. [8] H. Hanson, S.W. Keckler, Power and performance optimization: a case study with the Pentium M Processor, in: Proceedings of the Austin Center for Advanced Studies [IBM] Conference, February 2006. [9] M. Fleischmann, Longrun power management-dynamic power management for crusoe processors, Transmeta Corporation, January 17, 2001.

[10] A. Varma, B. Ganesh, M. Sen, S.R. Choudhury, L. Srinivasan, B. Jacob, A control-theoretic approach to dynamic voltage scheduling, international conference on compilers, architecture and synthesis for embedded systems, in: Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, San Jose, CA, USA, 2003, pp. 255–266. [11] Sun Fire B100x and B200x server blade installation and setup Guide, Estimating Power Consumption, /http://docs.sun.com/S, chapter 2.3. [12] S. Miku"a, A. Kos, Using GNU Scientific Library for temperature computation in VLSI systems, in: Tools of Information Technology, Proceedings of the 1st Conference, Rzeszo´w, Poland, 15 September 2006, pp. 88–94. [13] D. Grunwald, P. Levis, K.I. Farkas, C.B. Morrey III, M. Neufeld, Policies for dynamic clock scheduling, in: Proceedings of the 4th Symposium on Operating Systems Design and Implementation, October 2000, pp. 73–86. [14] S. Ghiasi, D. Grunwald, Thermal management with asymmetric dual core designs, Technical Report CU-CS-965-03, Department of Computer Science, University of Colorado, Boulder, May, September, 2003. [15] J. Pouwelse, K. Langendoen, H. Sips, Energy priority scheduling for variable voltage processors, in: International Symposium on Low-Power Electronics and Design (ISLPED), August 2001. [16] W.L. Bircher, M. Valluri, J. Law, L.K. John, Runtime identification of microprocessor energy saving opportunities, in: International Symposium on Low Power Electronics and Design, Proceedings of the 2005 International Symposium on Low Power Electronics and Design, San Diego, CA, USA, 2005, pp. 275–280. [17] Y. Hotta, M. Sato, H. Kimura, S. Matsuoka, T. Boku, D. Takahashi, Profile-based optimization of power performance by using dynamic voltage scaling on a PC cluster, in: Parallel and Distributed Processing Symposium, 2006, IPDPS 2006, 20th International, 25–29 April 2006, p. 8. [18] R.N. Mayo, P. Ranganathan, Energy consumption in mobile devices: why future systems need requirements—aware energy scale-down, HP Labs Technical Report, 12 August 2003. [19] C.-H. Hsu, W.-C. Feng, Reducing overheating-induced failures via performance—aware CPU power management, in: The 6th International Conference on Linux Clusters: The HPC Revolution 2005, April 2005. [20] T. Pering, T. Burd, R. Brodersen, The simulation and evaluation of dynamic voltage scaling algorithms, in: International Symposium on Low Power Electronics and Design, Proceedings of the 1998 International Symposium on Low Power Electronics and Design, Monterey, CA, United States, 1998, pp. 76–81.

ARTICLE IN PRESS 458

S. Mikula et al. / INTEGRATION, the VLSI journal 41 (2008) 447–458

[21] K. Govil, E. Chan, H. Wasserman, Comparing algorithms for dynamic speed-setting of a low-power CPU, in: Proceedings ACM Int’l Conference on Mobile Computing and Networking, November 1995, pp. 13–25. [22] E. Rotem, A. Naveh, M. Moffie, A. Mendelson, Analysis of thermal monitor features of the Intels Pentiums M Processor, in: Proceedings Temperature—Aware Computer Systems, June 20, 2004, Munich, Germany, June 19–23, 2004. [23] W. Yuan, K. Nahrstedt, Practical voltage scaling for mobile multimedia devices, in: International Multimedia Conference, Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, 2004, pp. 924–931. [24] J. Sorber, N. Banerjee, M.D. Corner, S. Rollins, Turducken: hierarchical power management for mobile devices, in: Proceedings of the Third International Conference on Mobile Systems, Applications, and Services (MobiSys ‘05), Seattle, WA, June 2005. [25] P. Pillai, K.G. Shin, Real-time dynamic voltage scaling for low-power embedded operating systems, in: ACM Symposium on Operating Systems Principles, Proceedings of the 18th ACM Symposium on Operating Systems Principles, Banff, Alberta, Canada, 2001, pp. 89–102, ISBN:1-58113-389-8. [26] D. Biermann, E.G. Sirer, R. Manohar, A rate matching-based approach to dynamic voltage scaling, in: Proceedings of the First Watson Conference on the Interaction between Architecture, Circuits, and Compilers, October 2004. [27] A. Sinha, A.P. Chandrakasan, Energy efficient real-time scheduling, in: International Conference on Computer Aided Design, Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, 2001, pp. 458–463. [28] J. Srinivasan, S.V. Adve, Predictive dynamic thermal management for multimedia applications, in: The Proceedings of the 17th Annual ACM International Conference on Supercomputing (ICS’03), San Francisco, June 2003, pp. 109–120. [29] A. Stoica, D. Keymeulen, R. Zebulum, Evolvable hardware solutions for extreme temperature electronics, Long Beach, CA, USA, pp. 93–97, ISBN:0-7695-1180-5. [30] J. Pouwelse, K. Langendoen, H. Sips, Dynamic voltage scaling on a low-power microprocessor, in: 2nd International Symposium on Mobile Multimedia Systems & Applications (MMSA’2000), Delft, The Netherlands, November 2000, pp. 157–164. [31] T. Krazit, Intel pledges 80 cores in five years, News Corporation website, /http://news.com.com/2102-1006_3-6119618.html?tag=st. util.printS. [32] H. Peter Hofstee, Power efficient processor architecture and the CELL processor, in: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, IEEE Computer Society Washington, DC, USA, 2005, pp. 258–262. [33] A. Kos, G. De Mey, Thermal Modeling and Optimization of Power Microcircuits, Electrochemical Publications, 1997. [34] B. Vermeersch, G. De Mey, Influence of substrate thickness on thermal impedance of microelectronic structures, Microelectron. Reliab. 47 (2007) 437–443.

[35] B. Vermeersch, G. De Mey, Influence of thermal contact resistance on thermal impedance of microelectronics structures, Microelectron. Reliab. 47 (2007) 1233–1238. [36] H. Carslaw, J. Jaeger, Conduction of Heat in Solids, Clarendon Press, Oxford, 1959, pp. 353–386. [37] D.J. Dean, Thermal Design of Electronic Circuit Boards and Packages, Electrochemical publications, Ayr, Scotland, 1985, pp. 80–84. [38] W. Stallings, Operating Systems Internals and Design Principles, fourth edition, Prentice Hall, ISBN 0-13-031999-6, 2004. [39] J. Aas, Understanding the Linux 2.6.8.1 CPU Scheduler, Silicon Graphics, 2005, /http://josh.trancesoftware.com/linux/S. [40] Intels Xeons Processor 5000 Sequence HPC Benchmarks: Finite Element Analysis, /http://www.intel.com/performance/server/xeon/ hpcapp3.htmS. [41] N. Magen, A. Kolodny, U. Weiser, N. Shamir, Interconnect-power dissipation in a microprocessor, in: Proceedings of the 2004 International Workshop on System Level interconnect Prediction (Paris, France, February 14–15, 2004), SLIP ‘04, ACM Press, New York, 2004. S$awomir Miku$a received his M.S. degree in electronics from University of Science and Technology Cracow, Poland in 2004. Currently, he is a doctorate student at the University of Science and Technology in Nano- and Microsystems Division. His research interests are in the areas of integrated circuit’s thermal management.

Gilbert De Mey received his engineering and Ph.D. degree from the University of Ghent in Belgium in 1970 and 1975. He is now full time professor at the same university. His major research area is heat transfer in microelectronics.

Andrzej Kos received his engineering, Ph.D. degree from the AGH University of Science & Technology in 1978, 1983, respectively, and D.Sc. degree from the Technical University in Lodz in 1994. He is now full time professor at the AGH University. His major research are thermal issues in microelectronics.