Parallelization of the stack P A Shallow
This paper looks at the problems encountered when allocating stack to processes within a multitasking environment. Although the concept of the solution proposed is applicable to other computer platforms, the paper uses the 80x86 to illustrate how the concept works. The paper looks at how the stack is currently implemented and describes how the implementation can be easily adapted to overcome stack allocation problems. The concept being proposed allocates stack to each procedure when it is called, as opposed to a complete process. The paper views the advantages and disadvantages of such a mechanism.
Keywords:stackallocation,threads,recursion Although more and more computing platforms are supporting multitasking environments, their operating systems still have to provide a separate stack area to each process whenever it is created and executed. This is a limitation governed by the way processors have been designed to control the stack. Microprocessors are still using stack mechanisms designed for the single user, single process system w h i c h have not been updated to use a mechanism designed for multitasking purposes. As a result, operating systems are still having to execute what are basically coarse-grained autonomous sequential programs, and are unable to execute code that provides the true fine-grain parallelism required by threads*. Compiling code that utilizes a single-user, singleprocess stack mechanism relies on the stack having a contiguous address space. Consequently when primary storage is allocated to more than one process (Figure 1)1,2, each process must be allocated its own stack area, otherwise the stack of one process w i l l overwrite the data in another. *The distinction between a thread and a process is very subtle. A thread is the smallest piece of code that can be dispatched and executed within a process, and is like a subroutine within a program. A process is like a program consisting of at least one thread (itself) and may contain several, which can be executed in parallel. In terms of stack allocation, this paper considers a process and a thread to be the same. Shallcode Limited, Danescourt, Pond Road, Woking, Surrey, GU22 0JT, UK Paper received: 13 December 1994. Revised: 27 March 1995
THE PROBLEM The problem with allocating stack space to a process is knowing h o w much stack the process requires. This is not easy to determine as the stack required to execute a process is different for different processor types (680x0, 80x86, etc), and is probably different for the different members of a processor family (8086, 80286, 80386, etc). Ascertaining the amount of stack required when recursion takes place is also impossible. An underestimate in the stack requirement w i l l result in stack overflow (i.e. memory is overwritten) whilst a gross overestimate will rapidly use up memory t. Once memory is full, processes need to be swapped in and out of secondary storage devices w h i c h involves an additional high I/O overhead t. In the case of the U n i x 'thread' create system function fork 3, the Unix kernel automatically allocates memory by
116k 98k 63k
ili!i !i i i i i iji i i i!i i i i i i ili!i!i il process C process B process A
20k
~
i !i istack i!i !!ii!i!i i!i iIi!i !ti!i] address colle I data
space
Kernel Figure 1 Typical memory allocation to processes tTake for example a simple C program which prints 'Hello World' and then calls itself using the 'system' function7. If the program is compiled as a 'tiny'z model, then when it is executed on a 640 k IBM PC it is only able to call itself eight times before the PC runs out of memory. If however, the default stack and heap sizes are altered to only be 3 k in total, then the program can call itself 38 times before the system runs out of memory. This is because the amount of system memory used by a 'tiny' model is 64 k by default. tTake a fast SCSIdisk with a 10 ms seek time and a data rate of 1 Mbyte s-1 ; the time taken to transfer 64 k of data to or from disk will be no less than 10ms + (64 k/1 M) which equals 72.5 ms.
0141-9331/95/$9.50 © 1995 Elsevier Science B.V. All rights reserved Microprocessors and Microsystems Volume 19 Number 7 September 1995
405
Parallelization of the stack: P A Shallow duplicating the address space of the parent process. This is fine whilst there is sufficient space in the primary storage area to accommodate the new child process, but if there is not, the kernel has to start swapping processes in and out of secondary storage. With both of the thread creation functions in OS/2's Application Program Interface (API)4 and 3L's parallel C ~, the amount of stack needed by the threads has to be calculated by the user. OS/2's API DosCreateThread function requires a pointer to a region of memory large enough to accommodate the stack, and 3L's parallel C threadcreate function requires the stack size to be passed in as one of the call parameters. The most accurate way a user can determine the stack requirement of a process is by looking at all of the possible logical programming paths that the process can take, calculating the amount of stack used for each of the paths, and taking the worst case. This is of course only practical for very small and simple programs. Alternatively the user could evaluate the total stack requirement for a process by using the stack requirements of the individual procedures that make up a process. This can be achieved by annotating the logical programming path in terms of procedural interdependencies (Figure 2); calculating the stack requirement for the different procedural hierarchy dependencies and taking the worst case. The answer is not as accurate as the first method but will always be an overestimation. Unfortunately, in both cases the stack requirements for a process under recursive situations cannot be calculated because the number of times a procedure is recursively called is unknown.
PROPOSED SOLUTION Instead of attempting to provide a solution to the coarse grain allocation of stack (CAST) to a process, this paper puts forward a much finer grained solution of allocating stack to each procedure. The advantage of allocating stack to each procedure is that for any particular procedure, the stack requirement is constant and can be calculated at compile time. If we look at the stack usage for a high level language procedure (Figure 3), we can rapidly see that it can be broken down into a number of parts, each with a distinct usage. The first part is used to store local variables declared and temporarily used within a procedure. The second part is used for intra-routine 'housekeeping', such as preserving registers used by the procedure. The third part is used to pass parameters to and
.-~ display|] ~
i returnaddress I environment registers / Caller stack frame Increasing stack. Decreasing memo~/.
< stackbase pointer local variables preserved stackpointer registers parameters passedto subroutine
li
Callee
: preserved [ Figure3
Stackusageduring a procedurecall
return parameters from called subroutines. The fourth part is used to save the caller's return address and the last part provides space for inter-routine 'housekeeping' purposes, which typically involve preserving the 'environment' registers during a procedure call. The stack space equal to all five parts is, within this paper, called the stack frame. The stack usage for each part is determined by evaluating the number and types of variables declared in the procedure, the number of registers used by the routine, the number and types of variables passed to a called routine, the type of the call (NEAR, FAR~'), and finally, adding the amount of stack used to preserve the environment which is constant for all procedures. In the situation where there are a number of different subroutines called, the worst case scenario for the amount of stack required to carry out the calls must be used. The amount of stack space used to carry out a call is the space required to push the passed in parameters, store the return address, and save the environment variables (Figure 3).
IMPLEMENTATION For the proposed fine grained allocation of stack model (FAST) to work, the allocation of memory for the process's data, code and stack segments can no longer be in the same contiguous address space (Figure 1). Instead, the code and data segments are allocated together and a common pool of memory is used to allocate stack (Figure 4). This pool is used to allocate all stack frames to all procedures regardless of the parent process.
printil I High level block structure call
I mninf)
"-4 enter()
[--
~
printll
]
input l] I |
Figure2 406
Procedurecalling dependency
Let us first look at the current mechanism by which a typical 'C' function call is implemented on a 80x86 processor (Figure 5) 2. Before a subroutine can be called, all of the parameters being passed to the subroutine must be pushed onto the stack. In calling the subroutine the value of the
Microprocessors and Microsystems Volume 19 Number 7 September 1995
Parallelization of the stack: P A Shallow Because the SBP value is never altered during the execution of a procedure, parameters being passed into the routine can always be accessed by a fixed positive offset relative to the SBP (Figure 6). These parameters are in fact always at the same offset relative to the SBP regardless of when or from where the subroutine is called within the program. The local variables can be accessed at a fixed negative offset to the SBP, as they are all 'placed' at the top of the procedure's stack frame. On exit of a subroutine, the pro.served registers are restored, the SP is reset to its entry value, by setting it to equal SBP, and the SBP is restored to its caller's value by popping it off the stack. FAST involves the use of two stack pointers (Figure 7), and two stack base pointers. One of the SPs is used as a stack pointer locally within the procedures's stack frame, and the other is used as the system's stack pointer. The SBPs provide access to the parameters passed in and access to the local variables and are not ahered during the execution of the procedure. The parameters are accessed by fixed positive offset from one of the SBPs (the parameter base pointer) and the local variables are accessed by a fixed negative offset from the other SBP (variable base pointer). The way in which FAST calls a subroutine is very similar to the CAST model described above. As before, prior to a subroutine being called, the parameters being passed to it are pushed onto the stack. The return address is then saved onto the stack as the subroutine is called. The two stack base pointers are pushed onto the stack and the parameter base pointer (PBP) is then set to the value of the local stack pointer. A new stack frame is then acquired from the system's stack and the system's sta,:k pointer (SSP) is
stackfor processes A,B&C
ii::i::iii i::iiiiiii:: ::iiiiiii::i::i::i i::iiiii
iiiiiiiiii!i! i:.ilii ii;iiiiiiiii!iiiiiiil;iiiiiiiiii Increasing stack Decreasing memo~/
code C data C code B data B codeA
t ProcessC ~ ProcessB t ProcessA
:::::::::::::::::::::::::~:::!:i:!:!:!:i!i i~i:i:!
::::::::::::::::::::::::: ========================
Figure 4 EASTmemoryallocation to processe~ program counter is pushed onto the stack as the return address, and on entry into the subroutine the stack base pointer (SBP) is saved onto the stack as the inter-routine 'housekeeping'. The SBP is then set to the stack pointer's value, and the stack pointer (SP) is adjusted by the amount of memory required by the local variables. Any registers used by the subroutine are then preserved by pushing them onto the stack (intra-routine 'housekeeping'). The stack is now 'free' to be used by the current subroutine to call its subroutines. procedure:
push
bp
mov
bp,
sp
; and setup SBP (bp).
sub
sp,
LOCAL SIZE
; A l l o c a t e l o c a l v a r i a b l e space.
push
a>:
t Preserver r e s g i s t e r s used by
push
es
l routine.
mov
ax,
mov
[bp-O2],ax
; Preserve environment v a r i a b l e s
[bp+04]
! Access passed in parameter. ; Access l o c a l v a r i a b l e .
push
[bp+04]
t Push parameters passed t o
push
[bp-02]
! c a l l e d s u b r o u t i n e onto stack.
call
subroutine
add
sp,
4
; Call s u b r o u t i n e ! Re-adjust spack p o i n t e r
...
Figure 5 Example of CAST implementation in 8086 assembler
pop
es
; Restore preserved r e g i s t e r s .
pop
ax
!
mov
sp,
pop
bp
re!
bp
! Restore environment v a r i a b l e s ; and r e s e t SBP t o c a l l e r s value : return to c a l l e r
Microprocessors and Microsystems Volume 19 Number 7 September 1995
407
Parallelization of the stack: P A Shallow adjusted accordingly. The local stack pointer (LSP) and the variable base pointer (VBP) are now set to the top of the new stack frame. The LSP is then adjusted by the amount of space required by the local variables and any registers that need preserving are pushed onto the stack. The LSP is now 'free' to be used by the subroutine to call its subroutines. On exit from a procedure, registers temporarily preserved are restored, the stack frame is de-allocated, and the SSP reset. The LSP is then reset back to its entry value (equal to PBP), and both the base pointers are restored to their entry value (i.e. popped off the stack). So far the paper has only looked at the manipulation of the stack purely in terms of a sequential program. When a procedure is called within a sequential program, memory allocated to the stack frame is always adjacent to that of the caller (Figure 6). However, this is not the case when multitasking processes in the FAST model. For example, if between starting a new process (MainA) (Figure 8) and calling its first
main I ) stack frame Increasing stack Decreasing memory
display ( } stack frame
_params ~'-~ BSP
"1 <
print [ ] stack frame ..:.,.;....:.'.:.....: i'.','.'. i
* ".'.',-.'
vsr8
. . . . .: . : . . . : . )
.& I,IL,
• • .. • ...
. . . . . . . . . . . . . . . . . . .
• • .,i •
Figure 6 CAST stack frame allocation of memory to processes
STALLOC macro Isp,frame size o.o
endm
FREES
macro Isp
endm
procedure: push
bp
; Preserve environment v a r i a b l e s
push
bx
; PBP (bp) VBP (bx) and LSP (sp)
mov
bp,
sp
; setup PBP
(bp)
STALLOC sp,
SF_SIZE
; Acquire stack frame of size SF_SIZE
mov
bx,
sp
; and setup VBP (bx).
sub
sp,
LOCAL_SIZE
! A l l o c a t e local
v a r i a b l e space.
push
ax
; Preserver r e s g i s t e r s used by
push
es
; routine.
mov
ax,
mov
[bx-O2],ax
[bp+04]
; Access passed in parameter. ; Access local v a r i a b l e .
push
[bp+04]
; Push parameters passed t o
push
[bx-02]
! called subroutine onto stack.
call
subroutine
; Call subroutine
add
sp,
; Re-adjust spack pointer
pop
es
pop
ax
FREES
sp
mOV
sp~
pop
bx
pop
bp
4
; Restore preserved r e g i s t e r s .
; d e - a l l o c a t e s t a c k frame
bp
; Restore environment v a r i a b l e s ; and r e s e t SBP's t o c a l l e r s
values
ret
408
Microprocessors and Microsystems Volume 19 Number 7 September 1995
Figure 7 Example of FAST imp)ementation in
8086 assembler
Parallelization of the stack: P A Shallow
m ainA( ) <_p_~ra m s stack frame Increasing stack Decreasing memory
main]B() ~" PBP stack frame VPB displayA| )
< va[ rs
stack frame SSP ,.:...:.:°: . . . . . o. :.:.:..:.:.:.:.:...) i:.:.:.:.:.:.:.:.:.'.:. :.:.:.'.;.:.:.:.'.:.'., i . . . . . . . . . . . . . "....... '! Figure 8
stack freeing (FREES) algorithms should be implemented, apart from saying that their functionality could be something loosely analogous to the concept of the 'malloc' and 'free' algorithms in 'C '7.
Implementing recursion There are no limitations to implementing recursion, except to running out of virtual address space, in the FAST model. This is because stack is allocated to each procedure, which removes the need to know the number of times a procedure is recursively called.
FAST stack f r a m e a l l { x ation o f m e m o r y to processes
Executing threads subroutine (DisplayA) another process (MainB) was started, then, when the stack frame is allocated to the first process's subroutine, it will be adjacent to the second process's stack frame and not to the caller's stack frame. Consequently, two SBPs are required as one can no longer act as a common stack base pointer to access the parameters passed in and the local variables. The full stack frame requirement must also be allocated to each called subroutine to ensure that the stack for one procedure does not overwrite that of another. Adjusting the system's stack pointer by the size of a called procedure's stack frame and back again when the subroutine exits is not a sufficient algorithm to control the SSP. When the stack frame is de-allocated the SSP does not necessarily return to its original value. Going back to the previous example, if the previous stack frame in memory (MainB) has been de-allocated before ProcessA's subroutine (DisplayA) had finished (Figure 9), then the SSP must reflect the double de-allocation when DisplayA finishes. Adjusting the SSP when MainB's stack frame was de-allocated would also be incorrect. Therefore, the allocation and de-allocation of the stack frames must be controlled by some form of linked list, where the SSP is used to gain fast access to available stack space. It is not the intention of this paper to describe or to become involved in the theory and methodology of how the stack allocation (STALLOC) and
mainA[) stack frame .-.-..,-.-°-.-
Increasing stack Decreasing memory
< SSP'
,-o- . - .
....,..*,..,.,....,.,.°.....,..,**...,..,.,.,.
iiiiilUiliiiiiiigii!iii!!iiiii]iiiiiiii!i!iil .°..,..,°..°..,,.,,°.......o,..,°,°o..,,...°.. .°...,,,....,.,.,*.-.,.,°,.°,,,,.°°..,.......
displayA| ) stack frame SSP •:...:-:.:,:.:...:, :.:,- o:.:,:.:..- ,re i:.:.:..'-:.:.:.:.:.'.:....:.:.:*:.:.:.;.:.'.'o
: Figure
9
Stack f r a m e de a l l o c a t i o n
;:::::-
Creating and executing a thread within a FAST procedure no longer requires the user to calculate the stack size. When a thread create function is called, both it and the procedure that it spawns are allocated stack frames in the same way that they would be if they were subroutines called within a procedure. The only difference between calling a thread and calling a subroutine is that the part of the caller's stack used to carry out the call to the thread create function (Figure 3) cannot be reused until the thread itself has terminated. This means that the stack frame size of the calling procedure must be adjusted to include the amount of stack required to call the thread create function. For example, if a procedure calls a thread create function (Pa) and a subroutine (Pb) then the sum of the stack requirements to call Pa and Pb is used and not the worst case when a procedure calls subroutine Pa followed by subroutine Pb.
Nested procedures In order to accommodate the scope rules of variables declared within nested procedures of high level block structured languages (Figure 10)8, the 80x86 processor has two assembler instructions ENTER and LEAVE~'. Each instruction carries out a number of 'bookkeeping' operations which simplify the assembler code generated for the entry and exit points of compiled procedures. One of ENTER's operations is to generate a 'display' of stack frame pointers ~'. These stack frame pointers ISFPs) point to the stack frames of calling procedures and are used to gain access to their local variables (Figure 11). By limiting the number of SFPs copied from the callers 'display', the ENTER instruction ensures that a procedure can only access variables from higher lexical procedural levels and not from a procedure of the same lexical level. For example, when a recursive procedure calls itself, because it is at the same lexical level as itself, it is only able to 'see' the variables declared in the current call to that procedure and those variables declared in the procedure that originally called the recursive procedure in the first instance. It is unable to 'see' the variables declared in all of the intermediate calls to the recursive procedure.
Microprocessors and Microsystems Volume 19 Number 7 September 1995
409
Parallelization of the stack: P A Shallow {{{
procedureA d e c l a r a t i o n
PROC procedureA (INT apa, --
INT apb)
the setting up and implementation of semaphores between processes 4 is simplified.
v a r i a b l e s seen by p r o c e d u r e B & C
Interfacing to high and low level routines
INT a l a :
{{{
procedure B d e c l a r a t i o n
PROC procedureB (INT bpa) --
v a r i a b l e s seen by p r o c e d u r e C
INT b l a :
{{{
procedure C d e c l a r a t i o n
PROC procedureC (INT cpa) - - procedureC v a r i a b l e s n o t seen by A o r B INT c l a : SEQ ...
procedureC code
}}} --
procedureB code
PAR SEQ procedureC (bpa) ...
code
The FAST model and CAST model are not really compatible. Whilst a CAST procedure can call a FAST procedure, a FAST procedure cannot call a CAST procedure without stack overflow errors occurring, as the CAST model uses the stack of the caller. The only way to overcome this problem would be to increase the size of the FAST stack frame by the amount of stack required by the procedure being called and any subroutines that it calls in its turn. This amount of stack has to be calculated, which is the situation the paper is trying to overcome! DOS hardware and software interrupts also rely on the stack of the caller. This means that all FAST procedures must also allow for the stack needs of all possible hardware interrupts, as they can occur at any time. Procedures must also cater for the stack requirements of any system interrupt called, an existing requirement of the OS/2 system if a thread calls any API services ~'. Ideally from the BIOS upwards, all interrupts, system and procedure calls should therefore be written in the FAST style proposed, and to enforce this, ENTER and LEAVE should be an integral part of the CALL and RET instructions.
I
FURTHER WORK
}}} --
procedureA v a r i a b l e n o t seen by B oF C
INT
ii:
PAR i i
= O FOR apa
SEQ procedureB (apb)
So far the concept of FAST has been tried and tested on an IBM PC using 80x86 assembler. The next step is to evaluate how FAST could be implemented on other microprocessors, such as the 68xxx CISC processor and the Txxx transputer and PowerPC RISC processors.
].}}
Figure10
Exampleof nested proceduresin OCCAM2
To generate a 'display' of stack frame pointers in the FAST model is not difficult, as it can be implemented in a way that ENTER generates a display of SFPs. This operation would be carried out between the new stack frame being allocated and the LSP being adjusted for the local variables (see above).
Accessing shared variables As new threads are no longer allocated their own separate stack areas, access can be gained to variables declared within the process that spawned the thread, in the same way that the ENTER instruction can access variables declared in calling procedures with a higher lexical level. This means that the fork function ~ no longer needs to duplicate address space of the parent process, shmget ~ no longer has to create regions of memory in order to share variables between processes, and
410
~
Pushed ; SBP"J,¢ I ~--~' Display ~'~ ProcedureA LocalVats / Jlexlcal level 1
Increasing stack [ Decreasing memow
!
l
DI PushedSBP' ill
;
~Vars
soP Display ~. ProcedureC lexicallevel 3
~
i!i;iiiiiiii!iiiiilili'""'" ii!i!ii! '::':" iiiiii!i!ili i
Figure 11 Accessing w~riables of calling procedures using a 'display' of stack pointers
Microprocessors and Microsystems Volume 19 Number 7 September 1995
Parallelization of the stack: P A Shallow
ADVANTAGES AND DISADVANTAGES
CONCLUSION
Advantages
Fine grain allocation of stack (FAST) to procedures overcomes the problems of determining the amount of stack required by threads and processes within a multitasking environment. FAST is a very simple and easy adaptation of the way the stack is currently implemented by the 80x86 processor and requires a very nominal processing overhead in its implementation. It provides a more efficient use of stack, which in its turn increases the threshold level before having to start swapping processes in and out of primary storage. By means of its implementation FAST provides an inherently secure stack system making continuous monitoring for stack overflow redundant; it allows recursion to take place and provides a means of accessing parental process variables. FAST is a small extension to the existing 80x86 assembler ENTER and LEAVE microcode commands and although quite possible to implement in assembler, it would be better implemented in microcode, with the addition of two extra registers.
The advantages of using FAST are: • The compiler only has to ascertain the stack use by each procedure and not for the whole thread or process. • The user no longer has to provide a stack or stack size when a thread or process is spawned within a process. • It allows recursion to take place. • It allows variables to be shared between processes without the need to duplicate the parent's address space or create memory regions to house shared variables. • The system stack no longer has to be in a contiguous region of memory. • System memory is more efficiently used because memory is no longer wasted by over-allocating stack to processes. Consequently more processes can be maintained in system memory before the need arises to start swapping processes in and out of secondary storage, with consequent overheads in I/O time. • No run-time stack overflow monitoring is required. • Large local variables can be declared within procedures without having to dynamically allocate them using MALLOC and FREE. • No need to carry out stack switching when procedures call more privileged levels ~.
REFERENCES
2 3 4
Disadvantages 5 6
The disadvantages of such a system are: • Allocating stack to each procedure when it is called adds an additional overhead in both execution time and CPU usage. • Stacks cannot be included with the code and data segments when processes are swapped in and out of a secondary storage. • In order to obtain FAST's full potential, operating systems need to be rewritten to suit the new methodology. Integrating the proposed solution into existing operating systems is not easy. It does not readily interface with existing high level procedures and low level BIOS calls. Compilers also need to be rewritten to generate the appropriate assembler stack code.
7 8
Keller, L S Operating Systems, Prentice Hall, Englewood Cliffs, NJ (1988) Borland C÷ ~ Programmers Guide V2.0, Borland International Inc (1991) Mikes, S UNIX fc~r MS-DOS Programmers, A:ldison-Wesley, Reading, MA (1989) Schildt, H 0S/2 Programming: An introduction, ()sborne McGraw Hill, New York (1988) Parallel C Users Guide, v2.0, 3L Ltd (1988) ~86SX Microprocessor Programrner's Reference Manual, Intel Corporation, Osborne McGraw-Hill, New York (1989) Kernighan, B W and Ritchie, D M The C Programming Language, (2nd edition) Prentice Hall, Englewood Cliffs, NJ (! 988) OCCAM2 Reference Manual, Inmos Lid, Prentice Hall, Englewood Cliffs, NJ (1988)
Piers A Shallow graduated in civil engineering in 1979 at Surrey University, UK. In 1983 he moved into computing and obtained an MSc in 1987 at Surrey University in telematics. Since then, he has worked r~r Marconi Maritime Applied Research Laboratory, looking at the implementation of parallel systems (transputers); at Southampton Univelsity, as a Research Fellow on the GENESIS s~pernode project; and is currently working as a ( r~nsultant.
M i c r o p r o c e s s o r s and M i c r o s y s t e m s V o l u m e 19 N u m b e r 7 S e p t e m b e r 1995
411