A simulator for teaching the internal workings of an assembler
A SIMULATOR FOR TEACHING THE INTERNAL WORKINGS OF AN ASSEMBLER M. H. WILLIAMS*, G. R. POTE and J. P. BROOKS Rhodes University. Grahamstown,
South Afr...
A SIMULATOR FOR TEACHING THE INTERNAL WORKINGS OF AN ASSEMBLER M. H. WILLIAMS*, G. R. POTE and J. P. BROOKS Rhodes University. Grahamstown,
South Africa
Abstract-h many courses on systems programming students are taught about assemblers. However. due to constraints on the time available they seldom get the opportunity to construct one. This paper describes an assembler simulator which can be used as a tool to assist in teaching about assemblers. Using it students can design and simulate the running of their own assemblers in a verv short time. This approach not only stimulates the student’s interest but also gives him a deeper Insight into assemblers.
INTRODUCTION Many courses on systems programming or compiler construction devote some attention to the subject of assemblers and assembly languages [l-3]. However. since assemblers are generally fairly
large programs and since the amount of time available is usually limited, there is seldom sufficient time for the student to obtain any practical experience on the construction of an assembler. For this reason courses sometimes concentrate on what an assembler should do rather than providing an understanding of how it does it. We have attempted to solve this problem by creating an assembler simulator and using it in such a course. An assembler simulator is a program which simulates the actions of ari assembler. In order to use it. the user must decide on the details of the particular assembly language which he wants to simulate (the assembly language instruction and pseudo-op formats, the machine code produced, the actions to be performed, etc.). These details are then encoded in a suitable notation and are referred to as the assembler specijication. The notation used for the specification is a special-purpose high level language. Using this notation one can describe the operation of a wide range of different assemblers with a variety of different properties. Once an assembler specification for a particular assembler has been drawn up. it is read in by the simulator. whereupon the latter will adopt the characteristics of this assembler and go through the motions of the assembler. accepting as input source lines of a program written in the particular assembly language, extracting the information from these, assembling machine code and printing the listings expected as assembler output. At any point in the assembly language source code one may include directives to print out the contents of the assembler’s tables, variables or the machine code produced, or to switch the tracing mechanism on or off. The whole process is illustrated diagrammatically in Fig. 1. Thus the simulator serves a dual purpose in teaching students. In the first place it provides a model which enables one to “see” the inner workings of an assembler; in the second place. in formulating the
assembler specification one obtains an insight into the way m which different facilities may be implemented. Besides its use as an educational device the simulator can also be used as a design tool in designing assemblers, especially for microprocessors. The mechanism behind the desired assembler can be specified and its operation debugged and tested on the simulator in a relatively short space of time. before proceeding with an actual implementation. In the following sections the notation used for the assembler specification is described briefly and then applied to an example. THE The assembler into two parts:
specification
ASSEMBLER
describes
SPECIFICATION
the operation
of the assembler
to be simulated.
It is divided
(a) The Declaration Section: This consists of a sequence of declarations describing the formats of tables used by the assembler. the variables used by the assembler, etc. (b) The Assembler Description: This is an algorithm in a high level notation which specifies the operation of the assembler. Data types and identifiers There are two basic types in this system: numbers and strings. In the declaration section one may declare variables of either type-a numeric variable is denoted by a numeric identifier (which consists of a letter followed by zero or more letters and digits), a string variable is denoted by a string identifier (which is similar except that it is preceded by a “0” symbol). In other declarations the format of the input record is described in terms of named fields, the format of a machine code instruction is described in terms of the fields which comprise the machine code word, the entries of a table are described in terms of the fields of which they are composed. These named fields may be either string or numeric and the same naming convention applies. The Declaration Section The declaration section is divided into eight subsections. each preceded by an asterisk two-letter qualifier code, viz. *CO. *IF, *OF, *MF. *TF. *EF. *EM. *VA.
followed by a
Constant section The constant section is used to set up certain constants for the simulator. The most important of these is the word length of the machine for which code is being generated. This is set up by means of a simple assignment statement. For example *co WORDLENGTH
= 16
will set up a word length of 16 for the machine
code produced.
Input f&mats The input format specifications de::ribe the different forms which input records read by the assembler. can have. Each specification indicates how a line of source code may be broken into fields. There can be any number of different input format specifications. each consisting of a sequence of field specifications separated by commas and terminated by a semi-colon. Each field specification consists of one of the following: A fixed lenyth ,je/d. This is specified by writing the name of the held followed by its length in parentheses. e.g. SLABEL(9). A cariah/e length field. This is written as a string identifier followed by a delimiter character in quotation marks. e.g. BLAB”.“. Under certain circumstances the delimiter character may be omitted from the assembly language source which is to be assembled. A constant field. This is indicated by a numeric identifier followed by a string constant, e.g. ASTERISK ‘*IL”.If the string is recognised. the numeric variable is set to I. otherwise it IS set to zero. A dumm~~field. written as (0). This indicates the presence of zero or more spaces. For example *IF SLABEL(9). SOP(6). BOPERAND(60): $LAB” “. (0), SPOP” “. (0). SARGUMENT” ASTERISK”*“. $ADDRESS” “:
“:
An assembler simulator
57
defines three input formats: the first consists of three fixed length fields $LABEL, $OP, and $OPERAND of length 9. 6 and 60 characters respectively; the second has three variable length fields SLAB, $POP and $ARGUMENT, each terminated by one or more spaces; the third indicates that if the line begins with an asterisk the remainder of the line is to be taken as a variable length field $ADDRESS. terminated by a space. Output ,fOmaTS The output format specifications indicate how a line of output is to be assembled. Again there can be any number of different output format specifications, each consisting of a sequence of field specifications separated by commas and terminated by a semicolon. Field specifications are the same as fixed length input field specifications. For example *OF $SPACE(8). KARD(80); Machine formats
The machine format specifications are similar to the output format specifications except that they are used to describe the formats of machine code instructions. There can be any number of machine format specifications, each consisting of a sequence of fixed length numeric fields. For example *MF F(4), N(12); NUM(16): Table lomats
In this section the user defines the tables (Symbol Table. Machine Opcode Table, etc.) which his assembler will use. Each table format specification has the form Table name (max entries) = field,, field,, .
. field,;
followed by zero or more entries of form @‘dat,, dat,,
. , dat,;
where max entries is an integer specifying the maximum number of entries in the table for which space is to be reserved; field,. . . ..field. are string or numeric field names referring to the different fields within each entry of the table; and @,dat,. datz, . , dat,: initializes an entry in the table. In the case of a string field name, the name must be followed by an integer in square parentheses, representing the maximum length which a string in this field may have. This is necessary in order for the system to print the contents of tables neatly. Each dati is a string constant (enclosed in quotation marks), a binary integer (preceded by #) or a decimal integer. For example *TF ST(20) = !&SYMBOL [8], $TYPE Cl], ADDRESS, VAL; MOT(3) = $OPCODE [6], MCOP, BRN; iic”LOAD”. #OOOO.0; #-ADD", ~00i,o; (4“JMP”. #OOlO. 1; defines a table ST with a maximum of 20 entries, each of which consists of four fields: a string field $SYMBOL whose maximum size is 8 characters, a string field $TYPE whose maximum size is 1 character and two numeric fields ADDRESS and VAL; and a table MOT consisting of 3 entries (each with three fields-a string field SOPCODE and two numeric fields MCOP and BRN). The entries in MOT are initialized with the data given. Expression ,formats
Since an operand field may contain an expression which may have one of a number of different possible forms. this set of declarations defines the set of possible expression forms which might occur. Each expression format is written in a form similar to a BNF production, with a string variable followed by an equal sign followed by one or more alternatives separated by exclamation marks and
M. H. WILLIAMS rf ul.
5.x
terminated by a semi-colon. Each alternative consists of a sequence of string constants. string variables or pseudo-terminals (e.g. ELETTER. SDIGIT. SLAMBDA) separated by commas. If the left hand variable is separated from the equal sign by a slash symbol and an integer i. this means that the token I will be returned whenever this symbol is recognized. For example *EF %ADDR SEXP BTERM @D/l SRESTTD SINT:‘2 SRESTINT %PLUS/3
Thus an expression of form SADDR consists of one or more terms separated by the symbol “+“, where each term is either an identifier or an integer. If an expression is recognized. the tokens associated with the nonterminals will be used by the assembler in analyzing the expression and generating code. Error ntessuyes If the user wants his assembler to print error messages at certain points. he specifies these by number in the Assembler Description. This section relates the numbers to actual error messages. Each entry has the form Error number, error message text: For example. *EM 1, “INVALID OPCODE”; 2, “INVALID OPERAND”;
The variables required in the Assembler Description are declared by writing them out as a list separated by semicolons. For example *VA LC; FLAG; ENDFLAG; C01?Imrnts
Any line which has asterisks in the first two character positions is treated as a comment.
The Assembler description This section (preceded by the characters *AD) contains a description of the action of the assembler written as a program in a high level notation. It consists of a single compound statement (i.e. a sequence of labelled or unlabelled.statements enclosed between BEGIN and END) followed by a full stop. A label has the form $6
identifier)
(c) A simple conditional
statement
IF(condition)THEN(statement sequence)ELSE (statement sequence)FI or IF(condition)THEN
sequence)FI
An assembler
(d) A multi-way
59
simulator
branch:
SWITCHON(var)INTO(case where (case sequence)
sequence)ENDCASE
is two or more cases having the format
CASE(const):(statement
sequence)ESAC
(e) A repeat loop: REPEAT(statement
sequence)UNTIL(condition)
(fl A while loop: WHILE(condition)DO(statement) (g) A foreach loop: FOREACH(table
name)DO(statement)
This loop steps through each entry of the given table and performs (h) A compound statement: BEGIN(statement
the statement
specified.
sequence)END
(i) A call statement: CALL(labe1
identifier)
fjl A return statement: RETURN (k) A stop statement: STOP (1) An error statement: ERROR(n) causes the error message text corresponding (m)A print statement:
to the error number
n under *EM to be printed,
PRINT causes whateber is in the output buffer of the simulated (n) A newcard statement:
assembler
to be printed.
NEWCARD causes the assembler to read in a new line of input and scans each input format specification in turn and matches field, the contents of the input line are stored in the field. fields of that particular input format specification are set (01 A table insert statement: INSERTIN(table
provided it is not a simulator directive, it it against the input line. As it matches each If it encounters a mismatch. all remaining to null.
name)ENTRY((list))
This creates a new entry in the table specified and assigns the data items in the list to the fields of the table entry. Each item in the list may be a string or numeric constant. a variable or an input field. The Current Entry Pointer for the table is set to point to this new entry. (pl A table unstack statement: UNSTACK(table
name)
removes the last entry of the table specified and sets the Current the last one. (q) A table position statement: POSITION(table where n is a numeric table.
name)TO
Entry Pointer
to the entry preceding
n
variable or constant.
moves the Current
Entry Pointer
to the nth entry in the
60
M. H.
WILLIAMS
et ul.
(r) A next entry statement: NEXTENTRY
(table
name)
moves the Current Entry Pointer (s) A monitor statement: MONITOR((list
to the next entry in the table
>)
where list is a list of names of tables, fields or variables, fields and variables to be printed.
causes the contents
of the specified
tables,
In addition to these statements there are a number of special functions available for use in expressions. These include: (1) FIND ((sv or if). (table name). n&this function searches the table concerned for an entry whose nth component matches the string in the string variable or input field ((sv or if)). If such an entry is found. the Current Entry Pointer is set to point to it and the function returns the value true; if the search fails. the Current Entry Pointer is left unchanged and the function returns the value false. (2) FORMAT((sv or if}. (expression var)tthis function checks the string contained in the string variable or input field to see whether it matches the format defined by the expression variable (one of the variables defined under *EF). If the match succeeds, the function returns the value true otherwise the function returns false. On a successful match two lists are set up: a token list (which contains a sequence of tokens corresponding to the nonterminals recognized) and a pointer list (which contains a sequence of pointers pointing to the places in the string where the nonterminals begin). Once a successful match has been obtamed. the following three functions may be used: -fetches the next token from the token list and returns the value of this token: (a) NEXTT (b) $NEXTS -yields the string corresponding to the current token: (c) NEXTV(b)-treats the current token as an integer with base b. converts it to a binary number and returns the value of this number. For example.
if one had the string “* + IO” In the field $OPERAND FORMATf$OPERAND.
and one encounters
the function
call
$ADDR)
the function will return the value true and will set up the token list as shown in Fig. 2. Thus the first call to NEXTT will yield the value 5; a reference to $NEXTS at this stage will yield the symbol ‘**” while a reference to NEXTV is meaningless. The next call to NEXTT will yield 2 ($NEXTS will yield “+“. a call to NEXTV is meaningless). The next call to NEXTT will yield the value 4 ($NEXTS will yield the string “lo”. NEXTV( 10) will yield the value 10). Token
List
Pointer
El
List
@!I l
FIN. 2. The
token
list
and
pointer
list
set up
+ IO
for “* + lo”
61
An assembler simulator
I
9 IO
15 16
.75
Fig. 3. (a) Machme Code format: (b) Assembly Language format Other numeric
functions
VALUE((string
include:
exp)ttreats a string as an ASCII representation of a decimal number converts it to the equivalent binary representation of that number: name))--yields the index of the current entry in the table.
INDEX((table
EXECUTION
OF
THE
SIMULATED
and
ASSEMBLER
As the assembler specification is read in and verified, tables are set up for the second part of the simulator. The second part acts as interpreter. using the tables to simulate the effect of the assembler. The interpreter steps through the tables and. whenever it encounters the statement NEWCARD. a new line of source is read. However. if this line contains a control character in the first column, the line is taken to be a directive to the interpreter and is read and acted upon. The possible directives are : #MONITOR((list)t_where #TRACE #UNTRACE #DUMP #FINISH
list is a list of tables, variables, and/or contents of the specified tables/variables/machine +causes tracing to be switched on --causes tracing to be switched off -gives a trace of the last 100 steps executed. -terminates the simulator run.
“MC”: this prints code area
Thus the user is not only able to simulate the given assembler but is also able to obtain what is happening inside the assembler at any particular instant.
A SIMPLE
the
a picture of
EXAMPLE
Consider a simple machine with a word length of 16 bits and a single accumulator which has a machine code format as shown in Fig. 3(a). The assembly language for this machine consists of instructions and pseudo-ops each of which are written according to the format given in Fig. 3(b). The OP/POP field contains a machine op mnemonic or a pseudo-op. The OPERAND/ARGUMENT field contains a symbol. a decimal integer. an octal integer or nothing. The machine opcodes included in the specification are: LOAD. ADD. STORE, JUMP. HALT and LDI. The pseudo-ops include: START END DC DS COM
-announces the start of the program -marks the end of the program -define constant (OPERAND = value of constant) -define storage (OPERAND = length of storage block) -comment.
An assembler specification for a one-pass assembler for this language is given in Fig. 4, together with a typical program in the assembly language to be used as data for the simulated assembler. CONCLUSIONS The notation described here enables one to specify the operation of a simple assembler in a reasonably concise manner. The notation is flexible and enables one to illustrate the important
SiilO)=$SYMBOL[9],$STYPE[Z],ADOR; BA(lO) = Pl,LINK; OT(ll)=SMNEM[5],OPTYPE,BIN; @"START" .1 .0.. @"END" .*, 1 0. @"DC" 1 0. @"Ds"'l'0: @"COM': i ,j. @"LOAD':,i,iOOOl; @“ADD”,2.#0010; @“STCRE”,2.#0011; @“JUMP” 2 X0100. @“HALT”:2:#0101~ @“L01”.2,#0110; 'EF tt **ADDRESS FIELD CONSISTS OF ONE OF: **SINGLE SYMBOL-CONSISTING ONLY OF LETTERS, **DECIMAL INTEGER, OR **OCTAL INTEGER-PRECEDED BY & l* SADDNT = fSYMNT!$DECNT!$OCTNT; SSYMNT/l = SLETTER,$RESTSYM; SRESTSYM = SLETTER,SRESTSYM!SLAMBDA; $DECNT/Z = SDEC; $OCTNT/3 = "&".$ODIti,$RESTOCT; $RESTOCT = $ODIG.$RESTOCT!$LAMBOA; SODIG = "0"!"1"!"2"!"3"!"4"!"5"!"6"!"7"; 'EM l,"OPCODE NOT RECO(iNISED"; 2,"LABEL DEFINED MORE THAN ONCE"; 3,"VARIABLE DEFINED MORE THAN ONCE"; 4 "UNDEFINED SYMBOL"', 5'"INVALID OPERAND"., . 'VA LC;FLAG;ENDFLAG;P;C;$I;$TEMPSYM;TEMPTOKN; *AD BEGIN LC := 0; ENDFLAG := 0; REPEAT FLAG := 0; REPEAT NEWCARD; fOCARD:=$CARD; PRINT; IF FIND($OP,OT.l) THEN FLAG := 1 ELSE ERROR(l) FI UNTIL FLAG = 1; IF OPTYPE = 2 THEN l
*
*"MACHINE OP +t F(LC) := BIN; IF $lAEEL <> fluTHEN IF FIND($LABEL,ST,l) THEN IF $STYPE = "u" THEN SSTYPE := "L";CALL PROCESSBACHAIN ELSF ERROR(Z) FI ELSE ~NSERTIN ST ENTRY (~LABEL,"L",L~) FI FI; IF SOPERAND <> II"THEN IF FORt44T(SOPERAND,fADCJNT)THEN TEMPTOKN:=NEXTT; IF TEMPTOKN=l THEN $TEMPSYM:=$NEXTS; IF NOT FIND($TEMPSYM.ST,l) THEN INSERTIN BA ENTRY (LC.0); INSERTIN ST ENTRY ($TEMPSYM,"U".INDEX(BA));N(LC) := O ELSE IF SSTYPE = v~ THEN INSERTIN w ENTRY (LC.ADDR); ADDR := INDEX(BA);N(LC) := 0 ELSE N(LC) := ADDR FI FI ELSE IF TEMPTOKN=2 THEN N(LC):=NEXTV(lO) ELSE N(LC):=NEXTV(8) FI FI ELSE ERROR(S) FI ELSE N(LC):=O FI; LC := LC + l;MOilITOR("MC")
63
An assembler simulator
l
ELSE *
** PSUEOO-OP t* SWITCHON SPOP INTO CASE"DC":CALL PROCESSVAR; NUM(LC) := VALUE(JARGUMENT); LC := LC + I ESAC CASE"OS":CALL PROCESSVAR; LC := LC + VALUE($ARGUMENT) ESAC CASE"ENO":ENDFLAG := 1 ESAC ENOCASE FI;MONlTOR("MC") UNTIL ENOFLAG = 1; $1 := “U”; IF FlNO($l,ST,Z) THEN ERROR(4) FI;STOP; l
*
l
* SUBROUTINES
GR~CESSVAR: IF SLAB = " THEN RETURN FI; IF FIND (SLAB.ST.1) THEN IF $STYPE = "U" THEN SSTYPE := "V";CALL FROCESSBACHAIH ELSE ERROR(3) FI CLSE INSERTIN ST ENTRY ($LAB,"V",LC) Fl; RETURN; UPROCESSBACHAIN: O:= ADOR; ADOR := LC; REPEAT P := 0;POSITION 54 TO P;Q := LINI;;N(Pl) := LC UNTIL 0 = 0; RETURN END. START A BEGIN A 1 :5 : LOAD f ADO B STORE C #MONlTOR(ST.BA.LC) #TRACE JUMP A # UNTRACE LDI 101 LO1 ii52 LO1 B HALT 20 C XMONlTOR(ST,BA.LC) #DUMP
#FINISH
Fig. 4. An assembler specification for a one-pass assembler.
features of a simple assembler (e.g. one-pass or two-pass assembly, simple macro facilities, a macro assembler with multilevel definition and expansion, assemblers producing binary object code or relocatable binary, assemblers producing a single block of code and data or producing separate blocks of code and data, free format or fixed format input. complex operand fields. assembly time constants, etc.). A simulator which accepts a specification of an assembler in this notation and simulates its behaviour, serves a dual purpose in teaching about assemblers. In the first place the design of the specification gives the student an insight into how the assembler operates, in the second place the monitor and trace facilities enable him to form a picture of the data structures within the assembler and the actions performed. In a relatively short period of time a student can obtain a reasonable depth of understanding of the properties of an assembler and how these might be implemented. Such a simulator has been implemented by us on an ICL 19031 and has been used as a teaching aid to supplement part of the course on systems programming in the Computer Science undergraduate programme at Rhodes University. The most important result from our point of view was the extent to which it stimulated students to think about the implementation of different facets of assembly languages and the novel ideas with which students came forward. REFERENCES 1. Seegmiiller
G.. Systems
programming
as an emerging
discipline.
In Information
Proccssiny 74 (Edited by
Rosenfeld J. L.).North Holland. Amsterdam (1974). 2. Curriculum 68 Cornmur~. Ass. Comput. Mac/~. 11, 151-197