Enhancing Maintainability Through Disabbreviation
of Source Programs
Kari Laitinen Jorma Jaramaa Markku HeikkilS VTTElectronics/Embedded
Sofiware, Oulu, Finland
Neil C. Rowe Naval Postgraduate School, Department of Computer Science, Monterey, California
It is common to use abbreviations as names for different source program elements such as variables, constants, tables, and functions. In most cases, however, abbreviations make source programs difficult to understand and maintain. Disabbreviation means replacing abbreviated names with more informative natural names which consist of natural words. This paper presents an experimental tool to help software maintainers to disabbreviate existing source programs. The tool, which is implemented using Prolog, is an interactive and intelligent system which can suggest name replacements to its users. Common abbreviation patterns, a specialized dictionary, and comment information are used to deduce name replacements. The tool has been evaluated by using it to disabbreviate the source programs of several existing applications. About 40% of the name substitutions suggested by the tool were acceptable in the tests. Learning to use the tool does not require much effort, and one application can be disabbreviated within a few days. 0 1997 by Elsevier Science Inc.
1. INTRODUCTION
Software maintainers often face the situation that a change has to be made to an existing software system, but they do not know which source program files have to be modified. Sometimes they may know which files to modify, but they have difficulties in
Address correspondence to Dr. Kari La&en, MT Ekcmnics / Embedded Software, P.0 Box 1100, 90571 Oulu Finland. J. SYSTEMS SOFIWARE 1997; 37:117-128 0 1997 by Ekevier Science Inc. 65.5 Avenue of the Americas, New York, NY 10010
deciding which program structures to touch, or they may be afraid of making modifications because they do not know how the modifications will affect the behavior of the entire software system. To overcome these problems, software maintainers usually need to study several source programs and other documents describing the software system under maintenance. Studying involves understanding, and therefore the understandability of source programs and other software documents is important to the success of software maintenance. In many cases, source programs are the only documents which reliably describe a software system (Bennett et al., 1991). For this reason, the understandability of source programs is extremely important. Understandability depends on several programming style factors which include, for example, the overall program file documentation and commenting style, the use of braces and other special characters, indentation, and alignment (Oman and Cook, 1991). Naming is yet another programming style factor which can strongly influence understandability. It has been, and it still is, common to use short and abbreviated names for different source program elements such as variables, tables, constants, procedures, and functions. The tradition of using abbreviated names is, at least partly, caused by the fact that, unlike modern compilers, early compilers restricted name lengths to a certain small number of characters. For this reason, it is usual that older programs contain more abbreviated names than programs
0X4-1212/97/$17.00 PII SOlf&1212(%)00108-2
118
K. Laitinen et al.
J. SYSTEMS SOFIWARE 1997; 37:117-128
made today. There exist software systems in active use which are decades old and need to be modified from time to time (AMES, 1993) and systems which are expected to be maintained until the end of this decade (Taramaa and Oivo, 1993). A nontraditional approach to naming is natural naming, which means that all names in source programs should be constructed using preferably several natural words of a natural language while respecting the grammatical rules of the natural language. The natural names must also describe the functionality of the program. We claim that by using natural names we can increase the understandability of source programs. We will justify this claim in the second section of the paper. Guidelines for using natural naming has been published by Keller (1990) and Laitinen and Seppanen (1990). More information about natural naming can be found in (Laitinen and Mukari, 1992; Laitinen, 1994; Laitinen, 1995). In the case of existing source programs we often have the problem that many of the names in those programs are abbreviated, which causes understanding problems for software maintainers. As a solution to this problem, we have developed an experimental tool which examines existing source programs, checks the naturalness of all names by using electronic dictionaries, asks the user to give replacements for those names which are abbreviated or contain unknown words, and finally, outputs a new program version in which abbreviated names are replaced with natural ones. The tool also suggests replacements for abbreviated names whenever possible. The suggested replacements are generated according to disabbreviation methods which are based on commonly used abbreviations and textual information extracted from comments. The tool is able to gradually learn new ways for disabbreviating on the basis of those name replacements which the user gives during the processing of the source program. We use the word “disabbreviation” to denote the process of replacing abbreviations with sequences of natural words. This process resembles the techniques for correcting spelling errors in text (Kukick, 1992), and a disabbreviation tool thus resembles a spelling checker. There are, however, some fundamental differences between correcting spelling errors and disabbreviation. Spelling errors are usually small accidental mistakes such as missing or duplicated letter, or a letter being replaced with a letter which is close to it on the keyboard. Usually, it is easy to see what is the correct replacement for a misspelled word, but for abbreviated names, it is much harder to invent natural replacements. Abbre-
viations in source programs can be thought as intentional “severe spelling errors.” In abbreviated names, words may be shortened by leaving out several letters and the shortened words may be combined together. Although a program contains abbreviations, it is considered to be correct, as it can be compiled. Disabbreviation is also related to many natural language processing techniques (Smeaton, 1992; Rowe and Guglielmo, 1993) as names in source programs contain elements of natural languages. According to the taxonomy given by Chikofsky and Cross (1990) a disabbreviation tool can be considered a redocumentation tool which produces a better version of an existing source program. Disabbreviation can be seen as preventive maintenance (Bennett et al., 1991). The disabbreviation tool which we describe in this paper has been developed in a project called AMES which is producing several other tools to help software maintenance (AMES, 1993). The other types of tools include reverse engineering tools which aim at producing higher-level descriptions from existing programs, application understanding tools which aim at providing additional descriptions to help in understanding existing applications, and impact analysis and navigation tools which should help to detect impacts of certain software modifications and connections among different software constructs. We introduce a disabbreviation tool, which we call InName, in the Section 3 of this paper after giving justifications for the use of natural naming in Section 2. In Section 3, we also discuss a more general disabbreviation tool (Rowe and Laitinen, 1995) on which InName is partly based. In Section 4, we evaluate the tool.
2. JUSTIFYING
THE USE OF NATURAL NAMING
Intuitively, a natural name like “customer_number” is more understandable than an abbreviated name like “cnumbr” or “cusnum”. As illustrated in Figure 1, a naturally named program looks more understandable than the same program with abbreviations. In our opinion, the problem with names is that no name is fully understandable alone and the meaning of every name is bound to program context. Usually, no abbreviation is completely meaningless either, but in a natural name, the meaning has been made more obvious. Sometimes, natural names may be only slightly more understandable than abbreviated ones, but in these cases we can view the use of natural names as an approach to minimize the risks
Enhancing Maintainability
of Source Programs
119
J.SYSTEMSSOFIWARE 1!997;37:117-128
,‘___________________________________________________--_____.,
,‘____---__---_______-________--.,
is_string_a_palindro(
ispdrome( char strtJ,int 'presult)
;f
given_string[], lstring_inspection_result
) ,‘--__---____--___--_____________________________--___--____~, r int string_start_indsx, string-end-index ; int string-length ; string-length
=
strlan( given-string
string_start_index string-end-index lstring_inspection_result
= = =
,'___-______-_____--__-____--___., f int i,j,len ; len = strlen( str ) ;
1 ;
i=O; j=lan-1; 'presult = NREADY
0; string-length - 1 STRINC_INSPECTION_tb_RWY
;
while
(
while ( lstring_inspection_result == STRING_INSPECTION_NOT_READY 1 ( if ( given_string[ string_start_index ] == given-string1 string-end-index 1 1
if
(
I
if t
( string_start_index
1 else ( string-start-index string-end-index
1
==
string-length - 1
lstring_inspection_result
)
( lpresult == NRSADY 1 ( str[ i I == str[
1
) ( lpresult = YES ; 1
else ( i ++ ;
= STRING_IS_A_PALINDRONS ;
; ;
;
1
else (
=
__
1
else (
lstring_inspection_result
j 1 )
if ( i == len - 1
j
++ --
;
STSING_IS~OT_LPALINDRONS
;
1
1
1
lprssult = NO ;
J 1
Figure 1. The same source program with natural names (left) and with abbreviated
for misunderstanding. Regardless of what kind of names are used, every source program is a complex construct. All conceivable means should be exploited to eliminate complexity, The effect of naming has been tested, among other programming style factors, by using groups of people to study source programs written with different naming styles (Weissman, 1974; Sheppard et al., 1979; Shneiderman, 1980; Teasley, 1994). Understandability of source programs has been measured by asking questions about the programs, or asking the people to modify or memorize the programs. Usually programs containing mnemonic or natural names have been found easier to understand than programs with abbreviated or randomly chosen names. However, the mentioned understandability tests have not always produced statistically significant results. Therefore, we cannot say that the usefulness of informative naming styles would have been fully proven in human experiments. The mentioned understandability tests have been carried out using students to study small programs, because it is difficult to organize controlled experiments in actual software development situations. Teasley (1994) points out that, although small experiments have not always produced convincing results, the use of informative naming styles could still be useful in indus-
names (right).
trial settings where people need to work with applications containing hundreds or even thousands of names. The complexity of natural languages can be one reason why the effect of naming is hard to measure in understandability tests. Names in source programs are constructed of words and letters of natural languages. Linguists note that natural languages are complex and they are not yet fully understood (Fromkin and Rodman, 1988). Philosophical studies by Wittgenstein (1953) suggest that using a language is a complex activity because the meanings of some symbols become evident only in the context in which a language is used. Although we agree that software-engineering research should seek empirical validation of develop ment methods and practices (Fenton, 1993; Glass, 19941, we cannot unambiguously prove the advantages of natural naming. However, we consider that the following points support the use of natural naming in software development and thereby disabbreviation of source programs in software maintenance. l
Natural names are generally used in graphicaltextual descriptions belonging to software develop ment methods (e.g., Your-don, 1989; Goad and Yourdon, 1990). We can assume that natural nam-
120
J.SYSTEMS SOFTWARE
K. Laitinen et al.
1997;37~117-128
ing is one reason graphical-textual descriptions are considered useful in software development. Some software development methods (e.g., PageJones, 1988; Yourdon, 1989) recommend the use of so-called pseudo-coding which means describing programs with a language that is somewhere in between a natural language and a programming language. The use of natural naming brings source programs closer to natural language. The use of abbreviations has been criticized in other contexts of technical documentation (Logsdon and Logsdon, 1986; Ibrahim, 1989). Software developers who have been given courses on natural naming generally agree that natural names make source programs more understandable. Especially the people working in industrial organizations support the use of natural naming (Laitinen, 1995). It is possible to formulate a philosophic-linguistic theory for software development to support the use of natural naming (Laitinen and Taramaa, 1994). These are good reasons to believe that it is relevant to try to “naturalize” abbreviated names in existing source programs. Searching for names in source programs and checking whether they are natural or not is a mechanical process. For this reason, naturalization can be done with a computerized tool.
3. INNAME: A DISABBREVIATION
TOOL
3.1 Background The InName
disabbreviation tool has been implemented using Quintus Prolog in UNIX. The tool checks whether the names found in a program are acceptable when compared to the words stored into its dictionaries. The names which are not acceptable are called unknown names. The “In” in the name of the tool denotes “intelligent” and “interactive”. InName is intelligent in the sense that whenever it finds an unknown name in a program it tries to suggest appropriate replacements for the unknown name. InName is interactive, as all names must be accepted by the user of the tool and sometimes the user gives a replacement from which the tool can learn. InName is partly based on a more general disabbreviation tool (Rowe and Laitinen, 19951, which has been used to disabbreviate source programs and
other technical texts. Source programs are very unique technical texts, and therefore, their proper disabbreviation needs a special tool which is tailored for a specific programming language. InName is made to disabbreviate programs written in C, because there exists many C applications which need maintenance. The more general disabbreviation tool is inadequate for this purpose, for example, because it does not split names systematically, and therefore it can sometimes disabbreviate only parts of names, it cannot adequately use comment information in the disabbreviation process, its dictionary is not tailored to the software domain, and it lacks a proper user interface. A special disabbreviation tool for source programs can also be designed to be integrated with other software development tools. 3.2 A Grammar for Splitting Names To check which names need to be disabbreviated, the names found in the program under disabbreviation are first split into separate words. Then, each word is compared to the words in dictionaries. Programmers have certain common means for separating different words in names. In C programming these means include using underscore characters as word separators or using capitalized words, i.e., words with an initial uppercase letter and the rest of letters lowercase. For example, in names like prev-pos and DispBuf f the programmer has obviously thought that the letter combinations “prev”, “pas”, “disp”, and “buff’ should be considered as distinct words, although they are not words in the same sense as natural words. Splitting names into separate words is a somewhat complex process and is best done with grammar rules which can conveniently be implemented with Prolog. The grammar for splitting names is illustrated in Figure 2. This grammar is able to split most names in which typical word separation conventions are used. There are, however, names which this grammar cannot decompose into words. In these cases, the entire name is treated as a single word. The grammar in Figure 2 can perform, for example, the following name decompositions. boOffsetMeasDone-+bo, TIME_UpdateRealTime-,TIME,
Offset, Meas, Done Update, Real, Time
Without being able to decompose names properly we could not find out whether they are natural or not. The latter name above would be found acceptable, as the name can be split into natural words, but the first name would not be acceptable because “bo” and “Meas” are not words found in InName dictionaries.
Enhancing Maintainability
of Source Programs
J. SYSTEMS SOFIWARE 1997; 37~117-128
gr8mm8tlc8l_Mmc -->
lmderscorek8s_word_sequence I underscorekss_word_se4umce,
lmderscorekas_word_~ce
“_“, gr8lMutl~
121
name.
--> lowe--word I uppe--word
I us_or_capi~_wor I lowe-_wo4 lfst_of_c8plwbcd_nord I upperceee_word, lk_of_cap&aed_words I list_of_ce~_wo* uppercae_word I lowe--word, list_of_capitaiid_wor uppercase-word I lowcxalse_word, uppe-_word I word_stertin~with_a_munber.
Figure 2. A grammar for decomposing names.
list_of_eepitaii7.ed_words
-->
capit0lizcd_nord
I eapitaUzed_word, cOpitalized_word
->
upperceeeJetter,
I uppe--letter,
list_of_capftau7.ed_worlk. lowe-Jetter lowe-_ktter,
list_ofJowe-JettOrs_or_numben5. lowe-
_ word
-->
lowe-_ktter
I lowercase_letter, Ust_of_loweres3t_ktte~_or_munbers. uppe--word
-->
uppercese_letter
Iuppe--letter, w0rd_startingwith_a_munber
He_of_uppe-Jetumbere.
--a e_mmlber
I O_numbez,
3.3 Dictionaries InName has two types of electronic dictionaries. One contains the words which are generally used in names in source programs. The other is a userdefined domain dictionary where words specific to a certain application domain are stored. The domain dictionary is updated during the disabbreviation process whenever the user wants to use words not found in the general dictionary. The user can have different domain dictionaries for different applications. The general dictionary of the InName tool was constructed by first taking all the words used in the guideline for natural naming (Laitinen and Seppanen, 1990) and in the related name generation method (Laitinen and Mukari, 1992). More words were gradually added when suitable words have been found, especially when InName has been tested using different source programs. A vocabulary analysis was performed on the entire documentation of one existing software system and the most common words were placed in the general dictionary of InName. When developing the InName tool we determined that we do not need a large general dictionary, such as the 29,000-word dictionary constructed for the general disabbreviation tool (Rowe and Laitinen, 1995). Presently, the general dictionary of InName contains only about 1300 words which we consider the most common words in names in source programs. We have found this dictionary to be sufficient
lkt_of_0ny_ch0r0cters_exeept_+erseorxs.
for the programs we have disabbreviated. About 30% or the words in the general dictionary are different forms of the most common root words. The core vocabulary of the dictionary is thus about 900 words. This number is in harmony with the estimate by Carter (1982) that even for large systems the number of words needed in names is likely to be calculated in the hundreds, not the thousands. The dictionary size can also be justified by noting that a list of 850 words has been considered to form the basic English vocabulary (Carter, 1987). When we have a rather small dictionary in InName, the tool is faster and less likely to propose silly disabbreviations. By having different forms of words (e.g., open, opens, opened, and opening for verbs; small, smaller, and smallest for adjectives; and buffer and buffers for nouns> stored in the dictionary, we do not need morphological rules in the system, which in turn simplifies the disabbreviation process substantially. Natural language processing systems with large dictionaries (e.g., Uthurusamy et al., 1993; Rowe and Guglielmo, 1993) usually have morphological rules to detect different forms of root words. Morphological rules are not necessary in the disabbreviation of names because the vocabulary is small, and the English language used in names in source programs is rather simple and has less morphological variation than common spoken or written English.
122
K. Laitinen et al.
J.SYSTEMSSOFTWARE 1997; 37~117-128
3.4 Disabbreviation
Methods
InName tries to disabbreviate every separate word _ . in an unknown name, and, if it succeeds, it puts the disabbreviations, the results of the disabbreviation process, back together and produces a possible replacement for the unknown name. The possible replacements are called name candidates and the disabbreviation process can produce several name candidates for a single unknown name. 3.4.1 Common abbreviations. The most basic disabbreviation method is to check whether an unknown name contains commonly used abbreviations and words listed in InName dictionaries. The tool has a repository of more than 300 common abbreviations for this purpose. The repository has been collected by inspecting existing source programs and program examples in textbooks. We suppose that people have learned to use some abbreviations by studying programs written by others. On this basis, it is relevant to use textbooks for finding common abbreviations. We have been gradually extending the repository of common abbreviations while testing the InName tool. We could also have added abbreviations invented by ourselves, but, to keep the abbreviation repository small and realistic, we have wanted to use only those which we have seen in real programs. The repository of common abbreviations does not need to be very large, because, as we discussed earlier, the number of words used in names is not very large. Disabbreviation using common abbreviations and dictionary words proceeds, for example, as follows. l
l
The name “tmpnamelen” “temporary-name-length”.
is disabbreviated
to
The name “currwinheight” “current-window-height”.
is disabbreviated
to
These disabbreviations can take place when the following Prolog facts are defined: common-abbreviationttmp,
temporary).
common-abbreviation(len,
length).
common-abbreviationtcurr, common-abbreviation(win,
current).
name into up to ,four pieces trying to find a combination when every substring in the unknown word is either a word in the dictionaries or can be explained with the repository of known abbreviations. Sometimes the tool can find many combinations to disabbreviate an unknown word. In these cases, the tool proposes several name candidates to the user. 3.4.2 Learning from user-given substitutions. The tool can also learn during the disabbreviation process. In the cases when the user gives a replacement for an unknown name, the tool deduces domainspecific word abbreviations from the user-given substitution. These word abbreviations are then exploited when subsequent unknown names are disabbreviated. Domain-specific word abbreviations are used similarly to the standard common abbreviations. For example, if the tool could not propose any name candidates for the name “cnumbrlen”. Then, the user gives the substitution ‘customernumber-length”. From this the tool can deduce domain-specific word substitutions as illustrated in Figure 3. The deduction is based on the general abbreviation rule which is formally defined in Figure 4. The general abbreviation rule says that a word and its abbreviation must start with the same letter and the second through last letters of the abbreviation must be found in the same order among the second through last letters of the word. This rule applies in many cases, since it is almost always true that a word and its abbreviation start with the same letters. According to this rule, the first letter of a word is always its abbreviation and an empty string is an abbreviation for any word. The algorithm which deduces domain-specific word abbreviations splits an unknown name into as many pieces as there are words in the user-given substitution. Then it applies the general abbreviation rule to the pieces of the unknown name and the words of the user-given substitution. Possible combinations of the unknown name are tried by using backtracking, until the domain-specific word abbreviations are found or all possibilities have been tried. When we allow empty strings while deducing domain-specific word abbreviations, we can handle
window).
dictionary-word(name). DEDlJCTIONs:
dictionary_word(height). The backtracking mechanism of Prolog is utilized in the disabbreviation process. Backtracking is particularly useful when the tool has to try many combinations of different words and abbreviations. The tool tries to split every unknown word in every unknown
Figure 3. Word substitutions
tution.
deduced from a name substi-
Enhancing Maintainability
of Source Programs
123
J. SYSTEMS SOFTWARE 1997; 37~117-128
If
OSIUSD,
q = wi,
Figure 4. The general abbreviation rule.
3 ( ip,, 13, id,.
. . , i, f
: 9
=
12 <
Wi2p
P3 5
i3
Wi3,
S4 =
Wi4p . . . . , am
=
Wi,,
i,
then A is M abbreviation of the word W.
cases when the user-given substitution has more words than can be recognized in the unknown name. For example, the tool produces the abovementioned deductions also in the case when the user “customer_number_length_ gives the name in-bytes” as a substitution for “cnumbrlen”. The words “in” and “bytes” are abbreviated as empty strings. Having deduced domain-specific word abbreviations, the tool becomes more clever when these abbreviations are used in other unknown names. For example, when the tool has made the deductions described above it can automatically generate the following name candidates: cnumbr+ productnumbrlen+
customer_number product_number_length
3.4.3 Using comment information. Both names and comments are similar textual information in that their purpose is to explain how a program works. The use of natural naming makes some comments superfluous (Keller, 1990; Laitinen and Seppanen, 1990). On this basis we can suppose that in the case of abbreviated names comment information can be used to generate name candidates which then replace the abbreviated names. It is common in source programs that variables and other data structures are defined with an accompanying comment that occupies the rest of the line, for example, in the following way. (a) char buff[2561; /* disk file buffer*/ (b) int nbytes; /*number of bytes in buffer*/ Also, there are comments which explain the functionality of a program on separate lines, for example, as follows. CC> /* The following function is used to calculate mean value of the numbers in the integer list, which is given as an input argument. */
(d) float calcmeant int ilist[l, int lien )
I ....
By studying these examples we can see that the words in comments can be used to produce natural name candidates for the abbreviated names. The following name replacements seem plausible. (e) buff +disk_file_buffer (f) nbytes + number_of_bytes_in_buffer (s> calcmean-, calculate_mean_value (h) ilist-+ integer-list
In the cases (e) and (f) the name substitution is formed by joining all the words which are in the comment after the definitions (a) and (b). In cases
124
J. SYSTEMS
SOFTWARE 1997; 37:117-128
ment (c) above would yield the following potential name candidates. following-function calculate-mean calculate-mean-value mean-value integer-list input-argument.
When forming the potential name candiates, the tool makes two or three word combinations. The words which are accepted as potential name candidates must follow each other in the comment text and not be separated with punctuation symbols. Words containing only one or two letters are discarded together with all longer prepositions and the article “the”. Also, all pronouns (e.g., “who”, “nothing”, “that”, and “they”) and auxiliary verb forms (e.g., “had”, “have”, “been”) are discarded. When selecting potential name candidates the tool thus discards words that are unlikely to be parts of names. It is necessary to limit the number of potential name candidates because that lessens the possibility that the tool suggests silly names, and also because processing the potential name candidates is quite time consuming. Potential name candidates are put into the database when a source program is read in. When the tool generates name candidates for an unknown name, it decides whether some potential name candidates in the database should be suggested to the user. The decision is made by using the general abbreviation rule which we introduced in the previous subsection. According to this rule, for example, the potential name candidates “calculate_mean” and “calculate-mean-value” would be suggested as replacements for the unknown name “calcmean”, because they start with the same ‘letter and the character sequence “alcmean” is found in both of the candidates. On the same basis, the name “integer-list” is suggested to replace “ilist”. It must be noted that the tool does not use the potential name candidates in the cases when the unknown name is only one or two characters long because this would often result in silly name candidates. 3.5 The Three Phases Process
of the Disabbreviation
Disabbreviation is carried out in three distinct phases which we call passes. In the first pass the tool reads in the source program and checks whether its names are acceptable or not. The tool performs a simplified
K. Laitinen et al. lexical analysis for C programs. This analysis can detect names invented by the programmer without mixing them with comments, include commands, reserved words, standard library functions, and string, character, and numerical values. Simplified lexical analysis is sufficient for disabbreviation purposes, but complete lexical analysis and parsing could increase the usability of the tool. During the first pass, comment information is used to produce potential name substitutions and potential name candidates. Every found name is split into separate words according to the name grammar, and the validity of each word is checked by comparing it with the words in the dictionaries. The first pass results in a list of unknown names together with information about line numbers. In the second pass, the actual disabbreviation takes place. The tool tries to disabbreviate every unknown word in the unknown names found in the first pass. On the basis of the disabbreviation process the tool displays for the user a list of name candidates from which to make a selection, give an alternate name substitute when the name candidates proposed by the tool are not acceptable, or leave the name as it is by using the skipping facility. In the case when the name substitution is invented by the user, the tool asks for confirmation before adding of new words to the domain dictionary. The tool also checks that selected name substitutions are valid in comparison to other names. Two different names may not always be replaced with the same substitution and a given name substitution may not usually be an existing name in the program. The third pass produces a new version of the source program. The unknown names which were given an acceptable substitution are replaced with their substitutions. The third pass also updates a file containing all name changes made in the source programs of the application currently being processed. The file is used to ensure that public names used in several source programs have the same substitutions. The file is updated when a program file has been disabbreviated and read in again when the following source program file is loaded in InName. Using the changes file, the tool is able to tell the user which names are public and suggest the same replacements which have been previously used for the same public names. 3.6 InName User Interface Figure 5 illustrates the user interface of the InName tool. It consists of two windows. The source program being processed is shown in one window, the pro-
Enhancing Maintainability
J.SYSTEMS
SOFTWARE 1997; 37~117-128
of Source Programs
125
/* Stop process hi18 calling user is oh? */ /* How long to sleep while ob is running */ /* How long to sleep while 1ob is stopped */ /* Nice level increment for job */ /* CIaximun time a user can be on */ /* Process group of job */ /,Iz ;;;: "/
Boolean witforme; int run_sleep. stop_sleep. nicelevel. maxtimeon,
,zob, wtnp_users:' /* List of users found in wta~p */ /* Nones of people to ignore i : on ,200) ignore;
/*DEcLARATIo 210
m1ite.c
ewtern char *getenvO: I ;;~tt.along lsts:$):
defali1t.dlc
.
time-t
timeof
I
215 /'MACROS*/ 220
225
230
fmcess-arouo
I
process_group parent_sets_group pqroup_not pgrwp_not_yet Drocess-erouD_of_-fob
#define maxCx ) C(x) #define bool_c*iarCi I eifdef DEBUG e define debug(x) de eelse +tndJ;fins debug(x) I /* Database operations #define init(list) #define find(nma. lis #define deltptr, list) #define addkme.
235
list)
(strncpy( (strncpy(l1st.n sizeof~list.nmasCOl))l
/*FIND-NAME:
Figure 5. The user interface of the InName tool.
gram window, together with line numbers. The other window is a command window which contains subwindows for showing the unknown name and name candidates which the tool has generated. The unknown name which is shown in the command window is simultaneously highlighted in the program window. The command window also has subwindows to select new files for processing and for choosing the domain dictionary file. The user can select a name candidate from the list with the mouse and the candidate is copied to the “Change to:” subwindow. The user can also enter a better replacement in the “Change to:” subwindow. The command window has buttons for changing and skipping unknown names as well as for adding the words of an unknown name to the domain dictionary. The list of name candidates can be scrolled using the “More” and “Previous” buttons. The user can convert uppercase words
to lowercase and vice versa by using the middlemost mouse button. 4. EVALUATING
THE INNAME
TOOL
We have tested the InName tool by disabbreviating the source programs of five different applications with the tool. Data related to these applications are presented in Table 1. Each application was disabbreviated by a different person. Application A is a software system for controlling a space instrument. It was disabbreviated by an expert who is one of the original developers of the application. The disabbreviated source programs of Application A are now being maintained and developed further until the space instrument leaves Earth in 1996. Application B is simulation software to be run on workstations and applications C, D, and E are UNIX software
126
K. Laitinen et al.
J. SYSTEMS SOFIWARE 1997; 37:117-128
Table 1. Statistics of disabbreviated applications APPLICATION
Disabbreviated by Number of .c and .h files in the application Total number of source code lines in the entire application (all lines counted) Total number of names in the application Number of acceptable names Number of names needing disabbreviation Number of accepted name substitutions generated by the tool Number of name substitutions given by the user Number of names added to the domain dictionary with the ADD button Number of skipped unknown names Total number of words added to domain dictionary in added and user given names Approximate time spent in disabbreviation
A
B
C
D
E
K. Kumpulainen 56 12075
K. Laitinen 17 6114
H. Puustinen 15 3874
J. Taramaa 12 6420
M. Vierimaa 7 3331
1410 79 1331 W.xI%)
927 29 898 WO%) 414 (46%) 181 (20%)
439 38 401 (100%) 15.5 (39%)
740 69 671 (100%) 346 (51%) 198 (30%)
272 30 242 WO%) 113 (47%)
(10; 212 (24%) 210
(2%; 180 (45%) 38
(4:
(132
(45%; (33;: 118 (9%) 172 (13%) 246 11 hours
obtained from Internet. The applications B through E were disabbreviated just to test the InName tool, and the disabbreviators were not previously familiar with these applications. Considering the data in Table 1, we can make the following remarks. In most cases, more than 90% of the names were not accepted by the tool-less than 10% of the original names consisted of words stored in InName dictionaries. In every application, more than half of the originally unacceptable names were replaced by new ones, either by names proposed by the tool or given by the user. In most cases, more than 40% of the names proposed by the tool were acceptable. There are no remarkable differences between the results of Application A and the other applications. This indicates that nonexperts can successfully use the tool. InName can thus be useful when a person has to start learning and maintaining a new application. The performance of the tool was the worst in the case of Application C (greatest percentage of skippings). Application C contained many very short names such as “i”, “m”, “k”, “rl”, “r2”, “ml”, and “hh” which are hard to disabbreviate. In the worst case (Application A), the size of the specialized domain dictionary was only about 20% (246/1300) of the InName general dictionary. This indicates that the size of the general dictionary is sufficient, i.e., most of the words needed can be found there. Even the largest application could be disabbreviated in less than two days.
5 hours
(14;
4 hours
(154 122 4 hours
(36;
(4;; 73 3 hours
We consider that the results of these experiments support the use of a disabbreviation tool in software maintenance organizations. Considering that learning to use the tool is easy, and the disabbreviation process itself takes only a few days, disabbreviating an application is not an expensive investment for a maintenance organization. (We exclude. the cost of the disabbreviation tool here.) A small investment in disabbreviation could be made, although it is not possible to tell exactly how much the maintainability of an application is enhanced by disabbreviation. As we explained in the second section of this paper, it is hard to demonstrate the effects of different naming styles, but the use of natural naming can still be justified. Disabbreviating source programs increases the naturalness of their names. In the case of Application A in Table 1, the implementation and maintenance of the application has already taken about three staff years. The company developing Application A was willing to invest a little bit more to the disabbreviation of the application because they believe that enhanced naming will help in further maintenance activities. Considering the practical use of InName, we think that other maintenance activities should be suspended while an application is being disabbreviated. Because disabbreviation produces new versions of existing source programs, the situation might be too complex if staff members tried to make other program modifications simultaneously. InName could be improved by having it process all source programs of an application simultaneously. It currently inputs one source program file at a time and maintains a list of name modifications made in previously disabbreviated files. This way it can tell the user to make the same modifications to
Enhancing Maintainability
of Source Programs
all pbulic names. However, it is possible to make errors in name modifications by skipping a name in one file and modifying the same name in another file. If all files of an application were processed at the same time, modifications to public names could be made simultaneously in every file. If InName were improved in this way, its usability would increase, but this would not change the disabbreviation methods. It could be a problem to simultaneously process all the files of a really large application.
5. CONCLUSIONS InName is an experimental tool with which we have demonstrated that disabbreviation is possible in practical software maintenance work. We believe that we have found the essential disabbreviation methods, though it may still be possible to improve these methods, for example, by adding more common abbreviations to the database. Because disabbreviation is a semiautomatic process which involves user participation, a tool that can learn from the names given by the user is the most appropriate. For this reason, Prolog is a suitable language in the implementation of these kinds of tools. Giving names to various source program elements is a human activity. It has turned out to be difficult to measure in exact terms how different naming styles affect understandability of source programs, or how understandability affects maintainability of applications. We have, however, justified the natural naming approach. InName has been used to disabbreviate source programs of existing applications. These experiments show that disabbreviation does not require much investment in staff hours. On this basis, disabbreviation is a promising approach to alleviate understanding problems in software maintenance.
J. SYSTEMS SOFIWARE 1997; 37:117-128
127
REFERENCES AMES, ESPRIT III Project no. 8156: Application Management Environments and Support, Technical Annex, AMES Consortium-Cap Gemini Innovation, Grenoble, France, 1993. Bennett, K., Cornelius, B., Munro, M., and Robson, D., Software Maintenance, in: Software Engineer’s Reference Book, Chapter 20, (J. A. McDermid, ed.), ButterworthHeinemann, Oxford, England, 1991. Carter, B., On Choosing Identifiers, ACM SIGPLAN Notices 17(5), 54-59, (1982). Carter, R., Vocabulary: Applied Linguistic Perspectives, Unwin Hyman, London, 1987. Chikofsky, E. J., and Cross, J. H., Reverse Engineering and Design Recovery: A Taxonomy, IEEE Software 7(l), 13-17 (1990).
P., and Yourdon, E., Object Oriented AnaIysis. Prentice-Hall, Englewood Cliffs, New Jersey, 1990. Fenton, N., How Effective are Software Engineering Methods?, Journal of Systems and Software 22, 141-146
Coad,
(1993).
V., and Rodman, R., An Introduction to Language, Fourth Edition, Holt, Rinehart and Winston, New
Fromkin,
York, 1988. Glass, R. L., The Software-Research Crisis, IEEE Sofhvare 1l(6), 42-47 (1994). Ibrahim, A. M., Acronyms Observed, IEEE Transactions on Professional Communication 32, 27-28 (1989).
Keller, D. A., Guide to Natural Naming, ACM SIGPLAN Notices 25(S), 95-102 (1990).
Kulick, K., Techniques for Automatically Correcting Words in Text, ACM Computing Surveys 24, 377-439 (1992).
Laitinen, K., and Seppanen, V., Principles for naming program elements, a practical approach to raise informativity of programming, in: Part I of Proceedings of Info Japan ‘90 International Conference, Information Processing Society of Japan, Tokyo, 1990, pp. 79-86. Laitinen, K. and Mukari, T., DNN-Disciplined Natural Naming, A Method for Systematic Name Creation in Software Development, in: Proceedings of the 25th Hawaii International Conference on System Sciences, Vol. II: Software Technology, IEEE Computer Society Press,
ACKNOWLEDGMENT This work has been supported by the Technology Development Centre of Finland (TEKES) and Technical Research Centre of Finland (VTT). It has been carried out in the European Strategic Program for Research in Information Technology (ESPRIT) in the project Application Management Environments and Support (AMES, Project Number 8158). Our collaborators in the AMES project are Cap Gemini Innovation from France, Cap Programator from Sweden, lntecs Sistemi from Italy, Matra Marconi Space from France, Space Systems Finland, and the University of Durham from the United Kingdom. Mr. Kari Kumpulainen, Mrs. Heli Puustinen, and Mr. Matias Vierimaa helped to test the InName tool. For valuable comments the authors wish to thank the anonymous referees, Mr. Douglas Foxvog, Ms. Minna MZikBrSinen, Mr. Hannu RytilB, and Prof. Veikko Seppiinen.
Los Alamitos, California, 1992, pp. 91-100. Laitinen, K. Pacific: A Programming Language Based on the Idea of Natural Naming, in: Computer Science 2: Research ana'Applications (R. Baeza-Yates, ed.), Plenum Press, New York, 1994,529~540. Laitinen, K. Taramaa, J., A Theory to Support the Use of Natural Naming in Software Documentation, Working papers series B33, ISBN 951-42-3967-9, Department of Information Processing Science, University of Oulu, Finland, 1994. Laitinen, K. Natural Naming in Software Development: Feedback from Practitioners, in: Proceedings of 7th Conference on Advanced Information Systems Engineering (CAiSE), Lecture Notes in Computer Science 932,
Springer-Verlag,
Berlin, 1995, pp. 475-488.
128
J. SYSTEMS SOFlWAIW 1997;37~117-128
K. Laitinen et al.
Logsdon, D. and Logsdon, T., The Curse of the Acronym, in: Proceedings of the International Professional Communications Conference, IEEE, 1986, pp. 14.5-152. Oman, P. W. and Cook, C. R., A Programming Style Taxonomy, The Journal of Systems and Software 15, 287-301 (1991).
Page-Jones, M., The Practical Guide to Structured Systems Design, Second Edition, Prentice-Hall, Englewood Cliffs, New Jersey, 1988. Rowe, N. C., and Guglielmo, E. J., Exploiting Captions in Retrieval of Multimedia Data, Information Processing and Management 29,453-461
(1993).
Rowe, N. C., and Laitinen, K., Semiautomatic Disabbreviation of Technical Text, Information Processing and Management 31, 851-857 (1995). Sheppard, S. B. Curtis, B., Milliman, P., and Love, T., Modem Coding Practices and Programmer Performance, Computer 12(12), 41-49, (1979). Shneiderman, B., Software Psychology: Human Factors in Computer and Information Systems, Winthrop Publishers, Cambridge, Massachusetts, 1980.
Smeaton, A. F., Progress in the Application of Natural Language Processing to Information Retrieval Tasks, The Computer Journal 35, 268-278 (1992). Taramaa, J., and Oivo, M., Evaluation of Software Maintenance of Embedded Computer Systems, in Zntemational Symposium on Engineered Sofhvare Systems, World Scientific Publishing Company, Singapore, 1993, pp. 193-203. Teasley, B. E., The Effects of Naming Style and Expertise on Program Comprehension, International Journal of Human-Computer
Studies 40, 757-770 (1994).
Uthurusamy, R., Means, L. G., Godden, K. S., and Lytinen, S. L., Extracting Knowledge from Diagnostic Databases, IEEE Expert 8(6), 27-38 (1993). Weissman, L. M., A Methodology for Studying the Phychological Complexity of Computer Programs, Ph.D. Thesis, Department of Computer Science, University of Toronto, Toronto, Canada, 1974. Wittgenstein, L., Philosophical Investigations, Basil Blackwell, Oxford, 1953. Yourdon, E., Modem Structured Ana(ysti, Prentice-Hall, Englewood Cliffs, New Jersey, 1989.