-?ii!zeti~ in Biom e!b tine ELSEVIER
Computer
Methods
and Programs
in Biomedicine
52 (1996)
1299138
tomated generation of a .World Wide Web-based d and check program for medical applications T. KiucIhF, S. Kaiharab aDepartment of Epidemiology bHospital Computer Center, Received
und Biostatistics, Faculty of’ Medicine, University of Tokyo, Tokyo, Jupan University of Tokyo Hospital, 7-3-l Hotlgo, Bunkyo-ku, Tokyo 113, Jupun
15 February
1996; revised
1 October
1996; accepted
7 October
1996
Abstract The World Wide Web-based form is a promising method for the construction of an on-line data collection system for clincal and epidemiological research. It is, however, laborious to prepare a common gateway interface (CGI) program for each project, which the World Wide Web server needs to handie the submitted data. In medicine, it is even more laborious because the CGI program must check deficits, type; ranges, and logical errors (bad combination of data) of entered data for quality assurance as well as data length and meta-characters of the entered data to enhance the security of the server. We have extended the specification of the hypertext markup language (HTML) form to accommodate information necessary for such data checking and we have developed software named AUTOFORM for this purpose. The software automatically analyzes the extended HTML form and generates the corresponding ordinary HTML form, ‘Makefile’, and C source of CGI programs. The resultant CGI program checks the entered data through the HTML form, records them in a computer, and returns them to the end,-user. AUTOFORM drastically reduces the burden of development of the World Wide Web-based data entry system and allows the CGI programs to be more securely and reliably prepared than had they been written from scratch. Copyright 0 1997 Elsevier Science Ireland Ltd KeyrvorJs’s: Hypertext
transfer
protocol;
Hypertext
markup
language;
Data management;
World
Wide Web
1. Introduction
A few years ago, World Wide Web (WWW), which provides a single user interfa,ce to a variety of services and data formats on the Internet, was developed and is very popular now [I]. Its native and powerful feature is its multi-media, distributed information service using hypertext data
* Corresponding author. Present address: Hospital Computer Center, University of Tokyo Hospital, 7-3-l Hongo, Bunkvo-.cu, Tokvo 113, Jauan. Tel.: + 81 3 38122111. ext. 3523: 0169-2607;97/$17.00
Copyright
PI1 SO 169-2607(96)01793-2
0 1997 Elsevier
Science
Ireland
Ltd.
All rights
reserved
130
T. Kiuchi,
S. Kailwa
/ Computer
Methods
formats based on the hypertext transfer protocol (HTTP) a;ld utilizing hypertext markup language (HTML), which is developed based on the standard generalized markup language (SGML) [2,3]. HTTP/HTML was originally developed for the purpose of providing information to network users. However, thanks to the recent revision of the HTTP and HTML specifications, data submission frsm clients to the server has been made possible. HTTP/HTML now provides the following functions: (1) The server can send a data entry form to a client. (2) Using this form, the client can submit data to the server. (3) The server can process the submitted data. (4) The server can return the results of processing to the client. ‘We have developed an automated patient registra.tion and random allocation system [4] and total data management system for a clinicat trial using HTTP/HTML [5]. The most popular HTTP server software, such as NCSA httpd, CERN httpd, and NetScape Commerce Server, has an interface fbr running external programs or gatewa.ys calle,d the common gateway interface (CGI) [6]. We have developed CGI programs for three projects in C language in order to process data entered using an HTML form. However, it was laborious for us to develop CGI programs for each project. One major reason is that, for medical. use, we have to write routines for checking deficits, data types, range errors, and logical errors for all items in order to assure the data quality as well as checking the length of the entered data and detecting illegal meta-characters in order to assure server security. This paper presents a software tool, AUTOFORM, which analyzes the HTML form and automatically generates a CGI program to handle incoming data reliability and security. 2. Method 2.1. Hcdwnre
ard operating system
The system was developed using a PC with a BSDj0S 1.1 (BSDI, USA) operating system.
and Programs
in Biomedicine
52 (1997)
129-139
2.2. Software
The main program of AUTOFORM was written in PERL (Per1 4.109 with Japanese patch). For the compilation of generated C programs, gee version 2.5.8 and GNU make version 3.00 were used. The HTTP server software used was NCSA httpd version 1.4 and the clients used for testing were NCSA Mosaic version 2.4 on an UNIXbased workstation and NetScape 1.1 on MS-Windows-NT 3.5 [7]. 2.3. Extension of the specl$catiorz of HTML fOrln
for
AUTOFORM
AUTOFORM was designed to recognize all tags in HTML form supported by NCSA Mosaic 2.4 [S]. In order to make the generated CGI programs check the deficits, data type, data length, range errors and logical errors and detect illegal meta-characters, it is necessary to use additional information other than those supported by Mosaic. Thus, we developed an ‘extended HTML form’ for AUTOFORM in order to describe such information in HTML tags using additional parameters.
2.4. Security consideratiorl AIJTOFORM is designed to provide security from the following known CGI securit.y problems [9].(l) Memory overwrite One weak point of many CGI programs is that if it receives longer than expected data containing executable codes; these will overwrite its memory. If the CGI program switches its instruction pointer to that code, any potentially hazardous extraneous codes may be executed within the server. Note that data can be sent directly to a server without using the form-based interface, whose output the CGI program handles. AUTOFORM avoids these problems as it is designed to check the length of each string submitted to the server, using parameters in an ordinary and extended HTML farm.(2) Insecure data passed to a shell There is another security hole in the way the shell interprets data. Some CGI programs receive
T. Kiuchi,
S. Kuihara
1 Computer
Methods
and Programs
Table 1 Necessary. optional and extended parameters for each combination Tags INPUT
Type RADIO CHECKBOX TEXT
Data typea
in Riornedicine
52 (1997)
Option parameters
INT
NAME, NAME NAME
CHECKED VALUE, CHECKED VALUE, SIZE, MAXLENGTH
DECIMAL
NAME
VALUE, SIZE, MAXLENGTH
CHAR
NAME
VALUE, SIZE, M.~XLENGT~ VALUE, SIZE, MAXLENGTH SIZE, SELECTED, MULTIPLE
Extended parameters fOI' AUTOFORM
VALUE
NAME
SELECT
NAME, OPTION
TEXTAREA
NAME, COLS, ROWS
ALIAS, REQUIRED ALIAS ALIAS, REQUIRED. MAXVALUE, MINVALUE ALIAS, REQUIRED, MAXVALUE, MINVALUE ALIAS, REQUIRED, SECURITY ALIAS, REQUIRED ALIAS, REQUIRED ALIAS,
LCHECKb -LData type is specified using a DATATYPE b LCHEC’K tag is an extended tag.
131
of a tag, type and data type
Necessary parameters
PASSWORD
129- 138
REQUIRED
IF, THEN parameter, which is itself an extended one.
data and pass it to a shell. This can produce potentially dangerous results. For example, when sending e-mail using the HTML-based form, the address is passed onto a C function such as (‘system (commands)” where “commands” is a char type array and its contents are given 3s “sprintf (commands, “mail O/OS”, form-data)” where the mail address string is included in the form-.data char type array. If the COKlteIltS Of “form-data” were “someone@lo:al.edu; mail cracker@,elsewhere.edu < / etcipasswd”, then the server’s user information becomes available to the cracker. The best way to avoid this security hole is not to use functions which invoke a shell. However, such functions are convenient and will probably be used. Thus, AUTOFORM was designed to detect potentially hazardous meta-characters if an additional parameter for security protection is specified in a extended HTML form, assuming that it may be passed to a shell.
3. Result
The parameters used in extended HTML form are sumlnarized in Table I and the examples are shown in Fig. 1. The meaning of extended parameters are described in this section. An ALIAS parameter indicates an alias of a variable name which is specified in NAME parameter and is used as a variable name of the resultant CC1 program in C language. The alias is used to specify the item name of the entered data to a user when the submitted data are successfully accepted or fail to be accepted because of some deficit or other errors. This parameter is especially important in non-English speaking countries because it is convenient for each enduses to be able to read item names in their native language while English must be used for a variable name in C language. An ALIAS
132
T. Kiuchi,
S. Kaihara
/ Computer
Methods
and Programs
in Biomedicine
52 (1997)
129-139
Fig. 1. Examples of extended HTML form for AUTOFORM. HTML. Assuming that the file name of this extended HTML CGI source generated from it by AUTOFORM are “regist.c”
Note that all parts form is “form.html”, md “o_form.html”,
of the document below are written in extended the files names of the ordinary HTML form and respectively.
T. Kiuciri, S. .Kaiharu / Computer Metimds arzd Programs in Biovwdicine 52 (1997) 129-138
parameter is used for all combinations of a tag, type and data type. A REQUIRED parameter is used to indicate whether the specified item permits deficit or not. If “REQUIRED = “yes” is specified, the resultant CGI program checks whether or not a value is entered to the corresponding item. If ‘“REQUIRED = “no” is specified, the resultant CGI program does not check it. If a REQUIRED attribute is not specified, ‘REQUIRED = %o” is assumed by default. There is more than one tag corresponding to a given variable in RADIO type in INPUT tag. A REQUIRED parameter should be specified by at least one of these tags. In CHECKBOX type in INPUT tag, REQUIRED parameter is meaningless and ignored by AUTOFORM even if it is specified. There are four additional parameters; DATATYPE, SECURE, MAXVALUE and MI~~LUE. These parameters are used only for TEXT type in an INPUT tag. For a DATATYPE parameter, three data types, ‘INT’, “DECIMAL’ and CHAR’ can be specified and these correspond to integer, decimal number and printable ASCII characters (string), respectively. The default value is ‘“CHAR”. MAXVALUE and MINVALUE parameters specify the maximum and minimal limit of an entered numerical value, respectively. When “CHAR” is specified in DATATYPE, MAXVALUE and MINVALUE parameters are ignored. On the other hand, SECURE parameter is evaluated only when “CHAP.” type is specified in DATATYPE. If “SECURE = yes” is specified, the resultant programs checks the entered strings to see whether or not there are potentially hazardous meta-characters, such as “;” or “laposl’, in the string, assurning that it might be passed to a shell. A new tag, LCHECK, was created to check logical errors of entered data (bad or unexpected combinations of data). In the LCHECK tag, two parameters are defined, namely IF and THEN. As differem: from those in other tags, the order of IF and THEN parameters is pre-determined (IF precedes TJ-IEN) and the values for these parameters is not specified using “ = “. An example of LCHECK tag is presented in Fig. 2.
133
.?..?. AUTOFORM
3.2.1. OVERVIEW AUTOFORM is a Per1 script named “autof’orm.pl”. It analyzes an extended HTML form (Fig. 1) and produces: (1) the corresponding ordinary HTML form file which can be used on a WWW server (Figs. 2 and 3); (2) one or more CGI programs in C language which handle the data sent from the user agent; (3) “Makefile’” for compilation; (4) “af-utile” which contains C fuctions; and (5) log file for describing analytic process of ‘~autofo~.pl”. The resultant ordinary HTML form will be named after that of the original extended HTML form (“oo” is added to the top of the original file name). As for a file name of the CGI source, the file name specified in the URL in ACTION parameter of FORM tag is used. If more than one FORM tag is specified in a given extended HTML form, the corresponding numbers of CGI programs are generated.
3.2.2. Usage
At a first step, users have to prepare a form in eixtended HTML form. Next, they have to set appropriate values for some variables in ‘“autoform.pl”, which is then executed with the extended HTML document filename as an a.rgument. This will result in the generation of the a.bove mentioned four files. Typing “make“ compiles the CGI program source and generates the CGI program executable. Typing ‘“make install” will install both the HTML form and CGI program. 3.2.3. Data checkirzg
In reality, the values of HTML form variables sent to CGI program via httpd using CGI are “strings”. A generated CGI program checks deficit, data length, data type, data range and/or the existence of meta-characters according to the parameters specified in the original extended HTML form. When there is at least one problem during the checking process of the CGI program, the user is notified of the erroneous item’s name (alias of the variable’s name) if an ALIAS parameter is specified, or the names of variables if one is not specified. In addition, the kinds of
T. Kiuchi, S. Kaihara / Computer methods and Programs in Biomedicine 52 (1997) 129- 139
134
Fig. 2. The HTML form produced from the extended HTML form in Fig. 1 by AUTOFORM. that ali of rhe document below is written in HTML.
The file name is “o-form.html”.
Note
T. Kiuchi~ S. Kuihnra / Computer Methods and Programs in Biomedicine 52 (1997) 129-138
135
Fig. 3. Re~rese~~atio~ of the HTML form in Fig. 2 in a user agent. Sex and age are required items and the others are not. The value of age shcufd be an integer and be between 18 and 60. Otherwise this form will not be accepted and the users are warned of illegal data ento. The value of hemoglobin should be between 11.O and 16.0, otherwise this form will be treated likewise.
errors are presented and the user is requested to re-enter the data. The data checking proceeds in the following order: (I) Deficit check This type of check is performed when a
REQUIRED parameter is specified except for RADIO type in a INPUT tag. (2) Data length check. The lengths of all the entered data are checked to see whether they exceed their expected lengths.
136
T. Kiuchi,
Table 2 Specifications of the HTML
S. Kaihara
1 Computer
form-compliant
Program Generators
-.
Methods
Polyform Un-CGI HTML Wizard
Any platforms where a Pert runs MS-Windows UNIX BUS-Windows
in Biomedicine
52 (1997)
129-139
CC1 generators
Source codes
P!atform
AUTOFORM
and Progrnms
Available Not available Available Not available
Resultant CGI programs ________ Data check Deficit, type, range, and logical check Deficit NO
ho
The expected data length is determined as follows: (a) INPUT tag, RADIO type: the length of a string specified in a VALUE parameter. No de= fault value. (b) INPUT tag, CHECRBOX type: the length of a string specified in a VALUE parameter. If not specified, the default maximum length is two, which is the length of the default value, “on”. (c) INPUT tag, TEXT type and PASSWORD type: the value specified in MAXLENGTH is used. If not specified, the default valne is 20. (d) SELECT tag: the largest value among the lengths of strings specified in OPTION tags. (e) TEXTAREA tag: the length of the maximum length of the entered data is calculated using COLS and ROWS parameters. If COLS and/or ROWS parameters are not specified in the extended HTML form, AUTOFORM assumes that the value of COLS and ROWS are 20 and 4, respectively aml specifies those parameters in the resultant HTl~L form accordingly. (3) Check for the possible data. As for RADIO type and CHECKBOX type in INPUT tag and SELECT tag, the possible data which can be sent from clients are pre-determined. In these cases, AUTOFORM checks whether or not the entered string is one of the possible data. (4) Range check for numerical values. As for INT and DECIMAL data type in TEXT type in lNPUT tag, AUIOFORM checks whether or not the entered numerical value is smaller than the maximum value specified using a MAXVALUE parameter and is larger than the minimum value specified using a MINVALU~~ parameter.
Data check for security
Source codes
Available
Available
Not available Not available Not available
Not available Available Available
(5) Logical check of possible combinations of data. For all kinds of variables, logical checks can be performed according to the formulae specified in LCHECK tags. (6) Security check of a string. As for CHAR data type in TEXT type in INPUT tag, A~OF~R~I checks whither or not the string entered contains potentialiy hazardous meta-characters which may have a special meaning for shells if ‘SECURITY = “secure” is specified. By default, AUTOFORM does not check it.
4. Discussion
The CGI programs generated by AUTOFORM may not be used as they are and further customization may be necessary for some purposes. However, by using AUTOFORM, the time and labor which is usually necessary for preparing CGI programs can be dramatically reduced. In addition, the resultant CGI programs are more reliable than those written from scratch. Programmers can save the time usually spent on programming, testing data entry, and checking parts of the CGI programs to concentrate on other things such as data calculation and processing Although the functions of the generated CGI program are limited to just checking entered data, recording them and returning entered values to users in simple format, it may be sufficient for the collection of data. Only a minimum knowledge of
C language is necessary to alter the contents of an HTML document returned to users. Ther? are a few programs which can generate CGI programs that handles the data entered using HTML form, such as Polyform [lo], LJn-CGI[ll]l and HTiLlL Wizard [12]. Their characteristics are listed in Table 2. These programs, however, do not meet our three requirements concerning data collection in the medical fields: (1) data check, including deficits, types, and range of data and logical errors for the data quality; (2) data length check and detection of meta-characters for server security; and (3) the avaiIability of source codes of both CGI generators and resultant CC1 programs for furLher customization. 4.2. Security considerations
By using AUTOFORM, secure CGI prograllls can be generated even if one does not know how to write secure CGI programs. We believe the generated CGI programs are usually more secure than those presently available on the Internet. It is much easier to develop a program which generat.e CC1 programs in Perl. However, Per1 script needs a Per1 interpreter ready to run on a ~W server. It would be desirable, from the view point of security, to avoid putting a powerful interpreter on the server. This is the reason for us to adopt the C language for CGI programs. As the Internet is universally accessible, the possibility of data tampering limits its use for research by the medical community. In HTTPbased communication currently used, all data are transferred in unencrypted (plain) form through the Internet in the current HTTP~l.0 speci~catiol~. Recently, some mechanisms for secure HTTP communications using cipher technology were proposed and, among them, Secure Socket La:yer (SLL) developed by NetScape Communications is available at present [13--161. We think that the security problem concerning data tampering has become reduced now and will be solved in the near fmure. This will result in WWW-based form having an important role in network-based data collection for medical research. Accordingly, we think that the importance of our program will be increased. It should be noted that WWW-based
form can be used via privately-leased circuits and public telephone line, using TCP/IP protocol, and lthey are more secure than the Internet. But they are more costly and inconvenient. The TCPjIP connection method should be preferred, after considering the level of security required and costs.
References 111T. Boutell (Maintainer
in August, 1996), World Wide Web Frequently Asked Questions, available on the World Wide Web as “ht~p:~/w~.bo~~tell.com~boutcIl~faq~” (1996). PI T. Berners-Lee. R.T. Fielding and H.F. Nielsen, Hypertext Transfer Protocol-HTTP/1.0. RFC 1945, available on the World Wide Web as “http://ds.internic.net/rfc/ rfcl945.txt” (1996). t31 D. Raggett, HyperText Markup Language Specification Version 3.0, available on the World Wide Web as “http:// ww~v.w3.or~~pub/~~WWlMarkUp/htlnl3]CoverPage.html” (1995). I41 T. Kiuchi, Y. Ohashi, M. Konishi, T. Kosuge, Y. Bandai and T. Kakizoe, A World Wide Web-based User Interface for a Data Management System -ror Use in Multi-institutional Clinical Trials-Development and Experimental Operation of an Automated Patient Registration and Random Allocation System (Controlled Clinical Trials 1996) in press. 151A Randomized trial of the effect of Bonnarine on Type C hepatitis (a project now under way). 161National Center for Supercomputin~ Association, The Common Gateway Interface, available on the World Web as “ http://hoohoo.ncsa.uiuc.edu/cgi/” (1996). 171 National Center for Supercomputing Association, httpd version 1.4, public domain software with documentation, available from an anonymous FTP server, “ftp.ncsa.uiuc.edu” (1995). IsI National Center for Supercomputing Association, NCSA Mosaic for X version 2.0 Fill-Gut Form Support, an HTML document included in NCSA httpd version 1.4 (I 995).
P. Phillips, CGI security, on-line HTML document, available on the World-Wide-Web as “http:// www.primus.com/staff/paulp/cgi-security/” (1995). 1101Willow Glen Graphics, Polyform version 2.3, software and information available on the World Wide Web as “http:~/~gg.com~’ (1995). illi S. Grimm, Un-CGI version 1.7, software and info~atioll available on the World Wide Web as “http:// www.hyperion.com/ - koreth/uncgi.ht.mI” (1996). [121HTML Wizard Web Developers Warehouse, software and information available on the World Wide Web as “http://htechno.com/wdw/index.html” (1995). [91
138
T. Kiuchi, S. Kaihara 1 Computer Met/lo& and Programs in Biomedicine 52 (1997) 129-138
[13] E. Rescorla and A. Schiffman, The Secure Hypertext. Transfer Protocol, Internet Draft, expired, available or. the World Wide Web as “http://ftp.umin.u-tokyo.ac.jp, interne:/draft-rescorla-shttp-OO.txt” (1994). [14] P.M. Hallam-Baker and A. Shen, Security Scheme for the World Wide Web, available on the World Wide Web as “http:l/www.w3,org/hypertext/WWW/Shen/ref/securityspechtml” (1995).
[I51 K.E.B. Hickman and T. Elgamal, The SSL Protocol, Internet Draft, work in progress, available on the World Wide Web as “http://ds.internic.net/internet-drafts/drafthickman-netscape-ssl-Ol.txt” (1995). [16] S. Spero, Progress on HTTP-NG, available on the World Wide Web as “http://www.w3.org/hypertext/WWWlProtocols/HTTP-NG/http-ng-statushtml”.