Computer Programs in Biomedicine 11 (1980) 43-47
© Elsevier/North-Holland Biomedical Press
A DATA VALIDATION PROGRAM NUCLEUS Dean S. Mac LAUGHLIN Boston Collaborative Drug Surveillance Program, Boston University School o f Medicine, 400 Totten Pond Road, Waltham, MA 02154, USA
A data validation program designed for flexibility and user-modification is presented. It is assumed that the data to be validated consist of packets; i.e., groups of records with a common value in a record-linking field. Acceptance or rejection of data is on a packet basis. Individual field validation is specified by user-supplied tables. Multiple field checking and field interaction checking on both an intra-record and an inter-record basis may be specified by user-supplied subprograms. The program is written in PL[i using structured programming techniques. Further modifications such as record size and key positions are possible at preprocessor time. Data validation
Error checking
Structured programming
I. Introduction
drugs and diagnoses may easily contain several thousand entries. Other multi.valued fields, in which the number of legal values may exceed 100, are common. Age- and sex-linked data lead to inter-field checks. Time-sequenced data may need exceedingly complex inter-field checking. The Data Validation Program Nucleus was developed at the Boston Collaborative Drug Surveillance Program (BCDSP) to provide a flexible yet thorough data validation capability. At the simplest level, it may be used as a table-driven validation program. Should more complex checks be required, the program is designed so that they may be implemented easily. Utility subprograms are provided for the more common tasks. The nucleus is written in PL/I and structured programming [ 1,2] is used throughout.
Crucial to any study is the reliability of the data on which it is based. It is now common, particularly if the amount of data is large, for input data to be checked by a validation program. Special-purpose validation programs may be costly. Changes in specifications, which are likely in any data-collection system with an extended lifetime, may lead to considerable program maintenance with its associated difficulties. Entirely new data-collection systems may involve starting over from the beginning. One solutien to this problem is the table-driven validation program. Data defining the checks to be performed are read at execution time. Specification changes are accomplished by changing the tables rathe r than the more difficult process of changing the program. New data-collection systems require only a new set of tables, rather than a new program. Table-driven data validation systems, however, lack certain capabilities: (a) New applications may require validation methods not currently provided. (b) Validation may require lookup in extensive dictionaries peculiar to the application. (c) Validation may require the checking of complex inter-field relationships. Capabilities (b) and (c) are likely to be needed for the validation of medical data. Dictionaries of
2. Assumptions on the data to be validated The nucleus is designed to process large amounts of data in batch mode. Typical runs have involved 1 - 2 X 103 packets averaging 20 unit records each. Packets are accepted or rejected on an all or nothing basis. Detailed error messages are issued for rejected packets. A packet consists of a set of unit records linked by a common value in a record-linking field. Each record's content and format may be specified by a type 43
44
D.S. MacLaughlin, A data validation.program nucleus
of field and a subtype field (useful for m-dtiple versions of the same basic type). The location and length of the record-linking field, the type field, and the subtype field may be user-specified, but must be common to all records. The input data must be sorted by the record-linking field into packets. These locations and lengths, as well as record length and page width and length, are specified at compile time. Also specified at compile time would be any user-defined checks.
The type of data expected in the field The type of validation to be performed Three validation parameters The data expected in a given field may involve an implied conversion to allow missing numeric ~,alues to be represented by blanks. Table 1 summarizes the data types currently provided. Table 2 summarizes the validation types. .Also associated with each subtype key is a value specifying whether there are any user checks and which check is to be performed.
3. Program i n p u t
4 . Program o u t p u t
Tile program input, in addition to ,lie data to be validated, consists of data describing packet composition and data describing tile validations to be performed.
4.1. Printed output
3.1. Data describOtg packet compositio~t There may only be one packet description ii~ force for a given validation run, although some variation in packet composition is possible. The packet is described in terms of its component record types. Associated with each record type is a unique key. A record of a given type may be either required or optional in a lcacket. A packet may be allowed to have more than one record of a given type or only one. A valid packet, then, consists of a set of recognized records, such that all required record types are present and all multiple occurrences are permitted. In addition, each record must hava passed validation checks peculiar to its type.
The program produces two reports. The first is a packet error listing. Missing record and duplicate record messages are issued where appropriate. Errors in individual data fields cause messages giving the field name, the field position, and the erroneous value. Records in error are printed with erro,eous positions marked. Figure 1 is an example of the error listing produced. The message issuing subprogram., which automatically handles record and packet demarcations, may also be used to issue user-defined error messages. The second report provides a listing of all packets processed, giving the record-linking field, an error flag, and the numbei of records of each type.
4.2. b,~or output Each record o f a p a c k e t in error is o u t p u t t o a
separate file. Conversions made by data t,~ pe codes 5 and 6 are included. If it is desired, fields i:a error may be replaced by any character.
3. 2. Data describ#tg the validation For each record type there may be defined a number of subtype keys. Each record type must have at least one subtype key. Associated with each subtype key is a pointer to a validation table. Each validation table consists of a set of lines, each associated with a given field to be validated. Each line contains the following: A print name for the field (used in error messages) The starting column in the record for the field The ending colunm in the record for the field
Table 1 Data types Code
Type of data expected Numeric - each column a digit 0 - 9 Alphabetic - each column A - Z or space Mixed - anything accepted Spaces - each column blank Convert all spaces to all zeroes - then numeric Convert all spaces to all nines - then n um e r ~
45
D.S. MacLaughlin, A data validation program nucleus
PACKET
00420394870
IN
EAR~R
RECORD 2 SNOK|NG O E T A I L S FOR NON-SMOgER
304203948700TOITIOOSIIO030099911|1243144324124442414212433331331322424|43~311 SSSSSSSSSSS
02
qEC~RO 3 DENTALXRAY
( 49-
491
= 0
CHEST XRAY ( 5 0 -
501
=
5
O04203948TOOTOITI0300000000000011111|IIIIIIIIllI052|0211|O021 S$ FEMALE PACKET O I O S O I O 4 0 i ?
RECORD 4 CONTAINS NULL
03
|NFO
I N ERROR RECORD I SEX
( 46-
46|
= 0
0 1 0 S 0 4 0 4 0 4 ?0802TQOXOUNY
M&02119021463160S08~1 2116890
OI
MI S$1NG D| AGNOSIS RE~'.ORO tllSSIqO
MEDICATION HISTORY RECORO(SI
Fig. 1.
4.3. Valid packet o u t p u t
5. Program description
The data in valid packets are put out to two files. All record types which may occur more than once are output to a file whose record size is the same as the input records. The remainder of the data are output to a f'de with one fixed-length record per packet. Each component record is assigned a specific place in this output record.
The main program, VALMAIN, does all packet i n p u t - o u t p u t . It reads and identifies all input records. It calls VALIDAT to validate each input record, with the appropriate validation table as a parameter. All packet composition checks (missing or duplicate records) are made by VALMAIN. Then INTER is called to perform any user-defined global checks. Finally, it outputs each packet to either the error file or the valid output files. The record validation subprogram, VALIDAT, scans the appropriate validation table and performs the checks indicated by each line. All errors are handled by a PL/i ON CONVERSION module. Thus, actual conversion errors are handled automatically. Non-conversion errors, such as range or data checks, use the same code by means of a SIGNAL CONVERSION statement. After all table-defined checks, VALIDATE optionally calls INTRA, a user-provided record checking subprogram. The remaining modules are primarily support modules. REPORT priats a packet report line. MESS issues an error message and may print packet and record demarcations depending on whether the error was the first detected for a packet or record. DICT con-
Table 2 Validation types Code
Validation Pl < v < P 2 Pl < v < P 2 o r v = p 3 No check performed Dictionary lookup (Details may be specified by p l , P2 and P3) Date check - field must contain valid date Date check - field must contain valid date or allbe 9 Date check - field must contain valid date but any portion may be 9
p2, and pz are the validation parameters. Dates are either 6 column mmddyy o,~ 5-column mmddy.
p 1,
46
D S. MacLaughlin, A daw validation program nucleus
~UUSER-DEFINEOI NUCLEUS...
"
.I
VALMAIN
INPUTOF VALIDATIONTABLES PAC, ZT INPtn'-OUTPUT
VAUDATION..--
SUMUAmrRFPOm
TABLES
-MULTIPLE i.re, INTER-RECORD CHECKS
I I
i IECORI)PROCESSING FIELDVALIDATION
DICTIONARY INPUT B LOOKUP
PACKET
SUMMARIESJ lm
i
MESS REPORTING
INTRA-RECORD CHECKS
I
"
I
CALLS
I'ig. 2.
trois tile input of special dictionaries and performs validations using them. It ca'._,isSEARCH to perform binary searches of dictionaries. The relationship between the various modules is illustrated in fig. 2.
Maximum number of records per packet Field name length in error messages Input record length Output record length (record/packet file) Lines/page Print line length
6. "Custom tailoring" the nucleus 6.2 Aid~_ to user programming The nucleus may be specialized to suit user requirements in two ways. 6.1. ~h)dification by preprocessor variables The fol!owi~,g may be specified by PL/I preprocessor statements: Number of record types Number of validation tables Number of li~es in validation table Recordqinking field length and location Record-type field length and location Record-subtype field length and location Number of subtypes allowed per type
Art intra-record checking exit is provided to the user by a call to subprogram INTRA. Parameters passed in the call are the input record, indications of errors already found, the record type, the recordlinking field, and a user-supplied argument from the validation table specifying the checks desired. If the user program detects an error in the data, it may alter an input parameter to indicate the location of the error(s) and issue appropriate messages by a call to MESS, which, in addition to printing the message, also sets error flags and maintains record and packet error message demarcation. The user must write his own dictionary input and
D.$. MacLaughlin, A data validation program nucleus
lookup subprogram, although a model is provided. Input parameters are the vdlue to be tested and the three validation parameters which may be used to specify the dictionary to be searched, valid default values, etc. A simply output parameter indicates the detection of an error. A global checking exit is provided to the user by a call to subprogram INTER. Parameters passed are a vector containing all input records, a vector giving the error status of each, a vector giving the count of the number of records of each type, and the packet error flag. The user may issue error messages through the subroutine MESS as he did in intra-record checking.
7. Hardware, software and mode of availability The program consists of approximately 1200 PL/I statements. Program size is dependent on compiletime specialization. A recent run in which each packet consisted of four 80-byte records required 134 k-bytes and 0.08 s/packet on an IBM 370/158. Machine-readable source code and/or listings are available at reproduction cost. Interested persons should contact the author.
47
8. Conclusion The validation program nucleus has been found to be an effective and flexible tool. Validation may be as complex as the user desires. The intervention of a programmer is still required, yet a simple application will involve little programming effort. In the past year at the BCDSP, four substantially different data bases of varying complexity have been subject to validation by variants of this nucleus. The programming has only involved the unique aspects of each data base.
Acknowledgements Supported by grants from the National Institute of General Medical Sciences (no. GM23430) and the Food and Drug Administration (no. FD00920).
References [ 1 ] B.W. Kernighan and P.J. Plauger, Software Tools (Addison-Wesley, Reading, MA, 1976). [ 2l E. Yourdon, Techniques of Program Structure and Design (Prentice-Hall, Englewood Cliffs, NJ, 1975).