Quality control methods for data entry in pathology using a computerized data management system based on an extended data dictionary

Quality control methods for data entry in pathology using a computerized data management system based on an extended data dictionary

Pathology Education Quality Control Methods for Data En in Pathology Using a Computerized Data Management System Based on ai Extended Data Dictionary ...

909KB Sizes 0 Downloads 64 Views

Pathology Education Quality Control Methods for Data En in Pathology Using a Computerized Data Management System Based on ai Extended Data Dictionary J. C. FLEEGE, MD, PHD, P. J. VAN DIEST, MD, PHD, AND J. P. A. BAAK, MD, PHD, FRCPATH In pathology, computerized data management systems have been used increasingly to facilitate a more efficient supply of information. Since data entry precedes data utilization, the reliability of the information stored strongly depends on the quality of data input. Despite its Ipotential capability, most personal computer-based database software does not provide versatile and user-friendly data validation procedures. Therefore, we developed a data dictionary driven data management system that enables the user to perform extensive validation routines without the need for hard programming. Using examples from an existing database for endometrial carcinomas. (different types of data errors and their error traps are explained. It is pointed out that data type definitions, defaults, templates, or picture clauses are suitable means to avoid formal errors. Validations on data domains and ranges test whether data fall into a predefined scope. Relational checks control data validity within a context of different data items, whereas process routines provide automatic data computation, thereby circumventing user input. By exploiting the facilities of an extended data dictionary, a powerful tool is made available to secure various aspects of data integrity simultaneously with input. In this way, computerized data quality control can improve the efficiency and reliability of data management tasks in pathology. HUMPATHOL 23:91-97. Copyright IQ 1992 by W.B. Saunders Company

The entry procedure often c~cnllpletes ;I labor-intensive period of dat;l acquisition. It is hence net-essarv to undertake ever-v ossible step to avoid data error and loss by mistake.“.‘. V However. monitoring dat,l entry interactively is ver;’ tedious.” Automation of data entry umtrol is much more attractive since it car1 si;iv(~manpower and preserve data.‘,” Unfortunately, appropriate data management software that is able to subject data to minute and extensive error trapping routines simultaneously with input is rare. Most personal computerbased database software is largely based o11 the premise that data that enter the system are valid. A result of this assumption is that data manipulation and retrie\,al modules constitute the largest portion of suc,h programs, whereas control routines on data entrv are rudimentarily developed and most often confined’to plain data type checks. For databases serving research or diagnostics in pathology, this restraint is unacceptable btkcause checks on data codes. ranges, and relationships are of particular interest.‘“‘X Therefore, we have developed a prototype data management system that renders custom-designed validation routines on incoming data. The program is based cm an extended data dictionarv that enables the user to define a variety of data contrc;l functions. Since the requirement for hard programming has thereby disappeared, a convincing argument to renourlce detailed data quality assurance has become inapplic,able.‘!’ From this point of view it is now worthwhile to reconsider issues of quality control on data entrv. This article airns to provide an overview of error conditions that can erode data integrity when data are entered into a database system. Assuming the availability of a data management system with powerful data dicapproaches LO accomplish tionary facilities, different immediate control on data input will he explained. After introductory remarks concerning data organization in databases, scales, types, and data dictionaries, error conditions and their corresponding automatic error traps will be discussed. All topics will be illustrated by examples of endometrial carcinoma ( ases from a database currently in use.

In the past decade, computerized data management sysiem~ have increasingly come to play a role in data registration and supply tasks. The major reason for their widespread use is their capability for strongly improving the efficient-y with which information can be gained.‘.“ AC pathology institutes, dBase-like software packages* have become especially popular because they can run 011 ordinar-v microcomputers under favorable costIn The Netherlands, a national benefit conditions.~~“’ pathology database network is operative. ” (Generally. computerized data collection in pathology requires manual data entry into the system prior to data utilization. The reliability of information therefore depends on the care with which data entry is performed.

DATA ORGANIZATION

Databases terest

91

Data are structured attributes of an object of in(eg, cases or patients) obtained hy carefully di-

HUMAN PATHOLOGY

file

Volume 23, No. 2 (February

bers).” These two factors determine the data scale and the data type. A data scale is a graduated system that defines the scope and relationship of all possible data of a special item,” for instance, a system for the registration of all types of endometrial carcinoma or all possible shortest nuclear axis lengths (gum). Basically, four data scales can be distinguished: nominal, ordinal, interval, and ratio scales (Table 1). Nominal data can designate different types (eg, adenocarcinoma or clear cell carcinoma). Although verbal terms are used most often, nominal data can be enciphered, for instance, according to the Systematized Nomenclature of Pathology (SNOP) convention (eg, M8143 for adenocarcinoma). The nature of nominal data is qualitative; there is no way to identify measurable relationships among nominal data. Ordinul data can he useful to represent different categories of one type in order to indicate a defined, often semiquantitative relationship between them (eg, grades of differentiation: well differentiated [WDC], moderately differentiated [MDC], and poorly differentiated [PDC]; or, for convenience: 1, 2, and 3). htemzl and r&o scaled data express quantities of a property by a continuous unit of measurement in digital values. Interval scaled data, however, have no absolute zero point. The chronology of survival time, for instance, starts at a rather arbitrary point, the date of diagnosis. Therefore, a comparison of the survival time of two patients without assumptions regarding the equivalence of the moments of diagnosis does not yield conclusive information. Ratio scaled data (eg, length of the shortest nuclear axis [pm]) do have an absolute zero point that is a great advantage (eg, for calibration purposes). The data type definition is strongly system dependent. It prescribes the format by which data have to be registered in a data field. Frequently, data types are of an alphabetic (eg, WDC) or numeric (eg, 1) type, or a combination of both (for instance, alphanumeric: M8143). (Since the computer’s internal representation of the alphabetic and alphanumeric data type is identical, we will use the term alphanumeric only.) The length of data can vary as well. Within a data field one to several hundred spaces can be reserved for alphanumeric data and one to several tens for numeric data. Other data types, such as time, date, logical (ie, true or false), and memo (ie, free text) also can be made available, but their application areas are limited. The data

record

L

key field

FIGURE 1. Sketch of the architecture of a relational database. Note the concordance with a table; a file is a complete table, a record is a row, a field is a column, and the key, for instance, is an order number for each row.

rected observations. The degree of structure determines the datum’s ability to be operative and, consequently, the efficiency with which information can be gained.‘” A datnhnse refers to an ordered collection of data stored in a computer. Assuming a standard, tabular database, three levels of data organization can be discerned: files, records, and fields (Fig 1). A data,field represents the smallest unit of a database and accommodates a single piece of data. A record is the entire series of data fields associated with one case or patient, whereas a file is a list of all records present. Within a file there is a key that uniquely identifies one particular record. A database with such an architecture is called a “relational database, ‘11L’.18.20 Data Scales

and Types

The characteristics of data depend on the purpose of observation (eg, typing, grading, measurement, etc) and, at the technical level, on the format required for suitable data recording (eg, character strings or num-

TABLE 1.

1992)

The Relationship Between Data Scales, Items, Contents, Types, and Nature Exemplified by Data From an Endometrial Carcinoma Database Scale

Data Item Content Type Nature

Nominal

Ordinal

Inlerval

Type of carcinoma Adenocarcinom~ (M8 143)” Alphanumeric Q.ualitative

(;rade of differentiation WDC, MDC, PDC. or (I), (2). (3)t Alphanumeric Semiquantitative

Survival time 93 (mo) Numeric Quantitative

Abbreviations: WDC, well-differentiated c;uxinoma; MDC, moderately * Systematized nomenclature of pathology code for adenocarcinoma. t Convenient codes for grade of dil%erentiation.

92

differentiated

carcinoma;

PDC, poorly

Ratio Shortest nuclear 7.63 (/Lm) Numeric Quantitative differentiated

carcinom;t

axis

QUALITY

CONTROL

ON DA.rA ENTRY (Fleege et al)

records of

records of

‘--+

FIGURE 2. The concept of a data dictionary. In a separate database, the data dictionary, information about data items of the application database, is recorded. The example illustrates that the properties of data item C of the application database are fully specified by one data record of the data dictionary.

ficttl i4 ~~~~~mmontv named ‘I’l~hIoK_.~l‘~‘t’E:).

after.

the

data

item

(q,

Data Dictionaries Data that tles~rihe anti determine proper&s of a database, ~:3pecialt~ of a data item, are catlect rrwtculatc~.‘S They ;~re recorded 111a special database file reserved for i~iformation about the data: the data dictionary.“‘-“” Everv property of an item of a database (for example, the da& scale, rype, length, name, and description) as well ;IS its validation rules, in a broader sense, can be defined here (Fig I!). From these metadata, a capable data management s’r;stenl is able to create an application database. With most personal computer-based database software, the capacity of data dictionaries comprises clata type ad ctescription declarations solely, which is insufficient for many purposes, particularly with respect to data validation fac.ilitit3.

DATA ERROR

and incorrec~t alignment within ii d:ita field (c’s, I- _ instead of_ _I ). “z~ These data crroI-s c;m be detectecl retativetv easilv. For numerics data, errors can devetol) hy omit&ig OI’ adding digits or changing their correc’t order (eg, 6% or 676% or 76%’ instead 01‘ 67% votu~nc percentage epithelium). These numeric data zrre more troublesome because a typing error doen not Icad to an obviously wontI notat& but ~~iewtv 10 ;i tlifltrenl number. It kcomes even more c c~nipt~~~;ited when inistakes producc~ eligible data that nonethek~s ma\ he fats< (eg, l’D(: instead of WDC). The situation becomes drainaticatly more coinpticatect when the above restriction is lift& and multjph data items are registered per record. (:oncer~ling two data items. “A” and “ES,” which contain data that are related to each other, the following error conditions can emerge. First, the error status of the individual ctat;i is subject to the same rules as single data irenls. Data A as dt as tlm B can be true or I&e. Second, tmvever, the overall validitv is determined hv I lit, contextual meaning of A and B. Consequently, ;I combination of’ an individually valid ctata A with ;I valid data R can yet be false.“’ For instance, a conlbirl;~tion of A = 99999 (thr data field tag for unknown tumor uype) with 13 = U’DC does not make sense. The etigibitit!; of such a data cxmbination is a cluestion of’ definition or rktusibilitv. Figure 3 provides’examptes of valid cornhi~ations (ii,d&ted hy sotid tines) for some tvpes oL’ endornet riat carcinoma (A) with grade of differentiation (B). Moreover. the scheme p&its out that all assignments interconnected h\, dashed tines causts contr;idictions that could not be ktected if the data were tested separafetv. Data A can be linked correctI! to K itI 1 IO II wy (A 4 R) and there are again I to VI possibilities to connect B ._ with A in reverse direction (B + :I’). The more data items inccnmorated. the more the network of Inossibtr connection; expands and the more its c onlptexitv increases. Relarionships among data from nckinal or ordinal scales incline to he deternlirlistic..“~,“’ For example, in the case of A being adenoa~arl(horlla. the grade of differentialion. n, is unequivocattv WD( :: no or otih

CONDITIONS ‘B’

‘A’

Errors in data cause inconsisrency in meaning. Erroneous data are either in conflict with a format definition or they do not tit in a context with other facts.‘.’ ’ In the hypothetical case that only one ~i~glr data item is IO be registered per record or that data items are rotally unrelated to one another, the validity of’ an individual piece of data depends on its error status alone, which can simply be twofold: true or false. That is. a piece of data does or does not fall into a predefined scx)pe of a data item. Within these restrictions and with rcaspect to. fi)r instance, grade of diflerentiation, “WDC” is true in itself and hence valid data. In this case, formal errors may occur. With alphanumeric data, errors can arise front several areas, such as incorrect or different spelling (eg. WD\’ or well-DC instead of WDC), omission of characters or numbers (eg, 8143 instead of M8143), incorrect format (eg, 8343M instead of MXt43), ambiguous data type definition (eg. 1 instead of WDC),

08808

not applicable

99999

unknown

t&3143

adeno_care.

MB573

adeno_acanthoma

ME563

adeno_squam_carc.

I

MDC

moderatly

diff.

I

FIGURE 3. Relationships among different types of endometrial adenocarcinoma, A, and grades of differentiation, B. Data are represented by SNOP codes as well as by terms. Solid lines denote valid combinations and dashed lines denote invalid combinations.

93

HUMAN PATHOLOGY

Volume 23, No. 2 (February

few other options are valid. On the other hand, relationships among interval or ratio scaled data, also in combination with nominal or ordinal scaled data, tend For example, in the case of A to be probabilistic.‘“,‘” being adenoacanthoma and B being WDC, C, the shortest nuclear axis length, is expected to lie between 2 to 12 pm with a certain level of confidence. Data within the range are probable and data below the lower limit are impossible, whereas data above the upper limit are unlikely. DATA VALIDATIONS It should be taken as a rule that no data field in a database is to be occupied by a piece of data that was not precisely defined beforehand. This problem is promptly encountered when a new (blank) record is appended to a database. Without being asked, the system inserts empty spaces into alphanumeric data fields and zeros into numeric data fields.‘” Although meaningless, these dummy data can interfere with data from scales actually defined by the user since, for example, a zero may fall into the eligible scope of a numeric data field. In addition. the data field status is uncertain; a field may still be empty deliberately or may have been left empty accidentally or intentionally (such as in the case of missing data). A straightforward solution is to declare unambiguous default data that are automatically entered when a blank record is added; for instance, asterisks (*...*) for alphanumeric and nines (9...9) for numeric data.lT4 Such default data clearly signal that the content of this data field is currently unknown, leaving no space for a different interpretation. On no account, however, may defaults represent possible data.“,““.“” Therefore, the field length of numeric data should be sized one space longer than is strictly necessary, and for both data types the fields should always be completely filled in (eg, 999.99 as default for the shortest nuclear axis length using a data field length of six spaces with two decimal positions). Simple checks on data type, including length, are normallv i done by the system for technical reasons, but

input:

[~8143]

1wdc1

i

I template :

[M____]

I !

pit ture :

[ Caps(wdc) ]

I

I result:

[~a1431

i

[

WDC 1

FIGURE 4. The operation modes of CI template and a picture clause. With the template M- _ _ _, only 8143 of the input can pass through this filter, while N is overwritten by M automatically. The picture clause capitalizes wdc to WDC.

1992)

domain of eligible data

entereddata:

1 -:

database field:

?

10001

.F.

=?

1999 I

.F.

=7

[ WDC I

.F.

=?

[ MDC1

.T.

P

1PDCI

1 c=

[ MDCl-

in consultation database:

-IMDC]

FIGURE 5. Principle of the domain validation routine. The valid data for the item “grade of differentiation” are all incorporated into the consultation database constituting the eligible data domain. The entered data “MDC” is first searched for in the consultation database. If a complete match is found, the data entry is valid. The numbers 888 and 999 are tags for “not applicable” and “unknown,” whereas .T. and .F. denote true and false, respectively.

tests on advanced format specifications may not be available. Template and picture clauses force incoming data into a user-specified shape. When templates are used, only certain spaces within an alphanumeric data field are accessible; the others are locked and reserved for predefined system input. For instance, a template for MB143 can look like M_ _ _ _. In no way can M be overwritten by a different character (Fig 4). Picture clauses control the character appearance of alphanumeric as well as numeric data. By means of a picture, input of, for example, “wdc” could be capitalized automatically to WDC (Fig 4). With more sophisticated measures, alignment within a data field can be achieved or, in combination with a template, M- _ _ _ can be defined such that only numbers are allowed in the second to fifth data field positions. In many instances, medical data are ordinally scaled and of an alphanumeric type. The method of observation can directly yield data that represent categories (eg. grades of differentiation) or, secondarily to registrations, data may need to be encoded or classified for some reason. In both cases, per data item, a defined set of data is eligible, delimiting a data domain’!’ (eg, 888, 999. WDC, MDC, PDC, where 888 is the data field tag for “not applicable” and 999 is the data field tag for “unknown”). For the purpose of validation on formal criteria, the data domain for a certain data item can be incorporated into a separate database, a consultation database. Each time a piece of data is to be entered into the application database, it is first searched for in the consultation database (Fig 5). If a piece of data in the consultation database is present that completely matches input, the entered data is accepted. Nominal scaled data (eg, addresses of hospitals) can be verified likewise, as can numeric data, if a few distinct values are used (eg, codes). The domain routine can be designed such that if no match is found a list of all possibilities for correct

QUALITY

CONTROL

ON DATA ENTRY (Fleege et al)

input het-c)0les ,L\aitabte, ,giving the user the opportunity to choose I he appropriate data. This function cxn also he evoked inte~ltionattv by committing an entr\’ error. b’hen. for esa~npte, tutnor types are to he en&,ded, a list of terms is l)rovided on typing a false entry (such as a cluestion mark), which would he comparable to an ontint help frlnction. After selection of a tumor type, the rontine finds the proper code and inserts it into the p1-oper Clacii fi&l. (Ln~responding to domain tests for discrete alphanumeric or numeric data. RZU~P checks AI-C’ used f:or conlinuolls data.‘~~~“‘~~”A minimunl and maximm~l value must be declared lo delimit the probable data range. Ol~tionall~. margins can be specified that r-epresrnt hones for improbable data. For instatlce, the tninimrrrrl and m~simwl 7ones for the shortest nuclear axis can be 3 ~111and 1L’ ynt, respectively, whereas ‘t pm nnd 1% pm mark rhe outermost limits, including the probable ah well ax rhe improbable data. Outs& this inl erval, data a1.c regarded as invalid. The above data controls all apply irrespective of the degree of association of one data item with another. Approa~~hes sllitable to evaluate the validity of‘ interrelated alphanumericor numeric data can be lr,wed on IF. .THE:N slatrments. If A represents xir~~oacant homa. then B represents WDC, that is. data other than B = WIK c.onllicts with A and is therefore invalid.“’ EvidentI,. !f.or the development of rdationul checks,“.“!’ expertise ( 11. at least assumptions regarding the implications of each single piece of data for others are rt’quired. Relational (eg, =, #. <, >. etc). mathematical (cg, +. -. >:. +, cbtc), and logical (eg, “and,” ‘io~.,” “not,” t’tc.) operators can hr llsed to specify tht2 type of condie ion .“I The highest trvet of’ data security can be attained if tlata are generated from already existing and validated data bv ~~-ou~.T~rouCines.“” Given certain field entries. c.atcut~tic~nr by the data management system make UWI input supertluous for other fields. Data fields supplied by these routines can he kept strictly inaccessible 10 the user. For esampte, the computation of the endometrial carcinoma prognostic‘ index (ECI’I) score can hc defined as (CLX data field A) + (b X data field B) + (1. X data field (1) - (~1,causing the result to be inserted into the data field E(:PI__SCOKE.“” In fact, data quality assur,mc‘e is ensured here hv complete automation. lt is important to mention rhat the spectrum of process routines is not confined to arithmetic- with numeric data. Alphanunleric. data can also he handled. To compile :I unique palicnt identification number, for instance, a process roilririe mi concatenate the first three letters of the patient’s hospital (substring[A, 1, 31) + the patient’s birtili date (13) + the patient’s first initial (suhstring[(: I, I]) + ttle first three letters of the patient’s birth name (suhstring[D, 1, 31j, inserting the result into PATIEN’I‘__ID, rhr patient identification numbrr field. Finally. validation routines should he designed as to data dynamic tests’“; that is. they should be sensitiw manipulation ill the whole course of data input. Irrespective of timr and localion within the database, the entire spectrum of data control mechanisms must remain fullv operative for each piece of data entered or edited.

FigIre 6 displays the palette of data validations cliscussed. Each type can be combined with another and optimallv adapted LOsuit specific demands. If defined. attempts to enter incorrect data cause user-clefincd en-01 messages or warnings to indicate improbable data. The functionality of each validation rolltine is automatically tested immediately after declaratioli. DISCUSSION In this report, principles of computerized databases data characteristics have been reviewed conciseI\: from Ihe pathologis~‘s point of \:ieh. Major attention has been devoted to data errors and their correction simultaneouslv with data input, assuming the avaitabilitv of extended data dictionary facilities. Since a 100% error-free database will hardly be attainahte,“’ the objective of data entry control is to minimize the error rate to an acceptable level. ” Nevertheless, the total expenditure of time and effort for data control tasks will depend on the purpose of investigation.“’ 111addition, the ease with which data validation procedures can he performed will largely ~determine the willingness to use them. While there is agreement about the requirement to ensure the highest data quality possible, there have been many ways described to reach this goat. What can be done to prevent data erl-ors from occ curring is optimizing the data entry conclitions. Generally. data should have a convenient format (free text should he avoided).” Several investigators suggest collecting data directly on those data forms that will serve as the source for data input later on. thereby avoiding

~1x1

types defaults

templates

I pictures

domains / ranges

relations processes FIGURE 6. Data validation types, in order of sophistication and versatility, that can be specified in the data dictionary used Combinations of different test routines are possible.

95

HUMAN PATHOLOGY

Volume 23, No. 2 (February

1992)

quirements, the data format needs to be standardized and simple, preferring codes instead of terms; moreover, the relationships between data items need to be defined exactly. Both points improve the functionality of a database that, in turn, can also enhance data reliability. Due to its invariance in time, automated input control largely warrants constancy of data fidelity. Neither fluctuations in data entry personnel nor their varying affinity fat computers or familiarity with the meaning of data to be entered can dramatically affect data quality. This is an important issue for long-term or multicenter studies in which the collected data may be entered by different collaborators, at remote sites, lacking the possibility of supervision.“” To accomplish data quality assurance with the aid of computer assistance, the utilization of a data management system that is based on an extended data dictionary provides important advantages. The data dictionary used at our laboratory can be edited like an ordinary database. This system is capable of supplying the appropriate flexibility needed to design and define data validation instruction code. The user is enabled to tailor strict control routines to suit multiple purposes. While most data tests can be defined easily, the development of relational checks is comparatively difficult. Although basic knowledge about the syntax needed to produce a proper validation instruction code is indispensable, this is not a major obstacle. The extra time required is chiefly absorbed by conceptual problems, particularly if several data items are involved. Considerable effort is necessary to obtain a full survey of multiple data relationships because efficient tests can be developed only if all possible data combinations have been taken into account. Nevertheless, tracing inconsistencies in advance not only allows for the setting of appropriate filter conditions, but also positively stimulates the discussion of data item definition in general. The developer is forced to substantiate the eligibility of each piece of data explicitly, which, in fact, is the basic prerequisite of any serious endeavour to control the quality of data. Typically, once the development phase is thought to be finished, thorough testing will detect leaks in data validation.’ However, since direct access to the validation instruction code is provided through the data dictionary, faults can be corrected immediately. This is a decisive advantage compared with systems in which hard programming is required to eliminate “bugs” because, in this way, a lot of time can be saved. Since the validation code is not incorporated into the program code, but written in a separate file, the test specifications also can be used for data documentation. It has been mentioned that data dictionaries also are useful for many other purposes, for instance, the definition of screens, colors, indexes, or relationships between databases,?“.?“-?“,‘”

simple transcription errors due to oversight.“.“‘,“‘,‘~~ In this respect it is emphasized that the computer screen layout should strictly resemble the data sheet design to facilitate finding the respective data fields.‘,‘.” Data input should be done as soon as data are collected, preferably by one person only.“~“” With manual data entry, typing errors are common. This source of error can be eliminated by reading the data sheets with an optical scanner instead of keying in the data.‘” Typing errors can also be circumvented by mapping the data forms onto a graphic tablet.“’ By means of an electronic pencil, fields on the mapped data form can then be activated, causing the choices to be directly stored in the database. Physiologic and psychologic factors also play a role during data entry, as the length of breaks and mood or stress can affect the error rate.‘” Other approaches propagate data screening for errors after the data have been entered into the database system. Some investigators suggest cross-checking of randomly chosen data from the database against the original data on the forms.“,‘” A data error percentage less than 1% is regarded as tolerable. Attentive proofreading of a computer printout of the contents of the database may also suffice in some instances,” especially if the database is small. Although cumbersome, duplicate data entry of the same data into two separate database files has been found to be a very precise method that facilitates uncovering of inconsistencies by means of a computational comparison of these two data files.“,‘“.” All data that do not match in this comparison are suspect for error. Statistical methods, such as exploratory data analysis, provide more options to clear a database, at least, of out-of-range data.12-‘” These two concepts resort to individual measures of data assurance that have to be applied before or after data entry, thereby mainly aiming at the elimination of formal data errors. The methods need special devices, special software, or require extra execution and/or input time. They have a purely preventive character, do not perform data checking actively, or are unrelated to the database system used. This may necessitate additional data manipulation or organizational tasks, leaving space for new sources of data error. Above all, none of these processes can cope with contextual data errors comprehensively. From our experience with the Multicenter Morphometric Mammary Carcinoma Project (MMMCP) database (>3,000 breast cancer patients and >50 data fields),” an integrated approach that combines different, automated methods of data validation can add to security and efficiency in data management. Using modern computing technology, a far-reaching automation of data quality tests is feasible. Automated validation routines are fast and can provide extensive data checking simultaneously with input. The user is thereby enabled to correct a data error immediately, avoiding the uncertainties accompanying a postponed data clearance. In addition, the prospect of automation of data entry control exerts a positive effect on database design. This is understandable because a system can only be automated if it is highly structured and consists of elements that are easy to access and process. To fulfill such re-

With an optimally secured data entry, nothing is accepted but valid data. Nevertheless, we observed that high rigidity can provoke offence if help is not available. Informative error messages or warnings that explain the rejection of data are important to keep up motivation.‘” Indeed, when the system enters into a dialogue with the user, the data entry task is experienced as more chal96

QUALITY

II

II

CONTROL

ON DATA ENTRY (Fleege et al)

1

acts like an ,mpert rather than a noncommunic.ativ~ mac hint, will lx&apprec‘iated. ‘The use of a databasr system that is able to rellect on input contributes to more care in data preparation and ccAlection, which perhaps ma! t)c tqlained t,y the conlpetition with the computer or IW cxmditio:ning. Moreover, users report that ic is reass&ing to know that even if they fail to notice the incmrrec‘tness of data. the system’s internal database rnanagt’r will ncrl. In conclusion, automated checks on data entry’ pel~fornred cxmc~~rrently with input can serve ah a powcrful tool tc) warrant the integrity of data. Exploiting thr facilitie:j provided by an extended data dictionar-);. user-defined validation routines can rapidly verify inc-owning dac;t on a Ixge scale of error conditions. which cnahles dim0 ,.orrection of data faults. In this way, c-omputerir.ed data quality control can iniproce to a cmnsiderahlr degree the eficiencv and reliabilitv of data tnanagemertt tasks in patholog,z,~.

REFERENCES

97