Data quality in a distributed data processing system: The SHEP pilot study

Data quality in a distributed data processing system: The SHEP pilot study

Data Quality in a Distributed Data Processing System: The SHEP Pilot Study Anna Bagniewska, Dennis Black, Kjeld Molvig, Cary Fox, Christine Ireland, J...

530KB Sizes 0 Downloads 34 Views

Data Quality in a Distributed Data Processing System: The SHEP Pilot Study Anna Bagniewska, Dennis Black, Kjeld Molvig, Cary Fox, Christine Ireland, Jacqueline Smith, and Stephen Hulley, for the SHEP Research Group Department of Epidemiology and International Health, University of California, San Francisco

Abstract: The Systolic Hypertension in the Elderly Program (SHEP) Pilot was a collaborative

clinical trial that distributed to the clinics all data processing tasks except for randomization assignment codes and morbidity and mortality data. The clinics used customized programs to enter and verify data interactively, to maintain their own local master files, and to transmit the data electronically to the Coordinating Center. We measured quality control based on criteria from centralized as well as distributed models: the error rate for baseline forms was 0.5 per 1000 items. Ninety-eight percent of the forms were query-free, and a central reentry of the data in a 5% sample yielded a miskey rate of 2 per 1000 items. The potential problems of distributed data processing are vulnerability of the local master files and the time demands on Coordinating Center programmers for maintaining clinic computer systems. The advantages are the active involvement of clinic staff in their own quality control, the functional accessibility of the clinics to the Coordinating Center in controlling protocol decisions and data monitoring, and the level of accuracy, completeness, and timeliness of the data that can be achieved. KEY WORDS: data systems in collaborative clinical trials, distributed data processing, data quality

control INTRODUCTION Data processing systems for most collaborative clinical trials have delegated all data processing functions except for data collection to the Coordinating Center. In particular, data quality control has been the responsibility of the Coordinating Center [1,2]. Recently, several collaborative clinical trials have used a distributed data processing approach, already c o m m o n in business and i n d u s t r y [3-6]. The Systolic H y p e r t e n s i o n in the Elderly Program (SHEP) Pilot Study dis-

Address reprint requests to: Dennis Black, Clinical Epidemiology Program, UCSF/SFGH, Building 1, Room 201, San Francisco, CA 94110. Received September 24, 1984; revised November 13, 1985. Controlled Clinical Trials 7:27-37 (1986) © Elsevier Science Publishing Co., Inc. 1986 52 Vanderbilt Ave., New York, New York 10017

27 0197-2456/86/$3.50

28

A. Bagniewska et al. tributed to the clinics a broad range of data processing from entry and editing to file maintenance and data analysis. This increased the quality control responsibilities at the clinics and facilitated the protocol monitoring ability of the Coordinating Center. It also provided each clinic with access to its own data files that could be used for scheduling clinical activities, for guiding clinicians through complex treatment protocols, and for analyzing local results. General aspects of the SHEP and its distributed data system have been described in previous reports [7,8]. In this article we present a detailed description of the data quality control systems used in the SHEP.

METHODS The SHEP was a pilot study for a clinical trial to test the effects of treating isolated systolic hypertension in the elderly [7]. The University of California San Francisco housed the Coordinating Center, and there were clinics in Birmingham, Alabama; Chicago, Illinois; Pittsburgh, Pennsylvania; Portland, Oregon; and St. Louis, Missouri.

Computer Hardware Each of the clinics and the Coordinating Center had a Wang minicomputer (2200 series LVP/SVP). The specific system configurations at the five SHEP clinics varied from 64 to 128 kilobytes of memory and were capable of supporting from one to three simultaneous users. Each clinic system had one 1.2-megabyte floppy disk storage unit and between 1 and 8 megabytes of additional storage. The Coordinating Center had a total of 18 megabytes of disk storage. Each computer in this system was used for data entry and editing, word-processing, data analysis, and telecommunications. Both data and text files were exchanged over the telecommunications link between the clinics, the Coordinating Center, and other participating units. There was a direct link with the University of California, San Francisco IBM 4341 mainframe where the more complex analyses were done. The clinic-entered data totaled 3.5 million data fields, requiring 13 megabytes of storage centrally. The data that were entered at the Coordinating Center added a further 1.4 million data points, requiring another 5 megabytes of storage.

The SHEP Distributed Data Processing System The data flow and quality control functions of the SHEP data system are shown in Figure 1. Data forms were reviewed by the clinician and forwarded to the clinic data manager, who was instructed to enter the data into the minicomputer within 24 hours of data collection. The clinic retained a copy of the paper data form and an additional copy was forwarded to the Coordinating Center. During a data entry session, interactive range and consistency checking was performed and the operator notified by an audible "beep" of any out-of-range or inconsistent values. The operator could then "force"

29

Quality Control in Distributed Data System

~NF,~'°"

COORDI NATING CENTER POLLING

LOCAL POSTI NG

MI~ED: NO

YES RLES

TRANSMn-rED CRAWOA~AF,~ES)

AUD~ REVI EW

Figure 1.

FAIL

I

Data flow and steps in quality control in the SHEP distributed data systems

30

A. Bagniewska et al. entry of these values, which were then automatically flagged for Coordinating Center review. The data were stored in temporary transaction files after being entered. Daily batch reports summarized data entry and editing activity, listing outof-range and inconsistent values. Each week, transaction files were transferred to the Coordinating Center by telecommunication where data were merged into the Coordinating Center master files. Simultaneously with transmission to the Coordinating Center, the transaction files were merged with the local clinic master files. Thus, each clinic maintained its own, local master file as well as transmitting data to the central master file. Any changes to the clinic master files automatically generated a replacement record in the transaction file that could then be remerged into the master files. This replacement record was used by the Coordinating Center to monitor any data changes at the clinics. After receipt at the Coordinating Center, an audit program checked the data. New data were merged into the existing Coordinating Center master files. Any duplicate or altered records were stored in a temporary audit file for individual review by the Coordinating Center data manager. This process both protected and updated the central master file while monitoring the clinic master file and increased Coordinating Center knowledge of and control over data quality. A sample of the Coordinating Center audit file contents is shown in Figure 2; problems are identified by participant ID, sequence number of visit, data file name, and a brief message regarding the problem. Sample problems include the attempted use of a nonexistent identifier (participant "12345"), local changes to the clinic master file (participant "54321"), and deletion of a form from the clinic master file (participant "22222"). The problem record was either merged into the master file or resolved with the clinic data manager. Figure 2. An example of contents of the audit file to be reviewed by the Coordinating

Center data manager, a Participant ID & visit # 12345 07

File Name Blood Pressure

54321 12

Treatment & Scheduling

22222 11

Side Effects

33333 12

Annual Medication

Message Invalid identifier in day file Change at clinic appointment date: from 22/22/83 to 12/22/83 Form recalled from clinic master f i l e but not reposted at clinic Hospitalization data inconsisten with morbidity record*

Status (from Coordinating Center Review Not merged OK to merge

Monitor for clinic's reposting the form to its master file

Waiting for corrected data transmission (Delete request when correction arrives) qnconsistencies of this sort are not detected on entry at the clinic because the morbidity records are centrally entered.

Quality Control in Distributed Data System

31

As a check of the quality of the data, baseline and early-treatment visit data for a 5% random sample of participants were reentered at the Coordinating Center. The identity of the sampled participants was not known to the clinics. All data were compared field by field with what was received from the clinic data entry. A list of discrepancies was generated and compared with the data form. Any nonsubstantive discrepancy between the clinic entry and the data form, such as a differential use of an ambiguous missing value code, was discarded and the remaining errors classified as miskeys. Miskey rates per 1000 data fields were calculated by form and by clinic.

RESULTS

Table 1 presents the results from the Coordinating Center batch edit of four baseline forms. Since it is a reedit of clinic-entered data, wider range checks and more complex logic loops were used than in the interactive or clinic daily edits. The central batch edit passed 98% of the forms; only 2% had out-of-range, inconsistent or missing values. On the 2% (120) of forms with some sort of query, there were a total of 213 queries, an average of 1.8 queries per form with any queries. The right side of Table 1 shows rates of queries by type of query and form. The rate of range or consistency errors is very low (mean = 0.5 per 1000 data fields) varying from a low of 0.1 per thousand on the medical exclusion form to a high of 1.0 per thousand on the first visit blood pressure form. Almost all (97%) of the range or consistency errors were due to out-of-range values. Note that forms are present for all 551 participants who were eventually randomized, highlighting the effectiveness of the safeguard that required a computer check of the data prior to the randomization of a participant into the study. The results of the 5% random reentry are shown in Table 2. Once the substantive discrepancies between clinic-entered data and Coordinating Center-entered data have been sorted out, the overall miskey rate at the five clinics was only 2 per 1000 items. The miskey rate per clinic ranged from 1 to 3 per 1000 items; the miskey rate per form ranged from 0 (on the visit 2 Blood Pressure Form) to 17 (on the Medical History Take-Home Form). The average miskey rate for all baseline forms was 3 per 1000 fields for baseline forms and 2 per 1000 for treatment visit forms.

DISCUSSION Little has been written in the clinical trials literature about established standards for data quality control. There is general agreement that quality control is important to assure accurate data [9], but no systematic, clear description or standard cost-effectiveness analysis is available [10]. The question is whether a distributed data processing system is as good or better than its centralized predecessor for use in collaborative clinical trial research. Distributed data processing means that more quality control measures can be carried out at the clinic. This brings quality control much closer to the source of data. A heavier load of data clean-up activity at the clinic level is balanced by a lighted load of data processing activity at the central level. This

363

551

551

551

BP Form Visit 3

Total

4987

687

914

1256

2130

Total

4867 (98%)

677 (99%)

887 (97%)

2062 (97%) 1241 (99%)

Query free ° N (%)

127,473

14,427

52,098

22,608

38,340

Total entry fields

70 (0.5)

11 (0.7)

6 (0.1)

39 (1.0) 14 (0.6)

Range or consistency errors ~' N (rate/1000)

143 (1.1)

0 (0)

66 (1.3)

67 (1.6) 10 (0.4)

Irretrievable values' N (rate/1000)

Errors by data fields

a"Query-free" forms are those without a single outlier, inconsistency, or irretrievable value. bAll of these were flagged at the time of entry at the clinic, but the operator overrode the computer message and entered the value. qrretrievable values: All missing values are reviewed and either completed or identified as irretrievable.

4436

136

705

1579

551

551

BP Form Visit 1

From participants not randomized

BP Form Visit 2 Medical Exclusion Visit 2

From randomized participants

Errors by forms

B a t c h - G e n e r a t e d Data Q u a l i t y R e p o r t of S H E P B a s e l i n e F o r m s

Form and visit type

Table 1

to

33

Quality Control in Distributed Data System

Table 2

S u m m a r y Results of C o o r d i n a t i n g C e n t e r Validation of Clinic Data Entry; 5% Sample (N = 25 Participants)

Visit and form type

Miskey rate/1000 data fields

Total N of entry fields Clinic A Clinic B Clinic C Clinic D Clinic E All Clinics all clinics

Baseline Visits

Visit 1 BP Form Visit 2 BP Form Medical Exclusions Medical History Clinician Review Visit 3 BP Form B.L. Side Effects B.L. Compliance All Baseline Visits

0

0

28

0

22

9

450

0 0

0 3

0 0

0 0

0 4

0 2

425 1320

7 0

24 0

0 0

16 0

38 0

17 0

525 550

0 1 7 1

0 1 0 3

26 0 0 3

0 0 N/A 1

0 0 12 5

0 1 7 3

475 3400 600 7745

Treatment Visits

Participant Contact BP Form Compliance Local Lab Med/Scheduling 3-month and Annual Compliance

0

0

3

0

5

1

2074

2 0 0 15 0

4 0 0 15 4

0 1 0 7 5

0 0 8 0 0

0 0 0 0 0

2 0.5 2 8 2

2412 3741 506 1352 1320

All Treatment visits Overall Rate

2 2

3 3

2 3

0.4 1

1 3

2 2

11405 19285

shift is reflected in the data clerical staffing pattern: m o r e staff required at clinics (0.75 full-time e m p l o y e e data m a n a g e r at each clinic) a n d less at the C o o r d i n a t i n g C e n t e r (1.0 full-time employee). At the s a m e time, central m o n itoring a n d s u p e r v i s i o n of data quality is e n h a n c e d by the ability of the Coordinating Center to design the automatic editing p r o g r a m s a n d by h a v i n g access to the data shortly after they are entered. The distributed data processing s y s t e m e n c o u r a g e s a m o r e distributed a p p r o a c h to all aspects of collaborative research. The increased responsibility of the clinics yields a heighte n e d a w a r e n e s s of the i m p o r t a n c e of the clinic staff in m a i n t a i n i n g data quality a n d a p a r t n e r s h i p with the C o o r d i n a t i n g Center in this e n d e a v o r . In the area of data m a n a g e m e n t , a n d d u r i n g interactive data entry in particular, the clinic data m a n a g e r functions as a representative of the C o o r d i n a t i n g Center.

Problems Created by a Distributed Data System and Some Suggested Solutions I m p l e m e n t a t i o n of a distributed data s y s t e m in a clinical trial is a complex e n d e a v o r that creates s o m e special p r o b l e m s . The local clinic's m a s t e r file is vulnerable to local alterations. O u r solution of an audit s y s t e m with a Co-

34

A. Bagniewska et al. ordinating Center review of any changed items provides a remedy by preventing unauthorized changes from being transferred into the trial-wide master files. A comparison at the Coordinating Center of a random sample of forms with the computer files provides a further means of monitoring data integrity. The need to develop and maintain each of the clinic computer systems in addition to that of the Coordinating Center is a programmer-intensive experience. Small changes in data forms or files must be implemented at multiple remote sites. In order to minimize programmer time required, it is important to allow adequate lead time for development, to anticipate and allow for changes in the system, and to recognize the need to program for at least three levels of users of distributed systems----clinician, data manager, and statistician. One specific technical recommendation for developers of distributed data systems for clinical trials is the use of software that allows the Coordinating Center programmers to fully operate each remote computer via telephone link. Another is the requirement that each remote site have identical hardware and software configurations. Finally, we recommend that certain data forms and fields be identified as high priority for the development of editing programs and that only those data items necessary for local editing or reports be maintained on the clinic master files.

Advantages of Distributed Data Processing The advantages of a distributed system stem from the functional proximity that is created between the Coordinating Center and the clinics. The data are always rapidly available to the clinics and the Coordinating Center. If necessary, the capability exists for immediate access to the data. The hypothetical time comparison of distributed with centralized data entry in Figure 3 shows the number of steps and approximate time involved in data collection and editing by the two systems. The total time needed for data availability is decreased in a distributed system through the lower number of data editing steps and the decreased time required for each step. The Coordinating Center can achieve total control over the degree of completeness of the data. For example, we required 100% complete baseline/eligibility data for all randomized participants. Requiring completed entries on forms, completed sets of forms, and noninterrupted sequences of sets is a distributed data processing feature that allows the Coordinating Center to control the data acquisition as stringently as is considered necessary, valuable, or efficient. Accuracy of data entry can be enhanced in a distributed data processing system by the computerized editing procedures performed first at the clinic and then at the Coordinating Center. Accuracy could be further increased by requiring duplicate entry (verification) perhaps for a subset of of variables considered of primary importance or at clinics not maintaining a preset standard of data quality. Another potential advantage of a distributed data system is the ability to use the data maintained at the clinic to guide clinicians through complex

Figure 3.

J

"1

Total Time: 2-8 days Ume.O

MaU~

11-2 weeks

MalledF°rms/J~1-7 days

,~

Total time: 2 1/2 I0 ??? Week8 J

~"" "P''I" 1

ProcessedJ Quedes"J" 1-7days

ate

Ml~pdSterFile

?

J Reports J

Data Processing System

Out Data /I %, I En..b' y arid i 1.7 days Editing

Centralized

A hypothetical comparison of the steps in data collection in a distributed versus a centralized data processing system.

~.~'~ ,-,. ~,.

E[~ng~l-7 days

IV~l~arteFile

Reports

Distributed Data Processing System

36

A. Bagniewska et al. study protocols (e.g., assessing eligibility or determining treatment decisions). Also, an electronic mail system used in conjunction with the distributed sys~ tern can increase communication among the investigators, clinicians, and data staff of all study units. Electronic communication also facilitates the preparation of reports, presentations and publications. A distributed data system differs from the traditional, centralized system in ways that can enhance the quality of data collected in a clinical trial. The rapid growth of microcomputer technology and the availability of sophisticated data management software has greatly decreased the staff and hardware required for a distributed system's development and maintenance, making it more practical and efficient. One possibility for future development is the direct entry of data by clinicians or the direct input from machines (such as an automatic blood pressure device) and the elimination of paper forms. Another possibility is the direct input by participants themselves. The introduction and development of distributed data processing systems for collaborative research constitutes a significant advar~ce producing data of high quality in clinical trials. The Collaborating Centers and investigators in this study were: Clinical Centers University of Alabama in Birmingham, Birmingham, AL H. Schnaper, MD (principal investigator) G. Hughes, PhD (co-principal investigator) P. Johnson (data manager) D. Parker (data manager) Rush-Presbyterian-St. Luke's Medical Center, Chicago, IL J. Schoenberger, MD (principal investigator) G. Neri, MD (co-principal investigator) T. Remijas (data manager) University of Pittsburgh, Pittsburgh, PA L. Kuller, MD DrPH (principal investigator) R. McDonald, MD (co-principal investigator) K. Martz (data manager) K. Sutton (data manager) Kaiser Foundation Hospitals, Center for Health Research, Portland, OR M. Greenlick, PhD (principal investigator) T. Vogt, MD (co-principal investigator) J. Downing (data manager) Washington University, St. Louis, MO H. Perry Jr., MD (principal investigator) G. Camel, MD (co-principal investigator D. Howard (data manager) H. Jaeger (data manager) B. Perry (data manager) Coordinating Center University of California, San Francisco, San Francisco, CA S. Hulley, MD MPH (principal investigator) W. Smith, MD MPH (former principal investigator) S. Edlavitch, PhD (former co-principal investigator) A. Bagniewska, MA (data group coordinator) D. Black, MA (senior biostatistician) C. Fox, MA (programmer)

Quality Control in Distributed Data System

37

S. Harvey (data assistant) K. Molvig (senior programmer) S. Shepard, MA (assistant programmer) Behavioral Evaluation Laboratory Center for Geriatrics and Gerontology, Columbia University, New York, NY B. Gurland, MD (principal investigator) J. Challop-Luhr, PhD National Institutes of Health Project Offices National Heart, Lung, and Blood Institute, Bethesda, MD C. Furberg, MD (project officer) T. Blaszkowski, PhD National Institute of Aging, Bethesda, MD G. Steinberg, PhD National Institute of Mental Health, Bethesda, MD N. Millerk, PhD Supported by grants HL23914, HL23913, HL23924, HL23916, HL23917 and HL23919 from the National Heart, Lung and Blood Institute, the National Institute of Aging and the National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland. REFERENCES

1. McDill MS: Coordinating Center Models Project Research Group: Activity analysis of data coordinating centers. XVI: CCMP manuscripts presented at annual symposia on coordinating clinical trials. June 1979, pp 105-125 2. Hawkins B, Coordinating Center Models Project Research Group: RFP's as planning guides to the initial organization of coordinating centers. XVI: CCMP manuscripts presented at annual symposia on coordinating clinical trials. June 1979, pp 41-51 3. Rasmussen W, Neaton J: Design, implementation and field experience with the use of intelligent terminals in clinical centers in the Multiple Risk Factor Intervention Trial. Presented at the 5th annual symposium on coordinating clinical trials, May 1978 4. Kronmal RA, Davis K, Fisher L, Jones R, Gillespie MJ: Data management for a large collaborative clinical trial (CASS: Coronary Artery Surgery Study). Comp Biomed Res 11:553-566, 1978 5. Jeffreys J, for the HPT Investigative Group: Performance characteristics of the Hypertension Prevention Trial distributed data system. Controlled Clin Trials 4:148, 1983 6. Karrison T, Meier P: Watching the watchers: Data quality control in the PARIS study. Presented at the 4th annual meeting of personnel involved in coordinating collaborative clinical trials, May 1977 7. Hulley SB, Furberg CD, Gurland B, McDonald R, Perry HM, Schnaper HW, Schoenberger JA, Smith WM, Vogt, TM. The systolic hypertension in the elderly program (SHEP): antihypertensive efficacy of chlorthalidone. J Cardiol (in press) 8. Black D, Molvig K, Bagniewska A, Edlavitch S, Fox C, Hulley S, Smith W: A distributed data processing system for a multicenter clinical trial. Drug Info J, in press. 9. Knatterud G: Methods of quality control and of continuous audit procedures for controlled clinical trials. Controlled Clin Trials 1:327-332, 1981 10. Meinert C, Coordinating Center Models Research Group: Cost profiles of data coordinating centers. XIV: Enhancement of methodological research in the field of clinical trials. September 1979, pp 127-153