ELSEVIER
Design and Evolution of the Data Management Systems in the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial ,Marsha A. Hasson, MS, Richard M. Fagerstrom,
PhD,
Dalia C. Kahane, PhD, Judith H. Walsh, BA, Max H. Myers, PhD, Clifford Caughman, MS, Blaine Wenzel, MS, Juline C. Haralson, Lynn M. Flickinger, MA, and Louisa M. Turner, BA for the PLCO Project Team Westat, Inc., Rockville, Maryland (M.A.H.,
D.C.K., J.H.W., C.C., B.W.); Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, Bethesda, Ma y2und (R.M.F.); NOVA Research Co., Bethesda, Ma yland (M.H.M.); Marshfield Medical Research and Education Foundation, Marshfield, Wisconsin (J.C.H.); Henry Ford Health System, Detroit, Michigan (L.M.F.); and Pacific Health Research Institute, Honolulu, Hawaii (L.M.T.)
ABSTRACT: This paper describes the design and evolution of the data management systems developed in support of the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial. These systems span platforms from stand-alone computers to distributed systems on local area networks to mainframes. Allowing all of these systems to share appropriate information electronically introduces integration, synchronization, testing, and support challenges. For each platform, applications were developed to handle data entry, editing, trial management, reporting, telecommunications, and data sharing. Approaches to issues such as level of data access, integration with other, existing applications, and handling the expansion of the protocol are discussed. Control Clin Trials 2000;21:329S348s 0 Elsevier Science Inc. 2000 KEY WORDS:
Software design, database management systems, automated data processing, data collection
BACKGROUND The Prostate, Lung, Colorectal and Ovarian (PLCO) is being conducted by the National Cancer Institute
Cancer Screening Trial (NCI) at ten screening
Address reprint requests to: Dorothy Sullivan, Early Detection Research Group, Division of Cancer Prevention, National Cancer Institute, EPN 330, 6130 Executive Blvd., Bethesda, MD 20892-7346 (E-mail: ds255jQnih.govJ. Received March 27, 2000; accepted May 31, 2000. Controlled Clinical Trials 21:329%3485 (2000) 0 Elsevier Science Inc. 2000 655 Avenue of the Americas, New York, NY 10010
0197-2456/CO/!$-see front matter PI1 SOl97-2456(00)00100-8
MA. Hasson et al.
330s
centers (SCs) nationwide, with Westat, Inc. acting as the coordinating center (CC) [l]. Ultimately, the PLCO will enroll 148,000 men and women, with half undergoing the experimental screening examinations over 5 years and the other half serving as a control arm [2]. Both groups will be followed for a minimum of 13 years from randomization to ascertain mortality from the subject cancers. Procedures and systems to support the trial were implemented and tested in the second year of the pilot phase, beginning September 30,1993 and concluding September 29, 1994.
APPROACH Before the initiation of the trial, some decisions regarding the support systems had already been made, as represented in the CC statement of work. A distributed system was envisioned, to be put in place at each SC, supporting full entry and editing of data, as well as management of trial activities. The design had to address both the immediate and long-term needs of trial operations and the strategic information requirements to monitor activities and evaluate the underlying scientific hypotheses. System design was guided by Federal Information Processing Standard (FIPS) 38 [3]. The CC contract was awarded at the same time as the SCs, leaving a very short design and development period. The systems were developed centrally as a joint effort between the NC1 and the CC. Once the basic systems were implemented, SCs were involved in expansion of the systems to support common operational needs. Design and progra mming of systems was accomplished concurrent with operations protocol development. Various subcommittees of the steering committee provided input ranging from detailed specifications for screening test reports to primary care physicians and participants to general protocol requirements that must be addressed in functions, edits, and logics [4]. These inputs were combined with the operations protocol, examinations protocols, specifications for forms completion, and requirements in the manual of operations and procedures (MOOP). The functional requirements document was completed in phases, with each module approved by the NC1 prior to programming [5,6]. The PLCO is a research trial with detailed requirements for data collection and validation that are not usually required in daily clinical operations. Quality control procedures must address both the research requirement for accurate and complete data, collected across time, and the needs of the operations staff, whose interest in data tends to dissipate as soon as the data have served the immediate operational purposes [7, 81. The NC1 and the CC developed data collection and verification routines to ensure that entered data would be consistent with the structure of related database entries and that the data collected were reliable. Using standard approaches, this was addressed by predefining database retrieval routines and reports by which data would be verified at their source and by giving SCs the wherewithal for verifying the accuracy and completeness of data. Operational incentives were added to assist in maintaining the quality of the data and to reduce the effort of duplicative entry in internal clinic management systems. Reports listing upcoming trial activities (“expectations”) and merge
Design of PLCO Data Management Systems
331s
files used with SC-developed form letters helped reduce the time spent reentering or manually tracking data for many PLCO-wide clinic management activities. Two functional areas had to be supported: management of trial activities using the study management system (SMS) and entry and editing of the full complement of research data with the data entry and editing system (DEES). Initial design efforts concentrated on developing the SMS. The DEES would be provided in phases following the initial implementation of the SMS. Since data collection was to be accomplished in large part by optical scanning, it was felt that the backlog could be addressed expeditiously. Furthermore, a number of forms (e.g., medical record abstracts of diagnostic findings) would not be needed for quite some time after the initiation of the pilot phase, thus their design could be delayed. A centrally developed and maintained system does not mitigate the need for specialized local functions at the SCs. A central system institutes standards across SCs for collection and entry of PLCO data and provides support for basic trial functions, reducing or eliminating the need for programmers at individual SCs and automating and controlling certain protocol elements [9, lo]. Design Considerations From its inception, the PLCO was to be supported through distributed systems. This would be a replicated database model, in which each SC had the same software locally, but stored and accessed only its data. The central database at the NC1 would be a superstructure of the databases of SCs (minus any data that would identify participants) and would also include data from the central laboratory (LAB) and the biorepository. From an operational perspective, the SC database is a repository of participants’ information, randomization assignments, and clinical and laboratory results that are most useful when they are readily available for day-to-day operations [7]. The systems support entry and editing of data and automate specified protocol functions to support adherence to research protocol specifications [ll], ensuring that all SCs are applying standardized rules and controls, thus reducing variability among SCs and among staff within SCs [9,12]. These automated controls have study impacts, however, including delays in implementing revisions until the systems can be modified, tested, documented, and distributed. For functions such as data entry, receipt control, and generation and mailing of forms, a form-oriented interface was designed. For other functions, a participant-oriented interface was incorporated. Because the database is so large, many of the participant-oriented views contain calculated and/or redundant variables to reduce response time. Keying of data at SCs was initially considered (including 100% key verification or double data entry) [8, 131. This approach is often recommended for distributed processing when the entry staff at the SCs are not the appropriate decision makers for interactive edits [14]. The complexity of the protocol and the enormity of the data management (more so than data forms) required shifting to optical-mark scanning, facilitating comprehensive edits in the field, decreasing time from data collection to availability in the central database,
M.A. Hasson et al.
332s
minimizing key entry and verification, and reducing staffing requirements. Appropriate approaches to marking data edits on forms and procedures for reviewing forms prior to scanning were established [S]. Data edits for consistency, logical completion (skips), and reasonableness would continue to be performed through batch reports. Remote access to SC computers was also factored into the design to facilitate resolution of software and hardware issues. CC programmers examine and resolve problems most expeditiously when they are able to follow the exact sequence of commands employed by users on site. Databases at the SCs, the LAB, the biorepository, and the NC1 must be kept synchronized. To ensure that the data correspond among systems, the design employs a number of quality assurance techniques, including database reduction comparisons and periodic full-file comparisons. Analysis datasets are made available only as directed by the NC1 from central files, so that investigators work from the same data and analyses accurately represent the entire trial [15]. Configuration/Capacity
Considerations
For the pilot phase, the NC1 initially provided funds for each SC to acquire two stand-alone microcomputers, a hand-held bar code reader (wand), a laser printer, modems for each computer, and a Bernoulli drive for backups. Additional equipment was purchased to support an ink-read optical-mark reader. Software included DOS 5.0, Word Perfect, PC Anywhere (PC-to-PC communications), Kermit (data transmission), Xcellenet, Norton Utilities (disk administration, virus checking), and PKZip/Unzip (data compression). From capacity calculations performed by CC programmers early in the pilot phase and updated as the protocol and system requirements expanded, it was clear that the two computers would not be sufficient to support the main phase. The NC1 therefore decided to develop a network capability (“PLCONet”) at each SC, as shown in Figure 1. The CC had developed flexible applications to allow future revisions to support multiuser access in the face of rapidly evolving protocol requirements. Other aspects, such as local versus network table locations, could be designed and reprogrammed only when the network architecture was finalized. Development of new applications continued concurrent with the conversion to a network environment. The network architecture is a star topology, in which all information goes through a central hub, simplifying the process of troubleshooting (among other advantages) as compared to a linear bus. The networks are implemented under Novell 3.12 and are remotely administered by the CC. At a minimum, each network has six workstations, a dedicated server with mirrored drives, two laser printers, an optical-mark reader, and a tape drive. Data transmission among SCs, the CC, the LAB, the biorepository, and the NC1 is accomplished through modems using common carrier lines. Some consideration was given to using the Internet for data sharing, but a number of concerns were raised and the NC1 decided not to do so. Among the concerns at the time were reports by other projects of significant performance degradation, disconnects, and data corruptions on the Internet [16]. An independent study conducted by Keynote Systems, Inc. in September, 1997 found that the average speed on the Internet was just 5000 characters per second (40 Kbps),
Design of PLCO Data Management Systems
333s
Scmning Center
Figure 1: The PLCO Systems Configuration Figure 1
The PLCO systems configuration.
despite wide increases in bandwidth by providers [17]. The problems were attributed to the network, not the individual Web sites. Other studies had reported an inability to transmit at all for significant periods of time as portions of the Internet failed [16,18]. These concerns, combined with the rapidly changing Internet landscape in which the ultimate direction and impact on comnumications were uncertain, led to a decision to use common carrier service. To date, no significant problems have been encountered, and the need for user support for data transmission has been minimal. IMPLEMENTING
THE REQUIREMENTS
To support the PLCO data management needs, the CC developed a web of communicating systems to share information among collaborators. The heart of the design is the SC, the originating source of most data. From the SCs, data are sent to and/or received from the LAB (PSA and CA125 results), the biorepository, contractors who process specified forms centrally, the CC, and the NCI. Screening
Center Systems
The SC systems are a combination of commercial, off-the-shelf software and applications developed specifically for the trial. Initially developed in Paradox 3.5 for DOS (with some programs in C + +), the applications are conceptually separated into two main program areas: the SMS and the DEES. Commercial software includes word processing, spreadsheets, telecommunications, and utilities (e.g., virus checking, disk problem diagnosis, and network monitoring). SCs are responsible for their own software to support physician scheduling and clinic management.
M.A. Hasson et al.
334s SMS Study Management System
I
TASR Traddng and summarizing RAND recruitment Randomization and enrollment
FAST Formsand specimens tracking
Entry of data %%Tz
b..
1
Update the SMS data tab& with DEES data Detailed screening exam reports
XPORT Export data
Computer-basedtraining and browser-based ato MOOP
LABELS Studyf label generation
_
-STATUS PaNdpant status REPORTS information Standard and ad hoc reports \
Figure 2
Telecommunications
Program modules to support screening centers.
Study Management
System
Nine fully integrated modules, each of which supports a functional operation needed to manage the trial, were developed for the SMS, as shown in Figure 2. The modules are: l
l
l
l
TASR (Tracking and Summarizing Recruitment). Initially designed to simply summarize recruitment activities for progress monitoring by the NCI, this module was expanded to meet analysis needs. Cross-tabs of recruitment information based on eligibility status, reason for ineligibility, gender, race, and age are available to evaluate recruitment efforts. Since recruitment activities are coordinated independently at SCs, this module is designed to import information from a variety of sources in a standard format. RAND (Randomization and Enrollment). This module randomizes eligible participants into the trial’s intervention or control arm, stratified by age and gender. FAST (Forms and Specimens Tracking). The heart of the SMS, this module is used for receipt control of all forms and specimens, as well as entry of forms that are not optically scanned. Once a participant is randomized into the trial, most input to the system is accomplished through either this module or the DEES. XPORT. This module exports selected data for use in SC-specific applications.
Design of PLCO Data Management Systems
l
l
l
l
l
335s
LABELS. Many of the labels used in the trial include bar codes to reduce data entry errors and facilitate receipt control. This module prints those labels, as well as shipping labels, box labels, and address labels. SHIPPING. This module tracks and coordinates trial materials that are due to be mailed and for which no response is expected. In addition to forms and specimens sent to collaborators, this module generates screening test results forms and customized cover letters to be sent to participants and primary care physicians. REQUESTS. This module tracks and coordinates PLCO materials that are due to be mailed (or remailed) and for which a response is required from the recipient. In some cases, the system generates the form itself, adding known information such as name, identifier (ID), address, and information for participant confirmation. Cover letters for materials may also be generated. STATUS. This module tracks participant status in the trial, including vital status and cancer status. REPORTS. This module provides numerous standard reports within and among these modules, as well as a facility for users to define individual ad hoc reports. Each module has various reports associated with it for management summaries and overviews, as well as detailed production support. Most include options to produce merge files for use with WordPerfect to generate cover letters, standard trial questionnaires with some prefilled information, and other individualized SC materials.
A number of SCs identified a need for operational support for tracking medical record abstracting requirements, assignments, and progress, activities required for diagnostic confirmation of positive screens and self-reported cancers, as well as for treatment information for all confirmed cancers. The CC adopted and expanded an approach already in use at one SC. While not specifically designed for this purpose, export of appropriate data from the SMS enables SCs to use EpiInfo to meet these needs [19].
Randomizing
Participants
Randomization and enrollment into the trial is performed by each SC using software provided by the CC. Software is provided only as a rim-time version, and confidential data, including blocking factors and personal identifiers, are stored in physically separate, encrypted, password-protected tables. The randomization is stratified by gender and four age groups, with an equal number of participants in each cell [2]. The program provides a blinded, blocked randomization with a randomly selected variable block size. The participant ID and trial arm are not assigned until the SC receives a signed consent, eligibility is confirmed, and the participant has passed the automatic duplicate checking process. Auditing information retained in the SMS includes the person conducting the randomization and the person who verified the participant’s eligibility. In the original system design, the date of birth, gender, and an exact match on name (first, middle, last, suffix) were used to check for potential duplicate participants before randomization. This approach would not identify variations
M.A. Hasson
Table 1
et al.
Partial List of Data Forms for the PLCO
Type of Form
Content of Form
Prerandomization data Eligibility screener
Age, relevant medical history
Baseline data Baseline questionnaire” Baseline locator form” Dietary questionnaire
Demographic and medical data Contact information Basic eating habits, food frequency
Periodic screening data (including baseline) Annual study updateb Screening examination forms Adverse experience report
Current health status Results of each specific test As needed for both screening and treatment
Other data Diet history questionnaireb Blood collection forms Screening exam quality assurance forms Medical record abstract forms Missing data form Nonresponse form Death documentation sheet Death review form
Detailed food frequency data Data for biorepository and special studies Repeat data on screening tests Diagnostic evaluation and treatment information Internal data processing information Internal data processing information Fact of and cause of death Confirmation of cause of death
* Some screening centers opt to administer this form prior to randomization. b Collected from both screened and control participants.
of a name (e.g., F. Scott Fitzgerald and Franklin Scott Fitzgerald) as the same person, however, as happened at one center. The matching algorithm was modified during the pilot phase to match on date of birth, gender, and similar first and last names only, in an effort to capture such duplicates. Two critical items used in the stratification scheme-gender and date of birth-are entered into RAND. To minimize entry errors during interactive randomizations, the user is asked to enter information twice. The RAND module is also designed to allow a “batch” randomization, which allows randomization of many participants at one time.
Data Entry and Editing System Entry and editing of the full complement of trial data is supported in the DEES (Figure 2). Due to the volume and need for timely information, data are generally collected on forms designed for an optical-mark reader (National Computer Systems Op-scan 5, model 30 equipped with a bar code reader and an ink-read head). Forms are scanned, automatically loaded into a Paradox database, and preliminary edits on key fields (participant ID, form type, serial number, study year, and sample ID) are automatically produced. Over 20,000 edits are then applied to the data. In addition to edits for valid ranges, logics, consistency, and skip patterns, edits also confirm medical consistency (e.g., diagnosis consistent with reported findings). A listing of some of the PLCO data forms is shown in Table 1. Each data form can be designated “final complete” (passed all edits), “final incomplete” (did not pass edits but no further action can be taken on issues),
Design of PLCO Data Management Systems
337s
or “pending” (has not been edited or failed edits are being resolved). A “final complete” status can be automatically assigned by the computer when the edit program is run if no edit errors are identified. The SMS and DEES systems are not fully integrated even though they are coresident on the same server. Data are scanned, edited, and revised as appropriate in the DEES. A program then transfers clean data (data marked either “final complete” or “final incomplete”) to the SMS, where they are available for expectations reporting and use in trial management monitoring and scheduling. Reports are available to ensure that data are synchronized between the two systems and that outstanding edits are addressed in a timely manner. This process ensures the accuracy of the data in the SMS, which drives subsequent trial activities and reports diagnostic results to participants and primary care physicians outside the trial. This precept also ensures that data shared with the biorepository and the LAB from the SMS are not revised after transmission, causing data synchronicity problems among collaborators.
Security in Screening
Center Systems
The screening center network is designed to ensure the security and integrity of the trial data and is compliant with FIPS 41 and 73 [20, 211. To reduce vulnerabilities, a number of variables must be well defined and controlled [22]. Areas that require protective safeguards include: l
l
personnel (controlled entry or movement in the computer area); physical objects (logging and cataloging of diskettes, destruction of hard copy containing individual-identifying information);
procedures (granting access to systems, assigning and changing passwords); management oversight (periodic review of safeguards, policy guidance, staff training, unannounced system audits); . communications (outside incursion, interactions with other systems, networks, and applications); software (audit trails, logon procedures, data integrity, viruses);
l
l
l
l
hardware
(network connections,
l
disaster preparedness
(sprinklers,
memory protection);
and
off-site backups).
Security and data integrity issues are addressed through the joint efforts of the CC and individual SCs. Physical security for the network is the responsibility of each SC. The servers are maintained in locked areas that are accessible only by authorized personnel. Each SC has developed an internal security plan for the protection of computer systems, including storage of backup tapes in an off-site location. When an SC creates a physical object from either CC-provided applications or from another source to be used with those applications, it assumes control for protecting that object in an appropriate manner. This includes responsibility for the appropriate distribution, storage, and destruction of all reports generated by the systems that carry confidential and/or sensitive data, as well as protected storage and handling of backup tapes that contain complete trial data. The trial MOOP indicates that all of these objects should be logged.
338s
M.A. Hasson et al.
Following use, they should be stored in a locked, restricted-access area or destroyed. One of the most important responsibilities of the SC is control of data access through the assignment of identifiers, passwords, and rights to staff members for access to the computer itself, as well as to applications and backup facilities. Procedures are in place at every SC that specify guidelines for determining staff access rights, including how and when passwords and access rights are assigned and deleted as staff join and leave the project. Guidelines for data access also include training of staff (e.g., do not leave the computer unattended without logging out, do not leave workstations in host mode) and establishing controls on any software that is not provided by the CC. Training also includes invoking virus checks before using foreign disks in software outside of the CC-controlled applications. Data specifically protected from disclosure by the Privacy Act of 1974 are being collected by the SCs [23-251. As previously noted, identifying information including name and social security number are stored in physically separate tables that are encrypted and password-protected. No protected data are stored on the hard drives of workstations. Both workstations and printers are located in areas restricted to authorized personnel. In the event that this information is used in a system report, the temporary report files are similarly protected. At no time is this information transmitted to the NCI, the CC, or other trial collaborators. All reports that include sensitive information carry a warning at the top and bottom of each page cautioning the user regarding distribution and disposal of the hard copy. SCs are responsible for ensuring appropriate distribution of output containing protected data, as well as appropriate storage and disposal of forms and reports containing information linking participants and trial data. Information, such as date of last update and user ID, used in auditing data is automatically updated but is not visible to users or documented in user materials. This information can be accessed by the CC network administrator and is transmitted to the CC and the NC1 along with the associated data each month. A continuous virus scan is enforced against the server and weekly against each workstation to protect against the introduction of viruses, worms, and Trojan horses [15,26]. In addition, the scan is automatically activated whenever a file is imported into the CC-provided applications from another source or whenever a file is created on a floppy disk. Viruses incurred from a variety of sources have been reported by SCs. In each case, the source of the virus has been identified and the system disinfected with no loss of data. Sources of viruses have been files generated from outside systems for use with the SMS, home computer systems where users developed WordPerfect trial documents and memos, and in one case, a laboratory. PLCONet is based on a layered approach to access approvals. In the highest layer, controlled by the CC network administrator, a user is asked for an ID and password during log-in to the network. The system enforces a password change every 6 months for each user, with three grace log-ins after notification. Each ID is associated with access to predesignated group(s), such as the SMS (which carries data protected under the Privacy Act), the DEES (which does not carry any identifiers), an exchange directory that could carry sensitive data
Design of PLCO Data Management Systems
339s
for use in merge files, and a private directory. A “guest” group is also available that allows access only to SC-specific software (such as appointment management systems or applications installed on workstation hard drives) and the printer. When a user selects an individual SMS module, the system requires a second ID and password and may further restrict a user from various sensitive functions (such as randomization, write-access to data, and addition of new users and passwords) or from viewing identifying information. This level of access is under the control of the SC data manager and/or the SC coordinator. For additional security, passwords at this level are alphanumeric and case sensitive. Regular system backups are critical [22,27]. Randomization data are backed up immediately following each randomization session, and automatic backup to tape of all data is enforced daily. One tape is assigned for each day of the week, and each Friday backup is retained for a month. SCs are directed to ensure that backup tapes are stored in a secure, off-site location to protect data in the event of a disaster while maintaining protection of confidential information. In addition, users are provided with a menu option to perform backups on an ad hoc basis to support equipment relocation, concentrated scanning efforts, etc. All data on the backup tapes are protected in the same manner as in the general database. Because the SC networks are administered remotely by the CC and to further protect the integrity of the database, software, and items to which the SC should be blinded, certain functions are not accessible by anyone at the SC. These include most network administration functions, access to the directories carrying the PLCO applications and data structures, and access to areas carrying other software, such as backup and virus-checking programs. Only the network administrator has the authority to add new software to the server or upgrade any of the CC-provided packages. While there have been some incidents that have necessitated relaxation of controls, all have been addressed immediately. For example, in relocating the network at one SC, it was necessary to give the data manager the network administration password to shut down the server. The move took place 3 days earlier than planned, without notice, and by the time SC staff called the CC, the telephone lines were inoperable. Immediately following the relocation, passwords were changed and logs monitored for inappropriate access.
Direc :t Access to the Database As noted in numerous studies on data quality, allowing SC staff direct access to data tables could potentially lead to security breaks, discernment of table contents (including hidden audits and factors blinded to SCs), and uncontrolled changes to or corruption of data [15, 18, 221. In fact, while the DEES was in development (it was implemented in phases), the tables were not passwordprotected. After numerous table corruptions at one SC, it was determined that some staff members were directly accessing the tables using a database management package. When protections were put in place, the frequency of record corruptions was significantly decreased and the source of the corruptions more easily identified.
M.A. Hasson et al.
SCs have two read-only approaches to selecting data for export. The first allows the user to select specific data elements to be exported, as approved by the NCI. Variables that allow discernment of end points (either directly or through inference) are excluded from the candidate fields. Similarly, auditing and other information to which the SC should be blinded (e.g., randomization blocking factors) is also excluded. SCs then use a package of choice to manipulate the exported data for reports or to import the information into other applications for appointment scheduling, etc. The second, more sophisticated approach allows the user to define an ad hoc query, using the Paradox syntax, to retrieve appropriate records from multiple tables, which are then available for reporting or export. The query system allows users to save queries for ongoing use and to tailor reports for presentation. This approach allows users to search and report on data using logical and statistical arguments. Once data are moved to an export file, security becomes the responsibility of SC staff, who must address both access to individual information that is considered confidential and/or sensitive and access to aggregate information that may have an adverse effect on the trial or may be prematurely or inappropriately released outside of the immediate trial collaborators. Periodically, the NC1 reviews the security in place for these corollary systems to ensure compliance with federal standards. Because the NC1 has instituted a method for accessing data in the distributed environment, SCs are prohibited from copying the data tables and are asked to access information only using the NCI-provided SMS and DEES applications.
Supporting the Screening Center Systems A local area network requires expert support to ensure system integrity, routine software and data maintenance, troubleshooting, user support, and diagnostics. The CC and the NC1 developed a plan that would substitute CC remote administration for on-site network administration. At the time the plan was developed, remote administration facilities were not included as network functions for either Novell or NT. For administrative activities that required on-site service, the CC negotiated a maintenance contract with a national provider that guaranteed same day response/24-hour resolution for server difficulties and next day response/48hour resolution for workstation and printer problems. SC staff call all problems into the CC user support office, where the issues are investigated and the need for service confirmed. This approach has worked fairly well, although the quality of service varies geographically by SC. In only one case has the server been down for more than 1 day. Most PLCONet administration functions are accomplished by CC staff dialing via modem into one of the SC workstations set into host mode through PC Anywhere (later upgraded to a server link through Wanderlink). This does not require the network to be open to outside access without a deliberate action by SC staff. All network facilities normally available to administrators can be accessed in this way. The administrator can also examine and act on individual workstations using Wanderlink Proxy-Host, which replaced the original package, Closeup.
Design of PLCO Data Management Systems
341s
User guides are available for the SMS, the DEES, and PLCONet through the CC user support office. These documents are supplemented at the time of each quarterly system upgrade. A cumulative index shows all system modifications and enhancements by version. The history of the PLCO is one of constant development to support an expanding protocol. A full training session was held shortly after version 1 of the system was implemented, and tutorials are provided at each PLCO system user group meeting. A prototype for a computer-based training system was installed at each SC to allow the staff to tailor their training programs for new staff by concentrating on specific modules corresponding to staff duties. This training is multiuser, and each staff member can exit at any point, with the option of returning to the same place or starting over. The subject area of the prototype was selected by the systems user group and will be expanded to other functions based on priorities set by the SCs. A resource for real-time questions regarding the use of trial-developed applications is provided through the CC user support office, which is accessible via telephone or e-mail. On any given day, questions may range from how to interpret a report to advice on formulating a query in response to management questions. As the primary liaison with users, the user support staff at the CC respond to and document all questions or system problems that are reported and assist in investigating and resolving issues. Each contact is logged using an internal system available to all CC staff, including programmers, coordinators, and analysts. CC staff prepare requests for modifications and enhancements to the systems through discussions with the SCs. The user support staff may want a user to repeat the steps that seem to be causing a problem while directly observing the system. These needs are addressed using a connection to the workstation through a modem. When the user’s workstation is in host mode, the user support staff can demonstrate or provide short training exercises on the functions in question. Similarly, the SC staff can use the system while under the observation of user support staff. To accommodate potential questions involving data transmission, each SC provides two data lines to the network, allowing user support staff to directly observe the process from the second line.
Configuration
Management
Regardless of preparation for such a complex trial, changes can be expected both in the protocol and in the operational systems as the trial matures. Configuration management is the methodology for controlling and monitoring changes to all aspects of the systems: commercial software, trial-developed software, hardware, firmware, and telecommunications [28,29]. Structuring the approach to change ensures that the most important changes are made first and enhancements and upgrades that are not required, but desirable, can be considered in turn. It is important that all changes go through the same process, so that each request is addressed in the total context of priority of need and all impacts are considered when planning an upgrade. Revising a single data element can
342s
M.A. Hasson et al.
impact every report using that information, edits and logics on that and associated fields, help messages, selection tables, user manuals, and system documentation. In a distributed environment like the PLCO, the “ripple” effect can extend to communication systems, collaborator systems, and the central analysis file (including update programs, monitoring reports, and interim analysis programs). As the system grows in complexity and involves more configuration items, change can have significant cost impacts. Requests for changes are assigned a tracking number and entered into the CC configuration management system for follow-up from initiation through prioritization, programming, testing, and documentation, ensuring that all steps are completed. Outstanding requests are distributed quarterly to SCs and the CC to gather input on priorities for planning the next regularly scheduled software upgrade. SCs differ, and a change sought by one may have a negative impact on another. Once a plan for what is achievable is in place, it is presented to the NC1 for approval. Hardware and software changes are tested prior to implementation on a test system at the CC, according to standard testing practices [30, 311. The testing analysts also perform version control prior to release of upgrades. Hardware and system configuration issues were somewhat more difficult to test as the initial hardware was purchased by individual SCs and varied significantly, creating function and support problems. The original computers have now been replaced by equipment purchased and installed by the CC. Updates to the trial applications are generally performed by distributing diskettes or CDs to the SCs, along with associated user documentation. The upgrade itself is usually performed by the network administrator remotely, with an SC staff member inserting diskettes under the guidance of the administrator. This approach was taken to reduce the on-line time required to download upgrades, which can be quite large and can involve restructuring of tables and generation of automated reports listing data conflicts or other information about changes made to the system. Commercial software is upgraded on a less frequent basis, as approved by the NCI. Software licenses are maintained by the CC, and upgrades are tested in the PLCONet environment. Most hardware upgrades are accomplished yearly, based on the budget for the current contract year. Prior to each upgrade, a detailed implementation plan is developed including checklists, diagnostics that will be performed onsite, and a schedule for each SC. A preinstallation checklist is particularly important, since SCs must address such issues as space allocation, furniture, and cabling prior to the installation date. SCs are required to use internal facilities support for cabling and to ensure that all local building and fire codes are met, obviating support by the CC.
Biorepository
Automation
Data associated with biosamples retained for long-term storage at the biorepository in Frederick, Maryland originate at the SCs [32]. Ultimately, these data are supplemented with inventory control information and loaded into the biological specimen inventory system (BSI), a system used to track and report on specimens stored at several central repositories associated with hundreds
Design of PLCO Data Management Systems
343s
of NC1 studies. The CC developed a receipt and inventory control system for biorepository use to provide the link between the SMS and the BSI. A blood collection form is completed, scanned, and edited at the SC prior to the monthly shipment of samples. The SMS prints a shipping directive, which lists all samples to be shipped to the biorepository. SC staff confirm the list by comparing it to actual vials and transmit the associated file to the NIH mainframe at the time the vials are shipped. The hard copy is included in the shipping box. Each sample vial is bar-code labeled to facilitate receipt and reduce data entry errors. On receipt of the vials, biorepository staff download the associated file to a stand-alone PC and scan the bar code on each vial. The system automatically identifies any discrepancies for resolution. As vials are scanned, the system automatically assigns inventory locations: freezer, rack, box, and slot. Each storage box is prelabeled with the freezer number, rack number, and box number. The assigned location is prominently displayed on the screen and color-coded to ensure that vials are placed in the correct storage box. Samples from each participant are split among freezers to mitigate the effect of any potential freezer malfunction. Laboratory
Automation
The central LAB where CA125 and PSA assays are done for all SCs is a segregated element of the UCLA Tissue Typing Laboratory. It has separate, secured facilities and does PLCO work only. Similar to the biorepository receipt system in function, the LAB receipt system was developed by the CC to facilitate receipt of blood samples from and downloading of test results (ISA and CA125) to SCs for subsequent reporting to participants and primary care physicians. It is installed on a small local area network (LAN) used by the LAB for the PLCO only. This network also supports a LAB analysis system, which carries all data associated with the test analyses. Security on the network is established by the LAB computing staff and includes administrator-controlled restriction of access to specific directories and password protections. A communications server on the network is accessible by SCs, who use PC Anywhere Remote to connect. The server itself uses PC Anywhere LAN-Host and carries only files ready for download. No confidential information is available from the LAB. All data are identified through a vial ID number only. Once it is confirmed that all data associated with the vials are present, an export file is generated from the receipt system for use with the LAB analysis program. A minimum of two analyses per sample are performed, and the values are averaged to obtain a final result. Based on the results of the first two tests and quality control standards established for the tests, additional analyses may be run and incorporated in the final result. The LAB analysis system retains all results, including quality control runs and calculated final test values. CA125 and PSA assay values are exported to a file for download to the associated SC. SCs access the LAB network weekly through a series of passwords and select a menu item to download results. SCs are restricted to accessing only their own results. Results files are loaded into the SC SMS, with any discrepancies flagged for resolution. Participant ID, draw date, trial year,
M.A. Hasson et al.
344s
visit number, and sample ID for each record are matched between the LAB file and the SC database. If any one item does not match, the record is not updated. Mismatches can result from data errors or corruptions in either the LAB or SC database or from revisions to the SC database that took place after transmission to the LAB. Errors are investigated, and the appropriate dataset is revised. The downloading process is repeated, ensuring that corresponding information is present in all systems. Errors and revisions are documented on discrepancy notification forms, which are reviewed to identify processing, protocol, or systems issues. These forms also serve as the formal documentation for data changes made by the LAB or the SCs.
Processing
Forms Centrally
Although most forms are processed at the SCs, the dietary questionnaire and the health status questionnaire are distributed and collected by the SCs, but sent to the CC for entry and analysis, Shipment and receipt of these forms are similar to that for biological specimens sent to the LAB or the biorepository. Upon receipt at the data entry location, dietary questionnaires are scanned using an optical-mark reader. Only minimal edits are performed centrally at this stage, including verification of key fields, checks for duplicate entry, and analysis of missing data-no more than ten items missing per page. Health status questionnaires are keyed, verified, and edited using the CC’s proprietary codebook editing system. Periodically, all files are compared to ensure that data at the SCs correspond to centrally stored files.
INCORPORATING
QUALITY
ASSURANCE
Data integrity refers to the accuracy of the data in the computer, as addressed through error prevention and detection [9]. In addressing data integrity, it is important to recognize that each piece of data has only one “owner” (source), that is, only one organization is responsible for the accuracy of that information. Throughout the PLCO systems, no one other than the data owner is allowed to add, modify, or delete any information. For example, the test results from PSA and CA125 are downloaded from the LAB to each SC. SCs have readonly access to these data, allowing them to view or print but not modify or delete data. As the data source, only the LAB can modify the information. In the event that an error is detected, the LAB must correct its database and follow the usual download procedures to correct the information in the SC system. The CC and the NC1 do not have the right to change any data. As issues arise during quality assurance reviews, the discrepancies are resolved by the data owner and retransmitted to the NCI. Similarly, the biorepository notes if a sample was received or not, but cannot modify sample information in the NC1 applications. There are rare occasions when a data owner may formally request that a correction be made by the CC. These situations may pertain to protocol violations as when randomization is based on an incorrect date of birth or gender. The system will not allow SCs to make these needed corrections. A data owner
Design of PLCO Data Management Systems
345s
must request the CC to make the needed change, and each revision is documented with a formal request that includes date of revision and name of the persons requesting and making the change. Sources are responsible for validation and certification of the data they generate. Error prevention is designed into the data collection, management, and dissemination procedures and includes: l
l
l
l
l
Standardized procedures. When procedures for collecting, editing, and reporting data are clearly understood and universally followed, variability in the data is reduced. Optical scanning/double entry. Use of bar-code identifiers for participant IDS and specimen IDS in conjunction with check digits significantly reduces errors at the point of entry. Optical-mark scanning introduces a lower entry error rate than traditional keying and mitigates the need for double entry of data [33]. Data that are not scanned and are designated as “critical” by the NC1 are double keyed, using traditional approaches. Interactive edits. Edits conducted while the user is entering data can help to identify keying errors for variables that are not double entered. While most of these types of errors could be identified later (error detection), timeliness of response can be an issue. In addition, error detection is sometimes conducted much later in the process. If a long time has elapsed since data collection, the quality of the data obtained in recontacting the respondent becomes questionable [34]. Security. Password control, read-only access to files, function checks, and use of software executables rather than source code at the SC level ensure that data are entered into the system only under the specified and controlled conditions of the protocol. Administration. Backups, virus checks, disk checks, and network diagnostics help to identify problems before data can be affected. Even when data have already been impacted, these procedures minimize problems and allow early correction.
Confirmation of a number of PLCO protocol requirements are automated. These include confirmation of informed consent prior to receipt of data forms, confirmation of screening examination final results based on detailed findings, and rejection of screening exam data for control participants. An attempt to process a duplicate form for a participant results in errors that cannot be corrected by the SC. This safeguard ensures that the source of labeling errors is explicitly identified before any new receipts or changes to existing data are made. In the event that the correct ID cannot be unambiguously determined, all impacted data are removed from the analysis datasets. Ensuring that the data are used on a regular basis by the people responsible for maintaining the information in the database is essential to error detection. PLCO applications encouraging data review include reports designed by SCs for monitoring, expectations reports detailing upcoming events for individual participants, delinquency reports, the ability to query the database in support of clinical operations (reducing the need to pull manual files), and a way to export information for use with SC-specific clinic management systems. Traditional detection measures are implemented throughout the PLCO systems. Edit reports, the cornerstone of detection techniques, check both intra-
M.A. Hasson et al.
346s
and interform errors. Over 15,000 edits in the DEES check for consistency, reasonableness, and questionnaire logics. Other reports cross-check data (e.g., between the LAB, DEES, SMS, biorepository, and central forms files) and identify inconsistencies and discrepancies. No assumption is made concerning which system carries the “correct” data-all parties are notified and asked to confirm their information. The PLCO systems rely on parameter-driven database management systems that use a data dictionary to control the attributes of each data element. Interactive edits are controlled through this data dictionary, as well as individually programmed checks. Batch edits include both interform edits (those that check for consistency among different forms for an individual) and intraform edits (those that ensure that a form is internally consistent). Edits include both errors (inappropriate skips, impossible values, range errors, between-field errors, etc.) and warnings (values that are unlikely, but not impossible). The results of edits are listed in hardcopy reports for review by the SC. An audit is kept, showing when the record was subjected to an edit and whether it failed to pass all edits. Once a record is found to be “clean” with regard to system edits, it is not again reviewed by the program unless the record is updated. If it has been updated, the indicator showing that it passed all edits is automatically removed and it is once again included in the edit cycle. Further error detection programs include: l
l
l
l
Data sent monthly from the LAB to the NC1 are cross-checked automatically against data sent to the NC1 from SCs. Files from SCs showing which blood specimens have been shipped to the biorepository and the LAB are compared automatically during receipt of specimens. Discrepancies are accumulated in a separate file for resolution. Files from SCs showing which forms have been sent for central processing are compared automatically during receipt of the forms. Discrepancies are accumulated in a separate file for resolution. A number of reports are available, identifying forms and activities that have not been completed for individual participants.
Another approach taken in the PLCO to error detection involves both system and procedure audits. Audits objectively measure and report compliance with the protocol, standards, and requirements [35]. Audits are also performed to ensure that computer systems are operating as required and indicate whether data integrity is compromised through system or security breaches, inadequate error detection and prevention, or programming errors. CENTRAL
ANALYSIS
FILES
All of the data, regardless of the source, are transmitted to the NIH mainframe computer. The CC accesses these files for monthly monitoring, progress reports, and quality assurance monitoring. SAS files correspond to each data table from every source. All data from every source, including auditing variables, are available on the mainframe, with the exception of personal identifiers from SC systems. These documented files provide the basis for generation of internally consistent reports used by the PLCO’s Monitoring and Advisory Panel review and for all analyses of PLCO progress and publications.
Design of PLCO Data Management Systems
347s
STATUS The PLCO systems continue to expand. New modules are under development to support pathology collections, death data review and submission of records to the National Death Index, and etiologic and early biomarker studies.
REFERENCES 1. Gohagan JK, Prorok PC, Hayes RB, et al. The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: History, organization, and status. Control Clin Trials 2000;21:251!3-272s.
2. Prorok PC, Andriole GL, Bresalier RS, et al. Design of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials 2000;21:273S-309s. 3. Guidelines for Documentation of Computer Programs and Automated Data Systems. Federal Information Processing Standards Publication 38. U.S. Department of Commerce, National Bureau of Standards; 1976. 4. O’Brien B, Nichaman L, Browne JEH, et al. Coordination and management of a large multicenter screening trial: The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials 2000;21:31OS-32%. 5. Westat, Inc. Study Management System Documentation: PLCO Cancer Screening Trial. 1993.
Functional Requirements for the
6. DeMarco T. Structured Analysis and System Specij%ation. New York: Yourdon; 1978. 7. Floumoy N, Hearne LB. Quality control for a shared multidisciplinary database. In Liepins GE, Uppuluri VR, eds. Data Qualify Control: Theory and Pragmatics. New York: Marcel Dekker; 1990. 8. Hosking JD, Newhouse MM, Bagniewska A, et al. Data collection and transcription. Control Clin Trials 1995;16(Suppl):66%103S. 9. Morganstein DR, Hansen MH. Survey operations processes: The key to quality improvement. In: Liepins GE, Uppuluri VR, eds. Data Quality Control: Theory and Pragmatics. New York: Marcel Dekker; 1990. 10. Anderson G, Hosking J, Borchers B, et al. Distributed data management systems: Making them do more of the work for everyone. Presentation at the Society for Clinical Trials PreConference Workshop. Control Clin Trials 1996;17(Suppl):5S. 11. Black D, Molvig K, Bagniewska A, et al. A distributed a multicenter clinical trial. Drug Info J 1986;20:83-92.
data processing
system for
12. Gassman JJ, Owen WW, Kuntz TE, et al. Data quality assurance, monitoring, reporting. Control Clin Trials 1995;16(Suppl):104~136S.
and
13. Blumenstein BA. Verifying keyed medical research data. Stat Med 1993;12:1535-1542. 14. Hosking JD. Distributed data management systems: Approaches to data capture and entry. Presentation at the Society for Clinical Trials Pre-Conference Workshop, Pittsburgh, PA. 1996. 15. McFadden ET, LoPresti F, Bailey LR, et al. Approaches to data management. Clin Trials 1995;16(Suppl):3OS--65S.
Control
16. Drabik M. Use of the Internet in conducting clinical trials: AASK. Society for Clinical Trials Annual Meeting, May, 1996. 17. Keynote Systems Inc. Keynote Systems clocks true speed on the Internet highway. 1997, http://www.keynote.com/press/html/97oct21.html. 18. Gassman J. Advantages gained from distributed data management in the AASK Blood Pressure Trial. Presentation at the Society for Clinical Trials Pre-Conference Workshop, Pittsburgh, PA. 1996. 19. Centers for Disease Control, World Health Organization. Epi Info: Software for Word Processing, Database, and Statistics Work in Public Health. v. 6.04; 1996.
M.A. Hasson et al. 20. Computer Security Guidelines for Implementing the Privacy Act of 1974. Federal Information Processing Standards Publication 41: U.S. Department of Commerce, National Bureau of Standards; 1975. 21. Guidelines for Security of Computer Applications. Federal Information Processing Standards Publication 73. U.S. Department of Commerce, National Bureau of Standards; 1975. 22. Wood CC, Banks WW, Guarro SB, et al. Computer Security: A Comprehensive Checklist. New York: John Wiley & Sons; 1987.
Controls
23. Privacy Act of 1974, P.L. 93-579; 5 U.S.C. 5552a. 24. Donaldson MS, Lohr KN, eds. Health Data in the Information Age: Use, Disclosure, and Privacy. Washington DC: National Academy Press; 1994. 25. Duncan GT, JabineTB, de Wolf VA, eds. Private Lives and Public Policies: Confidentiality and Accessibility ofGovernment Statistics. Washington DC: National Academy Press; 1993. 26. Cady GH, McGregor I’. Mastering the Internet. Alameda, California: Sybex; 1994. 27. Meinert CL, Tonascia S. Clinical Trials: Design, Conduct, and Analysis. Oxford University Press; 1986. 28. Royer TC. Soflware Testing Management: Hall; 1993. 29. Boehm BW. Software Engineering
New York:
Life on the Critical Path. New Jersey: Prentice
Economics. New Jersey: Prentice Hall; 1981.
30. Myers GJ. The Art of Software Testing. New York: John Wiley & Sons; 1979. 31. Beizer B. Software Testing Techniques. New York: Van Nostrand Reinhold; 1990. 32. Hayes RB, Reding D, Kopp W, et al. Etiologic and early marker studies in the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials 2000;21:349S-355s. 33. Employee Benefit Scanning, 1995.
Minneapolis,
Minnesota:
National Computer
Systems;
34. Groves RM. Survey Errors and Survey Costs. New York: John Wiley & Sons; 1989. 35. LePage NJ. Data quality control at United States Fidelity and Guaranty Company. In: Liepins GE, Uppuluri VR, eds. Data Quality Control: Theory and Pragmatics. New York: Marcel Dekker; 1990.