COMPUTERS
AND
A Method
BIOMEDICAL
8, 222-23
RESEARCH
of Updating
1 (i 975)
a Cancer-Treatment
Data
Base System
G. MCCLATCHIE Cancer
Registry,
Royal
Prince
Alfred
Hospital,
Camperdown,
N.S. W. 2050, Aujtraliu
Received April 1, 1974
A means of updating a statistical cancer-treatment data base is presented. The data base has been programmed to record valid survival durations and recurrence patterns for multiple treatment courses, accentuating surgical and radiotherapeutic techniques. The development of the method from that used in typical medical records is outlined. emphasising relevant data processing principles. The logical conventions used, are presented in detail.
INTRODUCTION This systemrecords data at multiple points of contact for each person. The system therefore has an important update phase in which existing records in the file are
updated and in which the records of those persons who have died or been lost, are transferred to an inactive (nonupdateable) master file. The development of the system begins with the specification of the record structure. From this, the entry of data from initial contacts follows fairly routinely. The main problem that arisesis the reciprocal effect on system efficiency of the update and output phases of the system. To illustrate this relationship, the following breakdown of the output (or data retrieval) procedure is used: (1) Identification: Find the record-location containing the required type of data item, e.g. : “Adjacent involvement.” (2) Relevancy: Is the value of the data item stored at a location relevant, e.g. : Does it refer to, say, the correct
treatment?
(3) Priority: If multiple relevant stored values exist with different a ~Y;OT; error probabilities, what is the correct error probability to be assignedto the stored value, at output? (4) Validity: Does the stored value satisfy systemconstraints on data validity? For example, do the number and sequenceof the recorded treatments fall within defined limits? (5) Logical Testing: Does the stored value have the required relation to a specifiedvalue, say equal to “skin”? Copyright 0 1975 by Academic Press, Inc. All rights of reproduction in any form reserved. Printed in Great Britain
222
223
UPDATINGACANCER-TREATMENTDATABASE
(6) Entry: If yes to the above, update the appropriate element of the output matrix. The existence of an update phase in a system complicates (I), (2), and (3). System types will be characterized, in part, by how these three operations are affected by the mode of data input and by the system constraints placed on data validity. To illustrate these ideas, two system types are briefly discussed to provide an introduction to this system. (1) HOSPITAL UNIT MEDICAL RECORD Here, the file consists of a number of physically separate, loosely structured. open-ended records. A person retrieving data identifies data items partly from a knowledge of the relative record-location of the data items, partly by recognizing “identity” markers such as the colour and shape of the report sheets, and partly from an inspection of the values stored. To determine the relevancy of the data item, use is again made of the record-location of the data item and the values stored there. In addition, use is made of (usually nonspecific) “linkage” markers, such as dates which do not, alone and immediately, give the relevancy of the data item. Data item priority involves the retrievalist in such judgements as giving priority to pathology tests over cytology tests, or priority to bone scans over conventional X-rays. Such a system, if it could be programmed, which is doubtful, could properly be termed an input-dumping system. NONINTERACTIVE
UPDATE-SEQUENCING
SYSTEMS
In this common type of system, the update increments are entered so that the record consists of the initial data plus an ordered sequence of the update increments. It is noninteractive in the sense that the structure and content of the data added to the record is independent of the data already stored in that record. In the present situation, with its multiple update contacts, this would result in the storage of much redundant data with increased storage space requirements. Data item identification would present little difficulty. However, the relevancy of the data item values is not necessarily sequential; it may be, for example, that a response parameter may refer to some treatment other than the last given. Further, data item priorities are again not necessarily sequential, in that the last value entered for a particular data item, may not have the smallest prior error. Hence, at output, it would be necessary to do a pair-wise comparison of all stored values of a data item. It would be easy to visualize situations where even this type of search would not provide the desired output. In general, the output-optimizing sequence is not of the update increments, but of the multiple values of a single data item (if such exists) and of similar and linked data items. Lastly, the record is open-ended, resulting in a number of programming complications. It is contended that this type of system is inapplicable to the present data base.
G.MCCLATCHIE
224
AN INTERACTIVE INTEGRATED OUTPUT-OPTIMIZING SYSTEM Here, the update data is assessed in the context of the data already stored, appropriately modified and then entered in the record, all being performed with the aim of optimizing output. Data item identification at output is now solely locationdependent, consisting, for a particular data item, of a search of the storage area allocated to it. Data item-value determination is now dependent solely on the recognition of specific “linkage” markers at adjacent locations; these markers are added at update. For neither of these two output operations, is an inspection of the actual stored data item-values needed.Data item-value priority determination iseliminated from output by having it performed at update. The record now consists of a fixed number of alpha-numeric data items, 400 in number. with an absolutely defined structure. This record structure has little resemblanceto that of either the initial input or the update increments. The relevant aspectsof this system are the subject of the remainder of this paper. EXTERNAL
FORMATOF
UPDATEINCREMENTS
It is estimated that there will be, for each person, an average of I5 significant contacts, from each of which data will be read into the file. This phaseof the system must (1) Provide the current record status for reference by the person writing the update increment. This is often helpful and occasionally necessary. (2) The increment format must : (a) Be easily learned by the coder who has not programmed the system. (b) Allow efficient card punching of the increment. (c) Allow the efficient quantitative recording of the clinical data. (3) The increment must be compatible with the logic of the system. Once the logic of the system is fixed, the mode of entry of the update increments involves a detailed analysis of the effect of the arrangement of the data fields on the above. These are, in a sense,routine, but are neverthelessof critical importance in the efficiency of the system. There are many different possibleapproaches. The one at present adopted in this system has each update increment consisting of: Identity data
Obligatory, same for all increments for same person. Date of contact Obligatory. Current diseasestatus Obligatory. Current treatment If and only if signalled. Error correction If and only if signalled. Post mortem data If and only if signalled. Figure 1 showsthe recording of typical follow up data for a hypothetical person. Codeswhere possibleare mnemonic. Here, 2 = “nil”, Y = “unknown”, Wfollowing blanks “nil”, X following blanks = “unknown”. P in the last four columns means
UPDATING A CANCER-TREATMENT
DATA BASE
225
that the signal for one of the last three items in the above listing is Positive. The data therefore records : Line 1. There is a clinically Certain Presence of the primary lesion. Surgical Excision of the primary Tumourwas done, with nil post operative complications and with Query completeness of excision of the primary. That there is an error correctionfor position 025 in the file at which the correct entry for the cell type is “SQU”.
FIG. I. External format for a hypothetical person. For discussion, see text.
Line 2. No disease at any site is present, on this occasion. Line 3. Multiple adjacent involvement, and metastatic CNS involvement are recorded. Both regarded as clinically certain (“A”). The patient died and post mortem data is recorded. P.M. Data. The site of involvement is mnemonically coded. The preceding single letter code is a disease-location parameter. M: Multiple. E : Extensive. B : Bilateral, and so on. This coding is performed by a person familiar with the system; hence the “open” nature of some of the data, though inelegant, causes no trouble. Most data, except
c?i:
4
::: D
v
4145
YT>J y"
C 5Y
c:i?N.eY
:::
! N
I
5
c
?
"
il
cvx
Y
:
:
"
i
Cl,?
v
3rjl6’
:::
111:
YLT
Y D i>>
“?
v 39"
X!L
I:
:
:::
i :
: 393 :
..I
92)'
:
.,
:::
0
?
2
1/ :
32382rr3
:::
393
_... *:
_
i :::
7
e
:
:
:
:
:::
: :
cu .r E :
3
:
::: :
-3
.I
::
v
::
-2
: z : 2
"
-2
: : :
i
:
Y ::: i::
2
Y
I
a-i
72
:
-: .--A1 . . .. :
::
2 :
: 5
:
::*
UPDATINGACANCER-TREATMENTDATABASE
227
for the occasional situation where the data spills over onto another line, can be recorded in one line. This allows data from 19 update contacts to be recorded on one form, on the other side of which is recorded the data from initial contacts. This is the only permanent external file in the system. PROCESSINGANDENTRYOFUPDATEINCREMENTS
Figure 2 is a teletype print out of the record after the update data has been entered; the boxed items are derived from the update increments. These data items have been extensively relocated. For example, the second treatment, a surgical excision of a recurrence of the primary lesion, is relocated to the first of two data-item groupings on line 14 of the print out. The determination of data item identification is thus limited to a scanning of the identity location (here the second location in the group) for each of these two groups. The data item value relevancy (to the treatment history) is indicated immediately by the value stored at the preceding location. This print out gives an indication of the amount of space necessary to record update data. The zeroes and colons represent blank, unused locations, most of which are allocated to the recording of update data. Note also that alphabetically coded data is stored as such and is not converted to numeric values. The system could be extended to accept data in a different format, provided it can be operated on by the same logic. TREATMENT
MODALITIES
The system recognizes three treatment modes : surgery, radiotherapy and chemotherapy. These modes have different parameterizations. On any one occasion, each mode may or may not occur; if occurring, it may be with any of the other two modes. Further, the modes are, at times, treated differently. For example, after a curative surgical treatment, any positive observation of the lesion at the site treated is regarded as a recurrence; however, for either curative radiotherapy or chemotherapy, such an observation is not regarded as a recurrence unless a prior absence (after the treatment) has been observed. RESPONSEPARAMETERSRECORDED
In this system, these are: (1) Remote responses.
(2) Immediate
responses.
Life survival. First recurrence pattern. Statim responses (e.g.: recovery of function). Toxicity. Palliation.
(3) Nil. Some of these responses are merely the direct entry of the appropriate data into
228
G. MCCLAI-CHIE
the record. Examples are immediate palliative and toxic effects. Others are caiculated by appropriate data comparisons. For example, life survival for agiven curative treatment is calculated by a programmed comparison of the relevant dates. the survival interval being progressively increasedat each update. Another example i< the first recurrence pattern, determined by a programmed comparison of the current disease(or post mortem) status with that at the preceding update: these consist oi the following parameters. (1) (2) (3) (4) (5)
Site of first recurrence for the .rrth curative treatment. Grading of first recurrence for the xth curative treatment. Laterality of the first recurrence for the xth curative treatment. Duration of known absenceof the recurrence. Duration to first recorded appearance of the recurrence.
These parameters are recorded for the primary site, adjacent organ, regtonal nodes and metastatic sites. Their manner of determination is indicated in Table I. TABLE
DETERMINATION
I
OF FIRST RECURRENCE
RESIWVS~ PARAMETERS”
New site involvement
Old site of involvement
Response signal
Unknown
Unknown Zero Positive Unknown Zero Positive Zero Unknown Positive, diff. site Positive, same site
0000 I ooool 00000 11111 0001 I 11111 I II01 1 I101 00000 01000
Zero Positive
a Response signal, Column I : If I, Update site of recurrence. Response signal, Column 2: If 1, Update clinical grade. Response signal. Column 3: If I, Update laterality. Response signal, Column 3: If 1, Update duration of known absence. Response signal, Column 5 : If I, Update duration to first appearance.
When there is a recurrence of the diseaseat somesite, the addition of the data to the existing record proceeds in a straightforward manner except in two respects. The first is the manner in which the clinical grade (or, equivalently, the a priori error probability) is updated. In a typical situation where there is, say, metastatic liver involvement, then it commonly happens that the update increments will vary as to the clinical grading of such metastatic involvement. This system adopts the convention of recording, ascurrent, the highest clinical grade recorded, no matter when
UPDATINGA
CANCER-TREATMENTDATA
229
BASE
that recording occurred; no priority is given to the clinical grade in the last update increment. The second problem, which occurs infrequently, is to change the site of involvement or to lower the clinical grade. This can only be done by first “clearing the registers” with a statement of zero involvement and then entering the new data about the current disease-recurrence status. It is for this reason that a statement of zero involvement is given the highest priority in this system. Lastly, it should be noted that an update increment which contains an unknown current disease status does not change the existing record of the current disease-recurrence status. As data are only recorded at discrete points in time, the exact point at which a recurrence could have been first observed, is usually not recorded. This system therefore records two points. The first is the point at which the system first records the recurrence, and the second is the point at which the system last observed an absence of involvement. CATEGORIZATIONOFTREATMENTMODES
The categorization of any treatment mode is determined by which set of responses is to be recorded, as indicated in Table 2. In general, a curative treatment is one for which remote responses (plus treatment toxicity) are to be recorded; palliative treatments those for which immediate responses (only) are to be recorded; while a maintenance treatment is one which “maintains” the previous curative treatment, itself having no response parameters recorded. The categorization “curative, multiple” TABLE 2 TREATMENT DEFINITION Response Recorded Treatment type Curative single Curative, multiple Palliative Maintenance
Survival duration
First recurrence pattern
Once-only responses
Immediate effects
Yes Yes No No
Yes No No No
Yes No No No
Yes No Yes No
is used in such situations where a treatment is repeated on a number of occasions, such as T.U.R.‘s or T.U.D.‘s for bladder cancers. On any one occasion when a treatment is given, the only allowed heterogeneity of treatment categories is that with a type 1 response occurring with that with a type 3 response; in other words, that of a curative single treatment with a maintenance treatment.
G.MCCLATCHlE
230
VALIDTREATMENT-RESPONSESEQUENCES To be accepted by this system, any treatment must consist of a homogeneous combination of treatment modes, as described earlier. Any treatment type may follow a curative treatment; however, once a palliative treatment is given, no subsequent curative treatment can be recorded. A maintenance treatment can only occur with a preceding or concurrent curative treatment which it “maintains”. A curative treatment may apply to any or all of the primary, adjacent, regional node or metastatic sites of involvement. In such an event, the response parameter< at these sites are reset, leaving intact the response parameters at the untreated sites. A palliative treatment is regarded as having no significant effect on the set of response parameters being recorded for the previous curative treatment (if any). Finally an TABLE VALID
Current treatment type (as per Table 1)
3
TREATMENT----RESPONSE
Valid immediate past treatment
S~QUEXW
Valid next future treatment
Curative, single
1101
1111
Curative, multiple Palliative Maintenance
1100 1111 1001
1110 0010 1011
Effect on existing response patterns Treated sites-all parameters reset. Untreated sites---no effect As above All sites-no effect No effect
’ Treatment legality signal: First column: If I, then curative single treatment is valid. Second column : If 1, then curative multiple treatment is valid. Third column: If I, then palliative treatment is valid. Fourth column: If 1, then maintenance treatment is valid.
overflow limit exists; this is up to five curative treatments, five palliative treatments and three maintenance treatments for any one person. The situation is summarized in Table 3. PROGRAMMINGCONSIDERATIONS In this system, programming complexity is not taken as a limiting factor in the development of the system. Two examples are given of how additional “nonessential” programming can improve the efficiency of the system. Firstly, all operations at programme execution are usually carried out sequentially in the central processingunit (referred to as “core”). However, it is possibleto allow the operations of reading in from the storage area, the processingof the data in core. and the writing out onto tape, to occur simultaneously in three different areas (or buffers) in core. This decreasesthe execution time from the sumof all three operations to that only of the longest operation of the three. In someinstallations, suchmultiple buffering may be part of the system software. Secondly, each data-item value, while being processedin core, has60 bits allocated
UPDATING
A CANCER-TREATMENT
DATA
BASE
231
to it of which only a small number are actually used. However, when the data itemvalue is being written out onto the storage area, the full 60 bits are stored. It is possible at the time of writing the data onto tape, to compact the data, by discarding unused bits, resulting, in this system, of a five-fold saving of storage space on magnetic tape. This is therefore, used when writing onto the inactive master tape, as the only occasion where data is accessed from this file is at output, in which case only selected data items need be decompacted. DISCUSSION
The present status of this project is that of a study of the feasibility of the outputoptimization of a statistical data base system of this type. The first cost of this optimization is to considerably increase the time required to develop the system, though not to an extent that would be regarded as excessive outside medical contexts. It may be maintained that a further cost is the loss of a significant amount of information as the system now places certain constraints on the data. It is contended that this is not so as, firstly, the data is not otherwise accessible and, secondly, in a statistical data base, the system is directed towards answering specific questions rather than aimlessly scanning a body of data to get “significant” results (with undefined significance levels). It may be further maintained that the input of data, particularly that of the update increments, is complicated to such an extent that its routine widespread application is jeopardized. No practical comment can be made on this at present, although it should be repeated that, at a later stage, it would be possible to offer a range of modes of inputting data into the system. It is therefore maintained that the present method of updating a cancer-treatment data base is feasible. However, a statistical data base system is a large scale undertaking; therefore such a statement of feasibility should always be interpreted with the possibility of unanticipated large scale effects emerging. ACKNOWLEDGMENT The author gratefully acknowledges the assistance of Mrs. J. Bubb, programmer with the C.S.I.R.O. The project was financed with a grant from the Royal Prince Alfred Hospital Cancer Research Fund. REFERENCES 1. MCCLATCHIE, G. An information retrieval procedure for a cancer-treatment data base system. Comput. Biomed. Res. 7, l-7 (1974). 2. CODASYL Systems Committee Report, Feature analysis of generalized data base management systems. 1971, 3. MCCLATCHIE, G. A large scale, prospective, computer-dependent survey into the cancer-treatment interaction: A feasibility study. J. Nat. Cancer Inst., August 1974, to appear. 4. “E.D.P. Analyzer”. Vol. 11, June 1973. 9