Coding data from child health records: The relationship between interrater agreement and interpretive burden

Coding data from child health records: The relationship between interrater agreement and interpretive burden

Coding Data From Child Health Records: The Relationship Between Interrater Agreement and Interpretive Burden Elisabet M.H. Hagelin, RN, PhD The relati...

754KB Sizes 1 Downloads 17 Views

Coding Data From Child Health Records: The Relationship Between Interrater Agreement and Interpretive Burden Elisabet M.H. Hagelin, RN, PhD The relationship between interrater reliability and interpretive burden when coding information from Swedish Child Health records was studied. Information on preschool children's living conditions, health, and different aspects of care was sorted into one of four groups according to degree of interpretive burden. Two interrater assessments were conducted and compared. The results showed that a low degree of interpretive burden correlated to high interrater agreement. It was possible to increase concordance by coder training, clarifying definitions of variables, and coding instructions, but not so as to eliminate totally the general differences between groups of variables that differed regarding interpretive burden. Copyright 9 1999 by W.B. Saunders Company

SESSMENT, MEASUREMENT, and documentation of different aspects of a client's health in clinical work as well as in research involve the transformation of reality into text and numbers. When gathering data in a clinical setting, the professional has to have the competence and skills to retrieve relevant facts from the client, to observe important signs or symptoms, to measure different health aspects, to give adequate care, and finally, to transform the information into text, numbers, or codes according to documentation instructions onto the client's record. When records are the sources of data, the researcher has to interpret and code the available information onto a protocol, and finally, transfer the data into the computer for further analysis. In the situations described in the present study, there is a risk of errors affecting both the validity and reliability of the information in the client's record as well as in the research protocol; hence, the quality of data depends both on the instruments used to collect data and the data collector (Aaronson & Burman, 1994). Validity and reliability are central concepts both in clinical work and in research. The higher the levels of validity and reliability, the greater confidence one may have in the information gathered. In brief, three concepts are involved in the process of translating reality into numbers and text: the construct, ie, the operational definition of the variable, the true score, and the score obtained on the variable. Validity describes the fit between the defined variable and the true score, reliability describes the

N

Journalof PediatricNursing,Vol 14, No 5 (October), 1999

fit between the true and the obtained score (Knapp, 1985). Because information in clinical records is increasingly being used to evaluate the quality of the health care process and for research purposes, there is a need for a better understanding of the magnitude and types of validity and reliability problems that might arise when transforming and transferring forms of record information to other sources for further analysis (Aaronson & Burman, 1994; Brennan & Hays, 1992; Brown & Semradeck, 1992; Pollit & Hungler, 1996; Reed, 1992; vonKnoss Krowchuk, Moore & Richardson, 1995). Two aspects of reliability to be considered when constructing a protocol for coding qualitative data are unitizing and categorizing reliability (Bakeman & Gottman, 1992; Guetzkow, 1950). Unitizing reliability is defined as the consistency in identifying what is to be categorized, ie, the amount of information, eg, text, to be included in each unit. The most important factors influencing unitizing reliability are reported to be the degree of observer inference needed to identify and delimit the unit to

From the Department of Women's and Children's Health, Unit of Pediatrics, Uppsala Universit% Sweden. Supported by grants from the Dalarna Research Fund, Department of Pediatrics at Uppsala Universi~, the May Flower Foundation, and the Gillbergska Research Fund. Address reprints requests to Elisabet M.H. Hagelin, RN, PhD, Central Child Health Unit, Entrance 92, Uppsala University Children's Hospital, S-751 85 Uppsala, Sweden. Copyright 9 1999 by W.B. Saunders Company 0882-5963/99/1405-0005510.00/0

313

314

be coded, the degree of exhaustiveness of the coding system, and the type or form of data. Categorizing reliability is the consistency with which the coder handles and labels each unit identified. One factor that may influence categorizing reliability is the clarity with which the definitions of the variables and interpretation rules are made (Aaronson & Burman, 1994; Bakeman & Gottman, 1992; Garvin Kennedey & Cissna, 1988; Guetzkow, 1950). To test the accuracy of a coding procedure, ie, similarity in the coders' judgments, interrater judgments are usually made (Bakeman & Gottman, 1992; Garvin et al., 1988; Topf, 1986). Percentage agreement and Cohen's kappa coefficient (K) are the measures frequently used to compute interrater agreement for nominal and ordinal data (Cohen, 1960; Soeken & Prescott, 1986). Percentage agreement is easy to compute and produces useful information, but is biased in favor of variables with a small number of categories (Scott, 1955) and does not take the influence of chance into consideration. Cohen's kappa coefficient gives a precise estimate of agreement beyond chance. However, factors such as the number of observations, the number of categories, and the distribution of the data influence the kappa values in such a way that the interrater agreement may be difficult to interpret. On some occasions percentage agreement may be high although kappa is undefined or rather low (Banerjee & Fielding, 1997; Brennan & Hays, 1992; Feinstein & Cicchetti, 1990; Garvin et al., 1988; Topf, 1986). To be able to interpret and decide whether the interrater reliability level is good enough for the purpose, both kappa and percentage agreement ought to be reported (Brennan & Hays, 1992; Topf, 1986).

Swedish Health Records In Sweden, as in other countries (Hall, 1996), a Child Health record (CH record) is kept for every preschool child (age 0 to 6 years) attending the preventive Child Health Services (CHS) (National Board of Health and Welfare, 1994). The Swedish record form is used nationally, and the different health professionals, mainly nurses and physicians, all make their notes in the same record. If a child moves from one health district to another, the CH record is mailed to the child health clinic currently responsible for the health care of the child. The record areas are strictly defined and designed to facilitate documentation of demographic information, the child's growth, feeding patterns, development, occurrence of minor and major health prob-

EUSABETM.H. HAGFLIN

lems, and care given within the CHS. The National Board of Health and Welfare (Socialstyrelsen 1981, 1993, 1994) recommends that information in CH records be used not only in clinical work and quality assessment but also for epidemiological analyses and other research purposes. Hence, it would be of great importance to analyze to what extent different types of information are available and valid, and, in addition, whether new judgments made on the recorded information can be transformed reliably to the research protocol. A deeper knowledge might help to improve documentation in clinical practice as well as in research. The aim of the current study was to examine the relationship between the interpretive burden on the coders and interrater reliability when different forms and types of information are transferred from CH records to a research protocol. The aim was also to analyze to what extent it was possible to improve interrater reliability by training the coders, clarifying the definitions of the variables, and elucidating the coding instructions.

METHODS Participants/Records Coders

Pediatric nurses with experience in research and from clinical work within the CHS coded the records. Records

After approval from the Research Ethical Committee at the University Hospital, names and addresses of 240 preschool children were randomly drawn from a national register of the Swedish population. The children were born between 1982 and 1988. After removing information that could lead to the identification of the child and its family, 223 copies (93%) of the children's CH records were sent to the author on request. The sample was used by the author in earlier studies of the completeness and availability of data in CH records from two local Child Health Clinics and a national sample (Hagelin, Lagerberg, & Sundelin, 1991; Hagelin, 1992) and different aspects of health surveillance (Hagelin, Jackson, & Wikblad, 1998). For the interrater assessments presented here, smaller samples were drawn from the national sample. There are no general recommendations for what proportion of a sample is to be used to assess reliability. When studying behavior, CordyackWashington and Moss (1988) suggest a minimum

CODING DATA

of 10 subjects to be rated to capture variability in data. In the present study, to judge demographic information, health services, and documentation quality, two different samples (nl = 30, n2 = 30) were randomly drawn from the 223 records, ie, 27% of the total sample. To judge health problems (f = 360) that resemble behavior, two new samples (n3 = 10, n 4 = 10) were used. The first sample (n3) was systematically chosen to find records containing large amounts of text, including problems. Sample n4 was randomly drawn. There were different records in the samples (nl, n2, n3, and n4), and the samplings were made without replacement.

Instrumentsand Measures Two protocols developed and used in this study were protocol A, to study variables with low and medium interpretive burden, ie, demographic information, health service, and documentation quality with high interpretive burden; and protocol B, to study variables with high and extrahigh interpretive burden, eg, to identify health problems in large amounts of text and to classify and assess each problem as to duration, severity, and need for treatment.

Interpretive Burden Interpretive burden (IB) was defined as the degree of observer inference needed to code information onto the protocols. The grading of IB was roughly comparable to the three levels of coder inference described by Campbell (1958), ie, from duplicating information, translation, or transformation to sophisticated expert decision-making. The different variables were assigned to one of four main groups according to degree of interpretive burden (ordinal scale): low (copying, counting, and simple coding), medium (gathering and comparing information from different record areas and complex coding using more extensive coding alternatives), high (scoring of documentation quality, categorization of type, duration, need for treatment, and severity of health problem), and extrahigh (identifying and delimiting health problems in text and other record areas). Types of information, types of record area to find the information, interpretive burden, and coding procedures used to code data in the two protocols are shown in Table 1.

Interrater Agreement Interrater agreement (IR) was defined as the extent to which the coders had identified the same units (unitizing reliability) and used the same coding alternatives in the coding system (categoriz-

315

ing reliability). It was deemed important to code all coding alternatives of a variable similarly.

Documentation Quality Documentation quality was appraised from the nurse's subjective standpoint with respect to whether the record contained enough information to first, make a quick and global assessment of the child's health, development, growth, and feeding; and second, continue the assessment when meeting the child without re-asking about the child's previous history. The scoring instructions for documentation quality were: 0 points = not possible to judge because of missing pages or bad copies, 1 point = poor, 2 points = manageable--good enough, and 3 points = very satisfactory. When health problems were judged, the units of analysis were not records but single health problems, defined here as documented deviations from health, from minor signs or symptoms to clinical diagnoses. All contacts and record areas were reviewed to detect health problems. The number of reviewed health visits from which the total number of health problems were derived in the two samples (n3 and n4) were 311 (mean [M] = 30.5; range, 23 to 45) and 315 (M = 27.5; range, 20 to 58), respectively. After the problem was identified in open-ended text and, when appropriate, verified in other record areas, it was marked and numbered to enable comparisons of how the nurses had delimited each unit. After this procedure, each problem was categorized by the type of the problem and further analyzed in three aspects: (1) duration: <1 month, -> 1 month to --<1 year, > 1 year; (2) need for treatment: not necessary, recommended, absolutely necessary; (3) severity: minor problem = ailment not requiring treatment; moderate problem = disease requiring treatment to avoid future impairment or suffering; severe problem = manifest disability or disease (Magnusson, 1997).

Procedure Before the first interrater session, the coders discussed the underlying theme of the study, the information to be coded, and the interpretation rules. The coders tested the protocol by coding and discussing three records in joint collaboration. Then the coders rated the first sample independently. The results of the interrater judgments were computed, analyzed, and discussed by the coders. The interpretation rules were further modified, and the definitions were clarified for variables with low K and Po values. Apart from general clarifications, the following main changes were undertaken to

316

ELISABETM.H. HAGELIN

Table 1. ResearchProtocol,Typesof Information, Coding Procedure, Degree of Interpretive Burden, and RecordArea to Find the Information ProtocolTypeof Information ProtocolA Basic informationabout the child Age Birthweightand length Neonatalcomplications Day care Birth order Basic informationaboutthe family Parents'age Nationality Occupation Participationin parentaleducation No of parentaleducationsessions Numberof siblings Psychosocialdata Numberof new addresses Healthservices Immunizations* Weight and heightmeasurements Healthinformation-nutrition and preventionof accidents Breast-feedingdocumentedin text Breast-feedingdocumentedin column Participationin parentaleducation Documentationqualily Personalappraisalof documentationquality Weight and heightmeasurementscorrectlyplol~edin growth chart ProtocolB Healthproblems Identificationof problems Classificationof problems Assessmentof duration,needfor treatment,and severity

CodingProcedure

Degreeof IB

RecordArea

Copying Copying Simplecoding Simplecoding Count

Low Low Low Low Low

Column Column Column Column Column

Copy Complexcoding Complexcoding Simplecoding Count Count Simplecoding Count

Low Medium Medium Low Low Low Low Low

Column Column Column Column + text Column + text Column Column Column

Copy Count Gatherdata Gatherdata Simplecoding Simplecoding

Low Low Medium Medium Low Low

Column Columns Column+ text Text Column Column

Personal appraisal Comparinginformation

High Medium

Column + text + chart Column+ chart

Interpretation Interpretation Interpretation

Extrahigh High High

columns + text + chart columns + text + chart columns + text + chart

*Diphtheriaand tetanus;polio; and measles,mumpsand rubella(MMR).

enhance reliability. (1) At the first judgment, the global appraisal of documentation quality was made on the entire record; at the second judgment the information to be considered for the rating was restricted to three main areas: growth, development, and feeding. (2) At the first rating, each health problem was sorted into one of 36 categories developed by the author and based on the location of symptoms or function affected. At the second rating, a system classifying of illnesses and health problems in use among Swedish community pediatricians (Svenska Barnl~tkaref6reningen, 1997) was used. This system contained 19 categories. Before the second coding session, the revised protocol and coding instructions were pretested. A sample of new records was rated, interrater agreement was calculated, and the results from the second judgment were compared with those from the first and the differences quantified (K-dit).

Analysis Degree of agreement for categorizing reliability was measured by calculating Cohen's K (Po - Pc)/ (1 - Po) and total proportion agreement (Po), which

equals the percentage agreement divided by 100. Unitizing reliability was measured by calculating total proportion agreement. There are different opinions about what levels of the K values should be considered as acceptable agreements (Knapp & Brown, 1995). To judge the strength of agreement for Cohen's K, the following limits were used: K < .00 = poor, K .00 to .20 = slight, K .21 to .40 = fair, K .41 to .60 = moderate, K .61 to .80 = substantial, K .81 to 1.00 = almost perfect (Landris & Koch, 1977). To judge level of agreement for Po, .70 to .79 was deemed necessary, .80 to .89 was adequate, and ->.90, good (Topf, 1986). Changes in agreements between the judgments were quantified by subtracting the K coefficient from the first judgment from the K coefficient from the second (K-diD. To calculate changes between the ratings, 95% confidence intervals for K mean values were used as a significance test. Total proportion agreement and X2 test were used in the calculations regarding unitizing reliability and categorization of health problems where only one variable was assessed. To test the association between IB and IR, Spearman rank correlation

CODING DATA

317

coefficient (rs) was calculated. An alpha level of .05 was used for all statistical tests. RESULTS

The degree of IB correlated with the degree of IR and yielded significant coefficients both for K and Po at the first rs (n = 102) = -.43, P = .0001, and r~ (n = 103) = -.41, P = .0001, respectively, and second judgment rs (n = 102) = -.32, P = .001, and r~ (n = 103) = -.30, P = .002, respectively. The distributions of K and Po values for the variables with low, medium, high and extrahigh IB obtained at the two ratings are shown in Figure 1. Low, Medium, a n d High Interpretive Burden Both the K and Po values concerning variables with low IB reached the level of almost perfect and good agreement, respectively, at the first judgment. Although high at the first rating, it was possible to make a statistically significant improvement of the agreement (confidence interval, .05 to. 13). At the first judgment, most variables with medium IB were rated identically to a substantial and adequate level, but reached the level for almost perfect (K) and good (Po) agreement only after the second judgment. The change in this group was also statistically significant (confidence interval, .07 to .23). Moreover, the different variables assumed to have a high degree of IB were coded identically to the extent that they reached almost perfect (K) and good (Po) agreement in the second judgment.

However, this was not the case for the assessments of duration, need for treatment, and severity of health problems; the K values were generally low, whereas Po reached the level necessary in two of the three aspects rated: duration (K = .47, Po = .76), need for treatment (K -----.42, Po = .62), and severity (K = .53, Po = .79). However, the agreement for these variables together reached the levels moderate (K) and necessary (Po). The improvement in coding variables with high IB was statistically significant (confidence interval, .01 to .47). The greatest improvement in terms of high K-dif was in this group. The greatest variations in K values occurred among variables with medium and high 113. This was because of the judgments of whether the weights were correctly plotted in the growth graphs and to the assessments of health problems. Although the agreement increased, the variation remained throughout the judgments (Figure 1).

Extrahigh Interpretive Burden In the two samples, the judges identified 222 (Po = .72) and 138 (Po = .76) health problems, respectively, and the unitizing reliability reached the level of necessary agreement. The slight improvement between the ratings was not statistically significant, • (1, n = 360) = .87, P = .35. The K, Po and K-dif values for the different variables with low, medium, and high IB are presented in Table 2. Statistically significant changes in agreement be-

KPo 1.0

Figure 1. Distribution of K and P, values obtained at the first judgment (kl, P1) and the second judgment (k2, P2) for variables with low to extrahigh |B. The box plot should be read as follows: The horizontal line in the middle of a box marks the median or 50111percentile. The top and bottom edgesof a box mark the quarfiles, or the 25th and 75111per-

centiles.~ narrow boxesextending to the top and bottom extend from the quadiles to the far.st observation not farther than 1.5 times the distance between the quartiles. FJdremevalues are plotted with individual markers (SASInstituteInc, '1~3).

~

0.2.

9

LOW

MEDIUM

HIGH

INTERPRETIVE BURDEN

EXTRA'HIGH

318

ELISABETM.H. HAGELIN

Table 2. Interrater Reliability(Means, Standard Deviations,K-dif, Minimum and Maximum Values)for Variables Wilh Low to Exlrahigh InterpretiveBurden Degreeof Interpretive Burden{ fof Variables) Low-total group ( f = 45) First judgment Second judgment Low: copy ( f = 11 ) First judgment Second judgment Low: simple coding ( f = 13) First judgment Second judgment Low: count ( f = 21 ) First judgment Second judgment Medium-total group ( f = 45) First iudgment Second iudgment Medium: complex coding ( f = 4) First judgment Second judgment Medium: compare ( f = 41 ) First judgment Second judgment High-total group ( f = 8) First judgment Second judgment High: personal appraisal ( f = 4) First judgment Second judgment High: categorization of health problems { f = 1) First judgment Second judgment High: assessmentof health problems ( f = 3) First judgment Second judgment Extrahigh-total group Identification of health problems ( f = 360) First judgment Second judgment

Kappa M

Po M

Kappa

Po

K-dif

SD

SD

.87 .96 S

.92 .97

.09

.12 .06

.07 .04

.56 .73

1.0 1.0

.77 .77

1.0 1.0

.97 1.0

.03

.04 -*

.03 -*

.91 1.0

1.0 1.0

.93 1.0

1.0 1.0

.88 .97 S

.92 .99

.10

.14 .03

.08 .02

.56 .91

1.0 1.0

.77 .97

1.0 1.0

.82 .93 S

.89 .95

.11

.09 .07

.06 .06

.68 .73

1.0 1.0

.80 .77

1.0 1.0

.75 .90 S

.85 .93

.15

.21 .16

.13 .13

.15 .30

1.0 1.0

.53 .40

1.0 1.0

.99 1.0

.01

.02 -*

.02 -*

.96 1.0

1.0 1.0

.97 1.0

1.0 1.0

.72 .89 S

.84 .92

.16

.21 .16

.12 .13

.15 .30

1.0 1.0

.53 .40

1.0 1.0

.48 .72 S

.71 .84

.24

.20 .22

.17 .12

.16 .42

.78 1.0

.33 .62

.88 1.0

.44 .87 S

.74 .92

.44

.03 .11

.07 .07

.40 .74

.46 1.0

.67 .83

.83 1.0

.78 .85

.81 .88 S

.07

-'1" -t

-t -'l"

.78 .85

.78 .88

.81 .88

.81 .88

.43 .47

.62 .72

.05

.29 .06

.28 .09

.16 .42

.74 .53

.33 .62

.88 .79

.97 1.0

.99 1.0

Kappa Po Minimum Maximum Minimum Maximum

.72 .75

Note: Statistically significant differences between the two judgments calculated for k (95% confidence intervals) and Po(• 2) are marked with S. *No SD because of total agreement. "l'No SD only one variable x 2 (1, n = 264) = 17.8, P = <.001.

tween the two ratings are marked with S in the column for mean value for K or the column for Po. DISCUSSION

The results show that interrater agreement is associated with IB; the higher the IB, the lower the K and Po values in general. The correlation coefficients for K and Po obtained at the first ratings were slightly higher compared with those at the second rating. This indicates that it was possible to decrease the IB on the coders but not to the extent totally to decrease the differences between the groups (Figure 1). Factors influencing unitizing reliability are reported to be degree of observer influence, degree of

exhaustiveness of the coding system, and type and form of data; additional factors influencing categorizing reliability are the clarity with which the definitions and coding instructions are described (Aaronson & Burman, 1994; Bakeman & Gottman, 1992; Brennan & Hays, 1992; Garvin et al., 1988). In this study, the categorizing reliability in general could be increased while the unitizing reliability remained unchanged. The coders were getting more experience between the ratings. In addition, the clarifying of the definitions of the variables and the coding instructions, the reduction of the record areas to be appraised, and the number of categories to choose from at the second rating all decreased the 1B on the coders. However, it is not possible to

CODING DATA

say to what extent each of the factors mentioned contributed to the increased reliability. Most variables with low and medium IB were of routine character, and were located in specified columns or record areas; a minimum of interpretation was needed. Variables with high and extrahigh IB were of nonroutine character, and had to be searched for in open-ended text and other record areas. They often had to be interpreted in several steps, which required high clinical and research competence. In the case of extrahigh IB, units also had to be delimited and defined by the coders themselves. The ratings of variables with low IB resulted in high agreement at the first assessment. When the IB was medium, the interrater agreement at the first judgment resulted in lower K and Po values than for variables with low IB, but after the second judgment almost all variables reached the levels of almost perfect (K) and good (Po) agreement. Even variables with high IB, ie, appraisal of documentation quality and categorization of health problems, finally reached almost perfect and good agreement. Perhaps differences between the judgments may have been caused by differences in samples and not by improved coding instructions. This would mainly have affected the results of judgments of health problems and documentation quality; the number of problems and the degree of problematic situations and documentation quality may have been different. On the other hand, the fact that the number of health problem categories was reduced may have increased results, which would probably not have been possible if the original 36 categories had been used. However, at the second rating the health problems were more complicated to categorize, so the influence of sample differences on improvements was probably not large. In certain cases, ie, duration, need of treatment, and severity of health problems, there was a major discrepancy between K and Po values. This was mainly caused by large numbers of observations clustering in one category; for example when severity was assessed, most of the health problems were minor; hence, the clustering was adequate. When observations are unevenly distributed, K values may be unjustifiably low, which has also been reported by other researchers (Banerjee & Fielding, 1997; Brennan & Hays, 1992; Feinstein & Ciccetti, 1990; Garvin et al., 1988; Mezzich, Kraemer, Worthington & Coffman, 1991; Soeken & Prescott, 1986; Topf, 1986). Because of reducing the influence of chance, the K values were generally lower than the Po values within a range of .10

319

(Figure 1). To evaluate the strength of agreement, a comparison between K and Po values is recommended (Brennan & Hays, 1992; Topf, 1986). When the discrepancy between the two measures is large (>. 10), a check for adequate unevenly distributed values is recommended here, and if this is the case, one should rely more on the Po values. It was not possible to increase the unitizing reliability between the judgments of health problems with extrahigh lB. Approximately one fourth of the problems were missed, differently delimited, or not interpreted as problems by one coder or the other. Probable reasons for missing problems were the large amount of text to be read and the large amount of problems to be identified. Differences in delimiting the description of a problem resulted in one coder breaking a description of symptoms into more items than the other coder, and ending up with more problems coded for the same situation. Too wide a definition of the concept health problem may have increased the IB; the interpretations may also have differed because the nurses had different coding or clinical experience. This highlights two integrated problems that need a joint solution; first, the difficulties in interpreting problems because they were nonsystematically or inconsistently documented in the records; and second, the difficulties in finding an exhaustive and logical classification system that produces meaningful information. The two classification systems used here were either too detailed or had an inconsistent structure. Reliable interpretations and aggregation of data from health records require the information to be recorded in standard ways (Cimino, 1996). As one goal of preventive health care is early detection of risks and health problems and the children are followed up over time, a multilevel classification system allowing for safe recording and coding of children's health status, risk factors, signs, symptoms, and diagnoses of health problems and the health care process is needed. The records studied were smaller samples drawn from a random sample of CH records from the Swedish population. The relatively large number of records (80/223) used to assess interrater agreement probably led to the detection of the most frequent and common coding problems that occur in a study of health records. The variables studied included many important aspects of child health care: health problems in preschool children, different aspects of care given according to the national surveillance program, and documentation quality. Although IB and reliability in the present study concerned different types of information that is

320

ELISABET M.H. HAGELIN

relevant in Child Health Care, the findings may be generalized to other health care systems using similar data needing similar coding procedures.

CONCLUSIONS AND RECOMMENDATIONS The results and practical experience from this study indicate that the degree of IB is related to the coding procedure used. and factors reported to influence unitizing and categorizing reliability (Aaronson & Burman. 1994: Bakeman & Gottman. 1992; Brennan & Hays, 1992; Garvin et al., 1988) influence both types of reliability. Hence, it is suggested that IB should be regarded as the main factor to be considered when designing a study based on health records, whereas the other factors mentioned should be considered in relation to IB. As noticed by Garvin et al. (1988), unitizing reliability is often taken for granted and therefore not reported. As the results of this study show. both nurses failed to identify a large proportion of problems. Although categorizing reliability may be high, unitizing may still be unsatisfactory. This emphasizes the need for unitizing reliability to be reported. From an organizational and epidemiological perspective, there is a need for knowledge to provide community diagnoses, to follow and describe child health problems and risks, to identify causes and risk factors, and to evaluate health services (Hall, 1996; Kurtz & Stanley, 1995). When preparing for

a national CH record, it will be important to decide what information should be used for such purposes and to make certain that this information can be recorded clearly and simply to facilitate the coding procedure. Specified columns and precoded variables should be used as much as possible. Even with more complicated information, it would be valuable if data could be recorded in a systematic and exhaustive way. The structure of the record, variable definitions, and recording rules should be agreed on and applied on a national basis. This is also worth aiming at in the case of open-ended text and complex information that is not appropriate for specified areas. A general national model for structured documentation will benefit the children and their parents as well as health personnel and researchers. Further research is needed to evaluate the validity of data in health records and whether it is possible to use an international classification system for nursing practice, eg, ICNP, (Clark, 1996) within child health services.

ACKNOWLEDGMENT Special thanks go to Dagmar Lagerberg and Claes Sundelin for advice during the research process. The author also thanks Karin Jackson and Margaretha Magnusson for assessing records in their spare time.

REFERENCES Aaronson, L.S., & Burman, M.E. (1994). Use of health records in research: reliability and validity issues. Research in Nursing and Health, 17, 67-73. Bakeman, R., & Gottman, J.M. (1992). Assessing observer agreement. In Observing interaction. An introduction to sequential analysis (first published 1986) (pp. 70-99). Cambridge: Cambridge University Press. Banerjee, M., & Fielding, J. (1997). Interpreting kappa values for two-observer nursing diagnosis data. Research in Nursing and Health, 20, 465-470. Brennan, P.F., & Hays, B.J. (1992). The kappa statistic for establishing interrater reliability in secondary analysis of qualitative clinical data. Research in Nursing and Health, 13, 153-158. Brown, J.S., & Semradek, J. (1992). Secondary data on health-related subjects: Major sources, uses, and limitations. Public Health Nursing, 9, 162-171. Campbell, D.T. (1958). Systematic error on the part of human links in communication systems. Information and Control, 1, 334-369. Cimino, J.J. (1996). Coding systems in health care. Methods of Information in Medicine, 36, 273-284. Clark, J. (1996). How can nurses participate in the development of an ICNP? International Nursing Review, 43, 171-173. Cohen, J. (1960). A coefficient of agreement for nominal scales. Education and Psychological Measurements, 20, 37-46. Cordyack-Washington, C., & Moss, M. (1988). Pragmatic

aspects of establishing interrater reliability in research. Nursing Research, 37, 190-191. Feinstein, A.R., & Cicchetti, D.C. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543-549. Garvin, B.J., Kennedy, C.W., & Cissna, K.N. (1988). Reliability in category coding systems. Nursing Research, 37, 52-55. Guetzkow, H. (1950). Unitizing and categorizing problems in coding qualitative data. Journal of Clinical Psychology, 6, 47-58. Hagelin, E., Lagerberg, D., & Sundelin, C. (1991). Child health records as a database for clinical practice, research and community planning. Journal of Advanced Nursing, 16, 15-23. Hagelin, E. (1992). Record keeping and health services mirrored by data from Swedish child health records. Scandinavian Journal of Caring Sciences, 6, 201-210. Hagelin, E., Jackson, K., & Wikblad, K. (1998). Utilization of child health services during the first 18 months of life: Aspects of health surveillance in Swedish pre-school children based on health records. A cta Paediatrica, 87, 996-1002. Hall, D.M.B. (Ed.). (1996). Health for all children (3rd ed.) Oxford: Oxford Medical Publications. Knapp, T.R. (1985). Validity, reliability, and neither. Nursing Research, 34, 189-192. Knapp, T.R., & Brown, J.K. (1995). Ten measurement commands that often should be broken. Research in Nursing and Health, 18, 465-469. Kurtz, Z., & Stanley, F. (1995). Epidemiology. In D. Harvey,

CODING DATA

M. Miles, & D. Smyth (Eds.), Community child health and paediatrics (pp. 3-22). Oxford: Butterworth-Heinemann. Landris, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 35, 159174. Magnusson, M. (1997). Rationality of routine health examinations by physicians of 18-month-old children--Experiences based on data from a Swedish county. Acta Paediatrica, 86, 881-887. Mezzich, J.E., Kraemer, H.C., Worthington, D.L., & Coffman, G.A. (1981). Assessment of agreement among several raters formulating multiple diagnoses. Journal of Psychiatric Research, 16, 29-39. National Board of Health and Welfare (1994). Child health care services in Sweden. Aims, surveillance program, and directions for the future [Booklet]. Stockholm: Socialstyrelsen. Pollit, D., & Hungler, B. (1996). Nursing research. Principles and methods (5th ed.) Philadelphia: Lippcott. Reed, J. (1992). Secondary data in nursing research. Journal of Advanced Nursing, 17, 877-883. SAS Institute Inc. (1993). SAS/INSIGHT. User's guide, version 6, second edition. North Carolina: SAS Institute Inc. Socialstyrelsen. (1981). Anvisningar och kommentar till journal inom Barnhtilsovdtrden. [National Board of Health and

321

Welfare. Manual for notification in Child Health records.] Stockholm: Author. Socialstyrelsen. (1993). Kvalitetssiikring i hiilso-och sjukvdlrden inklusive tandhtilsovd*rden. [National Board of Health and Welfare. Quality assurance within the National Health Services and the Dental Services.] Stockholm: Author. Socialstyrelsen. (1994). Kvalitetssiikring inom barnhi~lsovglrden. Att skydda skyddsni~tet. [National Board of Health and Welfare. Quality assurance in Child Health Services. Protecting the safety-net.] Stockholm: Author. Scott, W.A. (I 955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 321-325. Soeken, K.L., & Prescott, P.A. (1986). Issues in the use of kappa to estimate reliability. Medical Care, & 733-741. Svenska Barnl~ikaref6reuingen. (1997). BAD 97. Svenska Barnldkarfi~reningens Specialistversion av ICD-10: Ktassifikation av sjukdomar och htilsoproblem. [BAD97. The Swedish Pediatric Association's specialist version of ICD-10: Classifications of illnesses and health problems.] Stockholm, November 1996. Topf, M. (1986). Three estimates of interrater reliability for nominal data. Nursing Research, 4, 253-255. vonKoss Krowchuk, H., Moore, M.L., & Richardson, L. (1995). Using health records as sources of data for research. Journal of Nursing Measurement, 3, 3-12.