Reliability and Predictive Value of the National Institute of Diabetes and Digestive and Kidney Diseases Liver Transplantation Database Nomenclature and Grading System for Cellular Rejection of Liver Allografts A. JAKE DEMETRIS, ~'2 ERIC C. SEABERG, 1'6 KENNETH P. BATTS, 3 LINDA D. FERRELL, 4 JURGEN LUDWIG,3 RODNEY S. MARKIN,5 STEVEN H. BELLE, 1'6 AND KATHERINE DETRE 1'6
P a t h o l o g i s t s p a r t i c i p a t i n g in t h e N a t i o n a l I n s t i t u t e of D i a b e t e s a n d Digestive a n d K i d n e y Diseases L i v e r T r a n s p l a n t D a t a b a s e (LTD) c r e a t e d a h i s t o p a t h o l o g i c a l g r a d i n g s y s t e m for a c u t e liver allograft rejection, a n d t e s t e d it first f o r inter- a n d i n t r a - r a t e r reliability a m o n g a g r o u p of five p a t h o l o g i s t s e x p e r i e n c e d in l i v e r a n d t r a n s p l a n t a t i o n pathology. S p e c i m e n s f r o m post-transp l a n t a t i o n biopsies f r o m 48 p a t i e n t s w i t h rejection, hepatitis, o r o t h e r diagnosis(es) w e r e r e v i e w e d . T h e r e w a s m o d e r a t e to good ( k a p p a = 0.40 to 0.55) i n t e r - r a t e r a n d g o o d ( k a p p a = 0.55 to 0.58) i n t r a r a t e r a g r e e m e n t for t h e diagnosis a n d e x a c t g r a d i n g of mild, m o d e r a t e , o r s e v e r e a c u t e rejection, w h i c h i m p r o v e d w h e n a s h o r t clinical h i s t o r y w a s p r o v i d e d . Thus, t h e s c h e m e w a s r e p r o d u c ible, a n d few of t h e d i s a g r e e m e n t s a m o n g the pathologists w o u l d h a v e a f f e c t e d t r e a t m e n t decisions. Secondly, the ability of t h e g r a d i n g s y s t e m to p r e d i c t a n u n f a v o r able short- o r l o n g - t e r m o u t c o m e f r o m t h e initial histop a t h o l o g i c a l diagnosis of cellular r e j e c t i o n w a s t e s t e d on g r o u p s of 168 a n d 133 patients, respectively, f r o m t h e t h r e e LTD clinical centers, w h o w e r e followed u p f o r at l e a s t 6 m o n t h s a f t e r the first o n s e t of rejection. This analysis s h o w e d t h a t a h i g h e r h i s t o p a t h o l o g i c a l g r a d e of a c u t e r e j e c t i o n on first b i o p s y diagnosis w a s signific a n t l y a s s o c i a t e d (P ~ .006) w i t h b o t h a n u n f a v o r a b l e s h o r t - t e r m outcome, defined b y failure of the e p i s o d e to resolve w i t h i n 21 d a y s or t h e n e e d for a g g r e s s i v e i m m u n o s u p p r e s s i v e t r e a t m e n t , a n d a l o n g - t e r m o u t c o m e defined b y d e a t h or r e t r a n s p l a n t a t i o n f r o m r e j e c t i o n w i t h i n 6 m o n t h s of onset. Lastly, a n analysis w a s perf o r m e d to d e t e r m i n e w h e t h e r subjective r e j e c t i o n grad-
Abbreviations: NIDDK,National Institute of Diabetes and Digestive and Kidney Diseases; LTD, Liver Transplantation Database; LFT, liver function test; ALG,anti-lymphocyteglobulin. From the ~NIDDKLiver Transplantation Database; the 2Departments of Pathology at the University of Pittsburgh, PA; the 3MayoClinic, New York, NY; ~Universityof California at San Francisco; the ~Universityof Nebraska; and the ~Graduate Schoolof Public Health, Universityof Pittsburgh, PA. Received May 9, 1994; acceptedSeptember 8, 1994. Address reprint requests to: A. Jake Demetris, MD, Departmentof Pathology,Divisionof Transplantation, Pittsburgh Transplant Institute, E1548 Biomedical ScienceTower,University of Pittsburgh, Pittsburgh, PA 15213. Copyright © 1995 by the American Association for the Study of Liver Diseases. 0270-9139/95/2102-002253.00/0
ing b y t h e p a t h o l o g i s t or c e r t a i n "objective" h i s t o p a t h o logical f e a t u r e s identified b y logistic r e g r e s s i o n m o d e l i n g w e r e m o r e a c c u r a t e in p r e d i c t i n g a n unfavorable outcome. T h e sensitivity (.86 vs..71), specificity (.68 vs..75), positive p r e d i c t i v e v a l u e (.13 vs..14), a n d negative p r e d i c t i v e v a l u e (.99 vs..98) for p r e d i c t i n g a n unfav o r a b l e l o n g - t e r m o u t c o m e (allograft loss f r o m r e j e c t i o n w i t h i n 6 m o n t h s of onset) w e r e s i m i l a r for b o t h prediction m e t h o d s , a l t h o u g h t h e c o m p a r i s o n f a v o r e d the logistic r e g r e s s i o n model. T h e low positive p r e d i c t i v e v a l u e of b o t h m e t h o d s w a s a t t r i b u t e d to t h e c u r r e n t imm u n o s u p p r e s s i v e agents, w h i c h a r e highly effective in t h e p r e v e n t i o n of l i v e r allograft failure f r o m a c u t e rejection, a n d the difficulty in s e p a r a t i n g r e j e c t i o n g r a d i n g f r o m staging. To o u r k n o w l e d g e , this s t u d y r e p r e s e n t s the first a t t e m p t to e v a l u a t e b o t h t h e r e p r o d u c i b i l i t y a n d p r e d i c t i v e v a l u e of a h i s t o p a t h o l o g i c a l g r a d i n g s y s t e m for allograft rejection, u s i n g m u l t i p l e p a t h o l o g i s t s a n d p a t i e n t s f r o m m o r e t h a n o n e center. (HEPATOLOGY 1995;21:408-416.) P h y s i c i a n s in t h e field of t r a n s p l a n t a t i o n g e n e r a l l y a g r e e t h a t a h i s t o p a t h o l o g i c a l g r a d i n g s y s t e m for solid o r g a n a l l o g r a f t rejection should fulfill t h r e e b r o a d objectives: it should be simple, h a v e s o m e p r e d i c t i v e v a l u e useful in p a t i e n t m a n a g e m e n t , a n d be reproducible. A l t h o u g h m o s t p u b l i s h e d s c h e m e s h a v e b e e n locally t e s t e d for p r e d i c t i v e v a l u e a n d simplicity, few i f a n y h a v e b e e n t e s t e d for r e l i a b i l i t y or p r e d i c t i v e v a l u e in m o r e t h a n one center. E s t a b l i s h m e n t of t h e N a t i o n a l I n s t i t u t e of D i a b e t e s a n d Digestive a n d K i d n e y D i s e a s e s ( N I D D K ) L i v e r T r a n s p l a n t a t i o n D a t a b a s e (LTD) in 1985 p r o v i d e d a n o p p o r t u n i t y to d e t e r m i n e t h e i n t r a o b s e r v e r a n d i n t e r o b s e r v e r r e p r o d u c i b i l i t y of h i s t o p a t h o l o g i c a l findings, w h i c h w e r e i m p o r t a n t in t h e diagnosis of rejection a n d o t h e r c a u s e s of h e p a t i c a l l o g r a f t dysfunction. 1'2 U s i n g t h o s e f e a t u r e s , a c o n s e n s u s g r a d i n g s y s t e m for a c u t e liver a l l o g r a f t rejection w a s c o n s t r u c t e d b y t h e L T D p a t h o l o g i s t group, b a s e d on p r e v i o u s s t u d i e s t h a t foc u s e d on t h e h i s t o p a t h o l o g y a n d p a t h o p h y s i o l o g y of g r a f t i n j u r y a n d failure. 3-13 T h e p u r p o s e of t h e c u r r e n t s t u d y is to e v a l u a t e a g r a d i n g s y s t e m for a c u t e rejec-
408
HEPATOLOGYVol. 21, No. 2, 1995 tion, u s i n g a m u l t i s t e p a p p r o a c h : h a v i n g prospectively v a l i d a t e d t h e s y s t e m in the first step, t h e a c c u r a c y of the rejection g r a d i n g s y s t e m in p r e d i c t i n g a s u b s e q u e n t u n f a v o r a b l e r e j e c t i o n - r e l a t e d clinical o u t c o m e w h e n a n initial liver biopsy s h o w e d rejection was t h e n a s s e s s e d u s i n g a l a r g e r p a t i e n t population. Lastly, a n a n a l y s i s w a s p e r f o r m e d to d e t e r m i n e w h e t h e r rejection g r a d i n g or c e r t a i n histological feature(s) identified by logistic r e g r e s s i o n m o d e l i n g w a s m o r e a c c u r a t e in p r e d i c t i n g an u n f a v o r a b l e clinical outcome. MATERIALS A N D M E T H O D S
Study 1: Reproducibility of the Grading System Overview of Study Design. In this study, five pathologists (A.J.D., K.P.B., L.F., J.L., and R.M.) interpreted a set of 50 preselected liver biopsy slides in a blinded fashion on three separate occasions. The first two readings were performed with no clinical information. For the third reading, a brief history was provided. Case Selection. Archival post-transplantation liver biopsy specimens from 50 patients were selected from among those already interpreted and entered into the LTD before October 23, 1991 from the three participating institutions in the LTD: University of Nebraska Medical Center, University of California at San Francisco, and the Rochester Mayo Clinic. Stratified random sampling was used to ensure a mix of specimens showing rejection with varying grades of severity, hepatitis, other diseases, and normal histology. Specifically, 25 carried a diagnosis of rejection, 12 had been diagnosed as viral hepatitis (5 type B, 2 type C, and 5 possibly or probably viral), and 13 were free of both rejection and hepatitis. Because grades of rejection had not been assigned by the interpreting pathologists before October 23, 1991, a putative spectrum of rejection grades based on the degree and distribution of inflammation and lobular necrosis recorded by the original interpreting pathologist was assigned using a computer algorithm. All biopsies were performed within 0 to 384 days after transplantation, although the pathologists were not provided this information. Two unstained slides were requested from the appropriate institution for each of the selected cases and stained with hematoxylin-eosin at a single location (University of Pittsburgh) to standardize the preparation. Two slides were subsequently eliminated from the pool to be reviewed before any data analysis because they were technically inadequate, thus resulting in a final pool of 48 slides. Histological Interpretation. Each participating pathologist (A.J.D., K.P.B., L.F., J.L., and R.M.) reviewed all slides in a blinded fashion on two occasions. For the third reading, a brief clinical history was also provided, consisting of the original pretransplantation diagnosis, the dates of transplantation and the biopsy, and the results of the liver function tests (LFT) nearest in time to the date of the specimen. Six specimens had no LFT data; the remaining 42 had data obtained within 7 days of the biopsy, with data for 32 being obtained within 2 days of the biopsy. The identification numbers on each slide were randomly scrambled in between each reading. There was an interval of approximately 6 months between the first two readings and 4 months between the second and third readings. Data Recording. For each slide reading, each pathologist recorded both an assessment of individual histological features and rendered a final diagnosis. There were 15 non-
DEMETRIS ET AL 409 graded features recorded: more than four portal tracts, specimen adequacy, bile duct inflammation/damage, bile duct loss, bile duct/cholangiolar proliferation, inflammatory or necrotizing arteritis, obliterative arteriopathy, endotheliitis/subendothelial inflammation, piecemeal or bridging necrosis, hilar necrosis, infarct necrosis, centrilobular necrosis, other necrosis, cholestasis, and granulomatous lobular inflammation. An additional six features were graded on a none/mild/moderate/ severe scale: overall portal tract inflammation/intensity, portal fibrosis, central fibrosis, lobular disarray/ballooning, fat, and lobular inflammation. A final diagnosis was rendered based on standard histological criteria. The diagnosis of acute rejection was based on the presence of at least two of the following three findings: (1) predominantly mononuclear but mixed portal inflammation; (2) bile duct inflammation/damage; and (3) subendothelial localization of mononuclear cells in the portal and central veins. In the third reading, the interpretations were likely to be influenced by the provided clinical information. If a diagnosis of acute rejection was rendered, the severity was graded using the system shown in Table 1, which was derived by consensus based on a simplification of previously published schemes and histopathological findings shown to be reproducible in an LTD Pilot Study. 1 This study only addresses acute rejection. (Chronic or ductopenic rejection will be the focus of a future study.) Statistical Analysis. Inter-rater and intrarater agreement were assessed for both the individual histological features and for the final diagnosis using a multirater kappa, an agreement index corrected for chance. 2 Briefly, a kappa less than 0 indicates that agreement was poorer than that expected by chance, and perfect agreement would be reflected by a kappa equal to 1. The following scale is used to interpret this measure of agreement: K -< 0.10
Poor agreement
0.11 -- K -< 0.30
Fair agreement
0.31 -< K - 0.50
Moderate agreement
0.51 -< K -< 0.70
Good agreement
0.71 -< K
Excellent agreement
Study 2: Prognostic Utility of the Grading System Overview ofStudy Design. The predictive value of a diagnosis of acute rejection, the individual histological features, and the logistic regression modeling of the histological features were evaluated. The first part evaluated the ability of the grading system for acute rejection, as validated in study 1 and currently used by the LTD pathologists, to predict an unfavorable outcome. Specifically, we first wished to determine whether the histological severity of rejection correlates with clinical outcome. This was done by prospectively evaluating the clinical outcomes of individuals who had originally been diagnosed as having acute rejection by one of the LTD pathologists at their respective institutions. The biopsy slides were not re-reviewed, but rather the diagnosis of acute rejection and its grading (Table 1) was determined from the LTD computer records. The second part of this study used the same outcome criteria and statistical analysis as in part 1. However, individual histological features rather than grading of acute rejection were used as potential predictors of unfavorable outcome. These histological features had been recorded previously for
410
DEMETRIS ET AL
HEPATOLOGYFebruary 1995
TABLE 1. N I D D K - L T D N o m e n c l a t u r e a n d G r a d i n g o f L i v e r A l l o g r a f t R e j e c t i o n Chronic (Ductopenic) Rejectiont
Acute Rejection* Grade
Histopathological Findings
Grade
A0 (None) A1 (Mild)
No rejection Rejection infiltrate in some, but not most, of the triads, confinedwith the portal spaces
A2 (Moderate)
Rejectioninfiltrate involvingmost or all of the triads, with or without spilloverinto lobule.No evidenceof centrilobularhepatocyte necrosis, or dropout
B2 (Intermediate/moderate)
A3 (Severe)
Infiltrate in some or all of the triads, with or without spillover into the lobule,with or without inflammatorycell linkage of the triads, associated with moderate-severe lobular inflammationand lobular necrosis and dropout
B3 (Late or severe)
B1 (Early or mild)
Histopathological Findings
Bile duct loss, without centrilobularcholestasis, perivenular sclerosis, or hepatocyte ballooningor necrosis and dropout Bile duct loss, with one of the followingfour findings:centrilobularcholestasis, perivenular sclerosis, hepatocellular ballooning,necrosisand dropout Bile duct loss, with at least two of the following four findings:centrilobularcholestasis, perivenular sclerosis, hepatocellular ballooning,or centrilobularnecrosis and dropout
* The diagnosisof acute rejectionis based on the presence of at least two of the followingthree findings:(1) predominantlymononuclearbut mixed portal inflammation;(2) bile duct inflammation/damage;and (3) subendotheliallocalizationof mononuclearcells in the portal and central veins. Thereafter, the severity of rejectionwas graded on the abovefindings. t Bile duct loss in >50% of triads must be present for the diagnosis.
all cases a n d entered into the LTD computer records. The histological features are identical to those listed for reliability study 1. The third part of this study was a logistic regression analysis, which a t t e m p t e d to determine w h e t h e r there were a n y individual histological features t h a t could more accurately predict a n unfavorable clinical outcome t h a n the grading of acute rejection. Case Selection. The p a t i e n t population for this study consisted of LTD p a t i e n t s who had undergone liver t r a n s p l a n t a tion b e t w e e n April 15, 1990 a n d April 15, 1993 at one of the participating i n s t i t u t i o n s a n d h a d a n u n a m b i g u o u s acute rejection episode(s) diagnosed. Among the three institutions, 312 of 668 p a t i e n t s m e t these criteria. Exclusion criteria consisted of s u b s e q u e n t l y d e t e r m i n e d misdiagnosis(es) (n = 4), reception of OKT3 or M i n n e s o t a anti-lymphocyte globulin (ALG) at the time of original diagnosis (n = 11), a n d incomplete computer data (n = 2). For the first a n d third parts of this study, a grade of acute rejection was necessary. Because this was not done before October 23, 1991, 127 of the rem a i n i n g 295 p a t i e n t s were excluded from this arm. Thus, the first a n d third parts included 168 patients; the second p a r t included all 295. Definitions. For this study, we wished to use u n a m b i g u o u s cases of acute rejection. Accordingly, a biopsy was defined as r e p r e s e n t i n g a n acute rejection episode if the biopsy specimen was the first one to be histologically diagnosed as acute rejection, a n d occurred i n conjunction with a clinically acknowledged rejection episode w i t h i n the 6-week period immediately after the first t r a n s p l a n t a t i o n . For endpoint analysis, patients were divided into those with favorable or unfavorable outcomes. The objective, long-term m e a s u r e of a n unfavorable outcome was w h e t h e r a rejection-related death or ret r a n s p l a n t a t i o n occurred w i t h i n 6 m o n t h s of the rejection onset. Because this long-term endpoint was uncommon, a broader short-term m e a s u r e of unfavorable outcome was also used, consisting of the presence of a n y one of the following four findings: (1) the rejection resulted i n graft failure (whether death or r e t r a n s p l a n t a t i o n ) before resolution; (2) OKT3 or ALG was used to t r e a t the rejection; (3) a n y secondary t r e a t m e n t was required; a n d (4) complete resolution of the episode failed to occur w i t h i n 21 days.
Statistical Analyses. The prognostic ability of the grading system to predict a n unfavorable outcome was assessed by the X2 test for trend, u s i n g both the long-term a n d shortt e r m definitions of unfavorable outcome. For the long-term analysis, 6-month follow-up was required: 133 (79.2%) of the 168 p a t i e n t s were available for evaluation; for the short-term unfavorable outcome, all 168 p a t i e n t s were included. For the second p a r t of the prediction analysis, study 2, contingency table analyses were used to examine the association between the two m e a s u r e s of unfavorable outcome a n d each histological feature recorded by the original i n t e r p r e t i n g pathologist. These features are identical to those listed for study 1. X2 tests for association a n d t r e n d were used to test statistical significance where appropriate. Logistic regression analysis was used to identify a group of histological features t h a t "best" predict the long- a n d shortt e r m outcomes. The regression models were constructed from all histological features from all specimens t h a t were assigned rejection grades (n = 133 for long-term unfavorable outcome a n d n = 168 for short-term unfavorable outcome). The models reported below were derived u s i n g stepwise logistic regression with a significance level of .20 for a feature to enter the model a n d a significance level of .05 for a feature to r e m a i n i n the model. Additional models generated b u t not reported here allowed features to r e m a i n if the significance level was less t h a n .20. With respect to the correct classification of outcome status, the performance of these models was similar to the models reported below. The next step was to determine w h e t h e r the logistic regression models performed better t h a n the grading system. To compare the prognostic ability of the grading scheme to t h a t of the logistic regression models, we adopted the following strategy: for the grading scheme, we used the simple rule of predicting a n unfavorable outcome for grades A2 a n d A3, a n d a favorable outcome for grade A1. For the logistic regression models, we selected a probability cutpoint t h a t maximized the average of sensitivity a n d specificity. The respective prognostic abilities of the regression models as m e a s u r e d by sensitivity, specificity, positive predictive value, a n d negative predictive value were t h e n compared with those of the rejection grading system for each outcome. The regression models were constructed u s i n g all available
HEPATOLOGYVol. 21, No. 2, 1995
data. Thus, the prognostic ability of the models are maximized for these data. In an independent sample, the regression models would not be expected to perform as well. However, the grading system was not developed using these data. Thus, the grading system would perform approximately the same in an independent sample. In short, the regression models were given every chance to outperform the grading system.
DEMETRIS ET AL
411
the intrarater agreement. However, if the history were known for both readings, the agreement would be expected to be better. The features for wl~ich the largest discrepancy exists between the two intrarater kappas are those found to be either very rare or very common in this sample. For these, a relatively large change in the chance-corrected kappa value does not necessarily indicate a great change in overall agreement attributRESULTS able to the variability of the estimate of the chancecorrected agreement. Study 1: Reliability The intrarater agreement for one of the pathologists The distributions of the presence and severity of the on readings 2 and 3 differed only by two choices in 19 histological features and three diagnoses (rejection, more than 1,000 options, which was much better than hepatitis, and other) as identified by all five patholo- the intrarater agreement for any of the other pathologists collectively are presented in Table 2. The propor- gists for any two readings. The high rate of concurrence tions correspond to readings of all 48 slides by all five in this one instance was attributed to a "learning" of pathologists and thus are based on a total of 240 read- the slides. Nevertheless, analyses of the data excluding ings. For example, during the third reading, the pathol- this pathologist did not change the conclusions of the ogists documented mild portal tract inflammation on study. 60% (n = 144) of the readings. Note that the distribuTable 2 also presents the intrarater agreement for tions were very similar for the three readings. Also the presence of viral hepatitis and the presence and included in Table 2 are the intrarater kappa values for severity of acute rejection. The intrarater reliabilities two pairwise comparisons. This is an overall measure of the three measurements listed are all "good." Speof how well each of the five pathologists agreed with cifically, the grading of acute liver allograft rejection themselves for the given histological features. Reading was found to have good intrarater agreement (kappa 1 versus reading 2 provides the ability to evaluate the = 0.55 to 0.58) across the three readings and five paagreement on two readings of the same slide without thologists. additional patient information, and reading 1 versus The inter-rater reliability among the five patholoreading 3 enables a comparison of a reading of a slide gists is moderate or good across all three readings for without additional information to a reading of the same more than four portal tracts, portal tract inflammation slide with additional information. Finally, inter-rater intensity, bile duct inflammation/damage, endothelikappas reflecting the agreement about a given feature iris, piecemeal/bridging necrosis, fat, and severity of among the five pathologists are reported for each of the lobular inflammation. Fair to moderate reliability three readings. among the pathologists was observed for bile duct/choThere is "good" overall intrarater agreement between langiolar proliferation, portal fibrosis, lobular disarray/ the first and second readings for the presence of more than four portal tracts, bile duct inflammation or dam- ballooning, other necrosis, centrilobular necrosis, and age, bile duct loss, endotheliitis, piecemeal or bridging cholestasis. The remaining histological features, which necrosis, and cholestasis, and for the severity of portal had poor agreement on at least one of the three readtract inflammation, lobular disarray/ballooning, and ings, were again those features that were either very fat. The chance-corrected intrarater agreement for all rare or very common. In general, the additional clinical other histological features, with the exception of speci- information on reading 3 did not improve the intermen adequacy, arteritis, arteriopathy, and central fi- rater agreement for these histological features. The agreement among the pathologists for the diagbrosis, is moderate. These latter features had either noses ranges from moderate (without additional pafair or poor agreement attributable in part to the extient information) to good (when additional information tremely high or low frequency of the presence of these is available). The kappa values for the third reading features. These features were either very common (present on at least 95% of the assessments) or rare are those that are most likely to be observed in actual (present on less than 5% of the assessments). Thus, it practice because pathologists would have access to pawas very difficult for agreement to be much greater tient information. This suggests that the more informathan that expected by chance. For instance, when a tion available, the more likely the pathologists would feature is very common, indicated in, say, 95% of the agree on a given diagnosis. The increase in reliability assessments, one would expect agreement 90.25% of for the grading system across the three readings also might reflect an increased familiarity of the patholothe time simply by chance. The intrarater agreement between the first and third gists with the grading system. Thus, the kappa of 0.55 readings was very similar to the agreement between achieved with a minimum of additional information the first and second readings for most of the histological might be considered to be a lower bound for the reliabilfeatures. Thus, the incorporation of a brief patient his- ity of the grading system among pathologists in a "real" tory for the third reading does not seem to greatly affect setting.
412
DEMETRIS ET AL
HEPATOLOGY F e b r u a r y 1995 TABLE 2. P a t h o l o g y R e l i a b i l i t y S t u d y I n t r a - a n d I n t e r r a t e r A g r e e m e n t Proportion of Assessments With Feature (by Reading)
Histology (5 Pathologists, 40 Slides; 240 Assessments)
> 4 portal tracts Specimen adequate Portal t r a c t inflammation Mild Moderate Severe Bile duct inflammation damage Bile duct loss Bile duct cholestasis/ proliferation Arteritis Obliterative arteriopathy Endotheliitis PortaDfibrosis Mild Moderate Severe Central fibrosis Mild Moderate Severe Lobular disarray/ballooning Mild Moderate Severe Piecemeal/bridging necrosis Other necrosis Centrilobular necrosis Cholestasis Fat Mild Moderate Severe Severity of lobular inflammation Mild Moderate Severe Granulomatous lobular inflammation Diagnoses Acute rejection Viral hepatitis "Acute" rejection grade A1 A2 A3
1
2
3
0.83 0.96
0.80 0.96
0.83 0.95
0.63 0.14 0.00 0.53 0.08
0.60 0.18 0.00 0.52 0.08
0.60 0.17 0.004 0.48 0.06
0.42 0.00 0.02 0.23
0.40 0.004 0.01 0.23
0.48 0.01 0.01 0.23
0.20 0.01 0.00
0.17 0.03 0.01
0.19 0.02 0.01
0.08 0.004 0.00
0.05 0.004 0.00
0.03 0.004 0.00
0.37 0.15 0.004 0.09 0.38 0.10 0.33
0.43 0.16 0.00 0.11 0.46 0.09 0.30
0.45 0.18 0.00 0.13 0.51 0.09 0.26
0.17 0.01 0.00
0.20 0.03 0.00
0.19 0.02 0.00
0.52 0.09 0.00
0.53 0.16 0.00
0.58 0.16 0.004
0.01
0.03
0.33 0.18 0.20 0.09 0.04
Intrarater K* Interrater ~* Readings 1 and 2
Readings 1 and 3
Reading 1
Reading 2
Reading 3
0.61 0.29 0.62
0.64 0.15 0.63
0.51 0.08 0.52
0.48 0.17 0.53
0.40 0.09 0.48
0.63 0.55
0.59 0.49
0.49 0.34
0.50 0.21
0.54 0.05
0.38 0.0O -0.01 0.67 0.49
0.49 0.00 -0.01 0.69 0.51
0.21 1.00 -0.01 0.61 0.34
0.31 0.00 0.00 0.56 0.27
0.30 -0.01 0.00 0.56 0.35
0.24
0.12
0.26
0.15
0.06
0.59
0.54
0.29
0.33
0.32
0.60 0.48 0.45 0.57 0.58
0.50 0.45 0.53 0.61 0.62
0.48 0.20 0.38 0.32 0.44
0.39 0.34 0.23 0.32 0.39
0.38 0.28 0.21 0.30 0.45
0.46
0.47
0.34
0.45
0.42
0.03
0.35
0.39
-0.01
-0.03
0.12
0.33 0.24
0.33 0.24
0.69 0.61 0.55
0.66 0.61 0.58
0.41 0.41 0.40
0.52 0.47 0.45
0.56 0.56 0.55
0.21 0.07 0.05
0.22 0.08 0.03
* Kappa: <0.11 poor, 0.11-0.30 fair, 0.31-0.50 moderate, 0.51-0.70 good, >0.70 excellent. 1
Practical Significance of Disagreement in Pathological Diagnosis The practical significance of disagreement on any particular case was assessed on a case by case basis for reading 3, because this slide evaluation most closely approached routine clinical practice. This exercise was
performed to determine what impact the "statistical" differences would have in the everydayWorId of patient management. For example, if one pathologist's primary diagnosis was "mild preservation injury" and another's was "essentially normal" for the same case, the potential impact of the discrepancy on patient management
HEPATOLOGY Vol. 21, No. 2, 1995
DEMETRIS ET AL
413
TABLE 3. A s s o c i a t i o n B e t w e e n t h e H i s t o p a t h o l o g l c a l A c u t e R e j e c t i o n G r a d e as D e t e r m i n e d b y t h e P a t h o l o g i s t s a n d t h e
Two Outcomes Considered Short-Term Subjective Grade
No.
Unfavorable Outcome (%)
A1 A2 A3 Total
108 44 16 168
37.0 47.7 75.0 43.5
Long-Term
P = .005*
No.
Unfavorable Outcome (%)
87 32 14 133
1.2 12.5 14.3 15.3
P = .006*
, )12 t e s t for t r e n d .
was considered negligible. However, if one pathologist diagnosed "rejection" while another diagnosed "hepatitis," a substantial difference was recorded. In total, there were 7 of 48 (15%) cases when the pathological diagnoses given by at least two of the pathologists on the same case were substantially different. Three of these cases were attributable to a single pathologist who repeatedly made a diagnosis of "chronic vascular rejection"; none of the other pathologists ever made t h a t diagnosis. Instead, the other pathologists believed t h a t the changes represented acute or chronic active hepatitis. In two other cases, diagnosis of Epstein-Barr virus hepatitis by one pathologist was in direct conflict with a diagnosis of moderate acute rejection by another, a discrepancy t h a t is easily resolved in routine practice with the use of in situ hybridization for Epstein-Barr. In one case, cytomegalovirus hepatitis was diagnosed by one pathologist, whereas three others thought the biopsy sample was essentially normal. In the final case, almost all of the diagnoses differed; severe acute rejection, sepsis, drug reaction, and humora] rejection were diagnosed by different pathologists on the same biopsy specimen. This particular case was re-reviewed by the group at a multiheaded microscope after closure of the study, and still no consensus was achieved. Thus, in 4 of 48 (8%) of the cases, different readings potentially could have substantially impacted patient care.
Study 2: Prognostic Ability Table 3 describes the association between the rejection grade as provided by the pathologists, and the two outcomes considered. A statistically significant trend was observed between the grading system and each outcome separately. In each case, a grade indicating a more severe rejection episode was associated with a greater probability of an unfavorable outcome. The association between each individual histological feature assessed in the LTD and the two outcomes also was examined. Histological features associated (P < •10) with a long-term unfavorable outcome include portal tract inflammation, lobular disarray/ballooning, piecemeal/bridging necrosis, and centrilobular necrosis. Histological features associated (P < .10) with a short-term unfavorable outcome include portal tract inflammation, portal fibrosis, central fibrosis, and centri-
lobular necrosis. For m a n y of these features, very few individuals were found to be positive. The logistic regression modeling with respect to the long-term outcome was straightforward and used all 133 eligible specimens. However, the relationship between the short-term outcome and both central fibrosis and severe portal tract inflammation precluded a straightforward application of the logistic regression model. All five patients with mild central fibrosis and all three patients with severe portal tract inflammation intensity had a bad short-term outcome. Consequently, the iterative method for estimating parameters could not converge when central fibrosis was included in the model. To account for this problem, these 8 specimens of the 168 eligible specimens, were excluded from the modeling process. The ef~bctive model for the shortterm outcome was then a two-step model: first, if central fibrosis was mild or portal tract inflammation was severe, then the outcome was predicted to be an unfavorable outcome; second, if central fibrosis did not exist and portal tract inflammation was either mild or moderate, then the logistic regression model was used to calculate the probability of an unfavorable outcome. The final logistic regression models are presented below, where P is the estimated probability of an unfavorable outcome: e-3.85+l.52xl +4.54x2
Long-Term: P =
1 + e -3"85+l'52xl+454x2
where xl = 1 if portal tract inflammation intensity is moderate 0 otherwise x2 = 1 if portal tract inflammation intensity is severe 0 otherwise e-O.31+o.78xl +O.71x2-O.58x3 0.86x
Short-Term:
P = 1 -}- e °31+°78xI+°71x2-°58x~-°86x4
where xl = 1 if portal tract inflammation intensity is moderate 0 otherwise x2 = 1 if lobular disarray/ballooning is mild 0 otherwise
414
D E M E T R I S ET AL
HEPATOLOGY F e b r u a r y 1995
TABLE 4. L o n g - T e r m O u t c o m e : R e j e c t i o n - R e l a t e d Graft L o s s W i t h i n 6 M o n t h s of B i o p s y at O n s e t o f R e j e c t i o n Predicted UO
Predicted UO
Grading Scheme (Observed UO)
Yes
No
Total
Logistic Regression Model (Observed UO)
Yes
No
Total
Yes No Total
6 40 46
1 86 87
7 126 133
Yes No Total
5 32 37
2 94 96
7 126 133
.86 =
6/7
~-
.68 = 86/126 ~ .13 =
6/46
Sensitivity
-~
5/7
Specificity
-* 94/126 = .75
~- Positive predictive value -~ 5/37
= .71
= .14
.99 = 86/87 ~- Negative predictive value -~ 94/96 ~ .98 Abbreviation: UO, u n f a v o r a b l e outcome.
x3 = 1 i f l o b u l a r d i s a r r a y / b a l l o o n i n g is m o d e r a t e 0 otherwise x4 = i i f c h o l e s t a s i s is p r e s e n t 0 otherwise
For example, for a liver allograft recipient with the following histology on the first post-transplantation biopsy to be diagnosed with rejection (severe portal tract inflammation, mild lobular disarray/ballooning, cholestasis present, and mild steatosis), the estimated probabilities of the two unfavorable outcomes are: Long-Term: p = e ( 3.85+4.54)/( 1 _~_ e (3.85+4.54)) Short-Term:
:
1.9937/2.9937 = 0.67
P = 1.0
Because this biopsy sample shows severe portal tract inflammation, the two-step model estimates the probability of a s h o r t - t e r m u n f a v o r a b l e o u t c o m e t o b e 1.0.
Tables 4 and 5 show the sensitivity and specificity for both the grading scheme and logistic model with respect to both long- and short-term outcomes, respectively. For the long-term outcome, the sensitivity is greater for the grading system and the specificity is
greater for the regression model. The positive and negative predictive values are virtually identical. For the short-term outcome, the specificity and positive predictive value are greater for the regression model. The sensitivity and negative predictive values are virtually identical. Overall then, the grading scheme and the logistic model have comparable degrees of prediction. One should keep in mind, however, that the prognostic ability of the logistic regression model is optimal for these data. DISCUSSION
The good reproducibility of the diagnoses of rejection and hepatitis, even under the artificial conditions imposed by the study design, is probably the most important finding in the current study. This result should further bolster confidence in the usefulness of liver biopsy histopathology in the diagnosis of these two very common causes of liver allograft dysfunction. The good agreement on the subjective grading of acute rejection was also very encouraging, although testing by a group of pathologists not as experienced in liver or transplantation pathology would provide even stronger support for the proposed system. The consensus grading system proposed and used in
TABLE 5. S h o r t - T e r m O u t c o m e : R e j e c t i o n - R e l a t e d G r a f t L o s s or U s e of OKT3/ALG or U s e o f A n y S e c o n d a r y T r e a t m e n t or U n r e s o l v e d R e j e c t i o n for t h e F i r s t P o s t - T r a n s p l a n t a t i o n R e j e c t i o n E p i s o d e Predicted UO
Predicted UO
Grading Scheme (Observed UO)
Yes
No
Total
Logistic Regression Model (Observed UO)
Yes
No
Total
Yes No Total
33 27 60
40 68 108
73 95 168
Yes No Total
32 17 49
41 78 119
73 95 168
.45 = 33/73 ~-
Sensitivity
-~ 32/73 = .44
.72 = 68/95 ~-
Specificity
-~ 78/95 = .82
.55 = 33/60 ~- Positive predictive value -* 32/49 = .65 .63 = 68/108 ~- Negative predictive value -~ 78/119 = .66
HEPATOLOGYVo]. 21, No. 2, 1995 this study fundamentally is a simplification of previously published schemes for liver allografts, sl~ which, in turn, are based on the paradigm first developed for kidney allografts, 14'15 and then later adapted to other organs. ~619 Consistent with this approach, an attempt was made to segregate the acute, common, and mostly reversible form of rejection from the chronic, less common, and less easily reversible form. However, for any form of rejection, interference with arterial blood flow is the most serious manifestation, which histopathologically appears as direct arterial damage such as inflammatory arteritis, thrombosis, fibrinoid necrosis, or obliterative arteriopathy. Indirect evidence of arterial compromise such as confluent necrosis and interstitial hemorrhage can also be used but is less specific. Therefore, we attempted to "fit" arteritis (inflammatory or necrotizing) into the grading scheme, even though it was an uncommon and poorly reproducible finding. Unfortunately, arteritis was diagnosed by the pathologist(s) in only 3 of 295 (1%) first episodes of rejection, and all were independently assessed as rejection with favorable short- and long-term outcomes. This lack of association with poor outcome is probably more reflective of the lack of certainty in the diagnosis, rather than irrelevance of the finding. In any event, we thus chose not to include arteritis in the scheme, which is in contrast to other histopathological grading schemes for liver 1°'1~ or renal allograft rejection. 14'~5'2° The physioanatomic differences between the liver and kidney, which result in the underrepresentation in liver biopsy samples of arteries affected by rejection, probably account for this difference. Several liver allograft rejection grading schemes, including the one proposed herein, have attempted to circumvent the rare occurrence (and unreliability) of observing arteritis, by relying on indirect or "surrogate" markers of rejection-related microvascular collapse and ischemia, such as bridging or lobular necrosis. Unfortunately, relying on indirect evidence of ischemia decreases specificity because causes other than rejection can result in ischemic injury to a graft. Including perivenular and lobular inflammation in the scheme, along with the subjective judgment of the pathologist (which is difficult to avoid) that the ischemia is rejection related, were used to overcome this shortcoming. Kemnitz et al ~1 separate moderate from severe acute liver allograft rejection on the basis of the percent of lobular necrosis, a finding that may be difficult to reproduce. In this study, the dichotomous variable of centrilobular necrosis was found to be only fairly to moderately reliable. In general, increased demand for histopathological detail appears to occur at the expense of reproducibility. The results of the analysis for predictive value of the grading system are encouraging, although they have to be interpreted with appropriate caution until more patients and a longer follow-up are accrued. In this
DEMETRIS ET AL 415 series of patients, the histopathological acute rejection grade at initial onset could, with reasonable certainty, predict the need for more aggressive therapy and allograft failure from rejection within 6 months. Frankly, the result is surprising because liver allograft failure from rejection is uncommon with the current arsenal of immunosuppressive drugs and usually occurs only after several treatment failures. The many advantages of a standardized, reproducible, and predictive grading system are obvious. These include evaluation and tailoring of treatment protocols, and information exchange within and between transplantation centers about the influence of various donor and recipient factors on rejection and graft outcome. In contrast, a dangerous byproduct of standardization is that a proposed system could become a self-fulfilling prophecy, contrary to a patient's best interests. For example, if the presence and severity of rejection are dependent on the presence of an inflammatory infiltrate, a patient may receive additional immunosuppressive therapy until all traces of an infiltrate have disappeared, whether needed or not. Innocuous, interstitial lymphocytic infiltrates have been repeatedly described in well-functioning, long-term kidney and cardiac allografts. 161s'2123 The converse m a y also be true; a patient may not receive therapy when needed because prerequisite biopsy findings were not present. The low positive predictive value of the grading scheme and the logistic regression model for the unfavorable long-term outcome is likely related to two factors. First, as already mentioned, most episodes of acute rejection resolve with the current highly effective arsenal of immunosuppressive agents. Therefore, the "hard" endpoint of allograft failure from acute rejection is avoided. Whether patients needlessly suffer from increased infections as a consequence is a difficult question to address in humans, but should be kept in mind. Second, the separation between grading and staging of a rejection reaction is not a clear one. It is likely that the positive predictive value could be substantially increased by including more than one of a series of biopsies from a patient, as suggested by Snover et al. 1° Unfortunately, the small number of patients in this study with either allograft failure from rejection or serial biopsies with more severe grades of rejection precluded a rigorous test of this hypothesis. For this study, rejection was defined by a combination of clinical evidence of graft dysfunction and pathological evidence of tissue damage attributable to rejection-related inflammatory infiltrate. No attempt was made to define a single or even a set of histopathological changes for separating therapy-requiring rejection from episodes not requiring treatment. In the setting of liver transplantation, defining such histopathological parameters m a y be an extremely difficult task. For example, spontaneous reversal of grade A1 (mild) rejection without additional immunosuppression was observed in approximately 3% of patients in this study, even though they fulfilled all of the clinical and histo-
416
DEMETRIS ET AL
pathological criteria for rejection. Dousset et 3124 have also documented the "spontaneous" reversal of even moderate acute rejection, and Hubscher et a125 and Snover et al 1° have shown that even loss of bile ducts, presumably from chronic rejection, can involute. In some experimental animal models, histologically "severe" acute rejection associated with liver allograft dysfunction always spontaneously resolves without any immunosuppressive therapy. 26 Subsequently, the liver allografts are permanently accepted and induce immunologic tolerance in the recipient. 26 Therefore, concentration on strictly morphological findings confined to the allograft focuses attention away from broader, and possibly more important, immunologic events occurring in the recipientY The grading system for acute liver allograft rejection proposed is not unlike classification and grading schemes already in existence. Therefore, in practice it is likely that there will be a high rate of concordance for the various grades, even though different wording is used. There are drawbacks to all grading systems, including this one, but it has the advantage of rigorous testing for both reproducibility and predictive value among five different pathologists and patients from three centers. In addition, logistic regression modeling using the histopathological features recorded was unable to produce a significantly better scheme when the outcome was known. A future challenge will be to determine if a worldwide consensus can be achieved by combining into one grading system, this and other currently available schemes, so that problems associated with many different systems in concurrent use 2s can be overcome. REFERENCES 1. Demetris AJ, Belle SH, Hart J, Lewin K, Ludwig J, Shover D, Tillery GW, et al. Intraobserver and interobserver variation in the histopathologic assessment of liver allograft rejection. HEPATOLOGY1991; 14:751-755. 2. O'Connell D, Dobson AJ. General observer-agreement measures on individual subjects and groups of subjects. Biometrics 1984;40:973-983. 3. Porter KA. Pathology of liver transplantation. Transplant Rev 1969;2:129-170. 4. Portmann B, Neuberger J, Williams R. Intrahepatic bile duct lesions. In: Calne RY, ed. Liver transplantation. The CambridgeKing's College hospital experience. London: Grune & Stratton. 1983:279-287. 5. Wight DGD. Pathology of rejection. In: Calne RY, ed. Liver transplantation. The Cambridge-King's College hospital experience. London: Grune & Stratton, 1983:247-277. 6. Eggink HF, Hofstee N, Gips CH, Krom RAF, HouthoffHJ. Histopathology of serial graft biopsies from liver transplant recipients: liver homograft pathology. Am J Patbol 1983; 114:18-31. 7. Vierling JM, Fennell RH Jr. Histopathology of early and late human hepatic allograft rejection: evidence of progressive destruct:ion of interiobular bile ducts. HEPATOLOGY1985;4:10761082. 8. Williams JW, Peters TG, Vera SR, Britt LG, Voorst SJV, Haggitt RC. Biopsy-directed immunosuppression following hepatic transplantation in man. Transplantation 1985;39:589-596. 9. Hubscher SG, Clements D, Elias E, McMaster P. Biopsy findings
HEPATOLOGYFebruary 1995
10. 11. 12.
13. 14. 15. 16. 17. 18.
19.
20.
21. 22.
23. 24. 25. 26. 27.
28.
in cases of rejection of liver allograft. J Clin Patho11985;38:13661373. Shover DC, Freese DK, Sharp HL, Bloomer JR, Najarian JS, Ascher NL. Liver rejection: an analysis of the use of biopsy in determining outcome rejection. Am J Surg Pathol 1987; 11:1-10. Kemnitz J, Ringe B, Cohnert TR, Gubernatis G, Choritz H, Georgii A. Bile duct injury as part of diagnostic criteria for liver allograft rejection. Hum Pathol 1988;20:132-143. Ludwig J, Wiesner RH, Batts K, Perkins JD, Krom RAF. The acute vanishing bile duct syndrome (acute irreversible rejection) after orthotopic liver transplantation. HEPATOLOGY1987; 7:476483. Demetris AJ, Qian S, Sun H, Fung JJ. Liver allogra~ rejection: an overview of morphologic findings. Am J Snrg Pathol 1990; 14(1):49-63. Porter KA. Pathological changes in transplanted kidneys. In: Starzl TE, ed. Experience in renal transplantation. Philadelphia: Saunders, 1964:299. Porter KA. Renal transplantation. In: Heptinstall RH, ed. Pathology of the Kidney. Boston: Little, Brown and Co., 1992:17991933. Billingham ME. Some recent advances in cardiac pathology. Hum Pathol 1979;10:367-386. Kemnitz J, Cohnert T, Schafters H, Helmke M, Wahlers T, Herrmann G, Schmidt RM, et al. A classification of cardiac allograft rejection. Am J Surg Pathol 1987; 11:503-515. Billingham ME, Cary NRB, Hammond ME, Kemnitz J, Marboe C, McCallister HA, Shover DC, et al. A working formulation for the standardization of nomenclature in the diagnosis of heart and lung rejection: heart rejection study group. J Heart Transpl 1990;9:587-593. Yousem SA, Berry GJ, Brunt EM, Chanmberlain D, Hruban RH, Sibley RK, Stewart S, et al. A working formulation for the standardization of nomenclature in the diagnosis of heart and lung rejection: lung rejection study group. J Heart Transpl 1990;9:593-601. Solez K, Axelsen RA, Benediktsson B, Burdick JF, Cohen AH, Colvin RB, Croker BP. International standardization of nomenclature and criteria for the histologic diagnosis of renal allograft rejection: the Banff working classification of kidney transplant pathology. Kidney Int 1993;44:411-422. d'Ardenne AJ, Dunnill MS, Thompson JF, McWhinnie D, Wood RFM, Morris PJ. Cyclosporine and renal graft histology. J Clin Pathol 1986; 39:145-151. Herbertson BM, Evans DB, Calne RY, Banerjee AK. Percutaneous needle biopsies of renal allografts: the relationship between morphological changes present in biopsies and subsequent allograft function. Histopathology 1977; 1:161-178. Kiaer H, Hansen HE, Olsen S. The predictive value ofpercutaneous biopsies from human renal allografts with early impaired function. Clin Nephrol 1980; 13:58-63. Dousset B, Hubscher SG, Padbury RT, Gunson BK, Buckels JA, Mayer AD, Elias E, et al. Acute liver allograft rejection: is treatment always necessary? Transplantation 1993;55:529-534. Hubscher SG, Buckels JAC, Elias E, McMaster P, Neuberger J. Vanishing bile-duct syndrome following liver transplantation: is it reversible? Transplantation 1991;51:1004-1010. Kamada N. The immunology of experimental liver transplantation in the rat. Immunology 1985; 55:369-389. Demetris AJ, Murase N, Rao AS, Starzl TE. The role of passenger leukocytes in rejection and "tolerance" after solid organ transplantation: a potential explanation of a paradox. In: Touraine JL, Traeger J, Betuel H, Dubernard JM, Revillard JP, Dupuy C, eds. Rejection and tolerance. Proceedings of the 25th International Conference on Transplantation and Clinical Immunology. Boston: Kluwer Academic Publishers, 1994:325-392. Billingham ME. Dilemma of variety of histopathologic grading systems for acute cardiac allograft rejection by endomyocardial biopsy. J Heart Transpl 1989;9:272-276.