Neurotoxicologyand Teratology, Vol. 16, No. 1, pp. 55-63, 1994 Copyright © 1994ElsevierScienceLtd Printed in the USA. All rights reserved 0892-0362/94 $6.00 + .00
Pergamon
Behavioral Evaluations in Developmental Toxicity Testing: MARTA Survey Results ELIZABETH
A . L O C H R Y , .1 C O N N I E
JOHNSONt
AND
PATRICK
J. W I E R t
*Bristol-Myers Squibb Pharmaceutical Research Institute, New Brunswick, N J 08903-0191 ?SmithKline Beecham Research and Developmental Division, King o f Prussia, PA 19406-0939 R e c e i v e d 28 D e c e m b e r 1992; A c c e p t e d 20 A u g u s t 1993 LOCHRY, E. A., C. JOHNSON AND P. J. WIER. Behavioralevaluations in developmental toxicity testing: MARTA survey results. NEUROTOXICOL TERATOL 16(1) 55-63, 1994.-During 1991, the Middle Atlantic Reproduction and Teratology Association (MARTA) conducted a survey of laboratories performing behavioral evaluations as part of GLP developmental toxicity studies. This survey was conducted to determine the extent to which an "industry standard" had evolved for behavioral test batteries. The most commonly used developmental parameters were eye opening, pinna unfolding, and sexual maturation (physical landmarks); surface righting, pupil constriction, and nonautomated acoustic startle (reflexive landmarks). Locomotor activity was used by 80070 of the laboratories. The majority (76070) of laboratories conducted at least one learning and/or retention evaluation; nearly one-fourth of the laboratories routinely performed two. The most common learning tests were watermaze (primarily a simple two-choice discrimination task) and passive avoidance. Automated startle paradigms (habituation, prepulse modification, and/or startle elicitation) were evaluated by 28070 of the laboratories. This survey showed a remarkable similarity in methodology across laboratories and a progressive increase in the number of GLP studies that included behavioral assessments. The results indicate that behavioral tests have become a common component of developmental toxicity assessments of pharmaceuticals. Behavioral testing Startle Developmental landmarks Survey Reproductive toxicity Retention Developmental toxicity
Learning
Motor activity
PURPOSE OF SURVEY
CONSTRUCTION OF THE QUESTIONNAIRE
Behavioral evaluations have been a required or recommended component o f developmental toxicity assessments o f pharmaceuticals in J a p a n and Great Britain for well over a decade. In 1991, the Middle Atlantic Reproduction and Teratology Association ( M A R T A ) conducted a survey o f laboratories performing behavioral evaluations in safety assessments o f pharmaceuticals. This survey was conducted to determine the extent to which an "industry standard" had evolved for behavioral test batteries included in developmental toxicity evaluations. In attempting to answer this question, it was determined that a survey of the laboratories that were doing the safety testing would supply the most accurate and relevant information.
The survey included questions designed to identify: (a) specific behavioral tests that were consistently being used to satisfy regulatory guideline requirements; (b) the reasons why certain behavioral tests had survived within the various test batteries to become those tests most routinely used; (c) the overall perceived importance of the information obtained when behavioral testing was included in safety assessments. RESPONDING LABORATORIES Questionnaires were mailed to 51 companies that performed developmental toxicity evaluations o f pharmaceuticals
I Requests for reprints should be sent to Elizabeth A. Lochry, Bristol-Myers Squibb, Pharmaceutical Research Institute, 1 Squibb Drive, P.O. Box 191, New Brunswick, NJ 08903-0191. 55
56
LOCHRY, JOHNSON AND WIER
in the United States and Europe. Companies and individuals within the companies were identified through membership lists of MARTA, the Neurobehavioral Teratology Society, and Teratology Society, as well as published lists o f pharmaceutical companies supporting various behavioral seminars and meetings over the past 5-year period. Japanese pharmaceutical companies were excluded because they had been surveyed just prior to the present survey (7). Laboratory identity was kept strictly confidential. All questionnaires were assigned a coded number on receipt. The code number was then used for purposes of distinguishing the different laboratories while the survey results were tabulated. An alphabetized list of the respondents without code numbers was maintained only for purposes of forwarding a copy of the survey results to the participating laboratories. Of the 51 companies that were mailed questionnaires, 26 companies (22 private industrial laboratories; 4 contract laboratories) returned completed questionnaires. Twenty-five (18 from the USA; 7 from Europe) of the 26 laboratories indicated that most or all studies were conducted according to Good Laboratory Practice Standards (e.g., 5,8,10) (GLP). One laboratory that did not do any studies in accordance with GLPs was excluded from the tabulations. Companies that returned completed questionnaires tended to be those with active members in behavioral societies. On average, 86.5070 of the studies performed by the 25 laboratories included in the survey were designed and conducted for purposes of meeting regulatory agency guidelines. Twenty-one of the 25 laboratories devoted an average of 12.4% of their time to methods development and 7 laboratories allocated an average of 9.3070 of their time to basic research. A variety of test agents were evaluated by the responding laboratories and are summarized in Fig. 1. The majority of laboratories in the survey evaluated pharmaceuticals, al-
Environ. Agents
n:2
Cons. Products
n=2
Cosmetics Foo(
though approximately 25070 to 30% of the laboratories tested pesticides and industrial chemicals instead of or in addition to pharmaceuticals. STUDY DESIGNS THAT INCLUDE BEHAVIORALEVALUATIONS Developmental toxicity studies that included behavioral endpoints were most often designed on the basis of British guidelines (4) (22 laboratories) or Japanese guidelines (6) (19 laboratories) for reproductive safety testing of pharmaceuticals. Behavioral evaluations were most commonly included in Segment I fertility and general reproduction studies or Segment III perinatal and postnatal evaluations. All laboratories used rats and one laboratory also used mice. There was a definite trend toward increasing numbers of studies that included behavioral endpoints from 1986 to 1990 as shown in Table 1. This observation most likely reflects the increasing number of studies that were based on European a n d / o r Japanese guidelines during this same period. This may have reflected the trend toward individual companies trying to devise their own international protocols for submission to different countries which predated the official efforts to harmonize international regulatory requirements. Over the 5-year period covered by the survey, 40°70 of the U.S. companies and 9070 of European companies which evaluated test articles for conventional morphological teratology endpoints also evaluated behavioral endpoints in the offspring (developmental neurotoxicity parameters). However, these behavioral and morphological evaluations were not necessarily conducted in the same study. On average, 14070 of U.S. companies and 10°70 of European companies conducted morphological assessments and behavioral tests in the same study, presumably Japanese Segment II studies. Thirteen of the 20 companies stated a preference for treatment periods covering at least some portion of both gestation and lactation when studies included behavioral evaluations of
n=2 n=3
20 40 60 80 Percentage of Companies Evaluating Agents FIG. 1. Test agents evaluated by responding laboratories.
100
MARTA BEHAVIORAL METHODS SURVEY
57
TABLE 1 STUDIES WITH BEHAVIORAL EVALUATIONS 1986 TO 1990 USA Laboratories
Year Labs 1986 1987 1988 1989 1990
Studies Average*
9 13 16 14 16
22 36 55 48 65
2.4 2.8 3.4 3.4 4.1
European Laboratories Range
Labs
1-5 1-7 1-9 1-16 1-16
5 5 5 5 6
Studies Average*
Total Range
Labs
Studies Average*
Range
19
3.8
1-11
14
41
2.9
1-11
17 23 35 29
3.4 4.6 7.0 4.8
1-8 1-9 1-14 1-11
18 21 19 22
53 78 83 94
2.9 3.7 4.4 4.3
1-8 1-9 1-16 1-16
*Average number of studies per laboratory.
the offspring. The remaining seven companies preferred to restrict treatment to gestation. SURVEY RESULTS Most o f the laboratories surveyed had sampled a variety o f tests and retained or discarded specific tests as their individual behavioral batteries evolved. The questionnaire asked the laboratories to evaluate the tests they had sampled on the basis of practicality, variability o f the data, and the perceived sensitivity o f a given test, regardless o f whether or not they retained the test in their behavioral battery. The laboratories were asked to use a 4-point scale to evaluate the behavioral tests. The ratings associated with this scale were as follows: 1 = poor, 2 = acceptable, 3 = good, and 4 = excellent.
Physical Developmental Parameters As shown in Table 2, the most c o m m o n l y used physical developmental landmarks were eye opening, pinna unfolding, and sexual maturation. Each o f the 25 laboratories (10007o) included in the survey had sampled both eye opening and pinna unfolding and 8807o and 84070 o f the laboratories re-
tained these respective endpoints in their test batteries. Sexual developmental landmarks (vaginal patency and balanopreputial skinfold separation or testes descent) was sampled by 80°70 and retained by 8 5 % - 1 0 0 % o f the laboratories that sampled these tests. Balano-preputial skinfold separation and vaginal patency were rated best in terms o f variability and sensitivity and also received the highest overall rating o f any physical developmental parameter. Vaginal patency was used to monitor female sexual maturation by all 19 laboratories that regularly included this endpoint in their test battery and was given an overall mean rating o f 2.9. Although testes descent was the most c o m m o n l y used method of assessing male sexual maturation, this test received the lowest rating o f all physical landmarks for practicality and sensitivity, and received the lowest overall rating o f any physical landmark. Five laboratories used balano-preputial skinfold separation instead of, or in addition to, testes descent and gave clearly higher ratings to balano-preputial skinfold separation in all three categories. Based on the relatively higher ratings given to balanopreputial skinfold separation and its requisite use in the Environmental Protection Agency's Developmental Neurotoxicity
TABLE 2 PHYSICAL DEVELOPMENTAL LANDMARKS
Physical Landmark Eye opening Pinna unfolding Incisor eruption Hairgrowth Balano-preputial Skinfold separation Vaginalpatency Testes descent
Labs That Ever LabsThat Still Used Test Use Test N/N* (%) N/Nt (*/0) 25/25 (100) 25/25 (100) 20/25 (80.0) 17/25 (68.0) 5/25 (20.0) 20/25 (80.0) 20/25
22/25 21/25 (84.0) 13/20 (65.0) 9/17 (52.9) 5/5 (100) 19/20 (95.0) 17/20
(80.0)
(85.0)
Ratings;t (Mean ± SD)
Overall Rating (Mean)
Practicality
Variability
Sensitivity
3.2 + 0.9
2.9 + 0.6
2.4 + 0.8
2.8
3.2 + 0.9
2.8 + 0.8
2.2 + 1.0
2.7
3.0 ± 1.2
2.8 ± 0.9
2.5 ± 0.8
2.8
2.8 ± 1.1
2.3 ± 1.0
2.2 ± 2.3
2.4
3.0 ± 0.7
3.2 ± 0.8
3.2 ± 0.5
3.1
3.1 ± 0.7
2.8 ± 0 . 8
2.7 ± 1.0
2.9
2.4 ± 1.0
2.4 ± 0.9
2.1 ± 1.0
2.3
(88.0)
*Based on the total number of labs included in the survey; tBased on the number of labs that ever used the test; ~Based on the rating scores from labs that maintained the test in their battery (1 = Poor; 2 = Acceptable; 3 = Good; 4 = Excellent).
58
LOCHRY,
Study Design (9), this evaluation may soon become the predominant endpoint for monitoring male sexual maturation. In general, all of the physical developmental landmarks fared well when rated for practicality. Only hair growth and testes descent failed to receive an average rating of good or better in the area of practicality when rated by the laboratories that regularly used these tests. Of the laboratories that originally sampled incisor eruption and hair growth, only 65% and 53% retained these respective tests in their behavioral batteries. The primary reason cited by the 7 laboratories that dropped incisor eruption and 8 laboratories that discarded hair growth was poor sensitivity (average ratings of 1.3 and 1.1, respectively). Most laboratories routinely test all of the pups in each litter for each developmental landmark. The exception to this practice is the evaluation of sexual maturation which is conducted after weaning of the Fl generation when specific pups have been selected from each litter for postweaning evaluation. In terms of frequency of testing, the predominant practice is to test the pups on multiple days, rather than just once. Reflex Developmental Parameters The reflex measures most commonly used on a routine basis were pupil constriction, acoustic startle (performed manually), and surface righting (Table 3). Of the laboratories that sampled these tests, 87.5% to 94.1% retained these respective tests in their behavioral batteries. Although surface righting was the reflex test most often sampled (96.9%), it was retained in relatively fewer (87.5%) test batteries than pupil constriction (94.1 Vo)and acoustic startle (90.5%). Air righting, negative geotaxis, and grip strength were sampled by only about half of the laboratories surveyed. Of these, air righting was retained by 71.4% of the laboratories that sampled it and negative geotaxis and grasping reflex by about half of the laboratories that had originally sampled these tests. Although the grasping reflex was sampled and retained by the fewest number of laboratories, this test received the highest or second highest ratings for practicality, variability, and sensitivity (as evaluated by the laboratories that routinely in-
JOHNSON
AND WIER
elude it in their test batteries), and scored the highest overall rating as the best reflexive landmark. Four of the five laboratories that had sampled the grasping reflex and discarded it from their behavioral batteries, rated this test as “less than acceptable” (ratings of 1.3 to 1.7) in all three categories. These divergent opinions may reflect markedly different methods in conducting the test, possibly even the difference between the use of a quantified grip strength device as opposed to gross observation. An interesting finding was that the manually performed acoustic startle test was retained in the behavioral batteries of 90.5% of the laboratories that originally sampled it, even though it had a relatively low degree of satisfaction in terms of perceived sensitivity (average rating of 2.1, just above acceptable). This finding may indicate an awareness that this endpoint can now be measured electronically (startle elicitation) on a much more precise scale than manual conduct allows. As with the physical developmental parameters, the reflexive developmental landmarks fared well for practicality, receiving marks of 2.7 to 2.9. Parameters found to have the least variability were pupil constriction and grasping reflex as mentioned previously. The survey indicated that the reflexive parameters (like the physical landmarks) are usually tested in all pups in the litter. With the exceptions of pupil constriction and grasping reflex which were most often tested only once, the majority of laboratories evaluated the reflexive landmarks on multiple days. Motor Activity Automated activity testing was the most frequently used method of assessing motor behavior (Table 4). Sixty-eight percent of the laboratories participating in the survey had sampled automated activity testing and all (100%) of these laboratories retained this test in their behavioral batteries. Activity testing monitored by manually counting or observing motor behavior was sampled by three laboratories, each of which (100%) retained this evaluation in their test batteries. As expected, higher, though not substantially higher, ratings for practicality, variability, and sensitivity were given for auto-
TABLE 3 PREWEANING
Reflex Pupil constriction Acoustic startle (nonautomated) Surface righting Air righting Negative geotaxis Grasping reflex
REFLEX DEVELOPMENTAL
Labs That Ever
Labs That Still
Used Test N/N* (%)
Use Test N/N? (%)
17/25 (68.0) 21/25 (84.0) 24/25 (96.9) 14/25 (56.0) 14/25 (56.0) 11/25 (44.0)
16/17 (94.1) 19/21 (90.5) 21/24 (87.5) 10.14 (71.4) 8/14 (57.1) 6/11 (54.6)
LANDMARKS Ratings* (Mean f SD)
Overal
Practicality
Variability
Sensitivity
Rating (Me=0
2.8 f 0.8
3.2 f 0.9
2.4 f 0.9
2.8
2.7 f 1.1
2.5 f 0.9
2.1 f 1.0
2.4
2.9 f 0.8
2.6 f 0.6
2.2 f 0.8
2.6
2.8 f 0.9
2.8 f 0.6
2.5 f 0.9
2.7
2.8 f 0.5
2.6 f 0.5
2.1 f 0.8
2.5
2.8 f 0.8
3.2 f 0.8
3.0 f 0.9
3.0
*Based on the total number of labs included in the survey; tBased on the number of labs that ever used the test; $Based on the rating scores from labs that maintained the test in their battery (1 = Poor: 2 = Acceptable; 3 = Good; 4 = Excellent).
MARTA BEHAVIORAL
METHODS
59
SURVEY
TABLE 4 MOTOR ACTIVITY AND MOTOR COORDINATION EVALUATIONS Labs That Ever Labs That Still Used Test Use Test N/N* (%) N/Nt (%)
Test Automated activity monitors Nonautomated activity Rotating rods/drums Running wheels
17/25 (68.0) 3/25 (12.0) 12/25 (48.0) 2/25 (8.0)
17/17 (100) 3/3 (100) 6/12 (50.0) 0/2 (0)
Ratings J; (Mean ± SD)
Overall Rating (Mean)
Practical
Variability
Sensitivity
2.8 + 0.7
2.5 :i: 0.8
2.6 + 0.7
2.6
2.3 + 1.2
2.3 + 0.6
2.3 + 0.6
2.3
2.7 + 0.5
2.7 + 1.0
1.8 + 0.4
2.4
not rated
not rated
not rated
-
*Based on the total number of labs included in the survey; 1"Based on the number of labs that ever used the test; ~;Based on the rating scores from labs that maintained the test in their battery (1 = Poor; 2 = Acceptable; 3 = Good; 4 = Excellent).
m a t e d testing w h e n c o m p a r e d with n o n a u t o m a t e d ( m a n u a l ) testing. T w o c o m p a n i e s sampled the r u n n i n g wheel as a n alternate m o t o r activity e v a l u a t i o n b u t neither retained it as a routine test, a n d declined to c o m m e n t o n the practicality, variability, or sensitivity o f this evaluation. A u t o m a t e d m o t o r activity testing was m o s t o f t e n accomplished b y devices e q u i p p e d with photocell b e a m s (Table 5). Photocell b e a m s h a v e b e e n the m o s t readily available c o m m e r cial device for m e a s u r i n g m o t o r activity for at least a decade. H o w e v e r , ratings for variability a n d sensitivity given by the users o f i n f r a r e d heat sensor devices were m a r k e d l y higher (average rating o f 3.0 for each category) t h a n those given
by the users o f photocells (average ratings o f 2.5 a n d 2.6, respectively). O p e n fields were the m o s t o f t e n used test field c o n f i g u r a t i o n (65°70 o f laboratories) with the r e m a i n i n g laboratories ( 3 5 % ) using novel cages, h o m e cages or residential mazes as testing c h a m b e r s . Sixteen (84.2070) of the 19 laboratories t h a t p r o v i d e d descriptions o f their m o t o r activity testing m e t h o d o l o g y routinely assessed m o t o r activity in the o f f s p r i n g once, while eight laboratories evaluated the offspring at least twice (Tables 5 a n d 6). Five laboratories occasionally m o n i t o r e d m o t o r activity at different ages in different studies. T h e earliest evaluation m o s t o f t e n occurred o n p o s t n a t a l days ( P N D s ) 30 to 35 or
TABLE 5 METHODS OF MONITORING MOTOR ACTIVITY Apparatus/Equipment Mechanism of Motion Detection Photocell beam Manual (investigator) Infrared heat sensor Video scan Magnetic field or microwave
N (070)* 11 3 2 2 2
Test Chamber/Field Configuration
N (e/o)*
open field novel cage home cage residential maze
13 (65.0) 3 (15.0) 2 (10.0) 2 (10.0)
(55.0) (15.0) (10.0) (10.0) (10.0)
Age(s) of Offspring When Tested for Motor Activity (Automated and Nonautomated Testing) Age (Postnatal Day)
First Test N (e/0)
Second Test N (°7o)
21 25-28 30-35 42-50 60-63 70-77 (Responding laboratories)
3 (15.8) 3 (15.8) 5 (26.3) 4 (21.1) 3 (15.8) 1 (5.3) (N = 191")
2 (25.0) 2 (25.0) 2 (25.0) 2 (25.0) (N = 8)
*Based on the 20 laboratories which routinely include motor activity evaluations in their test batteries. 1-Excludes one of 20 laboratories which routinely tests motor activity but did not report test methodology (age, number of pups tested, or testing frequency).
60
LOCHRY, JOHNSON AND WlER TABLE 6 POSTWEANING BEHAVIORAL TESTS--FREQUENCY OF ROUTINE USE AND METHOD OF EVALUATION Frequency of Use* Parameter
Extent of Testing by Laboratories Routinely Using Testt All Pups in Litter Tested N (070)
N (%)
Motor assessments Automated motor activity Manual motor activity Rotating rod/drum Learning/retention Water maze Passive avoidance Active avoidance Startle evaluations Startle habituation Prepulse modification Startle elicitation
Pups Tested Multiple Days N (,°7o)
17 (68.0) 3 (12.0) 6 (24.0)
3 (18.8)~ 1 (33.3) 1 (16.7)
3 (18.8)~: 0 (0.0) 1 (16.7)
15 (60.0) 9 (36.0) 3 (12.0)
2 (13.3) 1 (11.1) 0 (0)
9 (60.0) 6 (66.7) 3 (100)
4 (16.0) 2 (8.0) 1 (4.0)
0 (0.0) 1 (50.0) 1 (100)
1 (25.0) 0 (0.0) 0 (0.0)
*Calculated as the number of laboratories routinely using the test in their behavioral batteries/total number of labs surveyed (25); tBased on the number of laboratories that retained the test in their behavioral batteries; ~Excludes one of 17 laboratories that use automated activity testing but did not report testing methodology.
p o s t n a t a l days 42 to 50. T h e r e was n o clear a g r e e m e n t o n the age o f the second m o t o r activity evaluation.
Motor Coordination T h e r o t a t i n g r o d / d r u m test was sampled by only 4 8 % o f the laboratories a n d r e t a i n e d by only h a l f o f the laboratories which originally evaluated it (Table 4). L a b o r a t o r i e s t h a t retained the r o t a t i n g r o d in their test batteries as well as those t h a t did n o t rated this test as " p o o r " for sensitivity.
Learning and Retention Nineteen (76%) o f 25 l a b o r a t o r i e s t h a t r e s p o n d e d to the survey routinely c o n d u c t e d learning a n d / o r r e t e n t i o n tests. A p p r o x i m a t e l y h a l f o f the laboratories ( 5 2 % ) surveyed conducted o n e learning e v a l u a t i o n a n d nearly a q u a r t e r (24°7o) p e r f o r m e d two learning tests as p a r t o f their r o u t i n e behavioral batteries. O f these, nearly two-thirds routinely include
r e t e n t i o n testing. T h e most c o m m o n retention interval was 24 h (50o7o o f the laboratories), a l t h o u g h a p p r o x i m a t e l y o n e - t h i r d o f the laboratories (33°7o) used a l-week retention interval. T h e r e was n o clear agreement o n the age o f testing, regardless o f w h e t h e r the various types of learning tests were considered separately or together. The w a t e r m a z e a n d passive avoidance tests were most o f t e n used to assess learning a n d retention (Tables 6 a n d 7). A p p r o x imately 60% o f the responding laboratories used a watermaze test, a l t h o u g h nearly h a l f o f those using w a t e r m a z e to assess learning did n o t include a retention (memory) test. Passive a v o i d a n c e was used by 36% o f the laboratories, all o f which evaluated retention (memory). Three laboratories used active a v o i d a n c e a n d n o companies used scheduled controlled ope r a n t behavior. The six laboratories t h a t p e r f o r m e d two learning a n d r e t e n t i o n tests all used the w a t e r m a z e a n d passive a v o i d a n c e tests. A l t h o u g h the w a t e r m a z e p a r a d i g m was m o r e often used
TABLE 7 LEARNING/RETENTION EVALUATIONS
Reflex Watermaze Passive avoidance Active avoidance Operant behavior
Labs That Ever Used Test N/N* (%)
Labs That Still Use Test N/Nt (a/0)
16/25 (64.0) 11/25 (44.0) 4/25 (16.0) 0/25 (0)
15/16 (93.8) 9/11 (81.8) 3/4 (75.0) .
Ratings~t (Mean ± SD)
.
Overall Rating (Mean)
Practicality
Variability
Sensitivity
2.4 __. 0.9
2.4 __. 0.6
2.3 ± 0.7
2.4
2.6 + 0.5
3.2 + 1.6
2.7 ± 0.9
2.8
2.3 + 1.2
2.0 + 1.0
1.7 ± 0.6
2.4
.
.
.
*Based on the total number of labs included in the survey; tBased on the number of labs that ever used the test; ~tBased on the rating scores from labs that maintained the test in their battery (1 = Poor; 2 = Acceptable; 3 = Good; 4 = Excellent).
MARTA BEHAVIORAL METHODS SURVEY than passive avoidance to assess learning, passive avoidance clearly received a higher degree of user satisfaction as indicated by the highest overall rating for the learning and retention evaluations, as well as the highest ratings in the individual categories of practicality, variability, and sensitivity. Active avoidance was rated lowest by the laboratories that routinely use it in all three categories, particularly for sensitivity, where it received a "less than acceptable" average rating. Little information was offered on the specific methodology or equipment used in testing passive avoidance learning and retention. Watermaze learning was most often evaluated in simple two-choice mazes. The simple two-choice maze was used more than twice as often (53.3070) as the more complex Biel maze (2007o). Four laboratories which routinely included a watermaze evaluation in their test batteries did not describe their apparatus.
Automated Startle Paradigms Seven of 25 laboratories (2807o) had sampled automated startle testing and an additional two laboratories stated their intention of obtaining the equipment to begin methods development of this test in the near future. At the time of the survey, automated startle paradigms were used exclusively by U.S. companies. Six of the seven (85.7070) laboratories that had sampled one or more automated startle paradigms retained at least one as a regular component of their behavioral batteries. The automated startle test received relatively high ratings (average rating of better than good) in the category of sensitivity, most likely reflecting the exceptional precision of the equipment used to measure it. However, the ratings it received for variability and practicality were more moderate.
61 Auditory startle habituation was the most frequently used startle paradigm (66.6°7o of laboratories). Prepulse modification was evaluated by two (33.307o) laboratories and simple startle elicitation was evaluated by one (16.7070). Two laboratories evaluated more than one type of startle paradigm. The majority of laboratories which routinely evaluated the startle response utilized force transducers to measure the magnitude of the response. One laboratory used an accelerometer, while another laboratory did not specify the device for gauging the startle response. Six laboratories that regularly tested startle paradigms in their test batteries, utilized an auditory stimulus, while two of these also occasionally used a tactile stimulus for startle testing. Startle testing was conducted as often as three times during a single study, although most laboratories tested startle twice within a given study. No clear trend toward a standard age of testing was identified. The earliest age of testing was PND 19 and the latest was PND 63.
Relationship of Behavioral Effects With Other Endpoint Results Behavioral measures were often found to be associated with other measures of developmental toxicity (Table 8). Seventy-seven percent of the responding laboratories found behavioral effects to be associated with pup body weight effects. Forty-four percent specifically identified physical developmental landmarks as being associated with pup body weight effects. Two of the laboratories stated that they had found no relationship between behavioral effects and other measures of developmental toxicity. Twenty percent of the laboratories had encountered a test
TABLE 8 EXPERIENCES WITH INTERPRETATIONOF BEHAVIORALDATA Relationshipof BehavioralEndpointsto OtherDevelopmentalToxicityEndpoints Behavioral endpoint Behavioral measures in general Development landmarks Behavioral measures in general (Responding laboratories = 18)
Other endpoints Pup Weight Pup Weight None
N (%) 14 (77.8) 8 (44.4) 2 (11.1)
BehavioralEffectsin Relationto OtherEffects Behavioral Outcomes in Safety Studies Offspring behavioral effects at doses < maternal NOEL Offspring behavioral effects determined the developmental NOEL Offspring behavioral effects used to decide dose levels in studies (Responding laboratories = 24 to 25 per question)
N (%) 9 (36.0) 5 (20.0) 6 (25.0)
DevelopmentalEndpointMostOftenAffectedby TestAgent Test or Endpoint Motor activity Body weight Startle reflex Offspring survival Passive avoidance Grasping reflex 10 others (Responding laboratories = 20)
N (o7o) 5 (25.0) 4 (20.0) 3 (15.0) 3 (15.0) 3 (15.0) 2 (10.0) 1 (5.0)
62
LOCHRY, JOHNSON AND WIER
agent that produced behavioral effects in the offspring at doses lower than those that produced any other type of developmental toxicity. Twenty-five percent (25070) had used behavioral effects found in one study to set dose levels in other studies. The developmental endpoints that most often identified a test article-related developmental change were motor activity and body weights. DISCUSSION Over the past 14 years, there have been at least three other surveys conducted that assessed the use and interpretation of behavioral evaluations and postnatal test batteries. These include a survey of academic, industry, and government institutions conducted in 1978 by Buelke-Sam and Kimmel (1), a survey of pharmaceutical companies and contract laboratories conducted in 1985 by Chester et al. (2), and surveys conducted in 1988 and 1989 by the Japanese Behavioral Teratology Meeting Group (7). Although this 1991 survey specifically targeted laboratories primarily devoted to conducting GLP reproductive safety studies of pharmaceuticals in contrast to previous surveys that involved a larger number of more diverse laboratories, several comparisons can be made. In all three surveys, growth parameters were consistently found to be the most common endpoints in behavioral test batteries, with an increasing number of laboratories opting to include these evaluations over the past decade. Use of physical developmental landmarks increased from about 2207o of the laboratories in the 1978 survey to about 50070 of the laboratories in the 1985 survey. In the current survey, at least 8807o of laboratories routinely included one or more physical developmental landmarks. Surface righting has been the most frequently used endpoint among U.S. and European investigators conducting reflexive tests on an ongoing basis (17 of 34 in 1978; 17 of 25 in 1985; 21 of 25 in 1990). Auditory startle response (manual or automated) has also been a favored reflex test in these surveys (15 of 34 in 1978; 21 of 24 in 1985; 22 of 25 i d i 9 9 0 ) . In the 1988 Japanese survey, the surface righting reflex was evaluated by 13 of 19 laboratories and Preyer reflex was assessed by 16 of 19 laboratories. Collectively, the surveys indicate that tests of reflex functions continue to be the primary method for evaluating sensory function. Motor activity assessments were most often conducted using an open-field apparatus but details of the test methods varied considerably across laboratories. Motor activity was evaluated in an open field by 59070 (1978 survey), 75070 (1985 survey), and 65070 (the current survey) of the laboratories included in the three surveys. However, the current survey indicated that the method of detection and field configuration were the same in only about half of the laboratories. A similar finding was noted in the Japanese survey which found that although all 26 laboratories (10007o) categorically used an open field to assess motor activity, there were fairly wide discrepancies in test field configurations (e.g., 15 of the laboratories used a circular field while 11 used a square-shaped field).
While the shape or type of test apparatus may not necessarily be critical factors, other test design variables such as the duration of the test session and the age of the animals at the time of testing have been considered significant variables which require standardization in other regulatory guidelines (9). Currently, it is unlikely that any one experimental protocol for motor activity testing is used by more than a minority of investigators. A similar situation was apparent for tests of learning and memory which have been primarily evaluated in watermaze escape paradigms (43070 in the 1978 survey, 65070 in the 1985 survey, 81070 in the 1989 Japanese survey and 60070 in the current survey). However, given a multitude of test design variables (maze configuration, trial type, age at the time of testing, inclusion of a retention phase), it is likely that few laboratories conducted watermaze testing in exactly the same way. At the time of the 1978 survey, operant testing procedures were being used for assessment of prenatal a n d / o r neonatal effects on behavior by nearly 25070 of the laboratories surveyed. However, subsequent surveys noted a progressive decline in the use of operant testing to only 15°70 of laboratories surveyed in 1985, 2070 of laboratories surveyed in 1989 (Japanese survey) and none in this survey. Relative to past surveys, the current survey noted an increased use of reproductive performance (mating and fertility), with more than 8007o of laboratories including these evaluations in their postnatal test battery. A concomitant increase in the assessment of sexual maturation as a standard component of behavioral test batteries (vaginal patency, balanopreputial skinfold separation) was also noted, indicating an increased recognition that these endpoints can be used to detect potential changes in neuroendocrine function (3). CONCLUSIONS During the 5-year period covered by the survey, there was a substantial increase in the number of GLP reproductive toxicity studies which included behavioral assessments. Nearly twice as many reproductive toxicity evaluations were conducted for submission to regulatory agencies during 1990 than in 1986. The results indicate that behavioral tests have become a common component of Segment I and III tests of pharmaceuticals. The survey also found that behavioral tests are not limited to simple, manually performed reflex or physical developmental milestones but now include learning and retention evaluations as well as automated locomotor activity assessments on a regular basis. Overall, most companies have kept relatively current with the advances in available equipment and techniques with which to assess behavior. ACKNOWLEDGEMENTS We thank the following individuals for their assistance in coordinating and distributing the survey and in compiling the survey results: Judy Hash, Erica Hellman, Lynn Rapson, and Marge Vargo.
REFERENCES 1. Buelke-Sam, J.; Kimmel, C. A. Development and standardization of screening methods for behavioral teratology. Teratol. 20:1730; 1979. 2. Chester, A.; Hallesy, D.; Andrew, F. Behavioral methods in reproductive and developmental toxicology. Neurobehav. Toxicol. Teratol. 7:669-673; 1985.
3. Clark, R. L.; Anderson, C. A.; Prahalada, S; Robertson, R. T.; Lochry, E. A.; Leonard, Y. M.; Stevens, J. L.; Hoberman, A. M. Critical developmental periods for effects on male rat genitalia induced by Finasteride, a 5a-reductase inhibitor. Toxicol. Appl. Pharmac. 119:34--40; 1993. 4. Committee on Safety of Medicines (1974). Notes for guidance on
MARTA BEHAVIORAL
METHODS
SURVEY
reproduction studies. Department of Health and Social Security, Great Britain. 5. Japanese Ministry of Health and Welfare. Good laboratory practice standards for safety studies on drugs. Notification No. 313 of the Pharmaceutical Affairs Bureau, 1982. 6. Japanese Ministry of Health and Welfare (1984). Toxicity test and guidelines. Notification No. 118, Pharmaceutical Affairs Bureau, February 15, 1984. 7. Tanimura, T. Japanese perspectives on the reproductive and developmental toxicity evaluation of pharmaceuticals. J. Amer. Coll. Toxicol. 9:27-37; 1990.
63 8. U. S. Environmental Protection Agency. Toxic Substances Control Act (TSCA); Good Laboratory Practice Standards: Final Rule. Federal Register, Part III, Vol. 54, No. 158; 1989. 9. U. S. Environmental Protection Agency. Pesticide assessment guidelines, subdivision F - Hazard evaluation: Human domestic animals, Addendum 10, Neurotoxicity. Health Effects Division, Office of Pesticide Programs, 1991. 10. U. S. Food and Drug Administration. Good laboratory practice regulations: Final rule. Federal Register, Part VI, Vol. 52 No. 172, 1987.