Recurrent problems in the behavioral assessment of social skill

Recurrent problems in the behavioral assessment of social skill

Rc\. Thu. Vol. 21. No. I, pp. 2’) II. IYX; Prmted III Great Britain. All rights rcsened oow7Yh7~83.01OO?Y-I3303.00,‘(3 C’qpnght G I%3 Pergamon Prcrr ...

1MB Sizes 98 Downloads 139 Views

Rc\. Thu. Vol. 21. No. I, pp. 2’) II. IYX; Prmted III Great Britain. All rights rcsened

oow7Yh7~83.01OO?Y-I3303.00,‘(3 C’qpnght G I%3 Pergamon Prcrr Ltd

Brhur

RECURRENT PROBLEMS IN THE BEHAVIORAL ASSESSMENT OF SOCIAL SKILL ALAN S. Medical

College of Pennsylvania

at EPPI,

BELLACK

3200 Henry

Avenue.

Philadelphia.

PA 19129, U.S.A.

Summary-Behavioral strategies for assessing social skill have been subjected to extensive analysis and criticism in the past few years. Many problems with earlier strategies have been corrected. yet the literature continues to be marked by a number of serious errors and invalid procedures. The purpose of this paper is to identify some of these recurrent difficulties. and recommend alternatives where possible. The discussion covers three general topics: the measures employed. the assessment format and conceptual issues which bear on assessment. Specific issues addressed include the respective value of molecular and molar ratings, the extent to which subjects actually engage in role playing. the utility of single prompt role-play tests. the selection and form of role-play scenarios and the role of skill deficits in interpersonal dysfunction. Some of the major assumptions and persistent dilemmas in the existing literature are examined in the light of questionable and often invalid assessment procedures.

INTRODUCTION

Progress in science tends to be agonizingly slow. It often takes 1 or 2 years to translate an innovative idea into analyzed data. Editorial and publication delays typically translate into another 1-2 years before the data actually appear. Research ongoing in other laboratories at the time of publication cannot be modified in mid-stream to accomodate the new findings. Hence, the impact of the innovation may not be felt or apparent in the literature for several more years. In the interim, additional research is published and new projects are initiated which may be invalid or conceptually obsolete. The resulting impact on theory and clinical practice can be awesome. This problem is especially serious in popular, highly-researched areas, such as social-skills training and assessment. A quick review of my reference file uncovered 3.5 articles dealing with the assessment of social skills in adults published in behavioral journals and the Journal of Comultitzg a~zcrl Clinical Psychology in 1980 and 1981. Almost undoubtedly, a like number appeared on social-skills training and on work with children and special populations. As an Editor and Editorial Board member of many of these journals. I know that many more articles on these topics were submitted and rejected. Many of the published articles (and some of the rejected ones) are conceptually and methodologically sound; several are seminal and may eventually redirect the field. Yet, a great many (the majority?) of studies are plagued by the information lag described above. They employ assessment procedures which are outmoded and, at times, either ungeneralizable or flatly invalid. Early (i.e. mid-1970s) behavioral strategies for assessing social skill were over-simplistic. They relied on face validity and measurement precision, and failed to adequately consider psychometric factors, generalizability or social validity. Many of the problems with early approaches have been elucidated in a number of review papers and book chapters published in the past several years (e.g. Arkowitz, 1981; Bellack, 1979; Conger and Conger, 1982; Curran, 1979; Curran and Mariotto. 1981; Hersen and Bellack, 1977). However, several pitfalls and problems addressed in those papers continue to plague the literature. Moreover, there are a number of important issues which have not yet been adequately addressed. The purpose of this paper is to highlight some of the recurrent problems involved in the assessment of social skill. The emphasis will be on direct observation procedures, but some of the issues to be considered have broader implications. The reader is referred to existing reviews for a general background. The focus here will be on misunderstood or rarely-considered issues and procedures. Many points are not currently documented by data. Rather. they are based on my experience with 29

.?O

ALAN S. BELLACK

over 1500 subjects and conversations with many of the leading social-skill researchers. Moreover. a number of the factors to be addressed may not individually account for enough variance to be easily verifiable. That does not discount their cumulative importance. The discussion will cover: (1) measures, (2) the observation format, (3) subject factors and (4) conceptual issues. MEASURES

The core of any direct observational assessment is the specific set of behaviors observed. It is vital that the target behaviors adequately represent the criterion (i.e. ‘social skill’), that they are precisely defined and that they are reliably and accurately measured. Unfortunately, this has been a controversial and problematic area in the social-skills literature. There is tremendous variability in the selection and definition of target behaviors, specific measurement procedures and the level of observation (e.g. molar vs molecular). The major problem with this variability prr se is that it makes interstudy comparisons difficult. More importantly, many otherwise sound studies have employed faulty and invalid measures. Medium

qf assessment

There is no one set of behaviors and definitions which is appropriate for all circumstances. The medium of observation (e.g. live vs videotaped). the particular skill area assessed (e.g. heterosocial vs assertion), the S population (e.g. schizophrenic vs college student) and the purpose of the assessment (e.g. screening vs clinical appraisal) all affect the choice of measures. The last three of these factors are widely recognized. Less attention has been paid to the medium of observation. Videotaping is generally the most adaptable and desirable medium. In contrast to audiotape, it allows for the assessment of visual response characteristics: the use of hand gestures, smiles, head nods etc. Because the data records are permanent, many more behaviors can be rated than is possible with live observations. Raters can play tapes over and over again, each time scoring a different response component. Reliability is thereby enhanced as well. However. most videotaping systems available to researchers yield pictures of only modest quality, and there are marked limitations to what can be validly rated. Subtle response elements such as postural rigidity and muscle tension, which are apparent to live observers. often cannot be detected. For example, raters frequently cannot determine whether a s’s hands are comfortably folded or nervously grasping one another, or whether a slouched posture is rigid (tense) or relaxed. Loose and bulky clothing make rating especially difficult. Another problem concerns the zone of focus. Close-up pictures, which give adequate views for rating facial expression, do not permit ratings of hand gestures, foot tapping etc. Conversely, long-range, full-body views rarely offer sufficient facial detail to detect expression. eye contact etc. The difficulty in observing and rating these features affects the selection of measures. It also has important implications for overall (i.e. molar) ratings of skill and anxiety. Judges are limited in the cues they can use for making such ratings, and thus (as will be discussed below), may place excessive reliance on verbal content in making judgments. Overall ratings of anxiety may be especially subject to distortion if judges cannot discern motoric, tension-related cues. In any case, live observations may be essential when it is important to assess such response elements. Selecting

and dqfining

targets

The choice of appropriate target behaviors is essential if the assessment is to be meaningful. The Method sections of recent articles in this area suggest that researchers have begun to give adequate consideration to this issue. There is growing awareness that targets should have a documented relationship to the criterion, and must be elicited by the testing situation (e.g. waiting-room situations elicit initiation behaviors but not affection or assertion). However, much less attention has been paid to how target behaviors are defined and scored. Definitions and rating/scoring procedures vary widely across studies. The same response labels may be used (e.g. eye contact, anxiety, posture), but definitional/procedural differences can yield dramatically different data.

Assessment

of social skill

31

It is not feasible to catalogue all such variations here. But the issue is clearly illustrated by one of the most widely-measured responses: eye contact. First of all, the very term ‘eye contact’ is a misnomer. Neither live observers nor videotape raters can determine whether the S and confederate are actually looking into one another’s eyes. At best, raters can tell only if the S is looking at the confederate’s head and face. The more appropriate response label is, thus, yule. Even that response can only be approximated, and only when the camera (or observer) is situated behind the confederate’s head. The focus of Ss’ eyes cannot be accurately judged from side views or distant (i.e. full-body) shots. Following the procedures established by Eisler et al. (1975). gaze (eye contact) is typically scored as the number of seconds the S looks while speaking. This might have some legitimacy in the context of assertiveness. But, during ordinary conversations the speaker looks wa~~~fj.o~~~the listrnet (Duncan and Fiske, 1977; Trower. 1980). The listener looks at the speaker’s face, and both shift focus when they exchange roles. Therefore. in assessing conversational or heterosocial skill, gaze should be scored when the S listens, or separately during listening and speaking. Another factor concerns just ho\~ the S looks. For example. even as a listener. a fixed gaze (i.e. stare) is not socially appropriate: most individuals gaze at the speaker’s face intermittently. Therefore. raters can be directed to score cumulative gaze time, number of gazes or duration of each gaze. These measures are not equivalent (Waxer. 1977); they reflect different affective states and parameters of social skill. The distinction is well-illustrated by contrasting a S who makes frequent ‘furtive’ glances, with one who stares fixedly for part of the time, with one who gazes for several short periods. Similar distinctions and definitional problems can be identified for other behaviors (e.g. autistic gestures vs illustrators; nervous vs responsive smiles). Much greater care is needed in determining precisely what raters should observe and score. Scoring

procedurrs

The selection and definition of targets is closely intertwined with the manner of scoring: e.g. defining the critical feature of a response as ‘duration’ indicates that it will be timed rather than counted. However. there generally are several alternative scoring strategies which may be employed. The alternatives are not necessarily equivalent, and some may even yield misleading or invalid data. The most widely-employed mode1 for scoring responses was devetoped by Eisler et al. (1975). In keeping with the behavioral goals of objectivity and operationalism, Eisler rt al. placed excessive reliance on simple. clear-cut measures: frequency. duration and determination of occurrence or non-occurrence. While such measures are highly objective and reliable, it now appears that they have major shortcomings for assessing the most critical features of social performance. First. most responses of interest occur on continua, in which optimum performance is at an intermediary level. Voice volume can be too loud or too soft. Speech duration can be too brief or too long. Gaze can be excessive as well as deficient. Response latency can be too long or to short (e.g. the person talks over his/her partner). Frequency and duration measures are unidirectional, and cannot account for the fact that too much may be as bad as too little (and vice versa). Most Llnskilled Ss speak too little. avoid gaze, have long latencies etc. Thus, Ss who speak too much: look too much (i.e. stare}. respond too quickly, or use too many gestures appear as ‘highly skilled’ when entered into data analyses. This problem is magnified in treatment studies. Deficient Ss improve by increasing their responses, while the over-responsive group must decrease. When the data are combined, the latter group appears to have gotten worse. If there are few of these ‘opposite’ scorers. they may be excluded and frequency or duration will suffice. But, such Ss are not at all unusual in clinical populations. In a recently completed outcome project on women with affective disorders, my colleagues (M. Hersen and J. Himmelhoch) and I found that while under-assertiveness predominated. over-assertiveness (marked by the type of behavioral excesses noted above) was a primary problem for a large segment of our sample. Similarly, a project on wife abuse (with R. Morrison and V. Van Hasselt) demonstrated that both under- and over-assertiveness were common among abusive husbands. The bidirectionality of responses also causes problems for the occurrence/non-occurrence scoring typically employed for verbal content measures (e.g. compliance, refusal of unreason-

ALAU

Table

I. Qualitative

s.

Bt

LI.ACK

ratmg scales developed al. (1978. pp. 14&150)

by Trower

c’t

VOLUME 0 Normal volume I(a) Quiet but can be heard without diHiculty (b) Rather loud but not unpleasant Z(a) Too quiet and difficult to hear (b) Too loud and rather unpleasant 3(a) Abnormally quiet and often InaudIble (b) Abnormally loud and unpleasant 4(a) Inaudible (b) Extremely loud (shouting) GAZE 0 Normal gax frequency and pattern I(a) Tends to avoid looking. but no ncgativc impression (b) Tends to look too much. but no negative Impression ?(a) Looks too little. Ncgatlvc imp!-cssion (b) Looks too much. Negative impression 3(a) Abnormally infrequent looking. Unrewarding (b) Abnormally frequent looking. Unpleasant 4(a) Completely avoids looking. Very unrewarding (b) Stares continually. Very unpleasant

able requests). Simple refusals cannot be distinguished from hostile refusals or apologetic refusals with this type of scoring. Similarly, submissive compliance is not distinguished from compliant reactions which include an expression of dissatisfaction or attempts at compromise. The importance and appropriateness of such modified responses has been documented in several recent studies (Pitcher and Meikle, 1980; Roman0 and Bellack, 1980; Woolfolk and Dever, 1979). These variations can be accounted for by scoring additional response categories (e.g. compromise, hostile remarks). But, there are statistical problems associated with the use of a series of highly-intercorrelated variables. Also, some potentially important qualifiers (e.g. hostile remarks) occur with sufficiently low frequency that they are not apt to be significant in overall analyses. One solution to the difficulties posed above is the bidirectional scoring system proposed by Trower et al. (1978). An example is presented in Table 1. Each behavior is scored on either of two parallel scales, depending upon the idiosyncratic response style of each S. The two scales may be collapsed for data analysis (i.e. 2A and 2B are both entered simply as a score of 2). This system has the advantage of allowing raters to make qualitative judgments, rather than being locked into totally objectified systems that fail to reflect the impact of the S on others. Some of the specific scales described by Trower ef al. are too unstructured to permit reliable ratings. But, behaviorally-anchored definitions can be written for each level. My raters have achieved reliabilities above 0.90 in several studies with scales such as those presented in Table 2. It should also be noted that these scales can reflect response timing and sequence, which are often more critical than the simple occurrence or duration of a response (Fischetti ef al.. 1977; Peterson et al., 1981). This system is not without flaws (see below), and is definitely not the ‘final word’ on how social behavior should be rated. But it is a viable alternative to current strategies, which have marked limitations. A final scoring issue which warrants comment concerns the range and variability of responses. The scoring system should be devised to achieve at least moderate variance and use a large segment of the possible range. A system which produces a highly-restricted range of scores is not functional, and may even be misleading. Lack of variability may result from a number of factors, including: (1) the sample is actually homogeneous on the response dimension;* (2) the assessment task does not permit an adequate range of responses; (3) raters are unable to make adequate discriminations; and (4) the scoring system ‘forces’ responses into too * Homogeneity of the sample does not nccessarlly

imply that the entlre population is homogeneous. Most studies highly-skewed samples, Analogue studies typically select mildly-dysfunctional or ‘normal’ samples. which exclude highly-deficient Ss. Conversely, ‘clinical’ studies draw samples primarily from the low-skill end of the distribution. In either case. conclusions are limited to the particular end of the distribution H hich was studied.

employ

Assessment Table 2. Behaviorally-anchored

33

of social skill qualitative

rating

scales*

LENGTH 1. Not only acknowledges prompter’s remark. but adds own remarks spontaneously. This occurs in both A 2. One response is good; other response is short. (One may have no response and others good length) 3. Both responses are short. (Just answers. adds nothing) 4. Both responses are short; unpleasant. 5. Monosyllabic. very unpleasant. (Hmm. yeh, ok) B 2. One response is long; other is good. 3. Both responses are long. 4. Both responses are too long. 5. Responses are so long. narrator must interrupt, REQUEST/COMPLIANCE I. Non-compliant. Both responses are good. A 2. Non-compliant. First response is good; second response is short, 3. Non-compliant. First response is inadequate: second is good. 4. Non-compliant. No expression of dissatisfaction/both responses are short/some hostility. 5. Non-compliant/considerable hostihty/inappropriate or excessive expression of dissatisfaction. B 2. Compliant. Appropriate expression of dissatisfaction when non-compliance would not be effective. 3. Compliant. First response is good: second response is inadequate. 4. Compliant. Both responses are inadequate. 5. Compliant. Both responses are extremely inadequate. * For role-play

scenarios

with two confederate

prompts

and two S replies

few categories by not reflecting important factors. The last three situations reflect procedural errors rather than the true nature of the response. It is possible to get acceptable interrater agreement despite a highly-restricted range. But, that is not a sufficient justification for conducting data analyses and drawing meaningful conclusions. The raw data must be examined and restricted or non-normal distributions interpreted and/or corrected. This often will entail re-rating with revised scoring procedures. Few published studies report basic descriptive statistics (e.g. means, standard deviations), which reflect response distributions and levels. Fewer still report dropping measures because of range problems. Yet this is undoubtedly a confounding factor in many studies. It requires greater consideration in the future.

Molar

LS molecular

ratings

There has been a persistent conflict in the literature about the appropriate level of assessment. Molecular measures target specific response characteristics, such as gaze and speech duration. These variables are presumed to be the basic elements of interpersonal communication, which together comprise the social-skill construct (Bellack, 1979; Eisler et al., 1975). Molar ratings consist of global, qualitative judgments. Judges are asked to appraise the total impact of the S and provide a single rating of overall social skill or anxiety. Both types of measures have advantages and disadvantages (i.e. Bellack, 1979; Curran, 1979). Many researchers have wisely employed both levels of measurement in their work. However, it is not at all uncommon for molecular measures to be omitted or given scant attention in relation to the overall ratings. This approach is acceptable in special circumstances, such as screening or gross categorization of Ss, but it is not appropriate for most assessment or treatment studies. Molecular measures have been discredited, in part, because they have not been useful in validating assessment methods or hypotheses about social-behavior problems. Yet they are clearly important. Social skill is not a concrete entity. It is a construct or a summary label which represents the specific things people do in social encounters: what they say, how they speak, what they do with their faces and bodies etc. There is extensive research which shows that variations in these factors play a critical role in determining the meaning and impact of interpersonal stimuli (Harper et al., 1978; Trower et a!., 1978). Other research documents that interpersonal partners (or observers) attend to these factors and use them in making judgments about others (Conger et al., 1980; Roman0 and Bellack, 1980; Rose and Tryon, 1979). The

33

ALAN S. BELLACK

problem is that the elements combine together to form a gestalt. The contribution of any one element varies across respondents (i.e. Ss). observers, behaviors and situations (Ekman et al., 1980; Krauss et al., 1981). It also varies across response levels. and as a function of the consistency/discordance among elements (Jacob and Lessin, 1982; Krauss rt al.. 1981). Intermediate levels of many responses may play little role in forming the gestalt. while extremes may have dramatic impact. Similarly, non-content elements (e.g. posture, inflection) may be of secondary importance when consistent with verbal content. but they may dramatically alter the meaning of a response when they are discordant (e.g. sincerity becomes sarcasm). Given the variable contribution made by specific molecular components, it has been difficult to find consistent response deficits among criterion groups (e.g. low dating males), or relationships across situations (e.g. role-play and waiting-room tasks). But the fault does not lie with the components so much as the scoring procedures, tasks and Ss employed. (The first of these factors has already been discussed: the latter two will be addressed in other sections.) The most appropriate course of action is not to ignore response components, but to develop more effective strategies for assessing them and analyzing the resultant data. One step in this direction is the development of intermediary levels of evaluation, such as the qualitative scales described above. Other strategies have been proposed by Conger and Conger (1982) and Trower (1980). McFall and his colleagues (Freedman et al.. 1978; Gaffney and McFall. 1981) have developed criterion-referenced scales which also promise to avoid some of the pitfalls of earlier molecular procedures. The intermediary approaches also tend to circumvent many of the problems which characterize molar measures, the foremost of which concerns the uncertainty about what overall scores actually represent. Previous articles (e.g. Bellack, 1979; Curran. 1979) have indicated that these measures fail to identify specific strengths and deficits: information that is vital for planning and evaluating treatments, as well as elucidating the precise nature of social dysfunctions. This difficulty has been somewhat vitiated by the presumption that judges attend to the most important response features in making overall ratings, and that scores simply do not specify those features. Yet this is a questionable assumption. The literature indicates that overall scores are primarily determined by verbal content, with gaze (eye contact) playing a secondary, albeit consistent. role (Conger and Farrell, 1981; Galassi et al., 1976; Roman0 and Bellack, 1980, Trower. 1980). These findings have been interpreted to mean that molecular components are not particularly important. But another conclusion is equally parsimonious and supportable: namely. that judges base their ratings on the most noticable and easilycategorized response characteristic (Trower. 1980) in order to simplify their task and achieve adequate reliability. It is extremely difficult to secure reliable ratings of overall skill, anxiety etc. Some judges are inconsistent and/or have idiosyncratic criteria, and simply cannot produce usable data. Others produce ratings which match some of their colleagues, but not others. Reliable judges gradually develop internal norms for ranking Ss. But, they do not really learn to integrate diverse cues. Rather. they select particular reference points: a limited set of elements and variations upon which to base their judgments. These reference points are primarily content oriented: the easiest features to categorize and discuss with other judges. This may be one reason why reliability ratings are notoriously lower for overall anxiety ratings than for overall skill. In contrast to assertiveness or conversational ability, anxiety is not characterized by specific verbal content. A related problem is that different pairs of raters, within and between laboratories. develop different covert scoring systems. This point is demonstrated in a recent study by Curran et ~1. (1982). They had raters from six different laboratories rate a standard set of role-play scenes and brief conversations. The authors concluded that there was an acceptable degree of consistency across laboratories. Inspection of the data belie this interpretation. For overall anxiety. only two laboratories produced reliability ratings above 0.60 (a rather modest criterion). Three (of the six) had reliabilities above 0.60 for ratings of overall skill. Intercorrelations between these more reliable sets of raters averaged - 0.16 for anxiety and 0.60 for skill. These low levels of agreement do not result simply from the use of different scoring standards (i.e. one group scoring higher than another): correlations are, essentially, ranking statistics and similar rank-

Assessment

of social skill

35

ings would yield high correlations regardless of level differences. The raters apparently employed entirely different criteria. This pattern of results raises serious questions about the utility and generalizability of overall ratings. Without clear referents, it is impossible to determine either what is being rated or the comparability of ratings across studies. This may partially account for the many inconsistencies found in the social-skills literature. The unique contribution of unreferenced molar ratings is also unclear. Direct observation is generally seen as more desirable than self-report scales and interview ratings of social skill because of the increased objectivity and specificity. Molar ratings do not clearly offer these advantages, despite the fact they are secured via observation. In fact, they are more difficult and expensive to secure, but as now employed they may be less useful. Current practice is generally to give raters an overview of the social-skill concept, some limited information about molecular components, and little or no specific rating guidelines. This fosters the development of ad-hoc criteria. Two alternatives seem more desirable: (1) Employ untrained consumers or peers to provide ratings. Their referents would be unclear as well, but they would be a viable criterion, in the nature of social validation. High interrater agreement could be sacrificed for an average measure of true social impact; (2) Provide trained raters with clear guidelines and at least partially-defined criteria. This would enhance reliability, specificity and generalizability.

ASSESSMENT

FORMAT

The most popular strategy for assessing social skill continues to be role-play tests. Numerous studies have been conducted to evaluate the validity and utility of role playing, with mixed results (Bellack et al., 1978, 1979a,b; Green et al., 1979; Kern and McDonald, 1980; Wessberg et ul.. 1979). The data are sufficiently positive to justify continued use of this strategy, especially in the absence of viable alternatives. But there are major limitations, and external validity cannot be automatically assumed. Moreover, validity is very much affected by the precise form and content of the role-play procedure. Some procedures, including several of the more widely-used variants, have marked limitations and should no longer be employed. Others should only be used in special circumstances. The following discussion will consider instructions, situational content and confederate behavior. While the focus is on brief role-play tests. many of the issues have relevance for extended interactions (e.g. waiting-room tasks) as well. lmtructions The instructions given to Ss obviously play a critical role in any assessment procedure. A wide variety of instructional sets have been employed in social-skill assessments, often without sufficient consideration or rationale. Subjects have been directed to perform as they usually do, as well as they can, in what they believe to be the most appropriate manner or as they think skillful people respond. These variations can produce radically different responses. Instructional demand and reinforcement for effective responses have been shown to elicit more skillful performance than occurs with a neutral or ‘usual’ instructional set (Kazdin et al., 1981; Nietzel and Bernstein, 1976). There is some suggestion that verbal content is more subject to such manipulations than non-verbal and paralinguistic parameters (Vincent et al., 1979). If so, overall skill ratings (which are predominantly based on content) must be interpreted in the light of the instructional set provided. Subjects under high demand may appear skillful only because they can think of appropriate content. Yet they may not be able to marshal requisite non-content response facets or perform effectively in the environment. Several studies have demonstrated that knowledge of appropriate content is not synonomous with functional skill (Bordewick and Bornstein. 1980; Pitcher and Meikle, 1980; Schwartz and Gottman, 1976). This issue has been considered conceptually in regard to the nature of skill deficits, but has not received sufficient attention as a fundamental factor in assessment. A related issue pertains to the ‘as if’ aspect of role-play procedures. Role playing is presumed to be a useful vehicle because it simulates the natural environment. Subjects are expected to ‘get into’ role and imagine that they are actually engaged in the respective situ-

36

ALAK S. B~LLACK

ations. The more vivid and realistic the interaction. the more Ss are likely to experience affect and respond in a natural fashion. This is especially important in regard to non-content response elements, which are the primary markers of affect (Harper et al., 1978; Trower et al.. 1978). Role playing may not be critical for assessment of response content; most Ss can report what they might say in an encounter in response to direct questions or paper-and-pencil inventories. But unless they are trained actors or unusually sensitive self-observers, they cannot portray their usual non-content responses on demand or describe them accurately. It is essential that Ss ‘get into’ role if these features are to be adequately assessed. Unfortunately, experience conducting assessments and anecdoctal data from other laboratories (e.g. Wessberg et al., 1979) indicate that this requirement often is not met. Subjects frequently indicate that they could not imagine that the interactions were real. and/or that they did respond as they would in the natural environment. This is a fundamental source of invalidity, which is accentuated by some of the most widely-employed procedures. Factors pertaining to scene descriptions and confederate behavior will be discussed in subsequent sections. A more basic issue is the general context and format of the role-play procedure. Subjects are often presented with a shotgun-type of experience. They receive brief instructions and get little or no chance to orient themselves to the laboratory situation or the task itself. They then are presented with a rapid-fire series of scenarios or a stilted encounter with a stranger (i.e. confederate). It is unclear how much initial performance is confounded by anxiety and confusion: data on test-retest reliability are inconsistent (Bellack rt al., 1980; Borkovec et ul., 1974; Mungas and Walters. 1979). But it seems clear that Ss are not given sufficient time, input or assistance to imagine that they really are in the enacted situations. It may not be possible or necessary for most Ss to become actively enmeshed in the various roles. They may simply need to visualize the scenario, and determine how they would react if it were really happening. For this to occur. they must at least be given several practice scenes and a clear orientation to the task (e.g. the confederate will only be able to respond twice in each scene), as well as an opportunity to think about each situation. I have employed a procedural modification which seems to facilitate this process. Subjects preview each scenario (on typed cards), and are asked whether or not they can imagine themselves in the situation. If not, slight variations are made to enhance relevance. For example, an auto mechanic might be replaced by a television repairman for a S who does not have an automobile. When the S reports that he/she can imagine himself/herself in the situation, the narrator reads it aloud and the interaction proceeds in standard fashion. Alterations are limited to features which will not alter the difficulty or context of the situation (e.g. names, places. chores). If the S cannot imagine ever being in such a situation, it is dropped and scored as missing data. Anecdotal data from debriefings suggests that this procedure yields more natural responses. Many fewer Ss report that their responses were atypical or distorted. There are also fewer blocks or refusals (i.e. Ss unable to generate a response). Conversely, there does not appear to be a loss in spontaneity. In the natural environment. people generally are not thrust into social situations without warning or anticipation. Much of our behavior is preceded by cognitive rehearsal. There is little reason to test Ss’ ability to respond without any forethought. Situational

content

It is now well known that social skills are highly situation specific. Clearly, heterosocial interactions require different responses than assertion situations. There are also substantial differences within these broad categories. Structural characteristics. such as sex of partner, status. race, familiarity and age all play a role in shapin, 0 behavior. These factors are now commonly controlled, either by counterbalancing or by restricting the set of scenarios to one level (e.g. males only). Several other important factors have not been given sufficient attention. including: relevance of scenarios to the target population, difficulty level and the descriptions provided to Ss. As indicated above, it is vital that Ss be able to visualize themselves in the test situations (or their equivalents). Ergo, the situations must be relevant to their lives and the problems they typically encounter. Socio-economic level, cultural background, age, race, sex and level of (psychological) functioning are common factors which affect relevance. For example, the Be-

Assessment of social skill

37

havioral Assertiveness Test-Revised (BAT-R) (Eisler et al., 1975) is the most common source of situations for assessing assertiveness. It was developed for use with male V.A. psychiatric patients in Mississippi. The BAT-R does not portray many of the most common and important assertion problems faced by women. urban residents or younger and less-disturbed populations. Conversely, many of the items are distinctly inappropriate for those populations. A related problem pertains to item difficulty (Kolotkin, 1980). Some situations are easy to handle. even for comparatively unassertive individuals. Others are so difficult that even assertive Ss are unable to respond effectively. Difficulty level depends on the skill level of the target population as well as a number of task factors, ranging from physical and status characteristics of the interpersonal partner to how fair and reasonable the partner behaves (Epstein, 1980). Either excessively-difficult or excessively-easy situations are inappropriate because they do not allow for sufficient variability across Ss .* In order to yield responses across the entire range. items should be of intermediate difficulty levels. Generally, it is more desirable to employ established procedures and test items than to develop new measures for each study on an ad-hoc basis. But that is only true when existing measures are empirically sound and suited to the population of interest. If not, new methods must be developed by sampling the target population to insure appropriate relevance and difficulty levels. The third context factor is the background information provided to Ss in the narration of each scenario. Following the model established by the BAT-R, most studies present brief narrations. The problem situation is identified. but little qualifying or background information is provided. The S knows only that he/she is being thwarted. or asked an unreasonable request or conversing with an opposite sex peer. Much of the information available in real-life interactions and needed for shaping naturalistic responses is not provided. For example. the potential consequences of various response alternatives is a vital factor in response selection (Fiedler and Beach, 1978): is there a risk of being hit’? fired’? rejected‘? This information cannot be discerned in most role-play tests. Attitudes about others, including degree of concern about their feelings and predictions of their future behavior, are also important determinants about which Ss do not receive adequate information. The same is true of the history of interactions with the individual: is this a recurring conflict with the person? Has he/she been helpful in the past? Compromise is reasonable with someone who has demonstrated ‘good faith’ efforts in the past, while a non-negotiable posture is appropriate with someone who has previously been unresponsive. The same basic problem situation (e.g. refusing a specific request) can result in dramatically different responses as a function of the particular qualifying and background information provided (Hopkins et al., 1981). The brief narrations typically employed do not contain sufficient cues for the S to visualize a real-life counterpart and develop a reasoned reaction; hence, he/she cannot respond in a truly representative manner. Lack of generalizability should not be surprising. It is much easier to develop and standardize brief items than the expanded descriptions recommended here. The advantages of the longer items warrant the effort, Confederate

hehacior

All of the commonly-used observation strategies employ confederates to simulate a real-life interpersonal partner. The S is expected to interact with the confederate as ifhe/she was a peer, stranger, employer, store clerk etc. This simulation is thought to elicit more naturalistic responses than would be possible simply by asking the S to respond in his/her typical manner (e.g. “Show me what you would say and do if someone cut in front of you in line.“). Curiously, this presumption has not been subjected to empirical test. However, if it is valid (and I believe it is), the confederate’s behavior must play a significant role in determining the form and content of the S’s response. The confederate may elicit dramatically different responses, depending on precisely what is said, how much is said. the non-content characteristics of speech etc.: i.e. the precise features that determine the S’s skillfulness and impact on others. * It is also essential to allow for variability w,irhin Ss. Data suggest that psychiatric patients arc more consistent across situations than non-patients (Snyder. 1974; Trower. 1980). Either they do not adequately attend to situational factors in selecting responses or they lack adequate skills to respond differentially. In either case. important population differences might only be apparent if the set of scenarios allows for differential responding.

38

ALAN S. BFLLACK

Moreover, contextual features of the subject-confederate interaction (e.g. seating arrangement, live vs videotaped confederates) may also affect how life-like and representative the S experiences the interaction. This issue is not widely considered. and many studies have employed procedures and formats which appear to have minimized realism. The BAT-R has served as a prototype for confederate behavior. as well as scenario content. The narrator first describes the situation. after which the confederate delivers a brief prompt which serves as a cue for the S. The interaction ends after this single interchange. This procedure facilitates standardization. but it certainly is not naturalistic. Few social interactions are so limited, especially when conflict (e.g. assertiveness) is involved or when someone wishes to establish a relationship (e.g. heterosocial skill). Many Ss are clearly uncomfortable with the pregnant pause which occurs when confederates fail to respond a second time. Subsequent replies are often geared to this anomalous pattern rather than the real-life version of the scenarios. The brief interchange is marked by two other problems as well. First. raters are not provided with a sufficient sample of S behavior to make sound judgments. This is especially problematic for paralinguistic and non-verbal elements. which are difficult to rate even with larger samples. For example, response durations of 1 set or less make it just about impossible to appraise intonation or time gaze, and Ss hardly have the opportunity to employ gestures or alter their posture. These limitations probably contribute to the content emphasis in overall skill ratings: judges have little else to go on. Second. many people exhibit skill deficits only when thwarted or faced with an uncooperative environment. Their initial responses are adequate, but they cannot follow-up effectively when the partner is not helpful (e.g. by facilitating conversation) or compliant. For example, Kirchner and Draguns (1979) found adult offenders to be more aggressive than non-offenders on a role-play test, but only after the third and fourth confederate prompts. There were no differences between groups on the first or second prompts. My work with depressed women suggests many are initially assertive, but they submit and become apologetic when resisted. Similarly, many shy males can make an adequate introductory comment. but are unable to follow through effectively. In sum, the single prompt role play yields extremely limited and stilted responses. Whether two, three or four prompts offer a sufficient increment must be empirically determined, but use of the single prompt format is no longer justified. By implication. this conclusion also mitigates against the use of taped role-play tests. in which the confederate prompt is presented by audioor videotape. The taped format suffers from all of the limitations of live single prompt tests, and is even less naturalistic (Galassi and Galassi. 1976). Several other issues regarding confederate behavior require brief mention. First, multiple prompt formats raise questions about whether confederates should be locked into specific responses no matter what the S says, or given the option of varying their response as a function of the S’s reaction. Standardized prompts insure consistency. but they also guarantee that some Ss will receive totally inappropriate prompts (e.g. the confederate continues to make a request after the S has already complied). There simply is too much intersubject variability to permit inflexible responses. Confederates can be trained to respond within a narrow range. preventing excessive variability while insuring relevance. Another controversial issue concerns the affective quality (i.e. non-verbal and paralinguistic features) of confederate behavior. Should the confederate deliver prompts in a neutral manner, or with dramatic affect (e.g. angry, apologetic, enthusiastic)? These variations can elicit markedly different reactions from Ss. There is no one correct affect level; the choice depends on the purpose of the assessment. However, in reporting results. the investigator must clearly describe how confederates responded as well as what they said. In addition, care must be taken to avoid drift; gradual changes in the manner of responding. When studies are initiated. confederates tend to be anxious, enthusiastic and histrionic. With experience, they generally relax and develop a subdued style. Continued repition sometimes leads to boredom and a lackadaisical manner. Clearly, these stylistic changes threaten the internal validity of the research. Confederates must periodically be matched to criterion. If possible, their behavior at different points in the study should be objectively assessed and compared, in a manner analogous to reliability checks for raters. If limited resources preclude this systemmatic analysis. ad-hoc comparisons

39

Assessment of social skill

of videotapes ency.

should

be conducted.

In either case, retraining

CONCEPTUAL

may be required

to insure

consist-

ISSUES

The superordinate guiding force in the development of any assessment strategy is one’s conceptualization of the problem. Psychodynamic and trait theory conceptions of interpersonal dysfunction would lead to entirely different approaches than a behavioral perspective. Even from within a generic behavioral view, there are decidedly different opinions about the ‘real’ nature of interpersonal problems, with associated recommendations about what should be assessed, and how the assessment should be conducted. In fact, the behavioral literature has been plagued by a series of overly-simplistic models, in which social problems have been artificially dichotomized. The so-called ‘skills’ model has frequently been set up as a straw man. and contrasted with a variety of alternative ‘causes’ of social failure. Inhibition due to anxiety and/or negative self-evaluation has been one of the most popular alternatives (Arkowitz, 1977; Twentyman and Zimering, 1979). This particular debate is characterized by the title of a recent article (Alden and Cappe, 1981), Nonassertiveness: Skill Dejcit or Selective Selfevaluation? Currently, the most popular dichotomy appears to be skills deficit vs cognitive/ informational problem (e.g. Bordewick and Bornstein, 1980; Bruch, 1981; Schwartz and Gottman, 1976). These various minidebates have generally fostered two types of conclusions: (a) all Ss in the diagnostic category (e.g. unassertiveness, shyness) have one particular problem (e.g. skill deficit); or (b) Ss in the category all have either one or the other problem (or, sometimes, both). Assessment strategies are recommended accordingly: behavioral observation for a ‘skills’ deficits, self-concept and expectancy measures for faulty self-evaluation, anxiety measures for social anxiety etc. The second of these conclusions is an advance over the first, but both are overly simplistic and lead to restricted assessment strategies. The focus of this paper precludes an extensive discussion of the nature of social dysfunction. Two points require some mention. First, the so-called ‘skills model’ has been prematurely and inappropriately reified. There is an identifiable position about the importance of verbal and motoric response characteristics, but there is no agreement as to a definition of social skill. let alone an integrated model. For example, Curran (1979) argues that the definition should exclude cognitive and perceptual factors, while McFall(1982) advocates the inclusion of everything but the proverbial kitchen sink. My own work has emphasized the central role of molecular elements in combination with a limited set of cognitive and social perception factors. In regard to assessment, Curran employs a somewhat restricted observational focus, but eshews molecular measures. McFall suggests a multifaceted, multistage approach, while I advocate an intermediary strategy. In no case is the entire spectrum of social dysfunction tied to the narrow set of molecular components by which the skills perspective is often characterized. Moreover, given the preliminary state of our knowledge, it is presumptuous and unjustified to conduct either-or comparisons between so-called ‘models’. Second, interpersonal behavior is a highly-complex phenomenon, with multiple interlocking determinants. A social dysfunction may develop for a multitude of reasons, only some of which have been addressed in the behavioral literature. Furthermore, maintaining factors may be different from the factors which precipitated the difficulty. A person with a long history of avoidance stimulated by social anxiety characteristically develops performance deficits, as with any unpracticed skill. Conversely, repeated social failures precipitated by performance deficits can be expected to lead to anxiety and avoidance. There are few pure clinical cases, in which only one problem prevails. It is fruitless to develop simple dichotomous categories, and try to shoehorn people and generic problems (e.g. unassertiveness) into one or the other. Much of the current criticism of molecular measures stems precisely from this process. For example, not every unassertive (or shy etc.) person has poor gaze; thus, there may not be large and consistent differences in gaze between unassertive and assertive groups. That is not to say that gaze is not an important factor. Our conceptions, assessment strategies, and methods of data analysis must be adjusted to account for this phenomenon. ~clrno~ladyernrnr~ Preparation of this manuscript Mental

Health.

MH 32182.

was supported.

in part, by a grant

from the National

Institllte

of

40

ALAN S. BELLACK REFERENCES

ALDEN L. and CAPPE R. (1981) Nonassertiveness: skill deficit or selective evaluation? Brhnr. Ther. 12, 107-l 14. AKKOW~TZ H. (1981) The assessment of social skills. In Brhoc~iorrrl Assrs.smrnt: A Pructical Handbook, 2nd edn (Edited by HERSEN M. and BELLACK A. S.). Pergamon Press, New York. ARKOWITZ H. (1977) Measurement and modification of minimal dating behavior. In Proyress rn Bchuvicw Modificutiorz. Vol. 5 (Edited by HERSEN M.. E~SLER R. and MILLER P. M.). Academic Press. New York. BELLACK A. S. (1979) .4 critical appraisal of strategies for assessing social skill. J. hehac. Assess. 1, 157 -176. BELLACK A. S.. HERSEN M. and TURNER S. M. (1978) Role-play tests for assessing social skills: are they valid‘? Behuc’. Thu. 9, 448-461. BELLACK A. S., HERSEN M. and LAMPARSKI D. (1979a) Role-play tests for assessing social skills: are they valid? Are they useful? J. consult. c/in. Pqxhol. 47, 335-342. BELLACK A. S., HERSEN M. and TURNFR S. M. (1979b) The relationship of role playing and knowledge of appropriate behavior to assertion in the natural environment. J. consult. clin. Psychol. 47, 67&678. BELLACI( A. S., TURNER S. M.. HERS~N M. and LU~FR R. (1980) Effects of stress and retesting on role-play tests of social skill. J. hrhac. Assess. 2, 99-104. BORDEWICK M. C. and BORUSTEIN P. H. (1980) Examination of multiple cognitive response dimensions among differentially assertive individuals. Behaz;. Ther. II, 44e-448. BORKOVEC T. D., STONE N. M.. O’BRIEN G. T. and KALOC~PFK D. G. (1974) Evaluation of a clinically relevant target behavior for analog outcome research. Behar. Tker. 5, 503-513. BRUCH M. A. (1981) A task analysis of assertive behavior revisited: replication and extension. Behuc. Thrr. 12, 217-230. CONGER J. C. and CONGER A. J. (1982) Components of heterosoclal competence. In Social Comperence und Psychiutric Disorders: Theory and Practiw (Edited by CUKRAN J. P. and MONT] P.). Guilford Press, New York. CONGER J. C. and FARRELL A. D. (1981) Behavioral components of heterosocial skills. Behar. Thu. 12, 41-55 CONGERA. J.. WALLAXDER J. L., MARIO~TO M. J. and WARU D. (1980) Peer judgements of heterosexual-social anxletv and skill: What do they pay attention to anyhow? .I. hrhar. Assess. 2, 243-259. CURRAN J. P. (1979) Pandora’s Box reooened? The assessment of social skills. J. hekat~. As.ws. I. 55 72. CURRAN J. P. and MARIOTTO M. J. (1981) A conceptual structure for the assessment of social skills. In Prugrras in Behacior Modification, Vol. 10 (Edited by HEICSFU M., EISL~R R. M. and MILLER P. M.). Academic Press. New York. CURRAN J. P.. W~SSBER~; H. W., FARRELL A. D.. MONTI P. M.. CORRIVEA~JD. P. and COYN~ N. A. (1982) Social skills and social anxiety: are different laboratories measuring the same constructs? J. consulr. c/in. PhsJchol. In press. DUNCAN S. JR and FISKE D. (1977) Face to Fuce Interuction: Research, Methods. and Theory. Lawrence Erlbaum. New York. EISLER R. M.. HERSEN M., MILLER P. M. and BLAN~HARD E. B. (1975) Situational determinants of assertive behaviors. J. consult. c/in. Psychol. 43, 330-340. EKMAN P.. FRIESEN W. V.; O’SULLIVAN M. and SCWERERK. (1980) Relative importance of face. body, and speech in judgements of personality and affect. J. Prrson. ~soc. Psych& 38, 27&277. EPSTEIN N. (1980) Social consequences of assertion. aggression. passive aggression. and submission: situatlonal and dispositional determinants. Behcrr. Thu. 1 I, 662-669. FIEDLER D. and BEACH L. R. (1978) On the decision to be assertive. J. consult. cliu. Ps~chol. 46, 537-546. FISCHE~TI M.. CURRAN J. P. and WESSRERC H. W. (1977) Sense of timing: a skill deficit in heterosexual-socially anxious males. Behac. ModiJ 1, 179-195. FREEDMAN B. J.. ROSENTHAL L.. DONAHOE C. P.. JR. SCHLIINDT D. G. and MCFALL R. M. (197X) A social-behavioral analysis of skill deficits in delinquent and nondeliquent adolescent boys. J. con.wlt. clin. Psycho/. 46, 1448-1462. GAFFNEY L. R. and MCFALL R. M. (1981) A comparison of social skills in delinauent and nondcliouent adolescent girls using a behavioral role-playing inventor;. J. cons~lc. clin. Psvchol. 49, 95$--967. GALASSI M. D. and GALASSI J. P. (1976) The effects of role-playing variations on the assessment of assertive behavior. Brhac. Thrr. 7. 343~ 347. GALASSI J. P.. HOLLANDSWORTH J. G.. RADECKI C., GAY M. L., HOWE M. R. and EVAX C. L. (1976) Behavioral performance in the validation of an assertiveness scale. Behay. Thu. 7, 447--452. GREEN S. B.. BURKHAR~ B. R. and HARRISOX W. H. (1979) Personality correlates of self-report. role-playing. and in riro measures of assertiveness. J. consulf. c/in. Psycho/. 47. l&24. HARPER R. G.. WIFNS A. N. and MATARAZZ~ J. D. (1978) Nonrr~hal Communication.\: The Srure of t/2(, Art. Wiley. New York. HERSEN M. and BELLA~K A. S. (1977) Assessment of social skills. In Hundhvok for Behavioral Assrvsmrnr (Edited by CIMINERO A. R.. CALHOUN K. S. and ADAMS H. E.). Wiley. New York. HOPKINS J., KRAVI~Z G. and BELLACK A. S. (1981) The effects of situational variations in role-play scenes on assertive behavior. J. hehar. A.wss. 3, 271-280. JACOB T. and LESSIX S. (1982) Inconsistent communication in family interaction. C/in. Ps~chol. Ret>. In press. KAZUIN A. E., MATSON J. L. and ESVELDT-DAWSON K. (198 1) Social-skill performance among normal and psychiatric inpatient children as a function of assessment conditions. Brhtiu. Rex Thu. 19, 145-152. KERN J. M. and MACDONALD M. L. (1980) t\ssessing assertion: an investigation of construct validity and reliability. J. consult. c/in. Psycho/. 48. 532-534. KIRCHNER E. P., KIXNFDY R. E. and DRAC;UNSJ. G. (1979) Assertion and aggression in adult offenders. Beha?. ThL,r. IO, 452-471. KOLO~KIN R. pi. (1980) Situation specificity in the assessment of assertion: considerations for the measurement of training and transfer. Brhm. Thrr. 1 I, 651-661. KRAUSS R. M.. APPLE W.. MORENCY N.. W~:NZFI. C. and WINTON W. (1981) Verbal. vocal, and visible factors in judgements of another’s affect. J. Person. WC. Psycho/. 40. 312-320. MCFALL R. M. (1982) A review and reformulation of the concept of social skills. J. hchw. Asress. 4, I- 33. MUNCZASD. M. and WALTERS H. A. (1979) Pretesting effects in the evaluation of social skills training, J. con.~u/t. c/in. Psrchol. 47. 21&21X.

Assessment

of social skill

NIETZEL M. T. and BERNSTEIND. A. (1976) Effects of instructionally

41

mediated demand on the behavioral assessment of assertiveness. J. consult. cfin. Psycho/. 44, 500. PETERSONJ., FISCHETTI M., CURRAN J. P. and ARLAND S. (1981) Sense of timing: a skill deficit in heterosocially anxious women. B&o. T&r. 12, 195-201. PITCHER S. W. and MEIKLE S. (1980) The topography of assertive behavior in positive and negative situations. Behuu. Ther. 11, 532-547. ROMANO J. M. and BELLACK A. S. (1980) Social validation of a component model of assertive behavior. J. consult.c/in. Psychof. 48, 478-490. ROSE Y. J. and TRYON W. W. (1979) Judgements of assertive behavior as a function of speech loudness, latency, content, gestures, inflection, and sex. Behau. Mod$ 3, 112-123. SCHWARTZ R. M. and GOTTMAN J. M. (1976) Toward a task analysis of assertive behavior. J. consult. clin. Psycho/. 48, 478-490. SNYDER M. (1974) Self-monitoring of expressive behavior. J. Person. sot. Psycho/. 30, 52&537. TROWER P. (1980) Situational analysis of the components and processes of behavior of socially skilled and unskilled patients. J. consult. clin. Psychol. 30, 52&537. TROWER P., BRYANT B. and ARGYLE M. (1978) Social Skills and Mental Health. University of Pittsburgh Press. Pittsburgh. TWENTYMAN C. T. and ZIMERING R. T. (1979) Behavioral training of social skills: a critical review. In Progress in Behucior Modification, Vol. 7 (Edited by HERSEN M.. EISLER R. M. and MILLER P. M.). Academic Press. New York. VINCENT J. P., FRIEDMAN L. C., NUCENT J. and MESSERLY L. (1979) Demand characteristics in observations of marital interaction. J. consult. c/in. Psychol. 47, 557-566. WAXER P. H. (1977) Nonverbal cues for anxiety: an examination of emotional leakage. J. ahnorm. Psycho/. 86, 3OG314. WESSBERG H. W., MARIOTTO M. J., CONGER A. J., CONGER J. C. and FARRELL A. D. (1979) The ecological validity of role plays for assessing heterosocial anxiety and skill of male college students. J. consult. c/in. Psycho/. 47, 525-535. WOOLFOLK R. L. and DEVER S. (1979) Perceptions-of assertion: an empirical analysis. Behac. Ther. 10, 404 41 I.