A Fuzzy Expert System for Evaluating University Teaching Efficiency

A Fuzzy Expert System for Evaluating University Teaching Efficiency

ELSEVIER Copyright ...

4MB Sizes 0 Downloads 81 Views

ELSEVIER

Copyright
IFAC PUBLICATIONS www.elsevier.comllocate/ifac

A FUZZY EXPERT SYSTEM FOR EV ALUA TING UNIVERSITY TEACHING EFFICIENCY Michele Lalla, Gisella Facchinetti, Giovanni Mastroleo Dipartimento di Economia Politica, Universitii di Modena e Reggio Emilia 41100 Modena, Italy, [email protected], [email protected], [email protected]

Abstract: Student evaluations of university efficiency are compulsory for universities. The Italian Ministry of University and Scientific and Technological Research proposed a questionnaire with items based on the four-point Likert scale and a traditional itemby-item analysis for the evaluation of teaching staff activity. In this study, three splitballot experiments were carried out to test the differences between the four-point and five-point Likert scale. Furthermore, the traditional analysis was compared with the results of the fuzzy expert system set up to achieve the same purpose. The fuzzy expert system yielded scores that proved to be generally higher but sometimes also lower than those obtained using the five/four-point Likert scale. Copyright © 2001 IFAC Keywords: Fuzzy expert systems, teaching, evaluation, statistical analysis, samples.

1. INTRODUCTION I

also suggested using means and variances to analyze the data, translating the categories (or labels) into a ten-point scale as follows: (])=2, (l)=5, @=7, (1)=10. On the one hand, for this ordinal scale, the absence of the middle position could violate the linearity assumption. On the other hand, means and variances cannot be used validly. Furthermore, the meaning of the labels was not entirely clear to all students and their intensities could reflect a high level of uncertainty. A set of categorical alternatives could be: (])Very insufficient, (l)lnsufficient, @Sufficient, ®Good, ®Very good (marking scale) because it seems more suited to the evaluation procedure as it is fairly similar to the score system used at previous school levels. The score of each item could be translated into a ten-point scale by multiplying the numerical label of the category by two, so that any evaluation becomes more interpretable. In the MURST scale, the elimination of the middle position was based on the assumptions that: (I) it attracts people who are careless or lazy or have no opinion, (2) respondents tend toward one of the two nearest alternatives, (3) respondents really in the neutral position randomly choose a polar alternative (Ray, 1990; Schuman and Presser, 1996). The marking scale would avoid these issues as there is no truly neutral category.

The evaluation of university activities, including research and teaching, is compulsory in practice, as exclusion from some grants is foreseen for those who do not comply (Law no. 370 of 19/10/1999, Official Gazette, G.S., no. 252 of 26/10/1999). The Italian Ministry of University and Scientific and Technological Research (MURST) recently founded an Observatory (now, a Committee) for University System Evaluation,2 which, in turn, created several research groups on various topics. One of the groups (Chiandotto and Gola, 1999) proposed a course-evaluation questionnaire with items using a four-point Likert scale: (])Definitely no, (l)No, rather than yes, @Yes, rather than no, ®Definitely yes (MURST scale). They

I This study is a part of the (Local) Project "Metodi e tecnologie per innovare e riorganizzare la didattica" supported by the University of Modena and Reggio Emilia.

2 It was established by a Ministerial Decree of 22.02.1996, but its tasks were established by art. 5 of Law no. 537 of 24/121 1993 (S.O. no. 121, Official Gazette no. 303 of 2811211993) and subsequently by articles 9, 15, and 19 of the Decree of the President of the Republic (D.P.R.) of 3011211995 (Official Gazette no. 50 of 29/0211996).

395

was worded and which terms defined the five/four categories, the question itself, the kind of course, the intensity level of the Likert scale.

A survey was conducted among students (target population), having them assign a rating (on a ten-point scale) to the labels used in the MURST and marking scales. The paper presents some results about the differences between the MURST and the marking scale, and a fuzzy system to evaluate teaching activity, showing that it is more flexible at handling data and presents no problems as to measurement methodology. In fact, both scales could be used without difficulty. Section 2 describes the questionnaires and the structure of the data, reporting some empirical results. Section 3 illustrates the fuzzy model built up to generate the numerical evaluation on a ten-point scale for some conceptual domains of the teaching activity. Section 4 concludes with some comments and remarks.

Table 1 Number of cases and gercentages bX tXge of guestionnaire, course, and consistencx of answers {cons.: consistent! inc.: inconsistent} with resgect to the evaluation of the categories Course

Cons.

%

98 74 73 245

Inc.

%

Tot.

72.6 37 74.7 25 64.6 40 70.6 102

27.4 25.3 35.4 29.4

135 99 113 347

20

36.4 35

63.6 55

21

80.8 5

19.2 26

45

60.0 30

40.0 75

24

28.2 61

71.8 85

110

45.6

54.4 241

33

56.9 25

43.1

37

68.5

31.5 54

34

87.2 5

12.8 39

14

87.5 2

12.5 16

28

77.8 8

22.2 36

146 501

72.0 57 63.3 290

28.0 203 36.7 791

Questionnaire 1 (Ql)

Mathematics A (M - A) Mathematics B (M - B) Mathematics C (M - C) TotalQ1 Questionnaire 2 (Q2)

Political Economy 2 A (PE2 - A) Political Economy 2 B (PE2- B) Math for Financial Markets A(MFM - A) Math for Financial Markets B(MFM-B) Total Q2

2. THE EV ALUATION QUESTIONNAIRE: PROBLEMS AND CONTENTS The attribution of numerical values to the four/five scales alternatives is questionable. However, three split-ballot experiments were carried out to obtain the rating of students for each labeled category and the performance of the marking/MURST scales. Eight key items were selected from the proposed MURST questionnaire: Adequacy of the Lecture Room (ALR), Adequacy of the Work Load requested (AWL), Correspondence between Actual and Planned lectures (CAP), Adequacy of the Teaching Materials - course books, handouts, etc. - (ATM), Clarity of the Teacher's Presentations (CTP), Teacher Availability during Office hours (T AO), Usefulness of Teaching-Support Activities - workshops, seminars, etc. - (UTS), Level of Interest in the Subject matter (LIS). Three different questionnaires were constructed: QI, Q2, Q3. Each one contained the eight items, but the answers were graded differently: Ql used the marking scale, Q2 used the MURST scale, and Q3 used both scales. To vary the response formats, the marking/MuRST scales were alternated with a Likert and a self-an. choring scale. The three questionnaires were administered to students in the first, second, and third years, respectively. The respondents had to specify their ratings of each categorical alternative for each item, using a ten-point scale, ranging from zero to ten. The results of the surveys are reported in Table 1 and the type of errors in Table 2.

131

QuestiOJUlaire 3 (Q3)

Public Economics A (PbE- A) Public Economics B (PbE - B) Public Economics C (PbE-C) Banking and Finance A (BF- A) Banking and Finance B (BF-B) TotalQ3 Total valid questionnaires

17

58

Table 2 Number of case and row gercentages bX tXge of guestionnaire and error % Tot. lE % SAS % EOOC % Inc. Q.I 26 25.5 8 7.8 45 44.1 23 22.6 102 Q. 2 44 33.6 22 16.8 24 18.3 41 3\.3 131 Q.3'·) 5Cat. 18 31.6 8 14.0 \.8 30 52.6 57 4Cat. 22 38.6 8 14.0 \.8 26 45.6 57 T.E 110 3\,7 46 13.3 71 20.5 120 34.5 347 (a) Only one considerable error per scale was counted.

%

100 lOO lOO lOO 100

Legend lE = Incomplete Evaluation of the categorical alternatives.

SAS = Self-Anchoring Scale, like semantic differential scale, for both even and odd numbers of categories. EOOC = Evaluation of Only One Category in the set of alternatives, while the others were missing. TE = Total Errors.

The figures reported in Table 1 reveal a better performance of the marking scale. which shows a low error rate. The types of errors suggest that the MURST scale could be misleading for respondents. Furthermore, the Kolmogorov-Smirnov tests showed that the distributions of the attributed values for the marking scale surveyed by Ql were equal to those surveyed by Q3. while the distributions of the attributed values for the MURST scale collected by Q2 were completely different from those collected by Q3. Both scales showed differences between the means relative to factors such as how the question

The score expressed by the i-th individual (i=I,,,.,nj) could depend on the course j (;=1, ... ,1), on the intensity level k (k=I, ... ,K) of the Likert scale, and on the statement I (/= 1,,, . ,L=8) under evaluation. Models for repeated measurements (Hand and Crowder, 1996) were used to ascertain the effect of factors such as type of scale and gender:

Y=XB®r+£

(1)

where Y is an nx(KL) matrix containing the K evaluations of the categories for each one of the L items, X

396

and assigned to the MURST scale categories, by students attending four courses: two in Political Economy 2 and two in Mathematics for Financial Markets in the second year. The observed means were markedly lower than the value label of the categories. The differences proved to be less marked just for «Definitely no» and «Yes, rather than no». There was heteroscedasticity without a specific pattern. The standard deviations of the MURST scale were higher than those of the marking scale and this could denote a better performance of the latter with respect to the former. The MURST scale revealed more uncertainty in the definition of value label among the target population.

is an nx(JKL) matrix representing the individualscourses-labels-items design, B is an (JK)xL matrix describing the mean responses to a different j course and k label, r is an LxK matrix of the changes over the L evaluations at a different j course and k label, and E is an nx(KL) matrix of random variation. Each row vector of residuals, Ej', of the i-th individual has a multivariate normal distribution, N(O,L), with a null mean vector and an unknown covariance matrix, L. Table 3 reports the means and the standard deviations of the scores, expressed through the ten-point scale and assigned to the marking scale categories by students attending the three Mathematics courses in the first year. The observed means were higher than the value label of the categories for those categories below sufficiency. They were lower than the value label of the categories for those categories above sufficiency. There was also heteroscedasticity. In fact, the standard deviations decreased for increasing levels of the scale up to sufficiency, and slowly increased for increasing levels above sufficiency. The values of the standard deviations in the first level were generally higher than others.

Table 4 Means and standard deviations of the scores assigned to the different MURST scale categories by courses and by items (02) ALR AWL CAP ATM CfP likert Course 2(') PE2-A I~ IS 13 13 I~ (1.3) (1.4) (1.1) (1.2) (1.2) n=20 PE2 - B 1.0 1.7 1.5 1.1 1.1 (1.2) (1.2) (1.3) (1.1) (1.0) n=21 MFM-A 1.8 2.4 2.6 1.7 1.8 (1.4) (1.0) (1.3) (1.5) (1.2) n=45 MFM-B 1.4 1.9 1.4 1~ 1.3 (1 .6) (1.8) (1.1) (1.6) (1.4) n=24 PE2 - A 3.7 3.6 3.5 3.8 3.8 5

Table 3 Means and standard deviations of the scores assigned to the different mark scale categories by courses and by items (01)

TAO UTS

IS

13

LIS 2~

(1.7) (11) (1.4)

1.4

0.6

0.9

(1.3) (0.8) (0.9)

1.8

1.7

1.6

(1.3) (1.2) (1.1)

13

1.4

1.5

(1.3) (1.9) (1.6)

3.7

3.4

4.0

(1.2) (1.1) (1.3) (1.4) (1.5) (1 .6) (1.4) (1 .3)

PE2 - B

4.0

4.5

43

3.6

3.8

3.9

3.9

4.0

(\.I) (\.I) (1.3) (1.2) (1.0) (0.9) (1.3) (1.0)

Likert Course ALR AWL CAP ATM CfP TAO UTS

2.9

3.0 2.5

1.4 2.2

2.4

2.0

LIS

MFM~

MFM-B

(0.9) (1.1) (1.1) (1.4) (1.2) (1.2) (1.3) (1.1) M- B

N=74

2.4

2.3

1.6 2.3

2.3

1.8

M- C

2.8 3.1

2.9

1.8 2.6 2.7

2.5

2.4

7

4.6 4.6 4.3

PE2 - A

2.7

PE2 - B

3.9 4.3 4.2 4.0 4.5

4.2 4.3 4.2 3.8 4.2 4.2 4.0

MFM-A

4.4 4.7 4.5

MFM-B

4.3

M- A

6.0 6.2 6.1

10

4.2 4.3 4.5 4.4 4.4 5.5 6.0 6.2 5.9

6.0 6.3

6.0

5.6 6.0 6.2 6.0

6.1

6.5

6.2

5.6 6.0 6.4 6.2

M- A

7.6

7.3

7.6

7.1

7.6 7.7

7.4

U

3.9

4.1

4.0

3.7

3.9

3.6

4.2

4.1

6~

63

6.0

6.8

6.4

7.0

6~

6~

6.8

6.8

6.8

6~

6.6

7.4

7.0

7.1

6.6

6.6

6.3

6~

6.4

6.5

6.4

6.7

6.4

6.6

6.4

6~

63

6.8

6.9

6.6

8.5

8.1

8~

8.8

8.7

8.9

8.7

8.7

9.4

8.8

9.2

9.6

9.2

9~

9~

9.5

(0.8) (1.0) (1.0) (0.6) (0.9) (0.8) (1.0) (0.7) MFM~

6.1

U

U

U

U

U

U

U

~

(1.6) (1.4) (1.5) (1.6) (1.5) (1.3) (1.1) (0.9) MFM~

U

U

U

U

U

U

~

U

(1.1) (1.4) (1.3) (1.3) (1.3) (1.2) (1.2) (1.1) (0) The sampling sizes are the same in categories 5, 7, and 10, as the respondents are the same.

6.1

(0.4) (0.6) (0.6) (0.6) (0.6) (0.7) (0.7) (0.4)

8

PE2 - A PE2 - B

6.0

(0.4) (0.7) (0.5) (0.7) (0.3) (1.0) (0.7) (0.4) M- C

U

(1.3) (1.3) (1.5) (\.I) (1.4) (1.3) (1.4) (1.3)

(0.2) (0.9) (0.5) (0.8) (0.4) (0.8) (0.7) (0.4) M-B

~

(1.2) (1.2) (1.5) (\.0) (1.3) (09) (1.1) (1.0)

(1.0) (1.0) (0.9) (1.0) (1.0) (1.0) (1.1) (0.8)

6

tl

(1.4) (1.2) (1.1) (1.2) (1.2) (1.1) (1.2) (0.9)

(0.9) (0.9) (0.9) (1.1) (0.7) (1.0) (1.1) (0.7) M- C

U

(0.8) (0.8) (0.9) (1.0) (1.0) (0.9) (1.1) (0.8)

(0.6) (1.0) (0.8) (1.1) (0.9) (0.9) (1.1) (0.7) M- B

Y

(1 .2) (1.0) (1.6) (1.1) (1.6) (1.2) (1.5) (1.2)

(1.3) (1.3) (1.3) (1.6) (l.4) (l.4) (1.6) (1.4) M- A

U

(1.2) (1.6) (1.7) (1.5) (1.2) (1.4) (1 .4) (1.1)

(1.0) (1.2) (1.2) (l.4) (0.9) (1.3) (1.3) (1.0)

N=73

4

2.2

tl

(1 .1) (1.0) (\.I) (1.3) (\.I) (\.I) (1.2) (1.0)

2.7

7.6

(0.5) (0.9) (0.5) (0.8) (0.5) (0.7) (0.7) (0.7) M- B

7.7

7.7 7.6

7.3

7.7

7.7

7.6

Table 5 reports the means and the standard deviations of the scores, expressed through the ten-point scale and assigned by students attending three Public Economics courses and two Banking and Finance courses in the third year, to the categories of the marking scale and the MURST scale. The observed marking scale means proved to be statistically equal to their corresponding values appearing in Table 3, while the observed MURST scale means proved to be statistically different from their corresponding values in Table 4. Furthermore, the figures for the marking scales categories seemed almost similar to the MURST scales ones, in Table 5. Therefore, it could be argued that the marking scale operated as a guideline for respon-

7.7

(0.5) (0.7) (0.7) (0.8) (0.5) (0.7) (0.7) (0.6) M-C

7.7

7.4 7.6

7.1

7.6 7.7

7.5

7.7

(0.5) (0.7) (0.6) (0.6) (0.5) (0.7) (0.7) (0.5)

10

M- A

9.0 8.5

9.3

9.3

9.1

9.3 9.1

9.0

(1.4) (1.1) (0.8) (0.9) (0.7) (0.9) (0.8) (0.8)

M- B

9.4

9.0 9.3

9.4 9.2 9.2 9.3

9.2

(0.7) (0.9) (0.9) (1.0) (0.8) (0.9) (0.8) (0.8)

M- C

9.1

8.6 9.1

9.2 8.9 9.1

9.1

9.0

(0.7) (1.0) (0.8) (0.9) (0.8) (0.8) (0.8) (0.7) The sampling sizes are the same in categories 4, 6, 8, and 10, as the respondents are the same.

(0)

Table 4 reports the means and the standard deviations of the scores, expressed through the ten-point scale

397

dents, to evaluate the intensity level of the MURST scale categories. The means of the marking scale were slightly higher than those of the MURST scale, while the heteroscedasticity persisted in the data. However, the differences proved to be just slightly lower than those observed between the values in Table 3 and Table 4. These simple remarks could lead one to believe that the most suitable and reliable scale is the one already well known to the target population, as the marking scale was used in seconddary schools to evaluate the learning performance of the respondents.

The analysis carried out through equation (I) showed that the values for the categories of the scales depended on item wording and which terms defined the categories (labels) or the item itself, the nature of the question itself, the kind of course, the teacher, and the intensity level of the item. Table 5 Means and standard deviations of the scores assigned to the different marking and MURST scales categories by course and by item (Q3) Likert Course ALR AWL 8n PbE - B 7.5 7.2 n=33 (0.8) (1.0) MURST S. 6.5 6.9

Table 5 Means and standard deviations of the scores assigned to the different marking and MURST scales categories by course and by item (Q3)

PbE-C n=37

Likert Course 2(0) PbE-A n=33

MURSTS.

ALR AWL CAP ATM CTP T AO lJfS

2.2 (1.3)

11 (1.7) PbE-B 2.4 (1.0) n=37 MURSTS. 1.7 (1.4) PbE-C 2.7 (1.2) n=34 MURSTS. 2.1 (1.8) BF-A 2.4 (0.9) n=14 MURSTS. 2A (1.4) BF-B 2.6

2.5 (1.4) 16 (1.4) 2.6 (1.0) 2.8 (1.0) 3.3 (1.3)

3.4 (\.1)

2.6 (\.2)

2.8 (1.4) 3.3 (1.\) (1.0) n=28 MURSTS. 1.82 2.55 (1.2) (1.3) 4/5 PbE - A 4.2 4.4 (1.0) (\,2) MURST S. 4.5 4.7 (1.1) (\.1) PbE - B 4.2 4.3 (0.8) (0.9) MURST S. 4.3 4.5 (\.2) (0.7) PbE - C 4.5 4.9 (0.5) (0.7) MURST S. 4.9 5.0 (0.7) (0.6) BF - A 4.3 4.3 (0.7) (1.\) MURST S. 4.4 4.4 (1.1) (\.1) BF - B 4.4 4.7 (0.8) (0.7) MURST S. 4.8 4.9 PbE - A (0.6) (0.7) 6 PbE - A 5.9 6.2 (0.6) (1.0) PbE - B 6.0 6.4 (0.5) (0.7) PbE - C 6.0 6.7 (0.1) (0.6) BF - A 6.1 6.3 (0.3) (0.7) BF - B 6.0 6.5 (0.1) (0.6)

2.5 1.3 2.0 2.4 (1.2) (1.4) (1.3) (1.5) 16 lA 1~ 2.5 (1.3) (1.5) (1.6) (1.4) 2.4 1.1 2.0 2.0 (1.1) (1.2) (1.3) (1.2) 2.6 1.2 1.6 2.4 (1.3) (1.3) (1.4) (1.3) 2.7 1.6 2.0 2.6 (1.3) (1.6) (1.3) (1.5) 2.8 1.9 2.1 2.8 (1.4) (1.8) (1.6) (1.3) 2.4 1.5 2.3 1.7 (1.4) (1.5) (\.1) (1.3) 2.3 1.7 2.4 2.4 (1.5) (1.5) (1.3) (1.4) 2.7 1.6 2.6 2.4 (1.0) (1.5) (\.1) (1.5) 2.41 2.63 1.43 3.04 (1.3) (\.3) (1.6) (1.8) 4.4 4.0 4.0 4.2 (0.8) (1.1) (0.8) (1.1) 4.9 4.2 4.5 4.6 (1.2) (1.2) (0.9) (1.1) 4.3 3.8 4.0 3.9 (\.1) (1.0) (0.8) (1.0) 4.6 4.1 4.3 4.1 (1.3) (1.0) (\.1) (1.3) 4.3 4.1 4.4 4.5 (0.8) (0.9) (0.7) (0.9) 4.9 4.4 4.8 4.8 (1.0) (1.1) (0.7) (0.9) 4.3 3.5 4.0 3.7 (\.3) (0.9) (1.0) (\.2) 4.4 4.1 4.3 4.2 (0.9) (0.9) (1.2) (1.0) 4.5 3.9 4.2 4.3 (0.6) (0.8) (0.8) (1.0) 5.0 4.3 4.7 4.6 (0.7) (1.\) (0.7) (0.8) 6.1 5.5 6.1 6.3 (0.7) (0.8) (0.3) (0.9) 6.0 5.4 5.9 6.3 (0.8) (0.9) (0.4) (0.8) 6.1 5.7 6.0 6.7 (0.3) (0.6) (0.2) (0.6) 5.8 5.2 5.9 5.9 (1.0) (0.5) (0.3) (0.6) 6.1 5.6 5.9 6.4 (0.5) (0.3) (0.2) (0.6)

LIS

MURST S.

1.7 2.2 (1.4) (1.4) 1~

IS

(1.6) 1.4 (1.1) 1.2 (1.3) 1.6 (1.5) 1.9 (1.7) 1.9

(1.6) 2.0 (1.6) 1.6 (1.4) 2.1

BF - A n=34

MURST S. BF - B n=14

MURST S. PbE - A

(1.3)

2.1 (2.1) 2.1 (1.3) (1.1) 2.1 2.3 (1.4) (1.1) 1.7 2.6 (1.5) (1.0) 2.17 1.90 (1.5) (1.5) 3.8 4.2 (\.1) (0.8) 4.3 4.7 (\.2) (0.9) 3.7 3.7 (1.2) (\.0) 3.9 4.5 (1.4) (0.9) 4.1 4.4 (1.0) (0.7) 4.9 4.9 (1.3) (0.7) 4.1 3.8 (1.2) (0.9) 4.5 4.5 (1.0) (1.0) 4.1 4.4 (0.8) (0.6) 4.7 4.8 (0.8) (0.6) 5.8 6.1 (\.0) (0.3) 5.6 5.9 (1.0) (0.9) 6.1 6.0 (0.6) (0.2) 5.7 5.8 (1.1) (0.8) 6.0 6.0 (0.3) (0.1) (continued)

n=28 MURST S.

10

PbE - A

MURST S. PbE - B

MURST S. PbE - C

MURST S. BF - A

MURST S. BF - B

MURST S.

( continued) CAP ATM CTP T AO lJfS LIS

7.5 (0.9) 7.0 (1.0) (1.0) (1.1) 7.7 7.3 7.4 (0.6) (0.6) (\.0) 6.2 6.7 6.3 (\.3) (0.7) (\.3) 7.7 7A 7.5 (0.4) (0.7) (0.5) 6.7 7.1 7.0 (0.6) (0.7) (0.7) 7.6 7.3 7.2 (0.5) (0.6) (0.8) 6.2 6.5 6.8 (0.7) (0.7) (0.9) 7.7 7.2 7.4 (0.5) (0.7) (0.7) 6.4 6.8 6.9 (0.6) (0.7) (0.8) 9.2 8.5 9.3 (1.1) (1.2) (1.2) 8.8 8.2 9.1 (1.3) (1.5) (1.4) 9.0 8.5 9.0 (0.8) (0.9) (1.0) 8.4 8.1 8.3 (1.8) (0.9) (1.4) 9.0 8.4 9.1 (0.7) (0.9) (0.8) 9.0 8.1 9.0 (1.0) (0.9) (0.9) 9.0 8.2 8.8 (0.7) (0.7) (1.0) 8.2 7.8 8.7 (0.7) (0.8) (1.0) 9.1 8.2 9.2 (0.7) (0.9) (0.9) 8.4 8.1 9.0 (\.2) (1.0) (0.9)

7.0 7.7 7.7 7.4 7.6 (0.8) (0.4) (0.9) (10) (0.5) 6.7 6.5 7.4 6.8 6.6 (\.0) (0.9) (1.\) (1.0) (0.9) 6.7 7.5 7.6 7.1 7.5 (1.1) (0.6) (0.7) (11) (1.1) 6.2 6.4 7.2 6A 6A (\.1) (1.0) (1.1) (1.2) (0.8) 7.0 7.6 7.9 7.6 7.7 (0.5) (0.5) (0.7) (0.7) (04) 6.9 6.6 7.6 7.2 6.9 (0.5) (0.6) (0.7) (0.7) (0.6) 7.0 7.4 7.5 6.9 7.4 (0.7) (0.5) (0.7) (1.1) (0.8) 6.5 6.0 7.2 6.8 6.1 (0.9) (1.0) (0.6) (0.6) (0.7) 6.9 7.6 7.7 7.4 7.8 (0.6) (0.5) (0.7) (0.7) (0.4) 6.7 6.4 7.6 6.9 6.5 (0.6) (0.5) (0.6) (0.6) (0.6) 9.3 9.3 9.0 9.1 9.3 (\.1) (0.6) (1.0) (1.\) (0.7) 9.1 8.7 9.0 8.8 8.6 (\.2) (1.3) (\.I) (1.3) (1.3) 8.9 8.9 9.0 8.8 9.0 (\.5) (0.9) (0.9) (1.4) (\.1) 8.4 8.3 8.8 8.6 8.5 (1.6) (1.4) (1.2) (1.3) (1.1) 9.4 8.9 9.3 9.3 9.0 (0.8) (0.6) (0.6) (0.7) (0.7) 9.3 8.6 9.1 9.1 8.8 (0.9) (0.9) (0.7) (0.8) (0.9) 9.0 8.8 8.8 8.2 8.8 (0.9) (0.6) (0.7) (1.1) (1.0) 8.8 8.2 8.7 8.4 8.1 (1.1) (0.7) (0.6) (0.8) (1.0) 9.1 8.8 9.2 9.1 9.1 (0.9) (0.6) (0.9) (0.8) (0.8) 9.1 8.3 9.3 8.7 8.4 (1.0) (0.9) (0.7) (0.9) (0.9)

(0) The sampling sizes are the same in categories 4/5, 6, 8n, and 10, as the respondents are the same.

3. THE FUZZY SYSTEM FOR TEACHING EVALUATIQN

The categories of the adopted scales were labeled by linguistic terms and a statistical analysis generally requires their translation into corresponding numeric values, which is completely arbitrary. The fuzzy set theory could represent an ideal tool to handle this kind of evaluation data, as it was originally proposed as a means for representing indeterminacy and formalizing qualitative concepts that generally have no precise boundaries. In many situations, it is difficult

398

jicient, CVlnsufficient, @Sufficient, @Good, and ~ Very good. The students numerically evaluated these labels on a ten-point scale. The scores collected for each term in each question were analyzed through a histogram in order to identify the form, peak, and amplitude of the corresponding fuzzy number. The relative frequency distributions had been normalized at one to be comparable and to derive the membership function of the relative linguistic attributes. Almost all the terms were represented by piece-linear functions. Figure 2, as an example, illustrates the fuzzification of the variable "Level of Interest in the Subject matter". This procedure was repeated for each of the seven remaining items.

to describe phenomena simply in terms of black and white distinctions. Language, our primary means of communication, is anything but precise. The fuzzy set theory supports reasoning about these kinds of situations because it is based on gradation rather than sharp distinction. The essential steps for a fuzzy system design (Kasabov, 1996; von Altrock, 1997) are: (i) identifycation of the problem and the selection of the type of fuzzy system that best suits the characteristics of the problem, (ii) definition of the input and output variables, their fuzzy values, and their membership function (fuzzification of input and output), (iii) construction of blocks of control rules and the translation of the latter into a fuzzy relation, (iv) treatment of any input information to select the fuzzy inference method, (v) translation of the fuzzy output into a crisp (numerical) value (defuzzification methods).

The rule-blocks, (iii), were set up through the experts' opinions, i.e. the experience of teachers and students. The aggregation operator, selected for the precondition (iv), was the MIN Operator. The resulting linguistic output of each module (intermediate inputs) was also the (linguistic) input of the next module. Its function, therefore, permitted the connection between the rule-blocks, as illustrated in Figure 1. At the last stage of the procedure, to obtain the output (termed "Evaluation" in Figure 1), Defuzzification, (v), was carried out for each respondent after having inserted the data in the system. The crisp value, corresponding to the best representation of the fuzzy value of the linguistic output, was obtained by the "Center Of Maximum" (COM) method. It determined the most typical value for each term and then COM computed the best compromise for the fuzzy inference resulting from a weighted average of them. The weights were the levels of activation of the geometric figure reSUlting from the union of the output fuzzy sets.

Fuzzification and the construction of blocks of fuzzy rules are the main problem in building a fuzzy expert system. These two steps could be obtained in several ways. One approach is to interview experts on the problem. Another is the use of the methods of machine-learning, neural networks and genetic algorithms to determine membership functions and fuzzy rules. The two approaches are quite different. The first does not use the past history of the problem and permits real contact with the experts, which may allow experience matured in years of work in that field to enter into the study. The second is based only on past data and transfers the same structure of the past to the future. The first approach seemed to be suitable to achieve the purpose of creating a different method to process the evaluation data, but not to set up the mathematical model underlying the procedures to yield the traditional scores according to standard methodology. This type of problem could not be tackled by using the second approach because past data on the numerical values for the marking scale categories are not available. Moreover, previous scores for courses have no meaning in constructing a system to determine the intensity of the categories in a given scale. The identification of the problem, (i), led to a modular system, as reported in Figure 1, whose structure consisted of several fuzzy modules linked together, enabling the variables to be introduced at different levels of importance, while in the traditional approach they were equally weighted. Each single aggregation produced intermediate variables that had a particular meaning. For example the aggregation "Clarity of the Teacher's Presentations" and "Teacher Availability during Office hours" generated the new variable, "Teacher", which summarized all the information about the teacher. The fuzzification of input variables, (ii), was carried out by using the answers collected by the corresponding questionnaire (Ql or Q2). For example, "Your Level of Interest in the Subject matter" had alternatives:
Fig. 1. Fuzzy inference system for teacher and course evaluation 1.0

vet,_insult

0.8 0.6 0.4 0.2

0.0 0.0

Fig. 2. Fuzzification of the five-point mark scale

399

numerical values in the fuzzy evaluation. Therefore, differences arose about the discrimination value, equal to six. The comparison of the two scores showed that the fuzzy scores were higher than the traditional ones in 45% of the cases, but none of them yielded a change between the SUF and INS classes. The correlation coefficients were stable over courses and about 0.85, although the fuzzy scores overestimated high means and underestimated low means.

As an example, some records contaInIng data gathered in December 2000 at the University of Modena and Reggio Emilia and fuzzy evaluations are reported in Table 6. The first row of Table 6 reports the variable names, the traditional, and the fuzzy results. The other rows show the responses of students attending a Mathematics course, translated into the double of the numerical label. Table 6 Evaluation of the eight items for a Mathematics course, means, and fuzzy responses

4. CONCLUSIONS

c:

The questionnaires administered to the students showed that the values of the marking scale were affected by rigidity and did not correspond exactly to the values intended by the members (students) of the target population. They could have been represented by the means, but the latter were affected by several factors (such as gender, course, teacher). The fuzzy approach offered the possibility of the marking scale to use "values" more proximate to those that the students really wanted to attribute to them. The ruleblocks set up accounted for links between the inputs and the importance that teachers and students attributed to the single input (item).

-l Vl

8

8

10 10 8

8

8 8 8 8 8 6 8

8 8 8 8 6 6 6

8 10 10 10

6 8 10 8

8 8

8810 10 4 10 10 8 10 8 8 10 10 8 10 8 8 8 10 6 10 8 8 8 668 668 8 6 6 10 8 10

888

10 10 10 10 6 10 8 8 6 6 10 4 2 6 2 2 8 10 2 6 8 6 10 8 8 10 6 6 8 8 4 8 8 8 10 10 8 6 8 8 6 IQ 6 8 6 10 6 2 6 6 6 222 2 2 644 6 6 8 6 8 8 10 8 8 866 Average (825 cases)

6 8

10 6 8 6

8 8 6 4 6

8 6 6

8 6

2

10 8 8 6 8 6 8 6 6 6 10 6 6 6 8 8 10

6

IQ

10 2

10

6

8 8

10 10 6

4

6

2

2 6 6 10

4 6

8

6

4 2 6 6 2 6 8 4 10 4 6 4 4 4 6 6 6 4 8 4 10 6 8 4 2 10

7.75 7.50 8.75 7.50 7.75 7.25 8.25 7.25 7.00 5.75 7.00 7.50 7.25 8.25 8.25 7.25 4.25 6.75 8.75 5.50 8.75 7.50 7.75 5.00 2.00 5.75 IQ 7.75 10 8.00 7.29

6.88 8.12 10.00 8.13 8.12 8.13 9.37 8.13 6.87 3.91 6.87 6.25 6.88 7.50 9.37 8.12 2.50 6.25 9.37 2.50 9.37 7.50 8.75 2.50 0.00 3.12 8.12 10.00 7.07

Another finding concerned the performance of the marking scale versus the MURST scale. The former showed a better performance than the latter, i.e. the marking scale, being more popular among the students, received mean values closer to the double of the numerical label than the values of the ten-point scales proposed by the MURST scale. Furthermore, the standard deviations of the scores attributed to the marking scales categories proved to be lower than those for the MURST scale categories.

REFERENCES Chiandotto, B. and M.M. Gola (1999). Questionario di base da utilizzare per I' attuazione di un programma per la valutazione dell a didattica da parte degJi studenti, internet site of the MURST,: http://www.murst.it/osservatorio/attivnuc.htm . Ray, J.J. (1990). Acquiescence and Problems with Forced-choice Scales. Journal of Social Psychology, 130 (3), 397-399. Hand, DJ. and MJ. Crowder (1996). Practical Longitudinal Data Analysis. Chapman & Hall, London. Kasabov, N.K. (1996). Foundations of Neural Networks, Fuzzy Systems, and KnowLedge Engineering. MIT Press, Cambridge, MA.

On average, a rating of «sufficient» corresponded to a rating greater than or equal to six. Both means of the ten-point scale and of the fuzzy evaluations were dichotomized between sufficient (SUP) and insufficient (INS). There was "coherence" when both mean scores yielded the same judgment be it sufficient or insufficient, i.e. all the answers corresponding to sufficient on the ten-point scale were sufficient using the fuzzy score, and all the answers corresponding to insufficient on the tenpoint scale were insufficient using the fuzzy score. There was a coherence in 80% of cases, while there was incoherence in the remaining 20%. In the case of coherence, 70% of the total was sufficient, while only 10% was insufficient, but this only indicated generalized satisfaction among the students. However, in the case of incoherence, the means of the ten-point scale all proved to be sufficient and the means of fuzzy scores all proved to be insufficient, which indicated a systematic underestimation of the

Schuman H., Presser S. (1996). Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context; Sage Publications, Thousand Oaks, CA. von AItrock, C. (1997) . Fuzzy Logic and Neurofuzzy Applications in Business and Finance . Prentice Hall, New York.

400