Univariate Descriptive Statistics

Chapter 3 Univariate Descriptive Statistics Mathematics is the alphabet with which God has written the Universe. Galileo Galilei 3.1 INTRODUCTION ...

Download PDF

8MB Sizes 3 Downloads 112 Views

Report

Full Text

Chapter 3

Univariate Descriptive Statistics Mathematics is the alphabet with which God has written the Universe. Galileo Galilei

3.1

INTRODUCTION

Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population. Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable. Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis). The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved. As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information: a) The descriptive statistics used to represent the behavior of one qualitative variable’s data are frequency distribution tables and graphs/charts. b) The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. c) The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and by a Pareto chart. d) For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of the data of continuous variables grouped into classes. e) Line graphs, dot or dispersion plots, histograms, stem-and-leaf plots, and boxplots (box-and-whisker diagrams) are normally used as the graphical representation of quantitative variables. f) Measures of position or location can be divided into measures of central tendency (mean, mode, and median) and quantiles (quartiles, deciles, and percentiles). g) The most common measures of dispersion or variability are range, average deviation, variance, standard deviation, standard error, and coefficient of variation. h) The measures of shape include measures of skewness and kurtosis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00003-3 © 2019 Elsevier Inc. All rights reserved.

21

22

PART

II Descriptive Statistics

Variable type Qualitative

Quantitative

Charts

Tables

Frequency distribution

Tables

Bar

Frequency distribution

(horizontal or vertical)

Graphs

Summary measures

Line

Pie

Histogram

Pareto

Stem-and-Leaf

Boxplot

Dispersion or Variability

Position or Location

Scatter

Central tendency

Range

Skewness

Average

Kurtosis

Quantiles

Mean

Quartiles

Mode*

Deciles

Median

Percentiles

Shape

Variance Standard deviation Standard error Coefficient of variation

FIG. 3.1 A brief summary of univariate descriptive statistics. *The mode, which provides the most frequent value of the variable, is the only summary measure that can also be used for qualitative variables.

3.2

FREQUENCY DISTRIBUTION TABLE

Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs. In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs. A frequency distribution table contains the following calculations: a) b) c) d)

Absolute frequency (Fi): number of times each value i appears in the sample. Relative frequency (Fri): percentage related to the absolute frequency. Cumulative frequency (Fac): sum of all the values equal to or less than the value being analyzed. Relative cumulative frequency (Frac): percentage related to the cumulative frequency (sum of all relative frequencies equal to or less than the value being considered).

3.2.1

Frequency Distribution Table for Qualitative Variables

Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed. Example 3.1 Saint August Hospital provides 3000 blood transfusions to hospitalized patients every month. In order for the hospital to be able to maintain its stocks, 60 blood donations a day are necessary. Table 3.E.1 shows the total number of donors for each blood type on a certain day. Build the frequency distribution table for this problem.

TABLE 3.E.1 Total Number of Donors of Each Blood Type Blood Type

Donors

A+

15

A

2

B+

6

Univariate Descriptive Statistics Chapter

3

23

TABLE 3.E.1 Total Number of Donors of Each Blood Type— cont’d Blood Type

Donors

B

1

AB+

1

AB

1

O+

32

O

2

Solution The complete frequency distribution table for Example 3.1 is shown in Table 3.E.2:

TABLE 3.E.2 Frequency Distribution of Example 3.1

3.2.2

Blood Type

Fi

Fri (%)

Fac

Frac (%)

A+

15

25

15

25

A

2

3.33

17

28.33

B+

6

10

23

38.33

B

1

1.67

24

40

AB+

1

1.67

25

41.67

AB

1

1.67

26

43.33

O+

32

53.33

58

96.67

O

2

3.33

60

100

Sum

60

100

Frequency Distribution Table for Discrete Data

Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable. Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order. Example 3.2 A Japanese restaurant is defining the new layout for its tables and, in order to do that, it collected information on the number of people who have lunch and dinner at each table throughout one week. Table 3.E.3 shows the first 40 pieces of data collected. Build the frequency distribution table for these data.

TABLE 3.E.3 Number of People per Table 2

5

4

7

4

1

6

2

2

5

4

12

8

6

4

5

2

8

2

6

4

7

2

5

6

4

1

5

10

2

2

10

6

4

3

4

6

3

8

4

24

PART

II Descriptive Statistics

Solution In the next table, each row of the first column represents a possible numeric value of the variable being analyzed. The data are sorted in ascending order. The complete frequency distribution table for Example 3.2 is shown below.

TABLE 3.E.4 Frequency Distribution for Example 3.2

3.2.3

Number of People

Fi

Fri (%)

Fac

Frac (%)

1

2

5

2

5

2

8

20

10

25

3

2

5

12

30

4

9

22.5

21

52.5

5

5

12.5

26

65

6

6

15

32

80

7

2

5

34

85

8

3

7.5

37

92.5

10

2

5

39

97.5

12

1

2.5

40

100

Sum

40

100

Frequency Distribution Table for Continuous Data Grouped into Classes

As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges. The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval. The following steps must be taken to build a frequency distribution table for continuous data: Step 1: Sort the data in ascending order. Step 2: Determine the number of classes (k), using one of the options: a) Sturges’ Rule ! k ¼ 1 + 3.3 pﬃﬃﬃ log(n) b) Through expression k ¼ n where n is the sample size. The value of k must be an integer. Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A ¼ maximum value minimum value) divided by the number of classes: h ¼ A=k The value of h is rounded to the highest integer. Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class. The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.

Univariate Descriptive Statistics Chapter

3

Example 3.3 Consider the data in Table 3.E.5 regarding the grades of 30 students enrolled in the subject Financial Market. Elaborate a frequency distribution table for this problem.

TABLE 3.E.5 Grades of 30 Students Enrolled in the Subject Financial Market 4.2

3.9

5.7

6.5

4.6

6.3

8.0

4.4

5.0

5.5

6.0

4.5

5.0

7.2

6.4

7.2

5.0

6.8

4.7

3.5

6.0

7.4

8.8

3.8

5.5

5.0

6.6

7.1

5.3

4.7

Note: To determine the number of classes, use Sturges’ rule.

Solution Let’s apply the four steps to build the frequency distribution table of Example 3.3, whose variables are continuous: Step 1: Let’s sort the data in ascending order, as shown in Table 3.E.6.

TABLE 3.E.6 Data From Table 3.E.5 Sorted in Ascending Order 3.5

3.8

3.9

4.2

4.4

4.5

4.6

4.7

4.7

5

5

5

5

5.3

5.5

5.5

5.7

6

6

6.3

6.4

6.5

6.6

6.8

7.1

7.2

7.2

7.4

8

8.8

Step 2: Let’s determine the number of classes (k) by using Sturges’ rule: k ¼ 1 + 3:3 log ð30Þ ¼ 5:87 ﬃ 6 Step 3: The interval between the classes (h) is given by: A ð8:8 3:5Þ ¼ ¼ 0:88 ﬃ 1 k 6 Step 4: Finally, let’s build the frequency distribution table for each class. The lowest limit of the first class corresponds to the minimum grade 3.5. From this value, we must add the interval between the classes (1), considering that the highest limit of the first class will be 4.5. The second class starts from this value, and so on, and so forth, until the last class is defined. We use the notation ├ to determine that the lowest limit is included in the class and the highest limit is not. The complete frequency distribution table for Example 3.3 (Table 3.E.7) is presented. h¼

TABLE 3.E.7 Frequency Distribution for Example 3.3 Class

Fi

Fri (%)

Fac

Frac (%)

3.5 ├ 4.5

5

16.67

5

16.67

4.5 ├ 5.5

9

30

14

46.67

5.5 ├ 6.5

7

23.33

21

70

6.5 ├ 7.5

7

23.33

28

93.33

7.5 ├ 8.5

1

3.33

29

96.67

8.5 ├ 9.5

1

3.33

30

100

Sum

30

100

25

26

PART

3.3

II Descriptive Statistics

GRAPHICAL REPRESENTATION OF THE RESULTS

The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively. In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart. The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stemand-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2. Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata. To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher. Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.

3.3.1

Graphical Representation for Qualitative Variables

3.3.1.1 Bar Chart This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends. As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis. Let’s now build horizontal and vertical bar charts from a practical example. Example 3.4 A bank created a satisfaction survey, which was used with 120 customers, trying to measure how agile its services were (excellent, good, satisfactory, and poor). The absolute frequencies for each category are presented in Table 3.E.8. Construct a vertical and horizontal bar chart for this problem.

TABLE 3.E.8 Frequencies of Occurrences per Category Satisfaction

Absolute Frequency

Excellent

58

Good

18

Satisfactory

32

Poor

12

Solution Let’s build the vertical and horizontal bar charts of Example 3.4 in Excel.

Univariate Descriptive Statistics Chapter

27

FIG. 3.2 Vertical bar chart for Example 3.4.

Satisfaction 70 58

60 Absolute frequency

3

50 40

32

30 18

20

12

10 0 Excellent

Good

Poor

Satisfactory

FIG. 3.3 Horizontal bar chart for Example 3.4.

Satisfaction Poor

12

Satisfactory

32

Good

18

Excellent

58 0

10

20

30 40 Absolute frequency

50

60

70

First, the data in Table 3.E.8 must be standardized, codified, and selected in a spreadsheet. After that, we can click on the Insert tab and, in the Charts group, and select the option Columns. The chart is automatically generated on the screen. Next, to personalize the chart, while clicking on it, we must select the following icons on the Layout tab: (a) Axis Titles: let’s select the title for the horizontal axis (Satisfaction) and for the vertical axis (Frequency); (b) Legend: to hide the legend, we must click on None; (c) Data Labels: clicking on More Data Label Options, the option Value must be selected in Label Contains (or we can select the option Outside End). Fig. 3.2 shows the vertical bar chart of Example 3.4 generated in Excel. Based on Fig. 3.2, we can see that the categories of the variable being analyzed are presented on the X-axis by bars with the same width and their respective heights indicate the frequencies on the Y-axis. To construct the horizontal bar chart, we must select the option Bar instead of Columns. The other steps follow the same logic. Fig. 3.3 represents the frequency data from Table 3.E.8 through a horizontal bar chart constructed in Excel. The horizontal bar chart in Fig. 3.3 represents the categories of the variable on the Y-axis and their respective frequencies on the X-axis. For each variable category, we draw a bar with a length that corresponds to its frequency. Therefore, this chart only offers information related to the behavior of each category of the original variable and to the generation of investigations regarding the type of distribution, not allowing us to calculate position, dispersion, skewness or kurtosis measures, since the variable being studied is qualitative.

3.3.1.2 Pie Chart Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).

28

PART

II Descriptive Statistics

This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example. Example 3.5 An election poll was carried out in the city of Sao Paulo to check voters’ preferences concerning the political parties running in the next elections for Mayor. The percentage of voters per political party can be seen in Table 3.E.9. Construct a pie chart for Example 3.5.

TABLE 3.E.9 Percentage of Voters per Political Party Political Party

Percentage

PMDB

18

PSDB

22

PDT

12.5

PT

24.5

PC do B

8

PV

5

Others

10

Solution Let’s build the pie chart for Example 3.5 in Excel. The steps are similar to the ones in Example 3.4. However, we now have to select the option Pie in the Charts group, on the Insert tab. Fig. 3.4 presents the pie chart obtained in Excel for the data shown in Table 3.E.9. FIG. 3.4 Pie chart of Example 3.5.

Political party Others 10% PV 5%

PMDB 18%

PC do B 8%

PSDB 22% PT 24.5% PDT 12.5%

3.3.1.3 Pareto Chart The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them. The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.

Univariate Descriptive Statistics Chapter

3

Example 3.6 A manufacturer of credit and magnetic cards has as its main objective to reduce the number of defective cards. The quality inspector classified a sample of 1000 cards that were collected during one week of production, according to the types of defects found, as shown in Table 3.E.10. Construct a Pareto chart for this problem.

TABLE 3.E.10 Frequencies of the Occurrence of Each Defect Type of Defect

Absolute Frequency (Fi)

Damaged/Bent

71

Perforated

28

Illegible printing

12

Wrong characters

20

Wrong numbers

44

Others

6

Total

181

Solution The first step in generating a Pareto chart is to sort the defects in order of priority (from the highest to the lowest frequency). The bar chart represents the absolute frequency of each defect. To construct the line graph, it is necessary to calculate the relative cumulative frequency (%) up to the defect analyzed. Table 3.E.11 shows the absolute frequency for each type of defect, in descending order, and the relative cumulative frequency (%).

TABLE 3.E.11 Absolute Frequency for Each Defect and the Relative Cumulative Frequency (%) Type of Defect

Number of Defects

Cumulative %

Damaged/Bent

71

39.23

Wrong numbers

44

63.54

Perforated

28

79.01

Wrong characters

20

90.06

Illegible printing

12

96.69

Others

6

100

Let’s now build a Pareto chart for Example 3.6 in Excel, using the data in Table 3.E.11. First, the data in Table 3.E.11 must be standardized, codified, and selected in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns (and the clustered column subtype). Note that the chart is automatically generated on the screen. However, absolute frequency data as well as relative cumulative frequency data are presented as columns. To change the type of chart related to the cumulative percentage, we must click with the right button on any bar of the respective series and select the option Change Series Chart Type, followed by a line graph with markers. The resulting chart is a Pareto chart. To personalize the Pareto chart, we must use the following icons on the Layout tab: (a) Axis Titles: for the bar chart, we selected the title for the horizontal axis (Type of defect) and for the vertical axis (Frequency); for the line graph, we called the vertical axis Percentage; (b) Legend: to hide the legend, we must click on None; (c) Data Table: let’s select the option Show Data Table with Legend Keys; (d) Axes: the main unit of the vertical axes for both charts is set in 20 and the maximum value of the vertical axis for line graphs, in 100. Fig. 3.5 shows the chart constructed in Excel that corresponds to the Pareto chart for Example 3.6.

29

30

PART

II Descriptive Statistics

FIG. 3.5 The Pareto chart for Example 3.6. Legend: A, Damaged/Bent; B, Wrong numbers; C, Perforated; D, Wrong characters; E, Illegible printing; F, Others.

3.3.2

Graphical Representation for Quantitative Variables

3.3.2.1 Line Graph In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines. Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph. Example 3.7 Cheap & Easy is a supermarket that registered the percentage of losses it had in the last 12 months (Table 3.E.12). After having done that, it will adopt new prevention measures. Build a line graph for Example 3.7.

TABLE 3.E.12 Percentage of Losses in the Last 12 Months Month

Losses (%)

January

0.42

February

0.38

March

0.12

April

0.34

May

0.22

June

0.15

July

0.18

August

0.31

September

0.47

October

0.24

November

0.42

December

0.09

Univariate Descriptive Statistics Chapter

3

31

Solution To build the line graph for Example 3.7 in Excel, in the Charts group, on the Insert tab, we must select the option Lines. The other steps follow the same logic of the previous examples. The complete chart can be seen in Fig. 3.6.

FIG. 3.6 Line graph for Example 3.7.

3.3.2.2 Scatter Plot A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis. Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines. The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time. In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis. Example 3.8 Papermisto is the supplier of three types of raw materials for the production of paper: cellulose, mechanical pulp, and trimmings. In order to maintain its quality standards, the factory carries out a rigorous inspection of its products during each production phase. At irregular intervals, an operator must verify the esthetic and dimensional characteristics of the product selected with specialized instruments. For instance, in the cellulose storage phase, the product must be piled up in bales of approximately 250 kg each. Table 3.E.13 shows the weight of the bales collected in the last 5 hours, at irregular intervals, varying between 20 and 45 minutes. Construct a scatter plot for Example 3.8.

TABLE 3.E.13 Evolution of the Weight of the Bales Throughout Time Time (min)

Weight (kg)

30

250

50

255

85

252

106

248

138

250

178

249

198

252

222

251

252

250

297

245

32

PART

II Descriptive Statistics

Solution To build the scatter plot for Example 3.8 in Excel, in the Charts group, on the Insert tab, we must select the option Scatter. The other steps follow the same logic of the previous examples. The scatter plot can be seen in Fig. 3.7. FIG. 3.7 Scatter plot for Example 3.8.

256

Weight (kg)

254 252 250 248 246 244 0

50

100

150 Time (min)

200

250

300

3.3.2.3 Histogram A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values. A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order. Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order. The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis. Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity. Example 3.9 In order to improve their services, a national bank is hiring new managers to serve their corporate clients. Table 3.E.14 shows the number of companies dealt with daily in one of their main branches in the capital. Elaborate a histogram from these data using Excel.

TABLE 3.E.14 Number of Companies Dealt With Daily 13

11

13

10

11

12

8

12

9

10

12

10

8

11

9

11

14

11

10

9

Univariate Descriptive Statistics Chapter

3

33

Solution The first step is building the frequency distribution table: From the data in Table 3.E.15, we can build a histogram of absolute frequency, relative frequency, cumulative frequency, or relative cumulative frequency using Excel. The histogram generated will be the absolute frequency one. Thus, we must standardize, codify, and select the first two columns of Table 3.E.15 (except the last row: Sum) in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns. Let’s click on the chart so that it can be personalized. On the Layout tab, we selected the following icons: (a) Axis Titles: select the title for the horizontal axis (Number of companies) and for the vertical axis (Absolute frequency); (b) Legend: to hide the legend, we must click on None. The histogram generated in Excel can be seen in Fig. 3.8.

TABLE 3.E.15 Frequency Distribution for Example 3.9 Number of Companies

Fi

Fri (%)

Fac

Frac (%)

8

2

10

2

10

9

3

15

5

25

10

4

20

9

45

11

5

25

14

70

12

3

15

17

85

13

2

10

19

95

14

1

5

20

100

Sum

20

100

FIG. 3.8 Histogram of absolute frequencies elaborated in Excel for Example 3.9.

Number of companies 6

Absolute frequency

5 4 3 2 1 0 8

9

10

11

12

13

14

As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve. Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively. Note that the values of the discrete variable are presented in the middle of the base. For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6. Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.

34

PART

II Descriptive Statistics

FIG. 3.9 Histogram constructed using SPSS for Example 3.9 (discrete data).

5

Frequency

4

3

2

1

0 6

FIG. 3.10 Histogram generated using SPSS for Example 3.3 (continuous data).

8

10 12 Number_of_companies

14

16

5

Frequency

4

3

2

1

0 3.00

4.00

5.00

6.00 Grades

7.00

8.00

9.00

Note that the data were grouped considering an interval between h ¼ 0.5 classes, differently from Example 3.3 that considered h ¼ 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).

3.3.2.4 Stem-and-Leaf Plot Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).

Univariate Descriptive Statistics Chapter

3

35

In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant. The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution. To build a stem-and-leaf plot, we can follow the sequence of steps: Step 1: Sort the data in ascending order, to make the visualization of the data easier. Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf. Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem. Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order). Example 3.10 A small company collected its employees’ ages, as shown in Table 3.E.16. Build a stem-and-leaf plot.

TABLE 3.E.16 Employees’ Ages 44

60

22

49

31

58

42

63

33

37

54

55

40

71

55

62

35

45

59

54

50

51

24

31

40

73

28

35

75

48

Solution To construct the stem-and-leaf plot, let’s apply the four steps described: Step 1 First, we must sort the data in ascending order, as shown in Table 3.E.17.

TABLE 3.E.17 Employees’ Ages in Ascending Order 22

24

28

31

31

33

35

35

37

40

40

42

44

45

48

49

50

51

54

54

55

55

58

59

60

62

63

71

73

75

Step 2 The next step to construct a stem-and-leaf plot is to define the number of initial digits of the observation that will form the stem. The complementary digits will form the leaf. In this example, all of the observations have two digits. The stems correspond to the tens and the leaves correspond to the units. Step 3 The following step is to build the stems. Based on Table 3.E.17, we can see that there are observations that begin with the tens 2, 3, 4, 5, 6, and 7 (stems). The stem with the highest frequency is 5 (8 observations), it is possible to represent all of its leaves in a single line. Therefore, we will have a single line per stem. Hence, the stems are presented in a single column on the left of the vertical line, in ascending order, as shown in Fig. 3.11.

36

PART

II Descriptive Statistics

FIG. 3.11 Building the stems for Example 3.10.

2 3 4 5 6 7

Step 4 Finally, let’s place the leaves that correspond to each stem on the right-hand side of the vertical line. The leaves are represented in ascending order throughout many columns. For example, stem 2 contains leaves 2, 4, and 8. Stem 5 contains leaves 0, 1, 4, 4, 5, 5, 8, and 9, represented throughout 8 columns. If this stem were divided into two lines, the first line would have leaves 0 to 4, and the second line leaves 5 to 9. Fig. 3.12 illustrates the stem-and-leaf plot for Example 3.10. FIG. 3.12 Stem-and-Leaf plot for Example 3.10.

2

2

4

8

3

1

1

3

5

5

7

4

0

0

2

4

5

8

9

5

0

1

4

4

5

5

8

6

0

2

3

7

1

3

5

9

Example 3.11 The average temperature, in Celsius, registered in the last 40 days in the city of Porto Alegre can be found in Table 3.E.18. Elaborate the stem-and-leaf plot for Example 3.11.

TABLE 3.E.18 Average Temperature in Celsius 8.5

13.7

12.9

9.4

11.7

19.2

12.8

9.7

19.5

11.5

15.5

16.0

20.4

17.4

18.0

14.4

14.8

13.0

16.6

20.2

17.9

17.7

16.9

15.2

18.5

17.8

16.2

16.4

18.2

16.9

18.7

19.6

13.2

17.2

20.5

14.1

16.1

15.9

18.8

15.7

Solution Once again, let’s apply the four steps to construct the stem-and-leaf plot, but now we have to consider continuous variables. Step 1 First, let’s sort the data in ascending order, as shown in Table 3.E.19.

TABLE 3.E.19 Average Temperature in Ascending Order 8.5

9.4

9.7

11.5

11.7

12.8

12.9

13.0

13.2

13.7

14.1

14.4

14.8

15.2

15.5

15.7

15.9

16.0

16.1

16.2

16.4

16.6

16.9

16.9

17.2

17.4

17.7

17.8

17.9

18.0

18.2

18.5

18.7

18.8

19.2

19.5

19.6

20.2

20.4

20.5

Univariate Descriptive Statistics Chapter

3

37

Step 2 In this example, the leaves correspond to the last digit. The remaining digits (to the left) correspond to the stems. Steps 3 and 4 The stems vary from 8 to 20. The stem with the highest frequency is 16 (7 observations), and its leaves can be represented in a single line. For each stem, we place the respective leaves. Fig. 3.13 shows the stem-and-leaf plot for Example 3.11. FIG. 3.13 Stem-and-Leaf Plot for Example 3.11.

8

5

9

4

7

11

5

7

12

8

9

13

0

2

7

14

1

4

8

15

2

5

7

9

16

0

1

2

4

6

17

2

4

7

8

9

18

0

2

5

7

8

19

2

5

6

20

2

4

5

10

9

9

3.3.2.5 Boxplot or Box-and-Whisker Diagram The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q1), second quartile (Q2) or median (Md), third quartile (Q3) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data. Thus, the first quartile (Q1) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q13) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q3 and Q1. This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14. FIG. 3.14 Boxplot.

38

PART

II Descriptive Statistics

Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.

3.4 THE MOST COMMON SUMMARY-MEASURES IN UNIVARIATE DESCRIPTIVE STATISTICS Information found in a dataset can be summarized through suitable numerical measures, called summary measures. In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean. The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis. These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.

3.4.1

Measures of Position or Location

These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied. The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).

3.4.1.1 Measures of Central Tendency The most common measures of central tendency are the arithmetic mean, the median, and the mode. 3.4.1.1.1

Arithmetic Mean

The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter m, or a representative measure of a sample with n elements, represented by X. 3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X (X) is: n X

X¼

Xi

i¼1

(3.1)

n

where n is the total number of observations in the dataset and Xi, for i ¼ 1, …, n, represents each one of variable X’s values. Example 3.12 Calculate the simple arithmetic mean of the data in Table 3.E.20, regarding the grades of the graduate students enrolled in the subject Quantitative Methods.

TABLE 3.E.20 Students’ Grades 5.7

6.5

6.9

8.3

8.0

4.2

6.3

7.4

5.8

6.9

Univariate Descriptive Statistics Chapter

3

39

Solution The mean is simply calculated as the sum of all the values in Table 3.E.20 divided by the total number of observations: X¼

5:7 + 6:5 + ⋯ + 6:9 ¼ 6:6 10

The MEAN function in Excel calculates the simple arithmetic mean of the set of values selected. Let’s assume that the data in Table 3.E.20 are available from cell A1 to cell A10. To calculate the mean, we just need to insert the expression 5MEAN(A1:A10). Another way to calculate the mean using Excel, as well as other descriptive measures, such as, the median, mode, variance, standard deviation, standard error, skewness and kurtosis, which will also be studied in this chapter, is by using the Analysis ToolPack supplement (Section 3.5).

3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (pi) to each value i of variable X, we use the weighted arithmetic mean: n X

X¼

Xi :pi

i¼1 n X

(3.2) pi

i¼1

If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes: X¼

n X

Xi :rwi

(3.3)

i¼1

Example 3.13 At Vanessa’s school, the annual average of each subject is calculated based on the grades obtained throughout all four quarters, with their respective weights being: 1, 2, 3, and 4. Table 3.E.21 shows Vanessa’s grades in mathematics in each quarter. Calculate her annual average in the subject.

TABLE 3.E.21 Vanessa’s Grades in Mathematics Period

Grade

Weight

1st Quarter

4.5

1

2nd Quarter

7.0

2

3rd Quarter

5.5

3

4th Quarter

6.5

4

Solution The annual average is calculated by using the weighted arithmetic mean criterion. Applying Expression (3.2) to the data in Table 3. E.21, we have: X¼

4:5 1 + 7:0 2 + 5:5 3 + 6:5 4 ¼ 6:1 1+2+3+4

40

PART

II Descriptive Statistics

Example 3.14 There are five stocks in a certain investment portfolio. Table 3.E.22 shows the average yield of each stock in the previous month, as well as the respective percentage invested. Determine the portfolio’s average yield.

TABLE 3.E.22 Yield of Each Stock and Percentage Invested Stock

Yield (%)

% Investment

Bank of Brazil ON

1.05

10

Bradesco PN

0.56

25

Eletrobras PNB

0.08

15

Gerdau PN

0.24

20

Vale PN

0.75

30

Solution The portfolio’s average yield (%) corresponds to the sum of the products between each stock’s average yield (%) and the respective percentage invested, and, using Expression (3.3), we have: X ¼ 1:05 0:10 + 0:56 0:25 + 0:08 0:15 + 0:24 0:20 + 0:75 0:30 ¼ 0:53%

3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data When the discrete values of Xi repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each Xi will be represented by absolute frequencies (Fi) and, instead of n observations with n different values, we will have n observations with m different values (grouped data): m X

m X

Xi :Fi

X ¼ i¼1 m X

¼

Xi :Fi

i¼1

Fi

n

(3.4)

i¼1

If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes: m X X¼ Xi :Fr i (3.5) i¼1

Example 3.15 A satisfaction survey with 120 participants evaluated the performance of a health insurance company through grades given to it. Grades that vary between 1 and 10. The survey’s results can be seen in Table 3.E.23. Calculate the arithmetic mean for Example 3.15.

TABLE 3.E.23 Absolute Frequency Table Grades

Number of Participants

1

9

2

12

3

15

Univariate Descriptive Statistics Chapter

3

41

TABLE 3.E.23 Absolute Frequency Table—cont’d Grades

Number of Participants

4

18

5

24

6

26

7

5

8

7

9

3

10

1

Solution The arithmetic mean of Example 3.15 is calculated from Expression (3.4): X¼

1 9 + 2 12 + ⋯ + 9 3 + 10 1 ¼ 4:62 120

3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, Xi represents each i value of variable X. For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that Xi is the middle or central point of class i (i ¼ 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k): k X

X¼

k X

Xi :Fi

i¼1 k X

¼

Xi :Fi

i¼1

(3.6)

n

Fi

i¼1

X¼

k X

Xi :Fr i

(3.7)

i¼1

Example 3.16 Table 3.E.24 shows the classes of salaries paid to the employees of a certain company and their respective absolute and relative frequencies. Calculate the average salary.

TABLE 3.E.24 Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Relative Frequencies Classes

Fi

Fri (%)

1├3

240

17.14

3├5

480

34.29

5├7

320

22.86

7├9

150

10.71

9 ├ 11

130

9.29

11 ├ 13

80

5.71

1400

100

Sum

42

PART

II Descriptive Statistics

Solution Considering Xi the central point of class i and applying Expression (3.6), we have: X¼

2 240 + 4 480 + 6 320 + 8 150 + 10 130 + 12 80 ¼ 5:557 1; 400

or using Expression (3.7): X ¼ 2 0:1714 + 4 0:3429 + ⋯ + 10 0:0929 + 12 0:0571 ¼ 5:557 Therefore, the average salary is US$ 5,557.14.

3.4.1.1.2 Median The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.

3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data The median of variable X (discrete or continuous) can be calculated as follows: 8 Xn + X n > > +1 > > 2 < 2 , if n is an even number: 2 (3.8) Md ðXÞ ¼ >X > , if n is an odd number: > > : ð n + 1Þ 2 where n is the total number of observations and X1 … Xn, considering that X1 is the smallest observation or the value of the first element, and that Xn is the highest observation or the value of the last element. Example 3.17 Table 3.E.25 shows the monthly production of treadmills of a company in a given year. Calculate the median.

TABLE 3.E.25 Monthly Production of Treadmills in a Given Year Month

Production (units)

Jan.

210

Feb.

180

Mar.

203

April

195

May

208

June

230

July

185

Aug.

190

Sept.

200

Oct.

182

Nov.

205

Dec.

196

Univariate Descriptive Statistics Chapter

3

43

Solution To calculate the median, the observations are sorted in ascending order. Therefore, we have the order of the observations and their respective positions: 180

182

185

190

195

196

200

203

205

208

210

230

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

The median will be the mean between the sixth and the seventh elements, since n is an even number, that is: X12 + X12 +1 2 2 Md ¼ 2 Md ¼

196 + 200 ¼ 198 2

Excel calculates the median of a set of data through the MED function. Note that the median does not consider the order of magnitude of the original variable’s values. If, for instance, the highest value were 400 instead of 230, the median would be exactly the same; however, with a much higher mean. The median is also known as the 2nd quartile (Q2), 50th percentile (P50), or 5th decile (D5). These definitions will be studied in more detail in the following sections.

3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table. Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median). If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column. Example 3.18 Table 3.E.26 shows the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies. Calculate the median.

TABLE 3.E.26 Frequency Distribution Number of Bedrooms

Fi

Fac

1

6

6

2

13

19

3

20

39

4

15

54

5

7

61

6

6

67

7

3

70

Sum

70

44

PART

II Descriptive Statistics

Since n is an even number, the median will be the average of the values that occupy positions n/2 and (n/2) + 1, that is: Xn + Xn +1

X + X36 ¼ 35 2 2 Based on Table 3.E.26, we can see that the third group contains all the elements between positions 20 and 39 (including 35 and 36), whose corresponding value is 3. Therefore, the median is: Md ¼ 2

2

Md ¼

3+3 ¼3 2

3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median: Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression: PosðMd Þ ¼ n=2

(3.9)

Step 2: Identify the class that contains the median (median class) from the cumulative frequency column. Step 3: Calculate the median using the following expression: n FacðMd1Þ AMd Md ¼ LIMd + 2 FMd

(3.10)

where: LIMd ¼ lower limit of the median class; FMd ¼ absolute frequency of the median class; Fac(Md1)¼ cumulative frequency from the previous class to the median class; AMd ¼ range of the median class; n ¼ total number of observations.

Example 3.19 Consider the data in Example 3.16 regarding the classes of salaries paid to the employees of a company and their respective absolute and cumulative frequencies (Table 3.E.27). Calculate the median.

TABLE 3.E.27 Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Cumulative Frequencies Classes

Fi

Fac

1├3

240

240

3├5

480

720

5├7

320

1040

7├9

150

1190

9 ├ 11

130

1320

11 ├ 13

80

1400

Sum

1400

Univariate Descriptive Statistics Chapter

3

45

Solution In the case of continuous data grouped into classes, let’s apply the following steps to calculate the median: Step 1: First, we calculate the position of the median: n 1400 PosðMd Þ ¼ ¼ ¼ 700 2 2 Step 2: Through the cumulative frequency column, we can see that the median is in the second class (3 ├ 5). Step 3: Calculating the median: n Md ¼ LI Md + 2

Fac ðMd1Þ FMd

AMd

where: LIMd ¼ 3, FMd ¼ 480, Fac(Md1) ¼ 240, AMd ¼ 2, n ¼ 1400 Therefore, we have: Md ¼ 3 +

3.4.1.1.3

ð700 240Þ 2 ¼ 4916 ðUS$ 4916:67Þ 480

Mode

The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies. 3.4.1.1.3.1 Case 1: Mode of Ungrouped Data Consider a set of observations X1, X2, …, Xn of a certain variable. The mode is the value that appears with the highest frequency. Excel gives us the mode of a set of data through the MODE function. Example 3.20 The production of carrots in a certain company is divided into five phases, including the post-harvest handling phase. Table 3.E.28 shows the average time the processing (in seconds) takes in this phase for 20 observations. Calculate the mode.

TABLE 3.E.28 Processing Time in the Post-Harvest Handling Phase in Seconds 45.0

44.5

44.0

45.0

46.5

46.0

45.8

44.8

45.0

46.2

44.5

45.0

45.4

44.9

45.7

46.2

44.7

45.6

46.3

44.9

Solution The mode is 45.0, which is the most frequent value in the dataset (Table 3.E.28). This value could be determined directly in Excel by using the MODE function.

3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency. Example 3.21 A TV station interviewed 500 viewers trying to analyze their preferences in terms of interest categories. The result of the survey can be seen in Table 3.E.29. Calculate the mode.

46

PART

II Descriptive Statistics

TABLE 3.E.29 Viewers’ Preferences in Terms of Interest Categories Fi

Interest Categories Movies

71

Soap Operas

46

News

90

Comedy

98

Sports

120

Concerts

35

Variety

40

Sum

500

Solution Based on Table 3.E.29, we can see that the mode corresponds to the category Sports (the highest absolute frequency). Therefore, the mode is the only measure of position that can also be used for qualitative variables.

3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods. Czuber’s method has the following phases: Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency. Step 2: Calculate the mode (Mo): Mo ¼ LI Mo +

FMo FMo1 AMo 2:FMo ðFMo1 + FMo + 1 Þ

(3.11)

where: LIMo ¼ lower limit of the modal class; FMo ¼ absolute frequency of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.

Example 3.22 A set of continuous data with 200 observations is grouped into classes with their respective absolute frequencies, as shown in Table 3.E.30. Determine the mode using Czuber’s method.

TABLE 3.E.30 Continuous Data Grouped into Classes and Their Respective Frequencies Class

Fi

01 ├ 10

21

10 ├ 20

36

20 ├ 30

58

30 ├ 40

24

40 ├ 50

19

Sum

200

Univariate Descriptive Statistics Chapter

3

47

Solution Considering continuous data grouped into classes, we can use Czuber’s method to calculate the mode: Step 1: Based on Table 3.E.30, we can see that the modal class is the third one (20 ├ 30), since it has the highest absolute frequency. Step 2: Calculating the mode (Mo): Mo ¼ LI Mo +

FMo FMo1 AMo 2:FMo ðFMo1 + FMo + 1 Þ

where: LIMo ¼ 20, FMo ¼ 58, FMo1 ¼ 36, FMo+1 ¼ 24, AMo ¼ 10 Therefore, we have: Mo ¼ 20 +

58 36 10 ¼ 23:9 2 58 ð36 + 24Þ

On the other hand, King’s method consists of the following phases: Step 1: Identify the modal class (the one with the highest absolute frequency). Step 2: Calculate the mode (Mo) using the following expression: Mo ¼ LI Mo +

FMo + 1 AMo FMo1 + FMo + 1

(3.12)

where: LIMo ¼ lower limit of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.

Example 3.23 Once again, consider the data from the previous example. Use King’s method to determine the mode. Solution In Example 3.22, we saw that: LI Mo ¼ 20 FMo + 1 ¼ 24 FMo1 ¼ 36 AMo ¼ 10 Applying Expression (3.12): Mo ¼ LI Mo +

FMo + 1 24 10 ¼ 24 AMo ¼ 20 + FMo1 + FMo + 1 36 + 24

3.4.1.2 Quantiles According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q2), 5th decile (D5), or 50th percentile (P50) correspond to the median; therefore, they are measures of central tendency. 3.4.1.2.1

Quartiles

Quartiles (Qi, i ¼ 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.

Min.

Q1

Md = Q2

Q3

Max.

48

PART

II Descriptive Statistics

Thus, the 1st Quartile (Q1 or the 25th percentile) indicates that 25% of the data are less than Q1, or that 75% of the data are greater than Q1. The 2nd Quartile (Q2, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q2. The 3rd Quartile (Q3 or the 75th percentile) indicates that 75% of the data are less than Q3, or that 25% of the data are greater than Q3. 3.4.1.2.2

Deciles

Deciles (Di, i ¼ 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.

Min.

D1

D2

D3

D4

D5

D6

D7

D8

D9

Max.

Md

Therefore, the 1st decile (D1 or 10th percentile) indicates that 10% of the data are less than D1 or that 90% of the data are greater than D1. The 2nd decile (D2 or 20th percentile) indicates that 20% of the data are less than D2 or that 80% of the data are greater than D2. And so on, and so forth, until the 9th decile (D9 or 90th percentile), indicating that 90% of the data are less than D9 or that 10% of the data are greater than D9. 3.4.1.2.3 Percentiles Percentiles (Pi, i ¼ 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts. Hence, the 1st percentile (P1) indicates that 1% of the data is less than P1 or that 99% of the data are greater than P1. The 2nd percentile (P2) indicates that 2% of the data are less than P2 or that 98% of the data are greater than P2. And so on, and so forth, until the 99th percentile (P99), which indicates that 99% of the data are less than P99 or that 1% of the data is greater than P99. 3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical. We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data: Step 1: Sort the observations in ascending order. Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in: i 1 × i + , i ¼ 1, 2,3 4 2 hn i 1 × i + , i ¼ 1, 2,…, 9 Decile ! PosðDi Þ5 10 2 h n i 1 Percentile ! PosðPi Þ5 × i + , i ¼ 1, 2,…, 99 100 2 Quartile ! PosðQi Þ5

hn

(3.13) (3.14) (3.15)

Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position. Assume that Pos(Q1) ¼ 3.75, that is, the value of Q1 is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q1 will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.

Univariate Descriptive Statistics Chapter

3

Example 3.24 Consider the data in Example 3.20 regarding the average carrot processing time in the post-harvest handling phase, as specified in Table 3.E.28. Determine Q1 (1st quartile), Q3 (3rd quartile), D2 (2nd decile), and P64 (64th percentile). Solution For ungrouped continuous data, we must apply the following steps to determine the quartiles, deciles, and percentiles we are interested in: Step 1: Sort the observations in ascending order. 1st

2nd

3rd

4th

5th

7th

7th

8th

9th

10th

44.0

44.5

44.5

44.7

44.8

44.9

44.9

45.0

45.0

45.0

11th

12th

13th

14th

15th

16th

17th

18th

19th

20th

45.0

45.4

45.6

45.7

45.8

46.0

46.2

46.2

46.3

46.5

Step 2: Calculation of the positions of Q1, Q3, D2, and P64: 1 a) PosðQ1 Þ ¼ 20 4 1 + 2 ¼ 5:5 1 b) PosðQ3 Þ ¼ 20 4 3 + 2 ¼ 15:5 1 c) PosðD2 Þ ¼ 20 10 2 + 2 ¼ 4:5 20 d) PosðP64 Þ ¼ 100 64 + 12 ¼ 13:3 Step 3: Calculating Q1, Q3, D2, and P64: a) Pos(Q1) ¼ 5.5 means that its corresponding value is 50% near position 5 and 50% near position 6, that is, Q1 is simply the average of the values that correspond to both positions: 44:8 + 44:9 ¼ 44:85 2 b) Pos(Q3) ¼ 15.5 means that the value we are interested in is between positions 15 and 16 (50% near the 15th position and 50% near the 16th position), so, Q3 can be calculated as follows: Q1 ¼

45:8 + 46 ¼ 45:9 2 c) Pos(D2) ¼ 4.5 means that the value we are interested in is between positions 4 and 5, so, D2 can be calculated as follows: Q3 ¼

44:7 + 44:8 ¼ 44:75 2 d) Pos(P64) ¼ 13.3 means that the value we are interested in is 70% closer to position 13 and 30% closer to position 14, so, P64 can be calculated as follows: D2 ¼

P64 ¼ (0.70 x 45.6) + (0.30 x 45.7) ¼ 45.63. Interpretation Q1 ¼ 44.85 indicates that, in 25% of the observations (the first 5 observations listed in Step 1), the carrot processing time in the postharvest handling phase is less than 44.85 seconds, or that in 75% of the observations (the remaining 15 observations), the processing time is greater than 44.85. Q3 ¼ 45.9 indicates that, in 75% of the observations (15 of them), the processing time is less than 45.9 seconds, or that in 5 observations, the processing time is greater than 45.9. D2 ¼ 44.75 indicates that, in 20% of the observations (4 of them), the processing time is less than 44.75 seconds, or that in 80% of the observations (16 of them), the processing time is greater than 44.75. P64 ¼ 45.63 indicates that, in 64% of the observations (12.8 of them), the processing time is less than 45.63 seconds, or that in 36% of the observations (7.2 of them) the processing time is greater than 45.63. Excel calculates the quartile of order i (i ¼ 0, 1, 2, 3, 4) through the QUARTILE function. As arguments of the function, we must define the matrix or set of data in which we are interested to calculate the respective quartile (it does not need to be in ascending order), in addition to the fourth we are interested in (minimum value ¼ 0; 1st quartile ¼ 1; 2nd quartile ¼ 2, 3rd quartile ¼ 3; maximum value ¼ 4). The k-th percentile (k ¼ 0, ..., 1) can also be calculated in Excel through the PERCENTILE function. As arguments of the function, we must define the matrix we are interested in, in addition to the value of k (for example, in the case of P64, k ¼ 0.64).

49

50

PART

II Descriptive Statistics

The calculation of quartiles, deciles, and percentiles using SPSS and Stata statistical software will be demonstrated in Sections 3.6 and 3.7, respectively. SPSS and Stata software use two methods to calculate quartiles, deciles, or percentiles. One of them is called Tukey’s Hinges and it is the method used in this book. The other method is related to the Weighted Average, whose calculations are more complex. Excel, on the other hand, implements another algorithm that gets similar results.

3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table. In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group (s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1). Example 3.25 Consider the data in Example 3.18 regarding the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies (Table 3.E.26). Calculate Q1, D4, and P96. Solution Let’s calculate the positions of Q1, D4, and P96 through Expressions (3.13), (3.14), and (3.15), respectively, and their corresponding values: 1 a) PosðQ1 Þ ¼ 70 4 1 + 2 ¼ 18 Based on Table 3.E.26, we can see that position 18 is in the second group (2 bedrooms), so, Q1 ¼ 2. 1 b) PosðD4 Þ ¼ 70 10 4 + 2 ¼ 28:5 Through thecumulative frequency column, we can see that positions 28 and 29 are in the third group (3 bedrooms), so, D4 ¼ 3. 70 c) Pos P96 ¼ 100 96 + 12 ¼ 67:7 that is, P96 is 70% closer to position 68 and 30% to position 67. Through the cumulative frequency column, we can see that position 68 is in the seventh group (7 bedrooms) and position 67 to the sixth group (6 bedrooms), so, P96 can be calculated as follows: P96 ¼ ð0:70 x 7Þ + ð0:30 x 6Þ ¼ 6:7: Interpretation Q1 ¼ 2 indicates that 25% of the real estate properties have less than 2 bedrooms, or that 75% of the real estate properties have more than 2 bedrooms. D4 ¼ 3 indicates that 40% of the real estate properties have less than 3 bedrooms, or that 60% of the real estate properties have more than 3 bedrooms. P96 ¼ 6.7 indicates that 96% of the real estate properties have less than 6.7 bedrooms, or that 4% of the real estate properties have more than 6.7 bedrooms.

3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles: Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions: n (3.16) Quartile ! PosðQi Þ ¼ i, i ¼ 1,2, 3 4 n Decile ! PosðDi Þ ¼ i, i ¼ 1,2, …, 9 (3.17) 10 n i, i ¼ 1, 2,…,99 (3.18) Percentile ! PosðPi Þ ¼ 100

Univariate Descriptive Statistics Chapter

3

51

Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column. Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions: ! PosðQi Þ FcumðQi 1Þ (3.19) RQi , i ¼ 1,2, 3 Quartile ! Qi ¼ LLQi + FQi where: LLQi ¼ lower limit of the quartile class; Fcum(Qi1)¼ cumulative frequency from the previous class to the quartile class; FQi ¼ absolute frequency of the quartile class; RQi ¼ range of the quartile class. Decile ! Di ¼ LLDi +

PosðDi Þ FcumðDi 1Þ

!

FDi

RDi , i ¼ 1,2, …, 9

(3.20)

where: LLDi ¼ lower limit of the decile class; Fcum(Di1)¼ cumulative frequency from the previous class to the decile class; FDi ¼ absolute frequency of the decile class; RDi ¼ range of the decile class. Percentile ! Pi ¼ LLPi +

PosðPi Þ FcumðPi 1Þ FPi

! RPi , i ¼ 1,2, …, 99

(3.21)

where: LLPi ¼ lower limit of the percentile class; Fcum(Pi1)¼ cumulative frequency from the previous class to the percentile class; FPi ¼ absolute frequency of the percentile class; RPi ¼ range of the percentile class.

Example 3.26 A survey on the health conditions of 250 patients collected information about their weight. The data are grouped into classes, as shown in Table 3.E.31. Calculate the first quartile, the seventh decile, and the 60th percentile.

TABLE 3.E.31 Absolute and Cumulative Frequencies Distribution table of Patients’ Weight Grouped into Classes Class

Fi

Fac

50 ├ 60

18

18

60 ├ 70

28

46

70 ├ 80

49

95

80 ├ 90

66

161

90 ├ 100

40

201

100 ├ 110

33

234

110 ├ 120

16

250

Sum

250

52

PART

II Descriptive Statistics

Solution Let’s apply the three steps to calculate Q1, D7, and P60: Step 1: Let’s calculate the position of the first quartile, the seventh decile, and the 60th percentile through Expressions (3.16), (3.17), and (3.18), respectively: 250 1 ¼ 62:5 4 250 7 ¼ 175 7th Decile ! PosðD7 Þ ¼ 10 250 60 ¼ 150 60th Percentile ! PosðP60 Þ ¼ 100 1st Quartile ! PosðQ1 Þ ¼

Step 2: Let’s identify the class that has Q1, D7, and P60 from the cumulative frequency column in Table 3.E.31: Q1 is in the 3rd class (70 ├ 80) D7 is in the 5th class (90 ├ 100) P60 is in the 4th class (80 ├ 90) Step 3: Let’s calculate Q1, D7, and P60 from Expressions (3.19), (3.20), and (3.21), respectively: Q1 ¼ LLQ1 + D7 ¼ LLD7 + P60 ¼ LLP60 +

Pos ðQ1 Þ FcumðQ1 1Þ

!

FQ1 Pos ðD7 Þ FcumðD7 1Þ

RQ1 ¼ 70 + !

FD7 Pos ðP60 Þ FcumðP60 1Þ FP60

62:5 46 10 ¼ 73:37 49

RD7 ¼ 90 + !

175 161 10 ¼ 93:5 40

RP60 ¼ 80 +

150 95 10 ¼ 88:33 66

Interpretation Q1 ¼ 73.37 indicates that 25% of the patients weigh less than 73.37 kg, or that 75% of the patients weigh more than 73.37 kg. D7 ¼ 93.5 indicates that 70% of the patients weigh less than 93.5 kg, or that 30% of the patients weigh more than 93.5 kg. P60 ¼ 88.33 indicates that 60% of the patients weigh less than 88.33 kg, or that 40% of the patients weigh more than 88.33 kg.

3.4.1.3 Identifying the Existence of Univariate Outliers A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values. Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population. There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators). The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles: IQR ¼ Q3 Q1

(3.22)

Note that the IQR is the length of the box. Any values located below Q1 or above Q3 by 1.5IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when: X° < Q1 21:5 IQR

(3.23)

X° > Q3 + 1:5 IQR

(3.24)

Univariate Descriptive Statistics Chapter

3

53

FIG. 3.15 Boxplot with the identification of outliers.

or any values located below Q1 or above Q3 by 3 IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X* value of a variable is considered an extreme outlier when: X∗ < Q1 3:IQR

(3.25)

X∗ > Q3 + 3:IQR

(3.26)

Fig. 3.15 illustrates the boxplot with the identification of outliers. Example 3.27 Consider the sorted data in Example 3.24 regarding the average carrot processing time in the post-harvest handling phase: 44.0

44.5

44.5

44.7

44.8

44.9

44.9

45.0

45.0

45.0

45.0

45.4

45.6

45.7

45.8

46.0

46.2

46.2

46.3

46.5

where Q1 ¼ 44.85, Q2 ¼ 45, Q3 ¼ 45.9, mean ¼ 45.3, and mode ¼ 45. Check and see if there are mild and extreme outliers. Solution To verify if there is a possible outlier, we must calculate: Q1 1:5 ðQ3 Q1 Þ ¼ 44:85 1:5:ð45:9 44:85Þ ¼ 43:275 Q3 + 1:5 ðQ3 Q1 Þ ¼ 45:9 + 1:5:ð45:9 44:85Þ ¼ 47:475 Since there is no value in the distribution outside this interval, we conclude that there are no mild outliers. Obviously, it is not necessary to calculate the interval for extreme outliers. In case only one outlier in a certain variable is identified, the researcher can treat it through some existing procedures, as, for example, the complete elimination of this observation. On the other hand, if there is more than one outlier for one or more variables individually, the elimination of all the observations can reduce the sample size significantly. To avoid this problem, it is very common for observations considered outliers for a certain variable to have their atypical values substituted for the mean of the variable, thus, excluding the outliers (Fa´vero et al., 2009). The authors mention other procedures for dealing with outliers, such as, substituting them for values from a regression or winsorization; which, in an organized way, eliminates an equal number of observations from each side of the distribution. Fa´vero et al. (2009) also highlight the importance of dealing with outliers when the researcher in interested in investigating the behavior of a certain variable without the influence of observations with atypical values. On the other hand, if the main goal is to analyze the behavior of these atypical observations or to define subgroups through discrepancy criteria, maybe eliminating these observations or substituting their values would not be the best solution.

54

PART

3.4.2

II Descriptive Statistics

Measures of Dispersion or Variability

To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary. The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).

3.4.2.1 Range The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data: R ¼ Xmax Xmin

(3.27)

3.4.2.2 Average Deviation Deviation is the difference between each observed value and

the mean of the variable. Thus, for population data, it would be m), and for sample data, by X X . The modulus or absolute deviation ignores the sign and is represented by (Xi i denoted by Xi X. Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations. 3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data The average deviation (D) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n): N X X m i

D¼

i¼1

D¼

ðfor the populationÞ N n X X X i

(3.28)

i¼1

(3.29)

n

ðfor samplesÞ

Example 3.28 Table 3.E.32 shows the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the average deviation.

TABLE 3.E.32 Distances Traveled (km) 12.4

22.6

18.9

9.7

14.5

22.5

26.3

17.7

31.2

20.4

Solution For the data in Table 3.E.32, we have X ¼ 19:62. Applying Expression (3.29), we get the average deviation: j12:4 19:62j + j22:6 19:62j + ⋯ + j20:4 19:62j ¼ 4:98 10 The average deviation can be directly calculated in Excel using the AVEDEV function. D¼

3.4.2.2.2

Case 2: Average Deviation of Grouped Discrete Data

For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:

Univariate Descriptive Statistics Chapter

3

55

m X X m:F i i

D¼

D¼

Pm

X :F i¼1 i i

bearing in mind that X ¼

i¼1

n

ðfor the populationÞ N m X X X:F i i

(3.30)

i¼1

(3.31)

n

ðfor samplesÞ

.

Example 3.29 Table 3.E.33 shows the number of goals scored by the D.C. soccer team in their last 30 games, with their respective absolute frequencies. Calculate the average deviation.

TABLE 3.E.33 Frequency Distribution of Example 3.29 Number of Goals

Fi

0

5

1

8

2

6

3

4

4

4

5

2

6

1

Sum

30

Solution 05+18+⋯+61 The mean is X ¼ ¼ 2:133. The average deviation can be determined from the calculations presented in 30 Table 3.E.34:

TABLE 3.E.34 Calculations of the Average Deviation for Example 3.29

Therefore, D ¼

Number of Goals

Fi

X X i

X X :F i i

0

5

2.133

10.667

1

8

1.133

9.067

2

6

0.133

0.800

3

4

0.867

3.467

4

4

1.867

7.467

5

2

2.867

5.733

6

1

3.867

3.867

Sum

30

Pm 41:067 i¼1 Xi X :Fi ¼ ¼ 1:369. n 30

41.067

56

PART

II Descriptive Statistics

3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes For continuous data grouped into classes, the calculation of the average deviation is: k X X m:F i i

D¼

i¼1

ðfor the populationÞ

N

(3.32)

k X X X:F i i

D¼

i¼1

n

ðfor samplesÞ

(3.33)

Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m Pk X :F groups, we consider k classes. Moreover, Xi represents the middle or central point of each class i, where X ¼ i¼1n i i , as presented in Expression (3.6). Example 3.30 In order to determine its variation due to genetic factors, a survey with 100 newborn babies collected information about their weight. Table 3.E.35 shows the data grouped into classes and their respective absolute frequencies. Calculate the average deviation.

TABLE 3.E.35 Newborn Babies’ Weight (in kg) Grouped into Classes Class

Fi

2.0 ├ 2.5

10

2.5 ├ 3.0

24

3.0 ├ 3.5

31

3.5 ├ 4.0

22

4.0 ├ 4.5

13

Sum

Solution First, we must calculate X: k X

Xi :Fi

2:25 10 + 2:75 24 + 3:25 31 + 3:75 22 + 4:25 13 ¼ ¼ 3:270 n 100 The average deviation can be determined from the calculations presented in Table 3.E.36: X¼

i¼1

TABLE 3.E.36 Calculations of the Average Deviation for Example 3.30

Therefore, D ¼

Class

Fi

Xi

X X i

X X :F i i

2.0 ├ 2.5

10

2.25

1.02

10.20

2.5 ├ 3.0

24

2.75

0.52

12.48

3.0 ├ 3.5

31

3.25

0.02

0.62

3.5 ├ 4.0

22

3.75

0.48

10.56

4.0 ├ 4.5

13

4.25

0.98

12.74

Sum

100

Pk 46:6 i¼1 Xi X :Fi ¼ ¼ 0:466. n 100

46.6

Univariate Descriptive Statistics Chapter

3

57

3.4.2.3 Variance Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.

3.4.2.3.1

Case 1: Variance of Ungrouped Discrete and Continuous Data

Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance: !2 N X Xi N N X X i¼1 2 2 ðXi mÞ Xi N i¼1 i¼1 2 s ¼ ðfor the populationÞ (3.34) ¼ N N !2 n X Xi n n X

2 X i¼1 2 Xi X Xi n i¼1 i¼1 2 ¼ ð forsamplesÞ (3.35) S ¼ n1 n1 The relationship between the sample variance (S2) and the population variance (s2) is given by: S2 ¼

N :s2 n1

(3.36)

Example 3.31 Consider the data in Example 3.28 regarding the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the variance. Solution We saw in Example 3.28 that X ¼ 19:62. Applying Expression (3.35), we have: S2 ¼

ð12:4 19:62Þ2 + ð22:6 19:62Þ2 + ⋯ + ð20:4 19:62Þ2 ¼ 41:94 9

The sample variance can be directly calculated in Excel using the VAR.S function. To calculate the variance population, we must use the VAR.P function.

3.4.2.3.2 Case 2: Variance of Grouped Discrete Data For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows: !2 m X Xi :Fi m m X X i¼1 2 2 ðXi mÞ :Fi Xi :Fi N i¼1 i¼1 2 ¼ ðfor the populationÞ (3.37) s ¼ N N !2 m X Xi :Fi m m X X

2 i¼1 2 Xi X :Fi Xi :Fi n i¼1 i¼1 2 S ¼ ðfor samplesÞ (3.38) ¼ n1 n1 Pm X :F where X ¼ i¼1 i i . n

58

PART

II Descriptive Statistics

Example 3.32 Consider the data in Example 3.29 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.29, the mean is X ¼ 2:133. The variance can be determined from the calculations presented in Table 3.E.37:

TABLE 3.E.37 Calculations of the Variance Number of Goals

Fi

0

5

Xi X

2

2 Xi X :Fi

4.551

22.756

1

8

1.284

10.276

2

6

0.018

0.107

3

4

0.751

3.004

4

4

3.484

13.938

5

2

8.218

16.436

6

1

14.951

14.951

Sum

30

81.467

Pm Therefore, S 2 ¼

3.4.2.3.3

i¼1

2 Xi X :Fi 81:467 ¼ ¼ 2:809 n1 29

Case 3: Variance of Continuous Data Grouped into Classes

For continuous data grouped into classes, we calculate the variance as follows: k X k X

s ¼ 2

k X

2

ðXi mÞ :Fi

i¼1

¼

N

!2 Xi :Fi

i¼1

Xi2 :Fi

N

i¼1

k X k X

S ¼ 2

k X

2

ðXi xÞ :Fi

i¼1

n1

ðfor the populationÞ

N

¼

Xi2 :Fi

(3.39)

!2 Xi :Fi

i¼1

i¼1

n1

n

ðfor samplesÞ

(3.40)

Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.

Example 3.33 Consider the data in Example 3.30 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.30, we have X ¼ 3:270.

Univariate Descriptive Statistics Chapter

3

59

The variance can be determined from the calculations presented in Table 3.E.38:

TABLE 3.E.38 Calculations of the Variance for Example 3.33

2

2 Xi X :Fi

Class

Fi

Xi

2.0 ├ 2.5

10

2.25

1.0404

2.5 ├ 3.0

24

2.75

0.2704

6.4896

3.0 ├ 3.5

31

3.25

0.0004

0.0124

3.5 ├ 4.0

22

3.75

0.2304

5.0688

4.0 ├ 4.5

13

4.25

0.9604

12.4852

Sum

100

Pk Therefore, S 2 ¼

i¼1

ðXi X Þ

2

n1

:Fi

Xi X

10.404

34.46

¼ 34:46 99 ¼ 0:348.

3.4.2.4 Standard Deviation Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows: pﬃﬃﬃﬃﬃ s ¼ s2 ðfor the populationÞ

(3.41)

pﬃﬃﬃﬃﬃ S ¼ S2 ðfor samplesÞ

(3.42)

Example 3.34 Once again, consider the data in Examples 3.28 or 3.31 regarding the distances traveled (in km) by the vehicle. Calculate the standard deviation. Solution We have X ¼ 19:62. The standard deviation is the square root of the variance, which has already been calculated in Example 3.31: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð12:4 19:62Þ2 + ð22:6 19:62Þ2 + ⋯ + ð20:4 19:62Þ2 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ S¼ ¼ 41:94 ¼ 6:476 9 The standard deviation of a sample can be directly calculated in Excel using the STDEV.S function. To calculate the standard deviation of the population, we use the STDEV.P function.

Example 3.35 Consider the data in Examples 3.29 or 3.32 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the standard deviation. Solution The mean is X ¼ 2:133. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.32, as demonstrated in Table 3.E.37: rP ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ m 2 ðXi X Þ :Fi i¼1 ¼ 81:467 Therefore, S ¼ 29 ¼ 2:809 ¼ 1:676. n1

60

PART

II Descriptive Statistics

Example 3.36 Consider the data in Examples 3.30 or 3.33 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the standard deviation. Solution We have X ¼ 3:270. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.33, as demonstrated in Table 3.E.38: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

2 Pk qﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ i¼1 Xi X :Fi ¼ 34:46 Therefore, S ¼ 99 ¼ 0:348 ¼ 0:59. n1

3.4.2.5 Standard Error The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size: s (3.43) sX ¼ pﬃﬃﬃﬃ for the population N S SX ¼ pﬃﬃﬃ for samples n

(3.44)

The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors. Example 3.37 One of the phases in the preparation of concrete is mixing it in a concrete mixer. Tables 3.E.39 and 3.E.40 show the concrete mixing times (in seconds), considering a sample with 10 and 30 elements, respectively. Calculate the standard error for both cases and interpret the results.

TABLE 3.E.39 Concrete Mixing Time for a Sample With 10 Elements 124

111

132

142

108

127

133

144

148

105

TABLE 3.E.40 Concrete Mixing Time for a Sample With 30 Elements 125

102

135

126

132

129

156

112

108

134

126

104

143

140

138

129

119

114

107

121

124

112

148

145

130

125

120

127

106

148

Solution First, let’s calculate the standard deviation for both samples: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð124 127:4Þ2 + ð111 127:4Þ2 + ⋯ + ð105 127:4Þ2 S1 ¼ ¼ 15:364 9 sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð125 126:167Þ2 + ð102 126:167Þ2 + ⋯ + ð148 126:167Þ2 ¼ 14:227 S2 ¼ 29 To calculate the standard error, we must apply Expression (3.44): S1 15:364 SX ¼ pﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃ ¼ 4:858 1 n1 10

Univariate Descriptive Statistics Chapter

3

61

S2 14:227 SX ¼ pﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃ ¼ 2:598 2 n2 30 Despite the small difference in the calculation of the standard deviation, we can see that the standard error of the first sample is almost the double when compared to the second sample. Therefore, the higher the number of measurements, the higher the accuracy.

3.4.2.6 Coefficient of Variation The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows: s (3.45) CV ¼ 100 ð%Þ for the population m S CV ¼ 100 ð%Þ for samples X

(3.46)

A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application. Example 3.38 Calculate the coefficient of variation for both samples of the previous example. Solution Applying Expression (3.46), we have: CV 1 ¼ CV 2 ¼

S1 15:364 100 ¼ 100 ¼ 12:06% 127:4 X1

S2 14:227 100 ¼ 11:28% 100 ¼ 126:167 X2

These results confirm the homogeneity of the data of the variable being studied for both samples. We conclude, therefore, that the mean is a good measure to represent the data. Let’s now study the measures of skewness and kurtosis.

3.4.3

Measures of Shape

Measures of asymmetry (skewness) and kurtosis characterize the shape of the distribution of the population elements sampled around the mean (Maroco, 2014).

3.4.3.1 Measures of Skewness Measures of skewness describe the shape of a frequency distribution curve. For a symmetrical curve or frequency distribution, the mean, the mode, and the median are the same. For an asymmetrical curve, the mean gets farther away from the mode, and the median is located in an intermediary position. Fig. 3.16 shows a symmetrical distribution. On the other hand, if the frequency distribution is more concentrated on the left side, that is, the tail on the right is longer than the tail on the left, we will have a positively skewed distribution or to the right, as shown in Fig. 3.17. In this case, the mean is greater than the median, and the latter is greater than the mode (Mo < Md < X). Conversely, if the frequency distribution is more concentrated on the right side, that is, the tail on the left is longer than the tail on the right, we will have a negatively skewed distribution or to the left, as shown in Fig. 3.18. In this case, the mean is less than the median, and the latter is less than the mode X < Md < Mo .

62

PART

II Descriptive Statistics

FIG. 3.16 Symmetrical distribution.

FIG. 3.17 Skewness to the right or positive skewness.

FIG. 3.18 Skewness to the left or negative skewness.

3.4.3.1.1

Pearson’s First Coefficient of Skewness

Pearson’s first coefficient of skewness (Sk1) is a measure of skewness given by the difference between the mean and the mode, weighted by one measure of dispersion (the standard deviation): Sk1 ¼

m Mo for the population s

(3.47)

X Mo for samples, S

(3.48)

Sk1 ¼

which has the following interpretation: If Sk1 ¼ 0, the distribution is symmetrical; If Sk1 > 0, the distribution is positively skewed (to the right); If Sk1 < 0, the distribution is negatively skewed (to the left). Example 3.39 From one set of data, we obtained the following measures X ¼ 34:7, Mo ¼ 31.5, Md ¼ 33.2, and S ¼ 12.4. Determine the type of skewness and calculate Pearson’s first coefficient of skewness.

Univariate Descriptive Statistics Chapter

3

63

Solution Since Mo < Md < X, we have a positive asymmetrical distribution (to the right). Applying Expression (3.48), we can determine Pearson’s first coefficient of skewness: X Mo 34:7 31:5 ¼ ¼ 0:258 S 12:4 Classifying the distribution as positively skewed can also be interpreted by the value Sk1 > 0. Sk 1 ¼

3.4.3.1.2 Pearson’s Second Coefficient of Skewness To avoid using the mode to calculate

the skewness, we must adopt the empirical relationship between the mean, the median, and the mode: X Mo ¼ 3: X Md , which corresponds to Pearson’s second coefficient of skewness (Sk2): 3:ðm Md Þ for the population s

3: X Md for samples Sk2 ¼ S

Sk2 ¼

(3.49) (3.50)

In the same way, we have: If Sk2 ¼ 0, the distribution is symmetrical; If Sk2 > 0, the distribution is positively skewed (to the right); If Sk2 < 0, the distribution is negatively skewed (to the left). Pearson’s first and second coefficients of skewness allow us to compare two or more distributions and to evaluate which one is more asymmetrical. Its modulus indicates the intensity of the skewness. That is, the higher Pearson’s coefficient of skewness is, the more asymmetrical the curve is. Thus: If 0 < j Sk j < 0.15, the skewness is weak; If 0.15 j Sk j 1, the skewness is moderate; If j Sk j > 1, the skewness is strong. Example 3.40 From the data in Example 3.39, calculate Pearson’s second coefficient of skewness. Solution Applying Expression (3.50), we have:

3: X Md 3:ð34:7 33:2Þ ¼ ¼ 0:363 S 12:4 Analogously, since Sk2 > 0, we confirm that the distribution is positively skewed. Sk 2 ¼

3.4.3.1.3 Bowley’s Coefficient of Skewness Another measure of skewness is Bowley’s coefficient of skewness (SkB), also known as quartile coefficient of skewness, calculated with quantiles, such as, the first and third quartiles, in addition to the median: SkB ¼

Q3 + Q1 2:Md Q3 Q1

In the same way, we have: If SkB ¼ 0, the distribution is symmetrical; If SkB > 0, the distribution is positively skewed (to the right); If SkB < 0, the distribution is negatively skewed (to the left).

(3.51)

64

PART

II Descriptive Statistics

Example 3.41 Calculate Bowley’s coefficient of skewness for the following dataset, which has already been sorted in ascending order:

24

25

29

31

36

40

44

45

48

50

54

56

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

Solution We have Q1 ¼ 30, Md ¼ 42, and Q3 ¼ 49. Therefore, we can determine Bowley’s coefficient of skewness: Sk B ¼

Q3 + Q1 2:Md 49 + 30 2:ð42Þ ¼ 0:263 ¼ Q3 Q1 49 30

Since SkB < 0, we conclude that the distribution is negatively skewed (to the left).

3.4.3.1.4 Fisher’s Coefficient of Skewness The last measure of skewness we will study is known as Fisher’s coefficient of skewness (g1), calculated from the third moment around the mean (M3), as presented in Maroco (2014): g1 ¼

n2 :M3 ðn 1Þ:ðn 2Þ:S3

(3.52)

where: n X

3 Xi X

M3 ¼

i¼1

n

(3.53)

which is interpreted the same way as the other coefficients of skewness, that is: If g1 ¼ 0, the distribution is symmetrical; If g1 > 0, the distribution is positively skewed (to the right); If g1 < 0, the distribution is negatively skewed (to the left). Fisher’s coefficient of skewness can be calculated in Excel using the DISTORTION function (see Example 3.42) or using the Analysis Tools supplement (Section 3.5). Its calculation through SPSS software will be presented in Section 3.6. 3.4.3.1.5 Coefficient of Skewness on Stata The coefficient of skewness on Stata is calculated from the second and third moments around the mean, as presented by Cox (2010): Sk ¼

M3 3=2

(3.54)

M2

where: n X

2 Xi X

M2 ¼

i¼1

n

(3.55)

Univariate Descriptive Statistics Chapter

3

65

which is interpreted the same way as the other coefficients of skewness, that is: If Sk ¼ 0, the distribution is symmetrical; If Sk > 0, the distribution is positively skewed (to the right); If Sk < 0, the distribution is negatively skewed (to the left).

3.4.3.2 Measures of Kurtosis In addition to measures of skewness, measures of kurtosis can also be used to characterize the shape of the distribution of the variable being studied. Kurtosis can be defined as the flatness level of a frequency distribution (height of the peak of the curve) in relation to a theoretical distribution that usually corresponds to the normal distribution. When the shape of the distribution is not very flat, nor very long, similar to a normal curve, it is called mesokurtic, as we can see in Fig. 3.19. In contrast, when the distribution shows a frequency curve that is flatter than a normal curve, it is called platykurtic, as shown in Fig. 3.20. Or, when the distribution presents a frequency curve that is longer than a normal curve, it is called leptokurtic, according to Fig. 3.21.

3.4.3.2.1

Coefficient of Kurtosis

One of the most common coefficients to measure the flatness level or kurtosis of a distribution is the percentile coefficient of kurtosis, or simply coefficient of kurtosis (k). It is calculated from the interquartile interval, in addition to the 10th and 90th percentiles: k¼

Q Q1 3

, 2 P90 P10

(3.56)

which has the following interpretation: If k ¼ 0.263, we say that the curve is mesokurtic; If k > 0.263, we say that the curve is platykurtic; If k < 0.263, we say that the curve is leptokurtic.

FIG. 3.19 Mesokurtic curve.

FIG. 3.20 Platykurtic curve.

66

PART

II Descriptive Statistics

FIG. 3.21 Leptokurtic curve.

3.4.3.2.2

Fisher’s Coefficient of Kurtosis

Another very common measure to determine the flatness level or kurtosis of a distribution is Fisher’s coefficient of kurtosis, (g2). It is calculated using the fourth moment near the mean (M4), as presented in Maroco (2014): g2 ¼

n2 :ðn + 1Þ:M4 ðn 1Þ2 3: ðn 1Þ:ðn 2Þ:ðn 3Þ:S4 ðn 2Þ:ðn 3Þ

(3.57)

where: n X

M4 ¼

Xi X

i¼1

n

4 ,

(3.58)

which has the following interpretation: If g2 ¼ 0, the curve has a normal distribution (mesokurtic); If g2 < 0, the curve is very flat (platykurtic); If g2 > 0, the curve is very long (leptokurtic). Many pieces of statistical software, among them SPSS, use Fisher’s coefficient of kurtosis to calculate the flatness level or kurtosis (Section 3.6). In Excel, the KURT function calculates Fisher’s coefficient of kurtosis (Example 3.42), and it can be calculated through the Analysis ToolPak supplement as well (Section 3.5).

3.4.3.2.3

Coefficient of Kurtosis on Stata

The coefficient of kurtosis on Stata is calculated from the second and fourth moments near the mean, as presented by Bock (1975) and Cox (2010): kS ¼

M4 M22

which has the following interpretation: If kS ¼ 3, the curve has a normal distribution (mesokurtic); If kS < 3, the curve is very flat (platykurtic); If kS > 3, the curve is very long (leptokurtic).

(3.59)

Univariate Descriptive Statistics Chapter

3

Example 3.42 Table 3.E.41 shows the prices of stock Y throughout a month, resulting in a sample with 20 periods (i.e., business days). Calculate: a) Fisher’s coefficient of skewness (g1); b) The coefficient of skewness used on Stata; c) Fisher’s coefficient of kurtosis (g2); d) The coefficient of kurtosis used on Stata;

TABLE 3.E.41 Prices of Stock Y Throughout the Month 18.7

18.3

18.4

18.7

18.8

18.8

19.1

18.9

19.1

19.9

18.5

18.5

18.1

17.9

18.2

18.3

18.1

18.8

17.5

16.9

Solution The mean and the standard deviation of the data in Table 3.E.41 are X ¼ 18:475 and S ¼ 0.6324, respectively. We have: a) Fisher’s coefficient of skewness g1: It is calculated using the third moment near the mean (M3): n X

M3 ¼

Xi X

3

i¼1

¼

n

ð18:7 18:475Þ3 + ⋯ + ð16:9 18:475Þ3 ¼ 0:0788 20

Therefore, we have: g1 ¼

n2 :M3 ð20Þ2 ð0:079Þ ¼ ¼ 0:3647 3 ðn 1Þ:ðn 2Þ:S 19 18 ð0:63Þ3

Since g1 < 0, we can conclude that the frequency curve is more concentrated on the right side and has a longer tail to the left, that is, the distribution is asymmetrical to the left or negative. Excel calculates Fisher’s coefficient of skewness (g1) through the SKEW function. File Stock_Market.xls shows the data from Table 3.E.41, cells A1:A20. Thus, to calculate it, we just need to insert expression 5SKEW(A1:A20). b) The coefficient of skewness used on Stata: It is calculated from the second and third moments near the mean: n X

M2 ¼

Xi X

2

i¼1

¼

n

ð18:7 18:475Þ2 + ⋯ + ð16:9 18:475Þ2 ¼ 0:3799 20 M3 ¼ 0:0788

It is calculated as follows: Sk ¼

M3 3=2

M2

¼ 0:3367,

which is interpreted the same way as Fisher’s coefficient of skewness. c) Fisher’s coefficient of kurtosis g2: It is calculated using the fourth moment near the mean (M4): n X

M4 ¼

Xi X

i¼1

n

4 ¼

ð18:7 18:475Þ4 + ⋯ + ð16:9 18:475Þ4 ¼ 0:5857 20

Therefore, we calculate g2 as follows: g2 ¼ g2 ¼

n2 :ðn + 1Þ:M4 ðn 1Þ2 3: 4 ðn 1Þ:ðn 2Þ:ðn 3Þ:S ðn 2Þ:ðn 3Þ ð20Þ2 21 0:5857 19 18 17 ð0:6324Þ

Thus, we can conclude that the curve is long or leptokurtic.

ð19Þ2

4 3: 18 17 ¼ 1:7529

67

68

PART

II Descriptive Statistics

The KURT function in Excel calculates Fisher’s coefficient of kurtosis (g2). To calculate it from the file Stock_Market.xls, we must insert expression 5KURT(A1:A20). d) Coefficient of kurtosis on Stata: It is calculated from the second and fourth moments near the mean: M2 ¼ 0.3799 and M4 ¼ 0.5857, as already calculated. Thus: kS ¼

M4 0:5857 ¼ ¼ 4:0586 M22 ð0:3799Þ2

Since kS > 3, the curve is long or leptokurtic. In the next three sections, we will discuss how to construct tables, charts, graphs, and summary measures in Excel and in the statistical softwares SPSS and Stata, using the data in Example 3.42.

3.5

A PRACTICAL EXAMPLE IN EXCEL

Section 3.3.1 showed the graphical representation of qualitative variables through bar charts (horizontal and vertical), pie charts, and the Pareto chart. We demonstrated how each one of these charts can be obtained using Excel. Conversely, Section 3.3.2 showed the graphical representation of quantitative variables through line graphs, scatter plots, histograms, among others. Analogously, we presented how most of them can be obtained using Excel. Section 3.4 presented the main summary measures, including measures of central tendency (mean, mode, and median), quantiles (quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), in addition to the measures of shape as skewness and kurtosis. Then, we presented how they can be calculated using the Excel functions, except the ones that are not available. This section discusses how to obtain descriptive statistics (such as, the mean, standard error, median, mode, standard deviation, variance, kurtosis, skewness, among others), through the Analysis ToolPak add-in in Excel. In order to do that, let’s consider the problem presented in Example 3.42, whose data are available in Excel in the file Stock_Market.xls, presented in cells A1:A20, as shown in Fig. 3.22. To load the Analysis ToolPak add-in in Excel, we must first click on the File tab and on Options, as shown in Fig. 3.23. Now, the Excel Options dialog box will open, as shown in Fig. 3.24. From this box, we selected the option Add-ins. In Add-ins, we must select the option Analysis ToolPak and click on Go. Then, the Add-ins dialog box will appear, as shown in Fig. 3.25. Among the add-ins available, we must select the option Analysis ToolPak and click on OK.

FIG. 3.22 Dataset in Excel—Price of Stock Y.

Univariate Descriptive Statistics Chapter

3

69

FIG. 3.23 File tab, focusing more on Options.

Thus, the option Data Analysis will start being available on the Data tab, inside the Analysis group, as shown in Fig. 3.26. Fig. 3.27 shows the Data Analysis dialog box. Note that several analysis tools are available. Let’s select the option Descriptive Statistics and click on OK. From the Descriptive Statistics dialog box (Fig. 3.28), we must select the Input Range (A1:A20) and, as Output options, let’s select Summary statistics. The results can be presented in a new spreadsheet or in a new work folder. Finally, let’s click on OK. The descriptive statistics generated can be seen in Fig. 3.29 and include measures of central tendency (mean, mode, and median), measures of dispersion or variability (variance, standard deviation, and standard error), and measures of shape (skewness and kurtosis). The range can be calculated from the difference between the sample’s maximum and minimum values. As mentioned in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by Excel (using the SKEW function or by Fig. 3.28) corresponds to Fisher’s coefficient of skewness (g1); and the measure of kurtosis calculated (using the KURT function or by Fig. 3.28) corresponds to Fisher’s coefficient of kurtosis (g2).

3.6

A PRACTICAL EXAMPLE ON SPSS

From a practical example, this section presents how to obtain the main univariate descriptive statistics studied in this chapter by using IBM SPSS Statistics Software. These include frequency distribution tables, charts (histogram, stemand-leaf plots, boxplots, bar charts, and pie charts), measures of central tendency (mean, mode, and median), quantiles

FIG. 3.24 Excel Options dialog box.

FIG. 3.25 Add-ins dialog box.

Univariate Descriptive Statistics Chapter

FIG. 3.26 Availability of the Data Analysis command, from the Data tab.

FIG. 3.27 Data Analysis dialog box.

FIG. 3.28 Descriptive Statistics dialog box.

3

71

72

PART

II Descriptive Statistics

FIG. 3.29 Descriptive statistics in Excel.

FIG. 3.30 Dataset on SPSS—Price of Stock Y.

(quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of shape (skewness and kurtosis). The use of the images in this section has been authorized by the International Business Machines Corporation©. The data presented in Example 3.42 are the input basis on SPSS and are available in the file Stock_Market.sav, as shown in Fig. 3.30. To obtain such descriptive statistics, we must click on Analyze ! Descriptive Statistics. After that, three options can be used: Frequencies, Descriptive, and Explore.

3.6.1

Frequencies Option

This option can be used for qualitative and quantitative variables, and it provides frequency distribution tables, as well as measures of central tendency (mean, median, and mode), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of skewness and kurtosis. The Frequencies option also plots bar charts, pie charts, or histograms (with or without a normal curve). Therefore, on the toolbar, click on Analyze ! Descriptive Statistics and select Frequencies..., as shown in Fig. 3.31.

Univariate Descriptive Statistics Chapter

3

73

FIG. 3.31 Descriptive statistics on SPSS—Frequencies Option.

FIG. 3.32 Frequencies dialog box: selecting the variable and showing the frequency table.

Therefore, the Frequencies dialog box will open. The variable being studied (Stock price, called Price) must be selected in Variable(s) and the Display frequency tables option must be activated so that the frequency distribution table can be shown (Fig. 3.32). The following step consists of clicking on Statistics... To select the summary measures that interest us (Fig. 3.33). Among the quantiles, let’s select the option Quartiles (which calculates the first and third quartiles, in addition to the median). To get the percentile of order i (i ¼ 1, 2, ..., 99), we must select the option Percentile(s) and add the order desired. In this case, we chose to calculate the percentiles of order 10 and 60. The measures of central tendency that we have to select are the mean, median, and mode. As measures of dispersion, let’s select Std. deviation (standard deviation), Variance,

74

PART

II Descriptive Statistics

FIG. 3.33 Frequencies: Statistics dialog box.

Range, and S.E. mean (standard error). Finally, let’s select both measures of shape of a distribution: Skewness and Kurtosis. To go back to the Frequencies dialog box, we must click on Continue. Next, let’s click on Charts... and select the chart that interest us. As options, we have Bar charts, Pie charts, or Histograms. Let’s select the last chart with the option of plotting a normal curve (Fig. 3.34). Bar or pie charts can be shown in terms of absolute frequencies (Frequencies) or relative frequencies (Percentages). In order to go back to the Frequencies dialog box once again, we must click on Continue. Finally, click on OK. Fig. 3.35 shows the calculations of the summary measures selected in Fig. 3.33. As studied in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by SPSS corresponds to Fisher’s coefficient of skewness (g1), and the measure of kurtosis corresponds to Fisher’s coefficient of kurtosis (g2), respectively. Also in Fig. 3.35, note that the percentiles of order 25, 50, and 75 that correspond to the first quartile, median, and third quartile, respectively, were calculated automatically. The method used to calculate the percentiles was the Weighted Average. The frequency distribution table can be seen in Fig. 3.36. The first column represents the absolute frequency of each element (Fi), the second and third columns represent the relative frequency of each element (Fri—%), and the last column represents the relative cumulative frequency (Frac—%). Also in Fig. 3.36, we can see that all the values happened only once. Since we have a continuous quantitative variable with 20 observations and no repetitions, constructing bar or pie charts would not give the researcher any additional information, that is, it would not allow a good visualization of how the stock prices behave in terms of bins. Hence, we chose to construct a histogram with previously defined bins. The histogram generated using SPSS with the option of plotting a normal curve can be seen in Fig. 3.37.

3.6.2

Descriptives Option

Different from Frequencies..., which also has the frequency distribution table option, besides bar charts, pie charts, or histograms (with or without a normal curve), Descriptives... only makes summary measures available (therefore, it is recommended for quantitative variables). Nevertheless, measures of central tendency, such as, the median and mode

Univariate Descriptive Statistics Chapter

FIG. 3.34 Frequencies: Charts dialog box.

FIG. 3.35 Summary measures obtained from Frequencies: Statistics.

3

75

76

PART

II Descriptive Statistics

FIG. 3.36 Frequency distribution.

FIG. 3.37 Histogram with a normal curve obtained from Frequencies: Charts.

Histogram 8

Mean = 18.47 Std. Dev. = .632 N = 20

Frequency

6

4

2

0 17.0

18.0

19.0

20.0

Price

are not made available; nor are quantiles, such as, quartiles and percentiles. To use it, let’s click on Analyze ! Descriptive Statistics and select Descriptives..., as shown in Fig. 3.38. Therefore, the Descriptives dialog box will open. The variable being studied must be selected in Variable(s), as shown in Fig. 3.39. Let’s click on Options... and select the summary measures that interest us (Fig. 3.40). Note that the same summary measures in the Frequencies... were selected, except the median, the mode, in addition to the quartiles and percentiles that are not available, as already mentioned. Let’s click on Continue to go back to the Descriptives dialog box. Finally, click on OK. The results are available in Fig. 3.41.

Univariate Descriptive Statistics Chapter

3

77

FIG. 3.38 Descriptive statistics on SPSS—Descriptives Option.

FIG. 3.39 Descriptives dialog box: selecting the variable.

3.6.3

Explore Option

As Frequencies..., Explore... does not provide the frequency distribution table either. Regarding the types of chart, different from this last option, which offers bar charts, pie charts, and histograms, Explore... provides stem-and-leaf plots, boxplots, in addition to histograms. However, it does not have the option of plotting a normal curve. Regarding summary measures, Explore... provides measures of central tendency, such as, the mean and median (there is no option for the mode); quantiles, such as, percentiles (of order 5, 10, 25, 50, 75, 90, and 95); measures of dispersion, such as, the range, variance, standard deviation, among others (it does not calculate the standard error), besides measures of skewness and kurtosis.

78

PART

II Descriptive Statistics

FIG. 3.40 Descriptives: Options dialog box.

FIG. 3.41 Summary measures obtained from Descriptive: Options.

Therefore, this command is the best one to generate descriptive statistics for quantitative variables. Hence, from Analyze ! Descriptive Statistics, select Explore..., as shown in Fig. 3.42. Therefore, the Explore dialog box will open. The variable being studied must be selected from the list of dependent variables (Dependent List), as shown in Fig. 3.43. Next, we must click on Statistics... to open the Explore: Statistics box, and select the options Descriptives, Outliers, and Percentiles, as shown in Fig. 3.44. Let’s click on Continue to go back to the Explore box. Next, we must click on Plots... to open the Explore: Plots box and select the charts that interest us, as shown in Fig. 3.45. In this case, we have to select Boxplots: Factor levels together (the resulting boxplots will be together in the same chart), Stem-and-leaf and the histogram (note that there is no option for plotting the normal curve). Once again, we must click on Continue to go back to the Explore dialog box. Finally, click on OK. The results obtained are illustrated. Fig. 3.46 shows the results obtained from Explore: Statistics, with Descriptives option. Fig. 3.47 shows the results obtained from Explore: Statistics, with Percentiles option. The percentiles of order 5, 10, 25 (Q1), 50 (median), 75 (Q3), 90, and 95 were calculated using two methods: the Weighted Average and Tukey’s Hinges. The latter corresponds to the method proposed in this chapter (Section 3.4.1.2, Case 1). Thus, applying the expressions in

Univariate Descriptive Statistics Chapter

3

79

FIG. 3.42 Descriptive statistics on SPSS—Explore Option.

FIG. 3.43 Explore dialog box: selecting the variable.

Section 3.4.1.2 to this example, we get the same results seen in Fig. 3.47, as regards Tukey’s Hinges method for calculating P25, P50, and P75. Coincidently, in this example, the value of P75 was the same for both methods, but they are usually different. Fig. 3.48 shows the results obtained from the Explore: Statistics, with Outliers option. The extreme values of the distribution are presented here (the highest five and the lowest five), with their respective positions found in the dataset. Now, the charts constructed from the options selected in Explore: Plots (histograms, stem-and-leaf plots, and boxplots) are presented in Figs. 3.49, 3.50, and 3.51, respectively.

80

PART

II Descriptive Statistics

FIG. 3.44 Explore: Statistics dialog box.

FIG. 3.45 Explore: Plots dialog box.

FIG. 3.46 Results Obtained from the Descriptives Option.

FIG. 3.47 Results obtained from the Percentiles option.

FIG. 3.48 Results obtained from the Outliers option.

FIG. 3.49 Histogram constructed from the Explore: Plots dialog box.

Histogram 8

Mean = 18.48 Std. Dev. = .632 N = 20

Frequency

6

4

2

0 17.0

18.0

19.0

20.0

Price

FIG. 3.50 Stem-and-leaf chart generated from the Explore: Plots dialog box.

Price

Stem-and-Leaf Plot

Frequency

Stem & Leaf

1.00 Extremes 17 .

2.00

17 . 59

6.00

18 . 112334

8.00

18 . 55778889

2.00

19 . 11

1.00 Extremes

Stem width: Each leaf:

FIG. 3.51 Boxplot generated from the Explore: Plots dialog box.

20.0

(=<16.9)

.00

(>=19.9)

1.0 1 case(s)

10

19.0

18.0

17.0

20

16.0 Price

Univariate Descriptive Statistics Chapter

3

83

Obviously, the histogram generated by Fig. 3.49 is the same as the Frequencies... (Fig. 3.37); however, without the normal curve, since the Explore... does not provide this function. Fig. 3.50 shows that the first two digits of the number (the integers, before the point) form the stem and the decimals correspond to the leaf. Moreover, stem 18 is represented in two lines because it contains several observations. In Section 3.4.1.3, we learned how to calculate an extreme outlier through expressions X* < Q1 3.(Q3 Q1) and X* > Q3 + 3.(Q3 Q1). If we consider that Q1 ¼ 18.15 and Q3 ¼ 18.8, we have X* < 16.2 or X* > 20.75. Since there are no observations outside these limits, we conclude that there are no extreme outliers. Repeating the same procedure for mild outliers, that is, applying expressions X° < Q1 1.5.(Q3 Q1) and X° > Q3 + 1.5.(Q3 Q1), we can see that there is one observation with a value of less than 17.175 (20th observation), and another one with a value greater than 19.775 (10th observation). These values are therefore considered mild outliers. The boxplot in Fig. 3.51 shows that observations 10 and 20, with values 19.9 and 16.9, respectively, are mild outliers (represented by circles). Depending on their survey goals, this allows researchers to decide whether to keep them, exclude them (the analysis may be harmed because of the reduction in the sample size), or substitute their values for the variable mean. Continuing in Fig. 3.51, the values of Q1, Q2 (Md), and Q3 correspond to 18.15, 18.5, and 18.8, respectively, which are those obtained from Tukey’s Hinges method (Fig. 3.47), considering all of the initial 20 observations. Therefore, the boxplot’s measures of position (Q1, Md, and Q3), except for the minimum and maximum values, are calculated without excluding the outliers.

3.7

A PRACTICAL EXAMPLE ON STATA

The same descriptive statistics obtained in the previous section through SPSS software will be calculated in this section through Stata Statistical Software. The results will be compared to those obtained in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. The data presented in Example 3.42 are the input basis on Stata, and are available in the file Stock_Market.dta.

3.7.1

Univariate Frequency Distribution Tables on Stata

Through command tabulate, or simply tab, as we will use throughout this book, we can obtain frequency distribution tables for a certain variable. The syntax of the command is: tab variable*

where the term variable* should be substituted for the name of the variable considered in the analysis. Fig. 3.52 shows the obtained output using the command tab price. Just as the frequency distribution table obtained through SPSS (Fig. 3.36), Fig. 3.52 provides the absolute, relative, and relative cumulative frequencies for each category of the variable price. FIG. 3.52 Frequency distribution on Stata using the command tab.

84

PART

II Descriptive Statistics

Consider a case with more than one variable being studied in which the objective is to construct univariate frequency distribution tables (one-way tables), that is, one table for each variable being analyzed. In this case, we must use the command tab1, with the following syntax: tab1 variables*

where the term variables* should be substituted for the list of variables being considered in the analysis.

3.7.2

Summary of Univariate Descriptive Statistics on Stata

Through command summarize, or simply sum, as we will use throughout this book, we can obtain summary measures, such as, the mean, standard deviation, and minimum and maximum values. The syntax of this command is: sum variables*

where the term variables* should be substituted for the list of variables to be considered in the analysis. If no variable is specified, the statistics will be calculated for all of the variables in the dataset. Through the option detail, we can obtain additional statistics, such as, the coefficient of skewness, the coefficient of kurtosis, the four lowest and highest values, as well as several percentiles. The syntax of this command is: sum variables*, detail

Therefore, for the data in our example, available in the file Stock_Market.dta, first, we must type the following command: sum price

obtaining the statistics in Fig. 3.53. To obtain additional descriptive statistics, we must type the following command: sum price*, detail

Fig. 3.54 shows the generated outputs. As shown in Fig. 3.54, the option detail provides the calculation of the percentiles of order 1, 5, 10, 25, 50, 75, 90, 95 and 99. These results are obtained by Tukey’s Hinges method. We have seen, through Fig. 3.47 on the SPSS software, the results of the percentiles of order 25, 50, and 75 obtained by the same method. Fig. 3.54 also provides the four lowest and highest values of the sample analyzed, as well as the coefficients of skewness and kurtosis. Note that these values coincide with the ones calculated in Sections 3.4.3.1.5 and 3.4.3.2.3, respectively.

FIG. 3.53 Summary measures using the command sum on Stata. FIG. 3.54 Additional statistics using the option detail.

Univariate Descriptive Statistics Chapter

3

85

FIG. 3.55 Results obtained from the command centile on Stata.

3.7.3

Calculating Percentiles on Stata

The previous section discussed how to calculate the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles through Tukey’s Hinges method. On the other hand, by using the command centile, we can specify the percentiles to be calculated. The method used in this case is the Weighted Average. The syntax of this command is: centile variables*, centile (numbers*)

where the term variables* should be substituted for the list of variables to be considered in the analysis, and the term numbers* for the list of numbers that represent the order of the percentiles to be reported. Therefore, let’s suppose that we want to calculate the percentiles of order 5, 10, 25, 60, 64, 90, and 95 for the variable price, through the Weighted Average. In order to do that, we must use the following command: centile price, centile (5 10 25 60 64 90 95)

The results can be seen in Fig. 3.55. We have seen, through Fig. 3.35, the results of the SPSS software for the percentiles of order 10, 25, 50, 60, and 75 using the same method. Fig. 3.47 on SPSS also provided the calculation of the percentiles of order 5, 10, 25, 50, 75, 90, and 95 through the Weighted Average. The only percentile that had not been specified previously was the one of order 64; the others coincide with the results in Figs. 3.35 and 3.47.

3.7.4

Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots

Stata makes a series of charts available, including bar charts, pie charts, scatter plots, histograms, stem-and-leaf, and boxplots, among others. Next, we will discuss how to obtain histograms, stem-and-leaf plots, and boxplots on Stata, for the data available in the file Stock_Market.dta.

3.7.4.1 Histogram Histograms on Stata can be obtained for continuous and discrete variables. In the case of continuous variables, to obtain a histogram of absolute frequencies, with the option of plotting a normal curve, we must type the following syntax: histogram variable*, normal frequency

or simply: hist variable*, norm freq

as we will use throughout this book. As mentioned before, the term variable* must be substituted for the name of the variable being studied. For discrete variables, we must include the term discrete: hist variable*, discrete norm freq

86

PART

II Descriptive Statistics

FIG. 3.56 Frequency histogram on Stata.

Frequency

10

5

0 17

18

19

20

Price

Going back to the data in Example 3.42, to obtain a frequency histogram, with the option of plotting a normal curve, we must type the following command: hist price, norm freq

The obtained output is shown in Fig. 3.56.

3.7.4.2 Stem-and-Leaf The stem-and-leaf plot on Stata can be obtained using the command stem, followed by the name of the variable being studied. For the data in the file Stock_Market.dta, we just need to type the following command: stem price

The obtained output is shown in Fig. 3.57.

3.7.4.3 Boxplot To obtain the boxplot on the Stata software, we must use the following syntax: graph box variables*

FIG. 3.57 Stem-and-Leaf plot on Stata.

Univariate Descriptive Statistics Chapter

3

87

FIG. 3.58 Boxplot on Stata.

20

Price

19

18

17

where the term variables* should be substituted for the list of variables to be considered in the analysis, and, for each variable, one chart is constructed. For the data in Example 3.42, the command is: graph box price

The chart is shown in Fig. 3.58 which corresponds to the same chart as in Fig. 3.51 generated using SPSS.

3.8

FINAL REMARKS

In this chapter, we studied descriptive statistics for a single variable (univariate descriptive statistics), in order to acquire a better understanding of the behavior of each variable through tables, charts, graphs and summary measures, identifying trends, variability, and outliers. Before we start using descriptive statistics, it is necessary to identify the type of variable we will study. The type of variable is essential for calculating descriptive statistics and in the graphical representation of the results. The descriptive statistics used to represent the behavior of a qualitative variable’s data are frequency distribution tables and charts. The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and a Pareto chart. For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of continuous variables’ data grouped into classes. Line graphs, dot or scatter plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams) are normally used to graphically represent quantitative variables.

3.9

EXERCISES

1) What statistics can be used (and in which situations) to represent the behavior of a single quantitative or qualitative variable? 2) What are the limitations of only using measures of central tendency in the study of a certain variable? 3) How can we verify the existence of outliers in a certain variable? 4) Describe each one of the measures of dispersion or variability. 5) What is the difference between Pearson’s first and second coefficients used as measures of skewness in a distribution? 6) What is the best chart to check the position, skewness and discrepancy among the data? 7) In the case of bar charts and scatter plots, what kind of data should be used? 8) What are the most suitable charts to represent qualitative data?

88

PART

II Descriptive Statistics

9) Table 3.1 shows the number of vehicles sold by a dealership in the last 30 days. Construct a frequency distribution table for these data. TABLE 3.1 Number of Vehicles Sold 7

5

9

11

10

8

9

6

8

10

8

5

7

11

9

11

6

7

10

9

8

5

6

8

6

7

6

5

10

8

10) A survey on patients’ health was carried out and information regarding the weight of 50 patients was collected (Table 3.2). Build the frequency distribution table for this problem. TABLE 3.2 Patients’ Weight 60.4

78.9

65.7

82.1

80.9

92.3

85.7

86.6

90.3

93.2

75.2

77.3

80.4

62.0

90.4

70.4

80.5

75.9

55.0

84.3

81.3

78.3

70.5

85.6

71.9

77.5

76.1

67.7

80.6

78.0

71.6

74.8

92.1

87.7

83.8

93.4

69.3

97.8

81.7

72.2

69.3

80.2

90.0

76.9

54.7

78.4

55.2

75.5

99.3

66.7

11) At an electrical appliances factory, in the door component production phase, the quality inspector verifies the total number of parts rejected per type of defect (lack of alignment, scratches, deformation, discoloration, and oxygenation), as shown in Table 3.3.

TABLE 3.3 Total Number of Parts Rejected per Type of Defect Type of Defect

Total

Lack of Alignment

98

Scratches

67

Deformation

45

Discoloration

28

Oxygenation

12

Total

250

We would like you to: a) Elaborate a frequency distribution table for this problem. b) Construct a pie chart, in addition to a Pareto chart. 12) To preserve ac¸aı´, it is necessary to carry out several procedures, such as, whitening, pasteurization, freezing, and dehydration. The files Dehydration.xls, Dehydration.sav, and Dehydration.dta show the processing times (in seconds) in the dehydration phase throughout 100 periods. We would like you to: a) Calculate the measures of position regarding the arithmetic mean, the median, and the mode. b) Calculate the first and third quartiles and see if there are any outliers. c) Calculate the 10th and 90th percentiles. d) Calculate the 3rd and 6th deciles. e) Calculate the measures of dispersion (range, average deviation, variance, standard deviation, standard error, and coefficient of variation).

Univariate Descriptive Statistics Chapter

3

89

f) Check if the distribution is symmetrical, positively skewed, or negatively skewed. g) Calculate the coefficient of kurtosis and determine the flatness level of the distribution (mesokurtic, platykurtic or leptokurtic). h) Construct a histogram, a stem-and-leaf plot, and a boxplot for the variable being studied. 13) In a certain bank branch, we collected the average service time (in minutes) from a sample with 50 customers regarding three types of services. The data can be found in files Services.xls, Services.sav, and Services.dta. Compare the results of the services based on the following measures: a) Measures of position (mean, median, and mode). b) Measures of dispersion (variance, standard deviation, and standard error). c) First and third quartiles; check if there are any outliers. d) Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. e) For each one of the variables, construct a bar chart, a boxplot, and a histogram. 14) A passenger collected the average travel times (in minutes) of a bus in the district of Vila Mariana, on the Jabaquara route, for 120 days (Table 3.4). We would like you to: a) Calculate the arithmetic mean, the median, and the mode.

TABLE 3.4 Average Travel Times in 120 Days Time

b) c) d) e)

Number of Days

30

4

32

7

33

10

35

12

38

18

40

22

42

20

43

15

45

8

50

4

Calculate Q1, Q3, D4, P61, and P84. Are there any outliers? Calculate the range, the variance, the standard deviation, and the standard error. Calculate Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. f) Construct a bar chart, a histogram, a stem-and-leaf plot, and a boxplot. 15) In order to improve the quality of its services, a retail company collected the average service time, in seconds, of 250 employees. The data were grouped into classes, with their respective absolute and relative frequencies, as shown in Table 3.5. We would like you to: a) Calculate the arithmetic mean, the median, and the mode. b) Calculate Q1, Q3, D2, P13, and P95. c) Are there any outliers? d) Calculate the range, the variance, the standard deviation, and the standard error. e) Calculate Pearson’s first coefficient of skewness and the coefficient of kurtosis. Classify the symmetry and the flatness level of each distribution. f) Construct a histogram.

90

PART

II Descriptive Statistics

TABLE 3.5 Average Service Time Class

Fi

Fri (%)

30 ├ 60

11

4.4

60 ├ 90

29

11.6

90 ├ 120

41

16.4

120 ├ 150

82

32.8

150 ├ 180

54

21.6

180 ├ 210

33

13.2

250

100

Sum

16) A financial analyst wants to compare the price of two stocks throughout the previous month. The data are listed in Table 3.6.

TABLE 3.6 Stock Price Stock A

Stock B

31

25

30

33

24

27

24

34

28

32

22

26

24

26

34

28

24

34

28

28

23

31

30

28

31

34

32

16

26

28

39

29

25

27

42

28

29

33

24

29

22

34

23

33

32

27

29

26

Univariate Descriptive Statistics Chapter

3

91

Carry out a comparative analysis of the price of both stocks based on: a) Measures of position, such as, the mean, median, and mode. b) Measures of dispersion, such as, the range, variance, standard deviation, and standard error. c) The existence of outliers. d) The symmetry and flatness level of the distribution. e) A line graph, scatter plot, stem-and-leaf plot, histogram, and boxplot. 17) Aiming to determine the standards of the investments made in hospitals in Sao Paulo (US$ millions), a state government agency collected data regarding 15 hospitals, as shown in Table 3.7.

TABLE 3.7 Investments in 15 Hospitals in the State of Sao Paulo Hospital

a) b) c) d)

Investment

A

44

B

12

C

6

D

22

E

60

F

15

G

30

H

200

I

10

J

8

K

4

L

75

M

180

N

50

O

64

We would like you to: Calculate the sample’s arithmetic mean and standard deviation. Eliminate possible outliers. Once again, calculate the sample’s arithmetic mean and standard deviation (without the outliers). What can we say about the standard deviation of the new sample without the outliers?

Univariate Descriptive Statistics

Univariate Descriptive Statistics

Recommend Documents