9 Analyzing Data: Import, Transformation, Compilation, Restructuring, Aggregation and Use of Statisticstoolbox
9.1. Importing and transforming We always recommend recording as many measures as possible when designing an experiment. It can prove costly to have to restart the experimentation process because a measure that turned out to be vital was omitted at the start. However, having taken this into account, we might still be brought to recoding variables post hoc. For example, it might be useful to recode the number of repetitions of a same task when two tasks alternate randomly, or to recode the switch from one task to another. Standard statistics programs do not allow several lines to be compared in order to recode data. They only allow for line-by-line transformation of the data, for example, by taking the logarithm of each of the cells of one column in order to create a new variable. How then do we code a new variable that would allow us to compare several trials from one line to another? Excel functions can be used to recode sequences. We shall see how these functions are limited, and therefore why we prefer to use MATLAB®. Nonetheless, here are some examples of practical recoding for use
For a color version of the code appearing in this chapter, see www.iste.co.uk/mathy/ experiments.zip.
200
Experiments and Modeling in Cognitive Science
in Excel. The following table holds the answers to a questionnaire. During data collection, for better or for worse, only affirmative responses were written down (represented by a “yes”). The “no” answer is thus indicated by an empty cell. Question 1 s1
Yes
s2
Yes
Question 2 Yes
s3
Yes
s4
Yes
A table like the following one would be more practical for adding the number of yeses per question. This can be achieved using the basic function SUM: Question 1 s1
1
s2
1
Question 2 1
s3
1
s4
1
Total
2
3
It is possible to search for all of the yeses and replace them with a 1, but this can be tricky if we do not want to use the option “Replace all”. One solution involves writing a 1 if the cell is not empty. This can be done with the function isblank. In the following table, we have applied the formula =SUM(IF(ISBLANK(B2:B5);0;1)). The function isblank returns the following values for the four lines: FALSE FALSE TRUE TRUE
Analyzing Data
201
The function IF(FALSE;0;1) transforms Boolean values into numerical values: 1 1 0 0
Finally, the function sum returns the sum of the 1’s and 0’s of each column. Unusually, the formula =SUM(IF(ISBLANK(B2:B5);0;1)) only works by pressing “Ctrl+shift+enter” (on a PC) or “cmd+enter” (on mac), as the formula isblank is applied to a vector (B2:B5). The usual reasoning is the following: in order to calculate a sum of squares in EXCEL, for example, for the numbers 1, 2 and 3, first we calculate a column of squares using the functions POWER(…;2), which we then apply to each line in order to obtain the numbers 1, 4 and 9. It is only after this has been done that we calculate the sum of squares using the function =SUM(B1:B3). To carry out all of these calculations in one go, we can write =SUM(POWER(A1:A3;2)) and once again press “Ctrl+shift+enter” (on PC) or “cmd+enter” (on mac), or else an error message will appear. This type of formula is called an “array formula”. 1
1
2
4
3
9 14
The following example shows how to recode a column that was badly coded from the start. It is not very practical to try to code the values of several variables with a single cell, as is the case in the first column of the following table, which indicates that the first subject underwent conditions x and y, and the second one only underwent condition x, the third only condition y and the fourth no conditions at all. In the second column, the formula =IF(FIND(“x”;A1)>=1;“x”; “”) does not work very well as it returns an error message when the FIND function does not find “x”. In the case where the FIND function does find “x”, it returns its position in the character string of cell A1. If the character string “yETx” appeared in cell A1, FIND(“x”;A1) would return the value 4 and FIND(“x”;A1)>=1 would
202
Experiments and Modeling in Cognitive Science
return the value TRUE as a result. This function does not work when it does not find anything. The solution to this is to use the function ISERROR. The formula =IF(ISERROR(FIND(“x”;A1));“”;“x”) thus lets us return the value TRUE when “x” is not found, and when this is the case, the cell is filled with an empty “ ”; when this is not the case, the cell is filled with an “x”. Similarly, in order to find the y’s, we use =IF(ISERROR(FIND(“y”;A1));“”;“y”). The third and fourth columns are now as expected, with a clear distinction between the X and Y dimensions. x and y
X
x
x
x
x
y
#VALUE!
nothing
#VALUE!
y y
While simple rules can indeed be used in EXCEL for recoding sequences (=IF …), this quickly becomes impossible to manage when the complexity of the recoding is too great. Take the example of an experiment in which the participant must randomly alternate/switch between two tasks (1 and 2), as such, for example: 11211122121, etc. We can later determine the presence of a switch (that is, going from 1 to 2, or from 2 to 1), which is likely to have involved some form of cognitive cost. The presence of a switch is coded as a 1, and the repetition of a task coded as 0. By recoding in this way, we obtain a new variable called switchLines, equal to 00110010111. Going further, we can even study the cognitive mechanisms involved in alternating between two tasks more precisely. We can calculate the number of times that the tasks are repeated after a switch, for example -- 001201000 (the symbol – means that nothing is recoded as no switch has taken place yet; 2 means that 2 repetitions have occurred, etc.). This allows us to test the hypothesis that the execution of a task is sped up when it is repeated. Finally, we might want to calculate the number of repetitions executed for a single task (task 1, for example) before going back to the other task (task 2). For a situation coded as 21112, this number is 3 when going back to task 2, since task 1 has been carried out three times between the two task 2’s. This is actually the number of repetitions between two switches. The new recoded variables (---1--3211) let us test the hypothesis that the more recently a task has been inhibited (the more recent the inhibition), the harder it is to execute it again. The focus of this chapter is not this specific example, but rather to glean that for all three of these examples, it is useful to understand the algorithmics
Analyzing Data
203
involved, which we shall highlight here in a short MATLAB® program. You can find the example by opening the file chap9transform.m. Importing a data file into MATLAB® is easy – just use the function import in the menu. In MATLAB®, we start by importing the data 11211122122 from the excel file chap9.xls, by opening it in the main menu “Files/Import data…”. Once the file is imported, MATLAB® generates a variable called data that is best not deleted. A precaution worth taking is to use the command clearvars -except data, when clearing everything apart from this variable. As a reminder, it is best to clear all variables before starting; otherwise, we run the risk of re-writing over existing variables that might not have the correct dimensions. The first part of the program is extremely simple: for lineNum=2:length(data) if data(lineNum)~= data(lineNum-1) then switchLines(lineNum)=1; This means that if the current line is different from the previous one, then the line contains a switch. We suggest the reader try coding the remaining variables --001201000 and ---1--3-211 in order to compare the code they obtain with ours. Here, we propose two solutions, a short one that is hard to read and a longer one that is easier to read. This example highlights the common issue of using a shorter but more elegant piece of code, but which is not always more useful in terms of readability. Code: %First of all, choose File/Import data and open chap9.xls, located in the file chap9xls %clean up clc clearvars -except data %dimensioning switchLines=zeros(length(data),1); NTasksRepBeforeGoingBack=nan(length(data),1); numTasksRepAfterSwitch=nan(length(data),1); %0 is a switch
204
Experiments and Modeling in Cognitive Science
%Finds the switches switchLines(1)=0; for lineNum=2:length(data) if data(lineNum)~= data(lineNum-1) switchLines(lineNum)=1; end end %Find the 1st and 2nd switches switchPositions=find(switchLines==1); firstSwitchLineNum=switchPositions(1); secondSwitchLineNum=switchPositions(2); % main code (short version) foundPreviousSwitch=0; for lineNum=secondSwitchLineNum:length(data) if switchLines(lineNum)==1 numTasksRepAfterSwitch(lineNum)=0; lineNum2=lineNum; while foundPreviousSwitch==0 lineNum2=lineNum2-1; if switchLines(lineNum2)==1 foundPreviousSwitch=1; NTasksRepBeforeGoingBack(lineNum)=lineNumlineNum2; numTasksRepAfterSwitch(lineNum2)=0; if lineNum-lineNum2 >1 incr=0; for lineBtw=lineNum2+1:lineNum-1 incr=incr+1; numTasksRepAfterSwitch(lineBtw)=incr; end end end end foundPreviousSwitch=0; %reset to zero end end %Displays the result switchPositions' switchLines' NTasksRepBeforeGoingBack' numTasksRepAfterSwitch' Output (As a reminder NaN means Not a Number, indicating that the cell of the matrix contains nothing) ans = 3
4
7
9
10
11
0 0 ans =
1
1
0
0
ans =
NaN NaN NaN
1
0
1 NaN NaN
1
1
1
3 NaN
2
1
1
Analyzing Data
205
ans = NaN NaN
0
0
1
2
0
1
0
0
0
The next piece of code is a bit lengthier, but a lot easier to read, simpler and therefore less likely to cause bugs. This code is based on counting tasks as they happen, rather than looking back at past tasks from the current one. Code: %main (main code, longer, but simpler) NTasksRepBeforeGoingBack=nan(length(data),1); numTasksRepAfterSwitch=nan(length(data),1); incrRep=0; for lineNum=firstSwitchLineNum:length(data) if data(lineNum)~= data(lineNum-1) incrRep=0; numTasksRepAfterSwitch(lineNum)=0; else incrRep=incrRep+1; numTasksRepAfterSwitch(lineNum)=incrRep; end end incrSucc=1; NsuccessiveTasks(1)=1; for lineNum=2:length(data) if data(lineNum)~= data(lineNum-1) incrSucc=1; else incrSucc=incrSucc+1; end NsuccessiveTasks(lineNum)=incrSucc; end for lineNum=secondSwitchLineNum:length(data) if switchLines(lineNum)==1 NTasksRepBeforeGoingBack(lineNum)=NsuccessiveTasks(lineNum-1); end end %Shows NTasksRepBeforeGoingBack' numTasksRepAfterSwitch' Output ans = NaN NaN NaN
1 NaN NaN
3 NaN
2
ans = NaN NaN
0
0
1
2
0
1
0
0
0
1
1
206
Experiments and Modeling in Cognitive Science
9.2. Compiling data files Imagine the scenario where you have 100 separate files, corresponding to the data of 100 different participants. Would you rather open-copy-pasteclose-save the 100 files, and then stick them one under the other, or use a short program in order to carry out this work? Compiling files refers to the automation process that involves grouping several files into a single file. The following program would allow us, for example, to compile three files, each containing the subject number, their age in months, the trial number, the presented image number for a given trial and the response time for a given image recorded at the time of the mouse click (there can be one or several response times for a single image or none, in which case the response time is zero; when the image has been clicked on several times, several lines appear with the same image number – this is the case for image 201, where the subject clicked twice). The program automatically extracts all of the files that end with the extension “dta” from a folder chosen at the start. The extension could be “.txt”, “.xlsx” or another – this is not important. The advantage is that we can automatically calculate the number of “… .dta” files, read each of the files, and remove the first line that shows the names of the columns in each of the files in order to have it back in a single column heading in the final file. Next, we can make the most of MATLAB® to export the final file in the desired format (.xls, .csv, etc.). The advantage of a program like this is that it can apply to a great variety of situations. We just need to change some of the parameters at the start for it to properly rename the column headings. This program can also serve as a base for recoding other variables (for example, making a column with the logarithms of the response times). However, we would not recommend mixing a program that compiles with a program that recodes, as the resulting program is no longer generic and therefore cannot be applied to other data. name 1 1 1 1 1 etc.
ageMonths 295 295 295 295 295
trialNum 1 2 3 3 4
picNum 229 221 201 201 239
RT_Click 2250 1609 2281 3219 1641
Analyzing Data
207
Code: %This program finds all of the files of the same type located in a folder. All of these files are then read and compiled into a single file. We use a cell array type matrix to mix numbers and character strings. %Clears the variables clc clear all %Parameters to be chosen based on the data to compile headColumns={'name', 'ageMonths', 'trialNum', 'picNum', 'RT'}; nColumns=length(headColumns); whereColumnsShouldBeNumbers=[2:5]; resultLocation = '/Users/fabien/DOCS/COURS/M2 Manuel MATLAB®/chap9data/'; nameNewFile='dataCompiled'; %Lists the name of the individual files tic %Start of the timer for the entire program tmpListOfFiles = dir(resultLocation); %Lists all of the files located in resultLocation, as well as all of the properties of the files tmpListOfFileNames={tmpListOfFiles.name};%Extracts just the name from the properties of the files tmpNFiles=length(tmpListOfFileNames);%Counts the number of files listOfDataFiles={}; for tmpFileNum=1:tmpNFiles if isempty(findstr('dta', tmpListOfFileNames{1,tmpFileNum}))==0 %dir*.dta is an alternative solution to findstr ; or better, use filenames=ls('*.dta') listOfDataFiles=[listOfDataFiles,tmpListOfFileNames{tmpFileNum }]; end end Nparticipants=length(listOfDataFiles); %% Reading the files incrLine=0; datamissing=[]; for fileNum =1:Nparticipants tic %starts the timer for reading a file disp(['subject = ', num2str(fileNum)]) data_one_subject{fileNum}=textread([resultLocation listOfDataFiles{1,fileNum}],'%s'); %use textscan instead of textread if problem size_file=size(data_one_subject{fileNum},1); %We are looking for missing data, looking at whether the division of the number of measures read divided by the number of columns results in a remainder that is not zero. The function mod gives us the remainder of a division. if mod(length(data_one_subject{fileNum}),nColumns)~=0 datamissing=[datamissing,fileNum] %The variable datamissing accumulates the number of files that are being
208
Experiments and Modeling in Cognitive Science
problematic. It is a variable that can be used to verify the files. end incrementCell=nColumns; %We skip the first line (the column heading corresponding to the names of the variables). incrementCol=0; for lineNum=2:size_file/nColumns %also skips the first line incrLine=incrLine+1; for colNum=1:nColumns incrementCell=incrementCell+1; if colNum==1; subjectNameAnonymous=['subject' num2str(fileNum)];%We make the most of this to make the data anonymous if it is not already the case. tmpData(fileNum,lineNum1,colNum)={subjectNameAnonymous};%lineNum-1 follows the skip of the first line else tmpData(fileNum,lineNum1,colNum)=data_one_subject{fileNum}(incrementCell); end compiledDataWithoutHeadCol{incrLine,colNum}=char(tmpData(fileN um,lineNum-1,colNum)); end end toc disp(' ') end %We get the column names back and we place them at the top of the new table for col=1:nColumns compiledData{1,col}=headColumns{col}; end for line=1:length(compiledDataWithoutHeadCol) for col=1:nColumns compiledData{line+1,col}=compiledDataWithoutHeadCol{line,col}; end end %We EXPORT the new compiled table and add some comments describing the work carried out. disp(' ') disp('Export data to text and Excel files ...') cell2csv([nameNewFile,'.csv'],compiledData,',') cell2csv([nameNewFile,'.xls'],compiledData,'\t') notes={'Here we provide useful information for ourselves, with useful comments for future simulations by other people }; cell2csv([nameNewFile,'Notes.txt'],notes,'\t') disp(' ') disp('Total running time =') toc %measures running time
Analyzing Data
209
Output
At the end, we get the following file: name,ageMonths,trialNum,picNum,RT subject1,295,1,229,2250 subject1,295,2,221,1609 subject1,295,3,201,2281 etc. subject1,295,90,209,3766 subject2,285,1,567,4289 subject2,285,2,559,2391 etc. subject2,285,89,511,4055 subject2,285,90,515,2934 subject3,287,1,372,2152 subject3,287,2,364,2961 subject3,287,3,344,2730 etc. subject3,287,88,316,0 subject3,287,89,302,0 subject3,287,90,307,0
9.3. Extracting digital information from a file that is not organized as a table Here, we reuse the previous method for compiling in order to automatically search for digital values in files that contain both text and numbers and that are not laid out as tables. The use of this type of file lets us quickly obtain descriptive statistics without having to carry out calculations on the raw data in a table format.
210
Experiments and Modeling in Cognitive Science
The raw data from the first subject looks like this: firstname xxx xxx xxx xxx xxx xxx xxx
familyname xxx xxx xxx xxx xxx xxx xxx
ageMonths 221 221 221 221 221 221 221
trialNum 1 2 3 4 5 6 7
stimulus a a ag cf ahb gad edfa
response a a ag cf ahb ecb adfc
The experiment involved a classical measurement of memory span. A series of letters of increasing length were presented until two errors were committed for a given length, with two trials per length. In this file, we can see that the participant quickly recalls the two series of 1 letter, as well as the two series of 2 letters. This can therefore be qualified as a memory span of 2. In order to be more precise, given that they were also able to recall “ahb” (but not “gad” and “edfa”), we can say that their memory span is 2.5. It would be unfair to say that the participant can only recall 2 letters, and it would be a stretch to say that they can recall 3, since their 3-element memorization is tenuous. The program that we created for this experiment directly summarized the span score in a joint file with the extension Misc.dta (Misc for miscellaneous). This score was calculated automatically by adding 0.5 for every correct recall. The summary file data1_Misc.dta contains the following information: Basic information + SPAN -----------------------------------------------------------------Surname: xxx Forename: xxx Age in month: 221 Date ((careful, the computer might provide a date with inverted month and day)) : 03-21-2012 SPAN = 2.5
In order to calculate the memory span of 60 participants, we would have to painstakingly open the 60 files, find the SPAN score and write it in a table beside the participant number. We might also want to write down their age, as required by a description of the sample in the Methods section. The following program was created in 30 minutes, which is a lot less time-consuming; once again, it is better to spend time algorithmically,
Analyzing Data
211
structuring your thought process, than on repetitive tasks (which, again, are the source of many mistakes). As in the previous section, the program lists all the files, then reads them (identifying that all the fields are separated by a space). The reading of the first file is the following in the variable data_one_subject{1}. Note that the age (in months) is always located in the 13th field (the 13th row) and the span in the second-to-last row. The program then calculates the mean and the standard deviation of the span and of the age using the function grpstats from statisticstoolbox. It is quite possible to go without this function, however, by using the functions mean and std. Finally, we create the histogram of the spans. ' Basic ' information '+' 'SPAN' '------------------------------------------------------------------' 'Surname:' 'xxx' 'Forename:' 'xxx' 'Age' 'in' 'months:' '221' 'Date' '((careful,' computer can provide a 'date' that 'inverses' months and days))' ':' '03-21-2012' 'SPAN' '=' '2,5' '------------------------------------------------------------------'
212
Experiments and Modeling in Cognitive Science
Code: %% Calculates the memory span for Exp #2 %Reset clc clear all %Data localization resultLocation = '/Users/fabien/DOCS/Articles/16-Mustapha 1 compression/DATA2/'; %Lists the data files tmpListOfFiles = dir(resultLocation); tmpListOfFileNames={tmpListOfFiles.name}; tmpNFiles=length(tmpListOfFileNames); listOfDataFiles={}; for tmpFileNum=2:tmpNFiles if isempty(findstr('Mis.dta', tmpListOfFileNames{1,tmpFileNum}))==0 listOfDataFiles=[listOfDataFiles,tmpListOfFileNames {tmpFileNum}]; end end Nparticipants=length(listOfDataFiles); %Reads the file, searches for the span and the age in each file. spanVector=[]; ageVector=[]; for fileNum =1:Nparticipants disp(['subject = ', num2str(fileNum)]) data_one_subject{fileNum}=textread([resultLocation listOfDataFiles{1,fileNum}],'%s'); %uses textscan instead of textread if problem; use ... '%s',’delineate’,'\t') if the data is separated by tabs. %Searches for the span and age lastCellNum=size(data_one_subject{fileNum},1); decimalPosition=findstr(',',char(data_one_subject {fileNum}(lastCellNum-1))); spanString=char(data_one_subject{fileNum}(lastCellNum-1)); spanString(decimalPosition)='.'; span=eval(spanString); age=eval(char(data_one_subject{fileNum}(13))); if span < 2 %Searches for aberrant data disp(char(data_one_subject{fileNum}(7))) end spanVector=[spanVector,span]; ageVector=[ageVector,age]; end
Analyzing Data
213
%Histogram freq=hist(spanVector,5); bar(freq) set(gca,'XTickLabel',{'2' '2.5' '3' '3.5' '4'}); h = findobj(gca,'Type','patch'); set(h,'FaceColor','g','EdgeColor','w') xlabel('Span','FontSize',12); %Descriptive statistics group=ones(1,Nparticipants); [meanSpan,stdSpan] = grpstats(spanVector,group, {'mean','std'})%use mean and std functions if you do not have statisticstoolbox to calculate grpstats [meanAge,stdAge] = grpstats(ageVector/12,group, {'mean','std'})
Output subject = 1 subject = 2 subject = 3 ... subject = 59 subject = 60 meanSpan = 2.5417 stdSpan = 0.4897 meanAge = 21.2931 stdAge = 3.8097
214
Experiments and Modeling in Cognitive Science
9.4. Import, combine and manipulate data in a table format To import the content of data files (csv, txt, xlsx, etc.), we can also use the function readtable. This creates sets of data in the “table” format in MATLAB®. This format is malleable and allows us to combine data and metadata in the same table: in a single table, we can group numbers, character strings and names of columns or of rows, everything in a single item. Having named the columns, it would be possible, for example, to carry out operations by calling the column name. We take the example of two text files that contain participant data (children or adults) for several immediate memory tasks. One line corresponds to one participant, with the children recorded in one data file and the adults in another. The group (school, college or university), the age, sex and memory span scores for the different tasks (“objects”, “alphabet”, “colors” and “kanjis” for the adults only) of the participants are reported in the corresponding columns. A preview of the source files to be imported is shown below. Preview of the “dataEnfants.csv”):
two
raw
files
(called
dataAdultes.csv
and
Analyzing Data
215
dataAdults: 'participant'; 'group'; 'age';
'sex';
A-01;
'objects'; 'alphabet'; 'colors'; 'kanjis'
univ
'25';
'female'; 5;
6;
6;
6
A-15;
univ
'25';
'female'; 5;
5;
4;
5
A-16;
univ
'25';
'male';
5;
5;
5;
5
'participant'; 'group'; 'age';
'sex';
'objects'; 'alphabet'; 'colors'
E-01;
'school' '8';
'male';
2;
4;
3
E-02;
'school' '8';
'female';
4;
6;
3
'college' '13';
'female'; 5;
6;
5
...
dataChildren:
... E-16;
Before importing the data, we inserted a preamble regarding preparation in order to specify the path of the folder and clean the workspace. We import the two source files into two tables called “children” and “adults” that we will combine into a single dataset called “total”, itself also a table. % The data is extracted from the results of ‘Adult’ and ‘Children’ % participants in a series of immediate memory tests % Each participant carried out several tasks ('objects';'alphabet';'colors'; and 'kanjis' for the adults only) % specify the directory cd('/Users/mchekaf/Desktop/manuelMatlab®') % returns the path of the folder that contains the code file currentFolder = pwd; % Creates an output file mkdir output clear;clc; % clears all ______________________________________________________________ ________________ %% import data and combine them % the function readtable imports the data, which is implemented in the tables % which here we name ‘children’ and ‘adults’ children = readtable('dataChildren.csv','Delimiter', ';'); adults = readtable('dataAdults.csv','Delimiter', ';'); % Import two data sets A and B, each containing a variable % 'participant' which serves as a grouping key. In this
216
Experiments and Modeling in Cognitive Science
example, % the tests are common to all groups, and one of the tests is carried out by only the adults. % to combine the datasets we use the function outerjoin total = outerjoin(adults, children, 'MergeKeys',true); Output : participant
group
Age
sex
objects kanji
colors
alphabet
'A-01' 'A-02' 'A-03' 'A-04' 'A-05' 'A-06' 'A-07' 'A-08' 'A-09' 'A-10' 'A-11' 'A-12' 'A-13' 'A-14' 'A-15' 'A-16' 'E-01' 'E-02' 'E-03' 'E-04' 'E-05' 'E-06' 'E-07' 'E-08' 'E-09' 'E-10' 'E-11' 'E-12' 'E-13' 'E-14' 'E-15' 'E-16'
'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'univ' 'school' 'school' 'school' 'school' 'school' 'school' 'school' 'school' 'college' 'college' 'college' 'college' 'college' 'college' 'college' 'college'
25 22 23 22 26 19 18 23 19 21 21 25 18 20 19 23 8 7 7 8 7 8 8 9 12 13 11 12 12 11 12 13
'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female' 'male' 'female'
5 6 6 4 6 5 3 5 4 4 5 3 5 4 5 5 2 4 2 4 4 4 3 5 4 5 2 5 5 3 4 5
6 5 6 4 6 5 6 5 4 5 5 5 5 5 4 5 3 3 3 3 3 3 4 5 5 4 4 5 5 4 6 5
6 6 6 6 5 6 5 5 6 6 6 5 6 5 5 5 4 6 6 6 5 3 3 6 6 3 4 6 6 4 4 6
6 6 6 6 5 6 5 5 6 6 6 5 6 5 5 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Analyzing Data
217
Once the dataset has been imported, we can carry out operations for cleaning the data. These operations are often carried out manually and upstream, for example, by deleting a row if a participant does not fulfill the inclusion criteria, or removing useless rows or columns. In order to delete the “kanji” column, we just need to make it so that in the “total” group, all the rows of the column called “kanji” are blank. Similarly, we can use a variable to work as a vector. In the following example, we have explicitly named it “toDelete”. In order to make it easier to read, “toDelete” is a set of 1’s (to delete) and 0’s (to not delete). We use strcmp (for string comparison) to compare the character strings, in order to search for whether yes (1) the character string to be deleted (‘A-12’) is present or not (0) in the “participant” column of the “total” set (total.participant). The vector “toDelete” thus identifies all the rows containing “A-12” and is therefore used to point out the rows to be deleted. We then need to just ask for all the rows indicated by “toDelete” to be deleted. Once the row has been deleted, in the example, we decide to replace “A-” with “Adults” and “E-” with “Children”. The function strrep (string replace) lets us search for and replace a character string with another. Here, by specifying the column (total.participant), we can replace the “A-” and “E-” as desired. It is also easy to create data subsets, grouping together all or a specific part of a dataset. For example, we can create “subTotal” containing only rows for which the value “Age” is not 8. For this example, we use a logical vector, which attributes the rows of “total” whose values are different from 8 to “subTotal”. To sort the values according to a variable, we can use sortrows, by specifying the column that is to serve as the sorter, and the direction (“ascend” or “descend”). In the same command, we can specify several sorting variables, as well as the direction we want to apply to each. ______________________________________________________________ %% Organize, clean the data % If we want to delete the column 'kanji' for example total(:,'kanji') = []; % Deletes all of the values of the column 'kanji'
218
Experiments and Modeling in Cognitive Science
______________________________________________________________ %% If we want to delete the row of participant A-12 for example toDelete = strcmp(total.participant,'A-12'); sousTotal=total; sousTotal(toDelete,:) = []; ______________________________________________________________ %% We want to replace 'A-' with Adultstotal.participant = strrep(total.participant,'A-','Adultes-'); total.participant = strrep(total.participant,'E-','Enfants-'); ______________________________________________________________ %% select: if Age is different to 8 sousTotal = total(total{:,'Age'} ~= 8,:); ______________________________________________________________ %% Sort the data by age total = sortrows(total,'Age','ascend') ______________________________________________________________ %% Sort the data by age (increasing) and sex (inversed alphabetic) total = sortrows(total,{'Age','sex'},{'ascend','descend'});
9.5. Restructuring and aggregating data in MATLAB® The last part of this chapter involves the restructuring and aggregating of data. The goal is to obtain summary tables from raw data. A classic example in cognitive psychology is to repeat the measures for x conditions in order to get more robust data. For example, we might want to measure response times in participants to see if a face is more attractive depending on pupil dilatation. In this case, we might present 50 faces per condition. We get 100 measures per subject, but in order to simplify the statistics, it is sometimes simpler to reduce the table to two mean response times for each of the conditions. Note that this is typical of analysis using ANOVA. However, the inverse would be true for mixed-model analysis, for which one advantage is to keep all of the raw data to get a more precise estimate. If the experiment involves 20 participants, the final table would therefore have 20 rows (the participants) and 2 columns (the mean times for each of the two conditions). This reduction is called data aggregation (also called data collapse). The term restructuring simply refers to the inversion of rows and columns, which is called “transposition” in Excel. In the last part of this chapter, we look at how to aggregate and restructure data using Excel or SPSS. We shall now look at how to aggregate and restructure data according to variables. In this example, aggregation lets us find out the mean score and the standard error of the mean (sem) for all of
Analyzing Data
219
the participants in each of the tests by separating the participants by group (school-college-university) and by sex (male-female). If we want to proceed in steps, we can first extract a data subset (dsGroupSex: ds for dataset) in which the “participants” column is removed and only target columns are preserved, that is, the grouping variables and the digital variables for which we want to calculate the means. Then, using the function “grpstats”, we can calculate the aggregated mean and standard error by group (school-college-university) and by sex (male-female). These steps can be gathered into a single line: ______________________________________________________________ %% subset for aggregation (mean. s.e.) by age and sex dsGroupSex= total(:,{'group' 'sex' 'objects' 'alphabet' 'colors'}); ______________________________________________________________ %% aggregation by age group and by sex agregSexGroup = grpstats(dsGroupSex,{'sex' 'group'},{'mean','sem'}); ______________________________________________________________ %% same result on a single line agregSexGroup = grpstats(total(:,{'sex' 'group' 'objects' 'alphabet' 'colors'}),{'sex' 'group'},{'mean','sem'}) ______________________________________________________________ %% export the resulting Table in the output file in csv format writetable(agregSexGroup,[pwd '/output/agregSexGroup.csv'],'Delimiter',';') Output : agregSexGroup = sex
group
mean_objects sem_objects mean_alphabet sem_alphabet mean_colors sem_colors
male_school
'male'
'school'
2.75
0.47871
4.5
0.6455
3.25
0.25
male_college
'male'
'college' 3.75
0.62915
5
0.57735
5
0.40825
male_univ
'male'
'univ'
4.5
0.32733
5.5
0.18898
4.875
0.125
4.25
0.25
5.25
0.75
3.5
0.5
female_college 'female' 'college' 4.5
0.5
4.75
0.75
4.5
0.28868
female_univ
0.35038
5.625
0.18298
5.25
0.31339
female_school 'female' 'school' 'female' 'univ'
4.875
The Statistics and Machine Learning Toolbox provide a large number of functions that we briefly specify here. It boasts a comprehensive collection of specific tools adapted to descriptive statistics, statistical visualization, probability distribution (hypothesis testing), ANOVAs, regression calculations, classifications, industrial statistics, etc. To illustrate this, after using grpstats and aggregation methods, we move on to a simple example of the graphical representation of the results by group (school-college-university) from the same example, with error bars
220
Experiments and Modeling in Cognitive Science
representing +/- 1 standard error. This code, which is therefore a continuation of the previous one, creates a data subset using only the scores in each of the tests (“objects”, “alphabet” and “colors”) as well as the “group” column. Next, we create the table agregGroup which gives us the mean and standard error by aggregation for each group in each of the tests. The function errorbar creates graphical representations in which error bars can be added. The variables are written into a formula to which other attributes of the graph can be added (in this example, the size of the markers, the thickness of the lines). % creates two subsets: age groups and scores in the tests groups=total{:,{'group'}} tasks=total{:,{'objects' 'alphabet' 'colors'}}; % aggregates by group for mean and standard error of the mean % means and std errors are assigned to meanSpan and semSpan % with task as a dependent variable and with groups as an independent variable [meanSpan,semSpan] = grpstats(tasks,groups,{'mean','sem'}) Output: meanSpan = 3.5000 4.8750 4.1250 4.8750 4.6875 5.5625
3.3750 4.7500 5.0625
semSpan = 0.3780 0.4795 0.3981 0.4407 0.2366 0.1281
0.2631 0.2500 0.1700
______________________________________________________________ %% Graph figure1=figure; % States the name of the figure fig = gcf; % Creates the axes axes1 = axes('Parent',figure1,'FontSize',16,'Color','none'); % draws the graph with error bars (+/- 1 e.s.) corresponding % to the values contained in meanSpan and semSpan errorbar1=errorbar(meanSpan,semSpan,'MarkerSize', 8,'LineWidth',1); ______________________________________________________________ %% specifies the attributes of the curves, the line style and marker type set(errorbar1(1),'LineStyle','-','Marker', 'none'); set(errorbar1(2),'LineStyle','-.','Marker', 'd','color','r'); set(errorbar1(3),'LineStyle','--','Marker', 'o','color','b');
Analyzing Data
221
xlabel('Age','FontSize',22) ylabel('Prop. correct','FontSize',22) box('off'); legend({'objects','alphabet','colors'},'FontSize',14,'Location ','BestOutside'),legend('boxoff'); set(gca,'color','none') set(axes1,'Color','none','FontSize',14,'XTick',[1 2 3],'XTickLabel',{'School','College','University'}); ______________________________________________________________ %%% exports the plot % Exports the graph in pdf and jpg format saveas(gcf,[pwd '/output/errorbarPlot.pdf']) saveas(gcf,[pwd '/output/errorbarPlot.jpg']) Output:
Figure 9.1. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip
We continue along the route of ANOVA-type variance analysis (one-way ANOVA), in which we want to measure the effects of the “age group” factor on the performance in two of the three memory tests. We start by calculating the mean span (spanAvg) per participant. Next, we create the dataset we are interested in, dataGroup, in which we preserve the mean of each participant over two columns in the two tests we are interested in (for example, “objects” and “colors”) and the group the subject belongs to (“school”,
222
Experiments and Modeling in Cognitive Science
“college” or “university”), and for each row of the 32 observations. Let us imagine that we never planned to analyze the effect of the different kind of stimuli, but only the effect of the group. In this case, a simple ANOVA can be calculated. The ANOVA is calculated using the formula anova1(dependent variable, group), which in our example would be anova1(dataGroup.spanAvg, dataGroup.group). By default, the result is given in a new output window with a graph. Code: ______________________________________________________________ %% calculate the mean of each participant for two memory span tests % calculate the mean obtained in the tests, shown in the column spanAvg total.spanAvg = mean(total{:,{'objects' 'colors'}},2); ______________________________________________________________ %% One Way ANOVA dataGroup = total(:,{'group' 'spanAvg'}); p = anova1(dataGroup.spanAvg,dataGroup.group) Output: Source SS df MS F Prob>F -----------------------------------------------------------------------------Groups 11.0312 2 5.51562 10.89 0.0003 Error 14.6875 29 0.50647 Total 25.7188 31
Figure 9.2. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip
Analyzing Data
223
In order to demonstrate how to restructure a dataset, we will use an example of data taken from an immediate memory test in which there were three presentation orders. The type of order varied depending on whether the items were organized by similarity (s), dissimilarity (d) or by regularity (r). The data table has 10 rows per participant, one row per trial, and for each trial gives the order type (ordertyp), the response time (RT) and the result (accuracy). We would like to obtain a table in which there was only one row per participant, and in the columns, the mean response times and correct answers for each of the three order conditions, which would result in 6 columns on top of the “participant” column. A preview (truncated) of the raw file called “compilation.csv” (extracted from real data for the example) displays the following: participant
ordertyp
RT
accuracy
1
r
15309
0
1
d
10141
0
1
r
9344
1
… 1
d
5438
0
2
r
16297
0
2
r
6859
0
4
r
17156
0
4
d
7406
1
4
d
8438
0
…
The function unstack lets us simultaneously restructure and aggregate the data depending on the grouping variables wanted. The grouping variable is by default the one that does not appear in the formula; the example shows the participant number. The function unstack restructures the variable “ordertype” into several variables in the new table (“myRestructuredData”). Thus, myRestructuredData = unstack(myData, “accuracy”, “ordertyp”), from the table “myData”, creates the restructured table “myRestructuredData” by splitting the variable “ordertyp” by its modalities (r, s, d). In our example, we would like to directly obtain the means of the response times and correct responses, aggregated by order type for each participant. For this, we have to put the two variables to be aggregated in brackets and add this to the aggregation instruction for the mean.
224
Experiments and Modeling in Cognitive Science
Code: % demo file for restructuring and aggregating data % The data extracted from an immediate memory test in which there were % three types of presentation order. The order type varied depending on whether the % items were organized by similarity(s), dissimilarity(d) or regularity(r) % specify the directory cd('/Users/mchekaf/Desktop/manuelMatlab®/statisticsTbox') clear;clc; % clear all _____________________________________________________________ %% import data then restructure-aggregate it myData = readtable('compilation.csv','Delimiter', ';'); myRestructuredData = unstack(myData,{'accuracy','RT'},'ordertyp','AggregationFuncti on',@mean) % creates myRestructuredData by ungrouping the variable ‘ordertyp’ (r, s, d). % AggregationFunction lets us aggregate for the mean of the restructured data _____________________________________________________________ %% exports in csv format writetable(myData,[pwd '/output/myRestructuredData.csv'],'Delineate',';') Output myRestructuredData = participant
accuracy_d
accuracy_r
accuracy_s
RT_d
RT_r
RT_s
1
0
0.2
0.5
8083.7 7696.2 6547
2
0.33333
0
0
9729
12554
15535
3
0.5
0.33333
0
7366.2 10073
15514
4
0.33333
0
0
8880.3 12711
12515
9.6. Restructuring and aggregating data with Excel or SPSS The two following solutions that we are suggesting for the aggregation of data, in the order of preference, are the use of dynamic pivot tables in Excel, and the Aggregate and Restructure functions in SPSS. An excellent tutorial on these two SPSS functions is provided by Lacroix and Giguère (2006). Moreover, their article is very handy for following the steps required for
Analyzing Data
225
restructuring a table, as SPSS uses a vocabulary that is not always very intuitive. With regard to the first solution for restructuring data using Excel, the function Dynamic Pivot Table can be found in the “Insert” menu in a recent version of Excel, or in the “Data” menu in older versions. In the following starting file, we can find the subject number, the trial number, the trial type (r, s, i or d) and the response time (RT).
226
Experiments and Modeling in Cognitive Science
The following pivot table shows the mean RT for the conditions r, s, i and d. We have taken the simplest possible example, but we could have chosen a more complex option in order to get means for each of the subjects.
We now look at the SPSS solution, which is preferable as it helps avoid “slips” of the mouse, which can be risky in Excel and other software. The data below (Figure 9.3) comes from the results obtained from participants in a short-term memory test. The number of trials per subject was 50, and we can see that for the same subject, there are as many rows as there were trials carried out. This data table has therefore 50 × n (number of subjects) rows. If we carry out an analysis of the response time (RT) by the type of presentation order (variable called “type” in the table, whose modalities are the values r, s, i or d) over the whole dataset, the number of degrees of freedom will be high and the results of the test will be biased (by confounding inter- and intra-individual variance). This is why it is a good idea to carry out aggregation and restructuring of the data in order to get a single row per subject, unless, again, you wisely adopt a mixed-model approach which uses a long format table to take all of the trials into account. In SPSS, the golden rule is that a table must respect the constraint that the data from one subject must only appear in a single row. A line break has to indicate a measure that is independent of the previous one. Aggregation
Analyzing Data
227
therefore lets us calculate a mean value for each of the manipulations (while the measures have been repeated for each of these manipulations in order to get more precise mean results). Restructuring, let us transpose these mean results onto a single row.
Figure 9.3.
To start with, after opening the original data file in SPSS, click on the “Data” tab and choose the option “Aggregate”. A new window then opens in which, among all of the variables (left column), you can select the aggregation criteria that you want and the variables that you want to analyze. In our example (Figure 9.4), we selected “type” and “subject” as the aggregation criteria (aggregation is necessarily done by subject if we want to obtain one row per subject) as well as the response time “RT” and “accuracy” as variable summaries. You have the option of adding the aggregated variables to an active dataset, or creating a new file containing only aggregated variables. We would recommend the last option in order to preserve the original data file. You can then rename this new file, making
228
Experiments and Modeling in Cognitive Science
sure to use a name that is explicit as possible. You are likely to generate as many files containing aggregated variables as you have analyses to carry out. If the name of the file is a precise reflection of its content, you will find it more quickly.
Figure 9.4.
In our example (Figure 9.5), for each subject, we obtain a row by order type (our aggregation criterion), so four rows per subject, corresponding to the means obtained for each of the possible orders (instead of the 50 rows at the start). This result is not sufficient though, as we have to transpose these results as rows into columns in order to obtain one row per subject.
Analyzing Data
229
Figure 9.5.
In order to get one row per subject, we have to carry out a restructuring of the table to invert the rows and columns (this operation is also called a transposition). This way, instead of having a single column for the response times, for example, we would get four columns for the response times corresponding to the four possible orders. To carry out this operation, click on the “Data” tab and choose the option “Restructure…”. The restructure data wizard opens in a new window and offers three choices for restructuring: restructure the selected variables into cases, restructure the selected cases as variables or transpose all data. In our case, in order to reorganize the datasets from each of the observation groups into a single row, we have chosen the second option (Figure 9.6). Now, click on “next” to select the observations to be transformed into variables (Figure 9.7). To do this, you must select the identifier variable(s) (in our example, these are the subjects as we want to obtain one single row for the subjects) and the index variable(s) (here the order presentation type).
230
Experiments and Modeling in Cognitive Science
Figure 9.6.
Figure 9.7.
Analyzing Data
231
After clicking on next, you can close the wizard by clicking on “finish” (Figure 9.8). The following page lets you choose the order in which the columns are laid out, if you so wish.
Figure 9.8.
Unlike aggregation, which keeps the data table, restructuring modifies it directly. In our example, the result can be seen in Figure 9.9. Our table contains a single row per subject with, for each of the two variables “accuracy” and “RT”, four columns corresponding to the four presentation order types, identified with the suffixes d, i, r and s. Random factors: this is a complicated point for those who are not trained in advanced statistics at the graduate level, but there is a way of avoiding data restructuring in order to preserve the recording of successive trials over successive rows by choosing statistical analyses using the subject factor as a random factor (or by using mixed models, which are more complex; see Baayen et al. 2008). The three following tables are an illustration of the preceding section, showing the successive steps needed to aggregate and restructure the data.
232
Experiments and Modeling in Cognitive Science
Figure 9.9.
Factor 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Trials 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Subject 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5
Result 1 2 3 4 4 5 6 8 7 9 10 12 13 14 16 20 23 25 25 19
Analyzing Data
233
After restructuring: Condition 1
Condition 2
Trial 1
Trial 2
Trial 1
Trial 2
1
2
10
12
3
4
13
14
4
5
16
20
6
8
23
25
7
9
25
19
After aggregation: Result Condition 1
Result Condition 2
1.5
11.0
3.5
13.5
4.5
18.0
7.0
24.0
8.0
22.0
Nonetheless, we can do without these steps by choosing a univariate ANOVA with subject factor as a random factor, with the first table. Analysis results in F(1,4) = 85.6, p < .001; the value and the degrees of freedom are the same as with a repeated measures ANOVA carried out on the last table after aggregation.