Data management and data analysis

Data management and data analysis

Poetics 16 (1987) 535-552 North-Holland DATA MANAGEMENT G. SEEGERS, 535 AND DATA ANALYSIS H. KEMPFF and C.J. VAN REES * To raise relet-ant ques...

1MB Sizes 0 Downloads 84 Views

Poetics 16 (1987) 535-552 North-Holland

DATA MANAGEMENT G. SEEGERS,

535

AND DATA ANALYSIS

H. KEMPFF

and C.J. VAN REES

*

To raise relet-ant questions regarding not only the nature and functioning of literary institutions such as publishing houses. public libraries. book clubs and criticism, but also the behaviour of book-buyers and book-borrowers, one should have at one’s disposal relevant data. Data bases are likely to be quite extensive. Mostly even before starting to collect these data. the researcher having adequate computer facilities at his disposal will benefit from considering the question of how to manage and analyse the data (to be) collected. The first part of this paper focusses on a number of special programs for data storage and data retrieval lie selecting and ordering the data. These programs are known as data management systems. The second part contains a discussion of some fundamental aspects of data analysis. The usefulness of a number of descriptive and statistical methods for analysis are illustrated on the basis of a sample of investigations current at the department of the sociology of literature at Tilburg University.

1. Background From the 1970’s on, a change in the object of the study of literature has been taking place. Students of literature, as can be seen from their publications in Poetics, have shifted their attention. Up till the ’70s they focussed on inquiries into the assumptions concerning the linguistic character of literary texts (cf. Van Rees (1983b), an emphasis that was based on the idea that a linguistic analysis of literary texts could reveal their specific properties and characteristics. This idea hinged on the assumption that the literary character of texts could be defined by intrinsic textual properties. As a consequence, only little attention was paid to the role of literary texts within the context of production, distribution, acquisition and assessment. Changing the topics of research, students of literature show a growing interest in the role of the reader and the production and distribution of books. This interest arises from the view that the assignment of value to literary works - or, to works of art in general - is a social process, and cannot be restricted to an analysis of the intrinsic proper* Gwen van Duijvenbode, Mariette Deenen and Gonny de Rooij were helpful in preparing the manuscript. Peter Nieuwint corrected our English. Correspondence address: G. Seegers, Tiburg University, Department of Language and Literature R 24, P.O. Box 90153, 5000 LE Tilburg, The Netherlands.

0304-422X/87/%3.50

0 1987, Elsevier Science Publishers

B.V. (North-Holland)

536

G. Seegers et al. / Data managemenr and dara ana!vsn

ties of these works. The research that has developed on the basis of this view takes as a starting point the assumption that texts derive their ‘literary’ or ‘aesthetic’ character from factors related to this social context. At the same time a critical attitude touards the scientific status of the study of literature arose. Some researchers argued in favour of using the methods of research that are common in the social sciences. In line with this view, it is argued that ‘literature’ should be regarded as a social construct and is to be located within a social context. In investigating this context, a break with the hermeneutical approach to literature, which focussed on the interpretation of texts, became apparent. As a consequence of this shift in focus, there is a need for other research methods. Especially the empirical methods developed in sociological and psychological research are relevant. A number of important topics of research in the sociology of literature are derived from the social context within which literature exists; in the first place special attention is given to the role that is played by students of literature (critics, students of literary history); secondly, research focusses on the relationship between these and other institutions that are engaged in the production and distribution of literary works (publishers, libraries, schools). This research is directed towards the specific activities of these institutions, the kind of relations between them, the influence they have on the valorisation of literature, the way literary works are classified in hierarchical order. The idea is that texts owe their ‘literary’ status to the fact that specific social groups and institutions subject them to a process of estimation and valorisation. It was Bourdieu (1979) who laid the groundwork for this kind of research in literature. In his view there is a constant struggle between the institutions, which all together constitute the fiteruy field. This continuous rivalry is the result of each institution trying to achieve the monopoly of conferring cultural legitimacy on works of art. Fundamental is Bourdieu’s idea of the interdependency of the power that institutions have to get their judgements accepted. The position of any of these institutions cannot be determined independently of the other. In his empirical research, Bourdieu accentuates the dominating role of institutions like the educational system and art criticism (cf. Bourdieu (1977, 1983)). To raise relevant questions regarding these institutions, one should have at one’s disposal relevant data. At present, most of these data are lacking, or, when available, they are not amenable to further processing. It may be important to know how much it will cost to publish a literary title; how the range of works published by publishing houses or the collection of a public library are composed; which titles in the field of children’s and juvenile literature are used in education; how the titles in publishers’ lists or contributions to literary journals are distributed over the various literary categories: which authors contribute to literary journals, and what is the rate of their contribution; to what extent literary books that have been published during

G. Seegers et al. / Data management

and dara anal,vsis

537

the last 10 years have been reviewed in (national and regional) papers, in weeklies and in literary journals. Anyone attempting to answer these and similar questions has to collect the relevant data first. The issues enumerated have not been listed haphazardly, but contain examples of the data that researchers at Tilburg University have collected in several studies over the last couple of years. In part these data have been taken from existing data files, for instance the registered loans in a public library or the records of publishing houses (cf. Verdaasdonk (1985) Seegers and Verdaasdonk (1987)) although in their original form many of these data were hardly fit for further processing. Most data however, had to be collected by the researchers themselves, for instance by conducting a survey.

2. Data management

and data analysis

Investigating these and similar topics implies, besides the use of methods of data collection, application of methods for data management and data analysis. In this paper we will restrict ourselves to the latter methods, leaving aside the issue of how to collect the data. A prerequisite is the availability of adequate computer facilities. Without them, collection, management and analysis of the data, as applied in empirical research, is practically impossible. Some fundamental aspects will be discussed, on the assumption that the researcher has adequate computer facilities at his disposal. The discussion will be restricted to a short sketch of the limits of application of these methods. Concerning data management, we will emphasize the problem of how to manage extensive data bases, usually having a complex structure. To handle these bases with the help of a computer, special programs for data storage and data retrieval have been developed, including facilities for manipulations like selecting and ordering the data. These programs are known as data-base management systems. In the second part of this paper some aspects of data analysis will be discussed. Some general remarks will be made concerning the essence of statistical methods. In addition, some descriptive and statistical methods for analysis will be introduced. Their usefulness will be illustrated on the basis of a sample of current investigations.

3. When to use a Data-base

Management

System (DBMS)?

Generally, the use of a computer application has its advantages and its drawbacks. To select from a number of possible applications the one that appears most appropriate, one has to know about the various possible altema-

S?S

G. Seegers et

al. / Data management and data ana(vsis

tives. This holds for word processin, 0 as well as for data management. When only a small number of texts has to be produced it does not seem worthwhile to acquire an automatic word processing system. The same argument holds for Data-base Management Systems. It should not be applied unless the number of data is large. Data can in principle be managed in the following three ways. Firstly, there is the well-known pencil-and-paper method, for instance in the form of a list of addresses or in the form of a box of filing cards. Secondly, data can be stored in a computer memory, for instance in the form of one or more files: thus they can be manipulated, combined. selected and printed with the aid of (standard) programs. The third possibility is the application of a genuine data-base management system. Whether or not to use a DBMS depends first of all on the amount and the complexity of the data. Small amounts of data or data which are not subject to frequent change can easily be managed without a DBMS. In such a case, it is clearly preferable not to rely on a DBMS. On the other hand, data-bases are likely to grow and such an extension may render it difficult to handle the data without a DBMS. A fictitious example may be helpful in clarifying this. Suppose that a student of literature is interested in the work of a particular writer. notably his poems. The researcher decides to make an inventory of the poet’s production. Having collected all the volumes of poems, he notes a number of data on filing cards: the names of the poems, the titles of the books in which they were published, the year of publication and the name of the publisher. The cards are kept in alphabetical order in an old shoe box, which is the appropriate way to handle data of this kind. Then one day he becomes interested in the literary reception of several authors and the role reviewers and critics play in this reception. Before burying himseif in the archives of the newspaper. he decides to make a design for the data-base and to purchase a personal computer to manage this data-base. The data-base program that he is going to make will also be used to feed the data into the computer. The data-base design is fairly simple. For every published volume the researcher wants to collect all reviews. For any review are recorded the name of the critic, the title of the newspaper, the month and year of publication of the review and the number of words per review. In data-base terms these are called the attributes. Once the kind of information to be recorded for each entry is known, the values of this information can be specified. The value information can have is called the domain of an attribute. Specifying a domain is essentially restricting information values. ‘Text’ information such as names etc. cannot readily be restricted apart from length, but other valuable information (dates for instance) can be specified more precisely, while for still other information (e.g.

G. Seegers et al. / Data management and data ana&sis Table 1 Example of the structure

539

of a simple review file.

Attribute

Domain

Attribute

Domain

Book title Publ. name Year of publ. Critic’s name

Text (30) Text (30) Digits (4) Text (30)

Title newsp. Xlonth of publ. Year of publ. Ko. of words

Text (30) Digits (2) Digits (4) Digits (4)

sex) the values can be enumerated. Basically, this situation is similar to one in which a data-base on filing cards is designed; these cards are divided into regions each of which is bound to contain a particular bit of information, for instance. the name of a revielver, a date of publication, etc. Sloppy data-bases will result if these regions cannot accommodate this information. In developing a computer data-base one is likewise bound to specify in advance the possible values of the attributes. In addition. a number of more specific constraints on these values have to be formulated. By specifying the domains of attributes, one is able to check at an early stage the correctness of data entry, which results in fewer errors in the data-base. ‘Textual’ information such as names, titles etc. cannot be constrained. although the maximum length of the values can be specified. Other domains, however, can and must be defined more precisely. Numbers are made up of digits and for dates some format can be designed that consists of predefined elements. A more or less formal account of the data-definition of the data-base could be the one shown in table 1. For several reasons one should strive for the greatest precision. First there is a practical reason. Suppose a name is too long for the region reserved for it in the fiing card. If written by hand, it could be forced to fit by writing small letters. But a computer is unable to write small letters: in order to avoid complicating programs, the maximum length of information has to be specified in advance. Specifications of length can be changed. but only at the cost of time and effort. Another reason to specify domains of attributes has already been mentioned: by defining domains one will reduce the risk of errors in the data-base (although one cannot prevent the occurrence of errors). The third reason is theoretical. In some data-base management systems (notably relational DBMS) certain operations are performable only on attributes having equal domains. For instance. two attributes containing textual data may only be compared when their domain specifications are equal. So far we have focussed on the structure of a data-base. Although elementary, our observations provide a basis for understanding the types of programs that are intended to handle the data-base. Programs of this kind have to perform the following tasks. First, input of large amounts of data must be possible. During this input, the constraints on the domain definitions must be

540

G. Seegers et al. / Data management

and da:a anal,vsis

applied. Second, procedures for adding new information to the data-base must be provided for and there must be an opportunity to correct errors that are detected. These tasks belong to the field of data-nzaintenance. Considering the example of a data-base, the proper task for the data-base management program consists of the following. The program must be able to produce (alphabetically ordered) lists of books of poems per author, and a list containing individual book titles along with the reviews of each title. Furthermore, it would be convenient if the program could produce lists of names of critics and newspapers. All these different tasks consist of the retrieval and manipulation of the stored information. To accomplish these tasks quickly and efficiently it is essential to organize the data-base in one way or the other. When, for instance, the information on the reviews of volume X is needed, and the data-base is not organized, the only way to find this information is to search the data-base sequentially until all volumes with the title X are found. If the data-base is alphabetically organized it is possible to access the information more directly and thus more quickly. This example may illustrate yet another point. Titles and names are not unique. For a number of operations, for instance that of updating existing information, it must be clear which name or title is requested. Every item must therefore be made unique by (a combination of) attribute values. This value (or this combination of values) is called the key of the record; any such key must be unique by definition. A program as briefly outlined might operate satisfactorily. In general. however, it is difficult to adapt ad-hoc programs; changes in the original questions are difficult to realize. An important aspect of data-management, not mentioned in the previous example, deserves special notice. In our previous example all the information was contained in one file. This is not necessarily always the case and sometimes it is necessarily not the case. For instance, to eliminate redundancy and to reduce possible errors, it is recommendable to store information in several subfiles. Suppose, for instance, that in the review data-base the name of a newspaper or reviewer occurs more than once. If the same name is found in different places this is a potential source of errors. The solution to this problem is reduction of redundancy. This can be achieved by constructing separate files for newspapers and reviewers, where each newspaper and critic is represented once in combination with a unique key, for instance a code. This code replaces the reference to the newspaper or critic in question. Errors in names are thus confined to one file only and can be corrected there. Data-bases will often be realised in this way, i.e. in the form of a set of subfiles. Although this leads to a more complex data-base, an organization of this kind permits a reduction of error rates and may increase efficiency of maintenance. Having reviewed a number of aspects of data management, or, more specifically, having indicated some ways of managing a data-base, we shall now focus on Data-base Management Systems proper.

G. Sregers et al. / Data management and data analvsis

541

4. What is a Data-Base Management System? A DBMS is a program or a set of programs that is independent of its application. In other words, from the point of view of a DBMS it is irrelevant what kind of information is contained in the data-base or how this information is organized. Note that this is not true for ad-hoc programs. They only operate on the data-base for which they have been developed. The structure and features of the data are - so to speak - hardwired into these programs. A DBMS therefore has to provide methods to define the data. There are a number of formal constraints on the design of Data-base Management Systems. First a DBMS must permit of reducing redundancies in a data-base. Next, it must allow for shared access to the data; that is, the data-base must be accessible to more than one user at a time. Access to a data-base normally implies the permission to change, add or restructure data in the data-base. A DBMS must guarantee that although several users can access the same part of the data-base, the data will be identical for all users. This is called data integrity. Normally a DBMS represents data at several levels. The lowest level of representation is the physical storage of data from which the user should be shielded off. Ordinary users should not be bothered with the details of the actual storage. A DBMS presents data to a user by means of a higher level of representation; this is called the conceptual uiew. A DBMS has the option to define different views for different (classes of) users. Views are different with respect to the amount of data they make visible or with respect to the organization of the data. In this way it is possible to protect certain information available in the data-base but not open to public access. Views can also imply certain actions that a user may perform. Some users are only allowed to extract data from the data-base. Other users can have other permissions. The highest level of data representation in a DBMC is the external representation, the representation of (part of) the data after selection and ordering by a user. The tool that a DBMS offers to manipulate data in the conceptual view and to produce the external representation is called the ‘query language’. SQL is a well-known example of such a query language.

5. Relational and Network DBMS There are two different kinds of DBMS, the relational one and the so-called network DBMS. They differ with respect to the model underlying the system. Each of these models implies a distinct way of defining relations between the data, and a mode of combining data given these relations.

G. Seegers et al. / Dara managemenr and data analysis

542

Table 2 Simple example

of the contents

of an author

data-base.

Author

Title

Kingsley Amis Kingsley Amis John Irving John Irving John Irving

The old devils Lucky Jim The hotel New Hampshire The world according to Garp The water-method man

Usually a data-base will consist of more than one file for storing the data; that is, it is made up of subfiles. A DBMS must permit of combining these subfiles to produce sensible information. The two types of DBMS combine data in different ways. A relational DBMS can combine data from subfiles on the basis of equal values for certain attributes that have equal domains. On the other hand, a network DBMS can combine data from subfiles on the basis of existing links in the data-base. A data-base that, for example, stores information on books will consist of more than one file for reasons already mentioned. One subfile contains information on writers and another subfile will contain (part of the) information on books.a Suppose that the information shown in table 2 has to be stored in the data-base. A relational DBMS might store this information as shown in table 3. If the domains of the attribute A-code are equal and the values of this attribute are equal for a certain pair of records, a relational DBMS can combine these records from the two subfiles and produce information on books. Connections between records in subfiles are made at the time the subfiles are searched. Changes in order or subtance of the subfiles are irrelevant with respect to the possibility of retrieving information. and do not affect retrieval. Of course. changes in content will result in different information, but they do not affect the possibility of retrieving information. In a network DBMS, on the other hand, subfiles may be organized as shown in table 4. The arrows represent links or pointers to records in some

Table 3 Example of the storage

structure

of a relational

DBMS.

Title-file

Author-file A-code

Author

AM11 IRVl

Kingsley Amis John Irving

name

A-code

Title

AM11 AM11 IRVl IRVl IRVl

The old devils Lucky Jim The hotel New Hampshire The world according to Garp The water-method man

G. Seegers er al. / Dara managemenr and dara analvsis Table 4 Example

of the storage

of a network

Title

name

Kingsley Amis John Ining *

DBMS Title-file

Author-fide Author

543

The old devils Lucky Jim The hotel New Hampshire The world according to Garp The water-method man

-* Tl, T2 a T3, T4, T5

a The ‘T’ is an indication

of the subfile (here: title-file).

other subfile. In a network DBMS information bearing on the same topic is retrieved by following these pointers. Relations between information have to be defined at input by specifying the information a pointer is linked with. These differences in storage and retrieval procedures between relational and network DBMS have consequences for the efficiency of retrieval, but also for what can be retrieved. A network DBMS is fast when retrieval adheres to the pointer structure as defined at the outset, i.e. when the data-base was designed. In other words, a properly defined and constructed network data-base is efficient in retrieving the information that is linked by predefined pointers. Other information may be retrieved but retrieval in most cases is inefficient and sometimes even impossible. Changing and updating a network data-base is a complicated matter because changes to subfiles are not confined to one subfile only; besides, any updating implies the updating of the pointer structure. A retrieval in a relational data-base is generally sloner because connections are made by means of comparisons of values of attributes, while a network DBMS permits fast retrieval by following the stored linkages. Retrieval in a relational DBMS also implies a more time-consuming process of comparison than the activation of pointers. The advantages of relational data-bases are, however, that relations can be made whenever attributes have equal values. Changes and updates are simpler because in most instances they do not affect other subfiles.

6. An example

of a Data-base

application

The research-group of the department of the sociology of literature at Tilburg University has collected a number of large data-bases. One of these, the so-called reuiew data-base, will serve to illustrate the prospects and problems of a DBMS, in maintaining a large data-base of a moderately complex nature. The review data-base consists of all the reviews that appeared over the period 1975-1980 in the Dutch press (dailies and weeklies), and were collected

544 Table 5 The organization

G. Seegers et al. / Data management and data analvsis

of the review data-base.

Attribute

Domain

Title of the book reviewed Author of the book Year of publication Genre of the book Name of the publisher Publication of the review: year month name of the reviewer Sex of the reviewer Number of titles discussed Number of words in the review Title and kind of newspaper

text text number

< 99

P(oems); G(enera1); E(ssay); N(arrative

prose)

text number < 99 number > 0 & < 13 text M: F; U(unknown) O(ne); M(ore) number text

by the Letterkundig Museum (‘Literary Museum’). The number of reviews in the data-base is approximately 20.000. The data definition for a review consists of the attributes shown in table 5. The data-base is stored as one file. This means that all the information on a review is found in one record of a file. This storage design is in conflict with the rules on proper data-base design. Some of these rules have been explained before. The consequence of the choice of design is what is called the update anomaly and data inconsistency. Because the data-base contains redundant information, updating certain review-data may lead to inconsistencies. The name of a particular newspaper, for instance, is represented hundreds of times in the data-base. Changes in this name, therefore, have to be made in as many records. If an error is made, with, for instance, an incorrectly spelled newspaper name the data-base will become inconsistent: the same newspaper will be represented under several names. This problem can be solved by introducing normalization rules. These are rules specifying constraints. Data-base theory defines five normal forms, ordered by degree from one to five. Normal forms restrict the information a record in a subfile is allowed to contain. The higher the degree of normalization, the greater the number of subfiles that represent the data. If a data-base is in a particular normal form, it is guaranteed to be free of certain forms of data inconsistencies. As usual, everything has its price. It will be clear that when a data-base consists of more subfiles - a consequence of higher normal forms - it will take more time to perform retrieval operations. Data integrity can thus be bought at the cost of efficiency. Without lengthy exercises in data-base theory and normal forms, a number of consequences of normalization can be illustrated with respect to the review data-base introduced as an example above. Normalization results in subfiles

G. Seegers et al. / Data management and data (ma&m Table 6 Example of a set of sub-files

545

in the review data-base.

Author-file A-code Author-name

Newspaper-file N-code Newspaper-name Distribution type

Publisher-file P-code Publisher

Book-Jile B-code Title Genre Year A-code

Reoiew-Jile A-code B-code R-code P-code N-code Year Month Number-titles Number-words

Reviewer-Jile R-code Reviewer-name Reviewer-sex

that contain only pieces of information that are mutually dependent, i.e. the structure of the data-base allows for combining information belonging to one topic. The records in the data-base have to be split up in subfiles. A possible set-up is represented in table 6. In such a design, a subfile only contains ‘real’ data when these data may be linked with data from other subfiles. Title, genre and year of publication identify a particular book and nothing but that book. Authors usually write more than one book and therefore, to avoid redundancy, the author identification is given by a code that can be used to combine the book-file and the author-file. (The same argument holds for other subfiles.) There is yet another reason to maintain a separate author-file. Additional information on an author is independent of the information kept in the book-file. For instance, the address of an author, his/her sex or his record of awards is not a fact about his books. Normalization demands subfiles in these instances. When in the insertion of information a correct code is used, this information will be accepted and stored in a subfile, regardless of any errors that may occur in, for instance, the spelling of names. Although the review data-base in Ti!burg is stored as one non-normalized file, a certain degree of normalization was performed outside of the DBMS. Before any review record is added to the data-base, the name of the author, the newspaper title and the name of the publisher are checked automatically against lists of legitimate names. In combination with the impossibility of changing any data in the data-base, this precaution achieves functionally the same things that normalization would achieve. The advantage of this procedure is higher performance but the inevitable drawback is a more troublesome input procedure.

546

G. Seegers et al. / Data management and data anabsis

7. Data analysis In the preceding section we discussed some aspects of data-base management systems, and their application in managing extensive data-bases. In this section we should like to consider a number of issues bearing on data analysis, confining ourselves to only a rough sketch. To illustrate the use of descriptive and statistical methods we shall refer to a number of investigations presently in progress at the department of the sociology of literature. There are roughly two kinds of methods for data-analysis: besides methods to describe characteristics of the data there are techniques to test specific hypotheses. This distinction does not parallel the contrast bet\veen descriptive and statistical methods, but in many cases methods may be applied in a descriptive or statistical sense. Roughly. a more or less standardised form can be obsemed in the formulation of the hypotheses. The question may be, for example. teether there are differences between groups. whether there is a correlarion betxveen variables, or whether the scores on a given variable can be explained (or predicted) from knowledge of the scores on another variable. Especially the terms in italics constitute the canonical form of such hypotheses. As to the description of the data, a further distinction must be made bet\veen methods of general description and methods by nhich the original data are reduced to facilitate interpretation. The former are made up of rather simple operations like the summin g of frequencies. the calculation of mean and standard deviation, and the description of the data by means of frequency tables. Usually this is only a first step, in order to present the reader with a clear picture of the data collected. Further processing of the data requires the use of methods developed to determine the structure underlying the data; that is. they permit the reduction of the original data-set to a more condensed one, in which the relations between the variables are expressed more clearly. Methods for structuring the original data in a more condensed set are still to be regarded as descriptive, but the operations employed are much more complex. Among the techniques that are most frequently used we mention factor analysis. In this type of analysis a given (object * variables) matrix is reduced to a more condensed (variables * factors) matrix. This means that the original matrix, with objects in the n rows and 111columns containing the variables, has been changed into a matrix with only m roivs containing the variables, and a smaller number of columns, each column containing one factor. These factors roughtly represent the most striking relations between the variables. Suppose, for example. that data on 15 variables ha1.e been collected. When in a factor analysis a number of 3 or 4 meaningful factors can be extracted, the interpretation may be facilitated considerably: no longer do 15 variables have to be interpreted, but a description of the meaning of the factors reveals the structure of the data. In this context iye will not discuss

G. Seegers et al. / Data management and data ana!rsls

547

issues concerning the number of factors that may be extracted. or the conditions under which a factor is meaningful. 1 Another frequently used technique which may lead to highly comparable results is multiple-regression ana!Vsis. Application of this technique allows a researcher to specify to what rate variables determine the scores on a dependent variable. Both techniques may be applied in a descriptive and in a statistical sense. It is more a matter of methodological tradition which application will prevail. Factor analysis usually will be applied as a descriptive technique, although it may be adapted to fit statistical application. Applications of multiple regression analysis mostly imply statistical testing The lack of a clear distinction between description and statistics on the level of methods explains vvhy these techniques usually are referred to as descriptive statistics when applied in a descriptive sense. The most important applications of statistical methods may be divided into tu’o groups: - estimations of population values; estimation intervals, distribution, etc., and - testing hypotheses about the relations between variables and/or groups (individuals). Suppose a researcher has collected data on a number of variables. A problem of methodology is how to infer valid arguments from these data. A fundamental problem is how conclusions may be inferred concerning a population when only data are available from a (random) sample. To solve this problem the researcher must have recourse to statistical procedures. Roughly speaking, statistics may be described as the methodology that renders it possible to draw inferences about a population given a (random) sample. Arguments are obtained by inferring from the particular to the general. Statements of these kind cannot, however, be made with complete certainty. Margins have to be accepted, marking the degree of certainty. This exemplifies the first type of application: sample values are estimates of population values. Intervals can be determined giving the area ahere a population value can be expected with a certain probability. In practical research, this application is not often used in isolation. Researchers are more interested in questions as to whether group A differs from group B, or whether knowledge of the scores on a variable X helps to explain the scores on variable Y. In more comprehensible words: is there an influence of variable X which might explain the scores on variable Y? To mention an example: how important is a variable like education when we want to account ’ See Seegers applied.

and Verdaasdonk

(1987)

for the description

of the way this technique

may be

548

G. Seegers et al. / Data management and data analysis

for the purchase of literary books? In case of statistical testing, these applications belong to the second group. Statistical testing implies determining the probability of a given result, assumin g some specific hypothesis about the data. To mention an example, when we compare two groups, randomly selected from a population, the values observed will usually deviate more or less from the true or population value. An observed difference has to be tested: which is the probability of finding this difference when in reality the scores for both groups are estimates of the same population value? Testing whether two groups differ means testing the hypothesis that there is no difference, and that the observed results show a haphazard difference, against the alternative that the difference is significant (in the statistical sense). The assumption that the two groups are in fact samples from one population, and that the observed difference is the result of haphazard fluctuation in data-collection is what is called the null-hypothesis. This hypothesis is tested against the alternative that the observed difference results from the fact that the two groups are taken from two different populations. In most cases, the null-hypothesis states that no difference between groups or variables is observed, or that a correlation between variables is absent; but null-hypothesis may also be understood to mean the assumption about a certain difference (or correlation) that is tested against an alternative hypothesis. Significance in the statistical sense refers to the conditions under which rejection of the null-hypothesis in favour of an alternative is required. When the probability of the observed result, assuming the null-hypothesis, is very small, we may be fairly certain that the assumed null-hypothesis is incorrect, and that the probability of finding the observed result is more plausible on the basis of an alternative hypothesis. Usually a probability - or significance - level of 0.05 or 0.01 is accepted. A significance level of 0.05 (or 0.01) indicates that, if a random sample from the research population is taken, there is only a 5 (or 1) percent chance of finding the observed result when in fact the null-hypothesis is true. Statistical testing permits a researcher to determine the (un)certainty with which valid arguments from the data may be inferred. This is why the second group of applications plays such a dominant role in practical research. To these applications belong univariate techniques like the computation of correlation coefficients and the testing of the means of two groups by the use of a t-test. These methods, common in research in the social sciences, are essential to enable the researcher to draw conclusion from the collected data. In their most simple form, these techniques involve comparing the data on t\vo variables, for instance score-means on two groups, scores on tn-o variables. or the correlation between these. Together they make up the univariate techniques. In most cases, however, (the scores on) more variables or groups need to be compared in a single analysis; that is, in most research designs the variables under investigation are entangled in a complex structure with sur-

G. Seegers et al. / Data management and dara analysis

549

rounding variables. To disentangle this structure one needs more powerful methods than the step-by-step univariate techniques. The methods in question are called multivariate techniques. They range from extensions of univariate techniques (multiple regression and variance analysis) to methods that permit of causal relations testing the tenability of models containin g the specification between variables. To exemplify the kind of situation requiring a multivariate analysis, we should like to refer to a research project which focusses on the variable purchase of literary books. By questioning a number of book-buyers, data are collected for a number of variables, including the number of books purchased (‘dependent’ variable), education, age, income and amount of time for recreation. It is easy to understand that income and age are interdependent (i.e. correlate) in a specific and not coincidental way. In order to determine the influence of the respective variables on the dependent variable we have to account for the interrelationships between the influencing (or independent) variables. One of the techniques that allows a researcher to determine the influence of each variable is the multiple regression analysis. By employing a univariate analysis, that is, by determining the influence of the variables age and income independently, a researcher might be led to conclude that both have a significant influence on the purchase of books. By using a multivariate technique, however, he might come to a quite different conclusion, i.e. that only income is of significant influence, while the influence of age is spurious, being the result of its correlation with income. Research in the field of the empirical sociology of literature leans heavily upon application of these techniques. Research as it has developed over the last 10 years demands the use of these complex statistical tools, as its main objects of research are located in the analysis of variables within a complex interactive structure. Fortunately, there is a large number of statistical programs available which greatly facilitate the analyses required. Occasionally, a researcher has to do some programming of his own; if so, it usually takes only minor adaptations of the data to facilitate further processing. Three packages of statistical programs practically dominate the market. All three include descriptive as well as statistical techniques. They are the Statistical Package for the Social Sciences (SPSS), the Statistical Analysis Systems (SAS), and a package that was originally developed for research in medicine and biology, the BMDP package. Besides these three overall packages, several programs have been developed for specific purposes. Often the most useful applications, however, will be added to the overall packages.

550

8. Some examples

G. Seegers et al. / Data management

and data ana(vsrs

of research

Some examples of investigations currently in progress at the department of the sociology of literature of Tilburg University will be mentioned to illustrate how aspects of data collection and analysis fit within the framework described above. A first research example concerns a market research on buyers of literary works. The three most important goals are: _ Description of the buyer of literary works. This description will be elaborated on the basis of a model for consumer behaviour. - Segmentation of the market and the consequences for the marketing policies of bookshops and publishing houses. _ Deciding what variables influence the acquisition of literary books. The data were collected by means of a survey, held with 727 buyers of literary books. Data on 234 variables were collected per respondent. ’ The most important research questions may be inferred from the goals, a few of these being: - What is the interdependency between variables that in principle may be relevant in explaining the acquisition of literary books? What structure can be derived underlying these variables? - How, and to what estent, do variables explain the purchase of literary books? Is education, for instance, of more importance than income? What is their influence on the acquisition of literary books? To answer these questions, which both bear on the explanation of a (dependent) variable. a technique like multiple regression analysis seems most appropriate. When structural characteristics within a set of variables are to be derived, a technique like factor analysis will be useful. In another investigation an analysis is made of the reviews of works of poetry which were published in the period 1975-1980. The original set contains the data of about 2500 reviews, including information on the following variables: _ _

name of the reviewer. journal in which the review is published, kind of journal (national/regional paper, paper/vveekly, etc.), length of the review (in number of words), characteristics of the tvork under review: author, year of publication, edition (yes/no), and publisher.

’ Cf. Leemans and Van Doggenaar

(1987) for a report

on an aspect pertaining

first

to the first goal.

G. Seegers et al. / Data management

and data analysis

551

Research questions focus on the hypothesized relationship between, for instance, publishers and reviewers, authors and reviewers. and reviews and publishing houses. Again, these questions suggest the relevance of techniques that render possible the disentanglement of variables from a complex structure of interacting influences. In still another research a survey was held with members and former members of a Dutch bookclub. The main research questions focussed on topics like the following: _ What reasons to become a member or to quit membership are most important? - Is there a stimulating influence of membership on the acquisition of books? - What variables most characteristically define members vs. ex-members? Again we may point to multivariate techniques as relevant for answering these and similar questions. Especially the multivariate regression analysis may be mentioned: variables like membership and acquisition of books are (dependent) variables for which the researcher tries to determine how they are determined by other variables. Another investigation concerned the variables that determine the decisions of selectors in public libraries. This, too, was a central research topic that may be formulated in terms of explaining and explained variables (Verdaasdonk (1986)).

9. Conclusions Although this set of examples can easily be extended, the enumeration just given will undoubtedly have given an impression not only of the research being conducted, but more specifically of the way statistical methods are being applied. This kind of .research depends heavily on the availability of adequate computer facilities. In discussing some fundamental aspects of data management and data analysis we took it for granted that the researcher had these facilities at his disposal. Without them, empirical research is practically impossible. For the researcher who does indeed have them at his disposal they represent powerful tools enabling him to carry out fast and efficient data manipulation and data analysis. Using a data-base management system may facilitate these manipulations considerably. Applying descriptive methods of data analysis may be helpful in interpreting the data. Application of statistical methods is fundamental to the drawing of inferences from the data, since they supply the researcher with a knowledge about the (un)certainty with which conclusions from the data may be held to be valid.

552

G. Seegers et al. / Data management and data ana(vsis

The problem of validity is fundamental to the issue of how to design a research, given specific questions and hypotheses. As has been said before, statistical methods may dictate some more of less canonical forms of the hypotheses to be tested. Consequently, before actually doing the research a researcher should formulate specific questions, using this canonical form, and subsequently choose a research-design that allows adequate responses. In the context of this journal. which aims at promoting empirical research in the field of books and of the arts, we have attempted to broach some fundamental topics bearing on the application of the computer and of empirical procedures by researchers in this field. Clearly our observations were introductory in character. We are confident, however, that they illustrate how the empirical research of literature may be designed and executed adequately with the help of powerful methods of data management and data analysis.

References Bourdieu. P., 1977. La Production de la Croyance. Actes de la Recherche en Sciences Sociales 13, 3-43. Bourdieu, P., 1979. La Distinction. Paris: Minuit. Bourdieu, P., 1983. The field of cultural production or: The economic world reversed. Poetics 12, 311-356. Leemans H. and J. van Doggenaar, 1987. Subdivisions of book supply made by individual buyers of Literature. Poetics 16. 255-268. Rees. C.J. van (ed.) 1983a. Empirical sociology of literature and the arts. Poetics 12. 285-487. (Special issue.) Rees, C.J. van, 1983b. Advances in the empirical sociology of literature and the arts: The institutional approach. In: C.J. van Rees (ed.), 285-318. Seegers, G. and H. Verdaasdonk, 1987. Choice patterns of adult users of a public Library. Poetics 16, 353-368. Verdaasdonk, H., 1985. The influence of certain socio-economic factors on the composition of the literary programs of large Dutch publishing houses. Poetics 14, 575-608. Verdaasdonk, H., 1986. Boekselectie en Coflectiegebruik. [Book selection and the use of the collection of a public library.] The Hague: Nederlands Bibliotheek- en Lectuur-Centrum.