The problems of input and their implications in mechanization

The problems of input and their implications in mechanization

lrtlorm. Stor. Retr. Vol. 4, pp. 253-256. Pergamon Press 1968. Printed in Great Britain T H E P R O B L E M S OF I N P U T A N D T H E I R IMPLICATI...

317KB Sizes 14 Downloads 27 Views

lrtlorm. Stor. Retr.

Vol. 4, pp. 253-256. Pergamon Press 1968. Printed in Great Britain

T H E P R O B L E M S OF I N P U T A N D T H E I R IMPLICATIONS IN MECHANIZATION J.

FARRADANE

The City University, London Summary--Systems with unorganized keywords have been found increasingly unsatisfactory, and classificatory devices (broader or narrower terms, etc.) have been introduced into thesauruses, or structural devices (links, roles) have been used; man-machine interaction is really a means of getting re-classification done by the user. Fully structured systems (Gardin, Farradane) are being investigated. Results by means of relational indexing are showing high values for both recall and precision (or zero or low discard ratios), showing the value of structui'ed (pre-coordinate) indexing in combination with browsing techniques by means of concept organization (new generalized classificatory principles). The computer techniques required to apply these methods are discussed; they may include associative memories, simultaneous access to structure-linked terms, and browsing facilities. ALTHOUGH in a great many places information is still being stored on ordinary filing cards, on edge-punched cards, on peek-a-boo or optical coincidence cards, or on mechanicallyhandled body-punched cards of various types, the development of the computer has presented an attractive but deceptive means of handling information. I use "deceptive" quite deliberately, because the power of the computer has led many to assume that the machine will somehow undertake for us the intellectual tasks of understanding, analysing, and piecing together required information from a mass of other information. This has never been expected from manually-operated card-indexing systems where the human being expects to undertake the intellectual tasks of analysis or classifying at input, and of identifying, in conjunction with various degrees of intellectual understanding and discrimination, the information required for output. Because all these intellectual processes are to a large extent carried out without clearly specified conscious procedures, they have received little detailed examination, and have indeed been largely overlooked. With the increasing use of machines, this necessity of intellectual operations has again been overlooked, perhaps because their role in manual methods was insufficiently appreciated, perhaps because the difficulties of improving intellectual methods seemed too formidable, or perhaps because too naive a view was taken of the capabilities of the machine. Thus Perry, at the ICSI meeting in 1958, could suggest that the computer could produce new combinations of information; we should all be aware now that you cannot get out meaningful information if it has not initially been put in. Similarly, if the computer is required to handle a full natural language text, it cannot be expected, without the previous provision of much more sophisticated programming than anyone has yet used, that we can overcome the vagaries of the writers of the texts. Few authors, if indeed any, avoid sentences free from unintended implications, the casual use of synonyms (which may perhaps be desirable to give variety in reading), colloquial phrases, or, far too often, plain grammatical errors, let alone incorrect statements or printers' errors. The human being, reading such text, is 253

254

J. FARRADANE

usually capable of penetrating the veil of such ambiguities and of understanding the intended meaning, without much conscious effort. Unless the machine can be programmed with equal powers, it will perpetuate the errors. For many reasons, such as disillusionment with library classification schemes, most mechanized information storage and retrieval systems have been based on the use of keywords, or descriptors; in the majority of cases, certainly in earlier systems, there is no coordination of terms at input (pre-coordination), and the interconnexion of descriptors takes place only at the time of search (post-coordination). Increasingly, such methods have been found unsatisfactory, partly because of failure to retrieve wanted information, but much more because of the amount of unwanted information, or noise, produced by the systems. Gradually, a number of means have been tried for overcoming the faults. The free choice of descriptors has been replaced by the authority list called a thesaurus; initially, such a thesaurus was only a plain standard set of terms, without extra devices; later, it was seen to be necessary to make clear the removal of synonyms, by "see also" references, and to introduce a small degree of classification, by listing so-called broader terms, narrower terms, and related terms; such devices were often not introduced very consistently, until controlled by means of the computer. Other attempts to control false drops and other noise have used rather arbitrary and elementary structural devices such as links and roles, often without much consistency. Others have introduced pre-coordination in terms of the most elementary forms of Boolean algebra. None of these methods mentioned so far has received any serious intellectual examination, or been developed beyond an elementary stage. It is not surprising, therefore, that the difficulties have hardly been reduced, and that the problem of noise remains considerable. A more recent approach to alleviation of the difficulties has been that of "man-machine interaction". In these methods the question put to the system is modified in one or more steps by referring the initially produced answers, or some of them, to the questioner, who can then modify his question in an attempt to obtain answers which will be closer to his original, or possibly even not initially realized, intentions; alternatively, the questioner can indicate which answers are satisfactory, and this selection can be used to guide or modify the system's output in a second, or further, search process. It is clear that this is another form of pre-coordination or classification carried out by the user, instead of the indexer, before proceeding with searches. It will thus be seen that the deviser of information retrieval systems has been forced back from a purely mechanized, post-coordinate procedure to one which uses one or more pre-coordinate approaches. These pre-coordinate methods are all methods incorporating human judgement, though as intellectual devices they tend to be relatively elementary. If the dream of a purely mechanical information storage and retrieval system is fading, surely much more attention should be paid to the exact nature of the intellectual pre-coordination methods that are to be used, so that the most efficient procedures are applied. Basically this means obtaining a better understanding of our intellectual operations, and of the way we think and express our ideas and needs. Many people may consider that we already have such understanding within the disciplines of logic, especially mathematical logic, but almost nothing has been done to test this in any advanced form; furthermore, the evidence from experimental psychology is that human thinking does not take place solely, or even to any large extent, on the lines of such logic. Another line of approach has been through linguistic analysis. Apart from the fact that our theoretical pictures of the structure of languages are as yet by no means perfected, there remain the difficulties which I have already noted, that

The Problems of Input and their Implications in Mechanization

255

language is itself not consistent, but can include many variations of forms of expression, colloquial phrases, and meanings which are often dependent upon extra-linguistic factors such as intonation, emphasis and gesture; admittedly these are not present in the written word, "but the reader, from his long experience of language as he hears it used, and often from his knowledge of the subject matter, can usually sense the intended meaning and emphasis, especially by the aid of punctuation, inverted commas, exclamation marks, italics and so on. However, linguistic and grammatical errors by careless authors may well nullify or misdirect these judgements; the trained reader may be able to guess his way past these errors; the machine cannot. Salton has produced a system of mechanized recognition of grammatical structure; Gardin has devised an intellectually derived and stylized system of representing structure, also based on linguistic theory. Both seem to me to be liable to suffer considerably from the sources of linguistic error mentioned above. In my own researches I have therefore concentrated on methods of semantic analysis for sophisticated pre-coordinate indexing. These methods I have described elsewhere [1,2] and need not be repeated here. We have indexed some 1200 papers, and we are testing the index with some 150 questions. The results of some 40 questions have been analysed so far, and the results are of an order quite different from that of other published tests. In terms of the parameters of recall and precision (which, for various reasons, we do not consider entirely satisfactory, particularly when plotted against each other graphically), we can achieve results, compared with the user's selection of highly relevant papers, which are almost all within the 50-100 per cent range for both parameters. When we include desired answers of a second order of relevance (possible answers), the recall may drop to about 50 per cent, but with the application of our browsing techniques, we can bring the recall figure up quite considerably, without much change in the precision ratio. We are also investigating other parameters, particularly what we have called the discard ratio (Cleverdon's "fallout"). So far we have not established any particularly consistent connexions between parameters, and can say only that the results have in general been most gratifyingly good; when we have completed the analysis of all the tests we shall no doubt have much more adequate material from which to make calculations and discover interconnexions. What we do already feel confident in claiming is that this pre-coordinate structuring produces results of an order quite different from that of post-coordinate methods, and that browsing techniques, also based on semantic principles, improve the results without introducing disadvantages. Those browsing techniques which arise from principles of concept organization (a generalized form of classificatory methods which are very different from library classifications) are only of limited use so far; the browsing techniques (one of which we call "condensation") based on semantic principles appear to provide a more powerful aid. Now, assuming that our full results confirm these initial figures, we need to consider very carefully the implications for the mechanization of these procedures. So far we have conducted the research on ordinary filing cards (with a slightly laborious system of duplication of cards with cross-referencing so as to make easy entrance to the file at any point). Anyway, the ICT computer was not installed when we started, and only now has sufficient peripheral equipment and languages, apart from the fact that we have at present no available analysts or programmers. The possibilities of mechanization are of course of great interest to us, if only in order to speed up the researches. It could not, however, affect the results. The methods of semantic analysis seem inevitably to require structured representations of knowledge which take the form of at least two-dimensional (or, more accurately, two-

256

J. FARRADANE

directional) diagrams composed of terms interconnected by semantic relations (which can be symbolized) [3]. One reason for the two-directional diagrams is that many complex items of information exist in which a circular scheme of relations has to be represented, such as when a particular substance is chemically identified by measurement of the spectral characteristics of a derivative of the original substance. Another, more obvious, reason is that with most subjects of any complexity, linear representation is more difficult and confusing, especially when browsing techniques are to be applied. Our analyses of subjects therefore look rather like the structural formulae of complex organic chemical compounds. (It is interesting to note that Gardins "Syntol" also partly uses similar-looking diagrams.) The semantic browsing methods result in the omission of certain terms from the diagrams of the indexed information (more rarely of the questions), with a closing of the gaps to produce condensed diagrams, special rules being applied to determine which relations survive. Classificatory browsing (which may be applicable to one term at a time in the diagrams) requires a means of reference to the concept organization schedules, but will be more complicated than the known procedures of dictionary look-up. If the computer is indeed the appropriate machine for our purposes (and it is, of course, the only available machine with sufficient power at present), then new techniques would appear to be needed. The structures of organic compounds, and Gardin's diagrams, have been programmed by somewhat laborious means, but the methods would not permit our sort of browsing without much greater complications. The principles so far developed for providing associative memories (e.g. Cheydleur) may be suitable for classificatory browsing, but the programming has been very complex. Methods of direct input of diagrams with a light pencil, from "Sketchpad" onwards, offer promise for handling the semantic diagrams, but have scarcely been sufficiently developed as yet. One is tempted to wonder whether the digital computer, which is so effective for numerical computation and data retrieval, is the appropriate machine for handling information in the semantic sense, i.e. non-numerical information. The classificatory type of browsing seems to have more affinity with the processes of an analogue computer. The real need is for simultaneous access to several stored terms, together with the expression of their semantic interlinkings, and means of manipulating these, or browsing, which is virtually the direct recognition of two-dimensional diagrams and of their possible modifications in various ways. Even if this is possible on a digital computer, will it prove forever much too expensive ? If so, is it possible that a special purpose machine of a different type will eventually have to be devised for information storage and retrieval ? For my part, I can only hope to make clear the semantic features of information handling. The rest I leave to the electronic engineers. REFERENCES [l I J. FARRADANE:Relational indexing and classification in the light of recent experimental work in psychology. Inform. Stor. Retr., 1, 5-9 (1963). [2l J. FARRADANE"Concept organization for information retrieval, Inform. Stor. Retr., 3, 297-312 (1967). [3] J. FARRADANE, S. DATTAand R. K. POULTON: Report on research on information retrieval by relational

indexing. Part ImMethodology. March 1966 (City University, London).