U’orld Purenr Informarion. Vol. 5. No. 3, pp. 186-187, Pergamon International Information Printed in Great Britain.
ONLINE Patent Information Chemical Information Some Thoughts
1983.
Corp.
Online:
Users of patent information soon learn that chemical information is far more accessible than any other type. This is due largely to the efforts of database Droducers such as the Chemical Abstrac& Service (CAS), Derwent Publications. IFI/Plenum Data Company, and the American Petroleum Institute (API); online hosts such as SDC, Dialog, and Questel: and the many information users who support their products. But the more those users utilize these information services, the more they recognize the shortcomings of all of the available products. They’re all good, in various ways - but there are still many things that could be improved. This subject has captured my attention recently in many ways, not the least of which was through a study of patent retrieval from various databases carried out by the API’s Patent Task Force, which I chair. It is the subiect of a naner I’m slated to deliver to this Fall’s Am&an Chemical Society Meeting. Perhaps if this issue is published early enough I might motivate you to attend that session, which should be a provocative one.
Let us consider a hypothetical database which indexes each substance used in the patent examples, indexes further the function that the substance serves and ties the two by a linking operator. The concept is already present in the API’s databases, where terms such as CATALYST, OXIDATION INHIBITOR, and the like are linked to terms describing the chemical structure of the substance. API only does this for ‘significant’ substances in the patent, and I’m looking for something more. The indexing record would be a long one, of course, but there’s already online software that can cut through a maze of indexing terms and prepare a compact and pinpointed record of the information we seek: SDC’s PRINT HIT capability, which allows the selective printing of the portion of an indexing record which produced the hit. By searching for solvent along with appropriate terms for a given area of t;ihnology one could, iy specifying PRINT HIT of linked indexing fields. get just the sort of pinpointed &forma&n collection that is needed. Now all we need is for somebodv to oroduce that sort of deeply-indexed database. In the meantime the PRINT HIT feature can be used in this way to spotlight many sorts of information retrieved in SDC searches. Let us look at the substances we’re indexing. Some of them are specific substances, some of them generic - the Markush structures that create such problems for everyone dealing with chemical patents. Specific substances have always been called by names, but nomenclature is often highly complex and inconsistent. Over the past two decades, though, a new identifying medium has been developed and has taken hold: the CAS Registry Number (RN), a serial number which identifies unambiguously each known chemical substance.
Information services in general attempt to index the products of synthesis and the essential components of formulations. sometimes deal with starting They materials, less frequently with the products of intermediate reaction steps. Needless to say the information user is best served when all of these materials are indexed, especially when their role in a process is identified, as it is in several databases. Indexing concentrates on claimed substances - even CAS, which used to index only material from examples, now has a policy of indexing specific substances that are claimed, even when they are not supported by data. Databases generally index information from examples, whether or not it is claimed; less frequently they cover information from the rest of the disclosure.
I believe that there is no longer any valid excuse for the patent offices of the world to fail to require the use of CAS RNs to identify each significant substance in each patent specification. Certainly the United States Patent and Trademark Office should take the lead. 1J.S. chemical manufacturers have learned to identify their substances by RN under the terms of the Toxic Substances Control Act: companies and scientists around the world use RNs more and more in information retrieval and processing; they could learn to do so in patent applications as well.
The emphasis in indexing is on the novel feature of the patent, and for legal purposes this is not unreasonable. But patent information can be useful for many other including the influencing of purposes, research programs and of marketing efforts. For example, a manufacturer of solvents may want to glean from patents information on typical uses of its solvents, or of other solvents which its products might potentially replace, in various industries. But nobody bothers to index the ‘lowly’ solvent, unless it turns out to be novel.
Of course there are potential problems, the most obvious of which is the potential for the applicant’s using the wrong RNs. An arrangement in which CAS would oarticipate i% the patenting process by re;iewing RNs could provide one solution, and the effort expended would be partly offset by CAS efficiencies when it came time to index the patent. Other obstacles could be raised, but one thing is certain: patent documents containing RNs could be indexed far more effectively by the various database producers. All users of patent information would benefit.
186
What about the generic Markush structures, or generic searches for specific substances for example, C, to C6 alcohols? Derwent, IFI/Plenum and API all feature fragmentation coding systems to cope with generics, but all of these are in the final analysis inadequate for the full representation of structure. They leave out much of the detail about how the fragments are put together. One of the greatest shortcomings of fragmentation systems is their inability to distinguish whether two groups, X and Y, are both present at the same time in a molecule, or merely alternatives. A search for compounds with both X and Y inevitably retrieves unwanted references, where the two are alternatives. A fragmentation system which could make this distinction would be a significant improvement over the current situation. In the Derwent, IFI. and API systems a fragment can be used directlv in the in combination wiih other database, indexing terms. This is certainly preferable to a staged process in which an external dictionary file is queried to identify compounds with a given structural feature, and a list of those compounds is transferred to a second file to be combined with other terms. That is the current situation with the CAS Registry dictionaries and CASearch files offered by DIALOG and SDC, as well as with the CAS ONLINE and QuestelDARC Registry files. First of all these files can only cope at present with fully defined structures, not generics. Secondly, the technique produces lists of compounds which can be very long. It is wonderful that several hosts have devised techniques for transferring automatically these lists from the dictionary files to CASearch, and these techniques certainly work well with lists that are moderate in length, but they break down with lists of many hundreds, even thousands of compounds. A partial solution to this problem could be achieved by a modification in the content of the CASearch files. Registry Numbers are invaluable for their unambiguous identification of substances, but they have a serious shortcoming in that they convey no structural information in themselves. But what if the CASearch tiles contained not only the RNs, but also the CAS preferred nomenclature for the substances they represent? We would have then added to the files words and word fragments which can frequently be used to represent structural fragments. and would have enhanced our ability to do generic searching. The files are very large ones now, of course, and I’m proposing enlarging them still further, which could certainly create problems. The benefits could be substantial, though. Topological coding systems are of course more complete than any fragmentation system, at least for specific structures. They also have the potential for being more reliable, because they are based unequivocally on a molecule’s structure, rather than on an indexer’s judgement. Some very exciting work is being done by several
ONLINE
groups in an attempt to develop topological systems that can deal with Markush structures. Hopefully we’ll see some fruit before long. Hopefully that fruit will allow us to search substructures in combination with other bibliographic indexing terms; alternatively I’d like the capability for the facile transfer of long lists of compounds from a structure search into a bibliographic file for further manipulation. Finally, consider the benefits to information
Patent
Information
retrieval from an overall system in which patent specifications were required to contain RNs, and the database producer had (as CAS does) a program for automatically generating the topological record from the RN.
systems this time, but don’t be surprised if I bring up the subject in the near future. Chemical patent information is currently in a state of considerable ferment. The product could indeed be an interesting brew.
In this brief piece I haven’t been able to do more than point up a few of the aspects of chemical substance retrieval. I’ve carefully avoided discussion of full text search
Stuart M. Kaback Exxon Research and Engineering Co. Linden, NJ 07036 U.S.A.