Compulers & Geosciences Vol. 22, No. 2, pp. 193-194. 1996 Elsevier Science Ltd PIksoo%-3004(%pooo1-5 Printed in Great Britain 0098-3004/96 $15.00 + 0.00
ANOTHER
NODE ON THE INTERNET
JOHN C. BUTLER Department of Geosciences, University of Houston, Houston, TX 77204, U.S.A. (e-mail:
[email protected])
As you become proficient in moving through cyberspace, there will come a time when you ask “is there information about geoscience-related newsgroups or geophysics degree programs at the University of Houston or general statistical packages on the Net or what is Enid Grubbman’s e-mail address at Miami University”? On the ANON home page (http://www.nsm.uh.edu/anon.html) you will find several references that may prove helpful. The ANON page has links to listings of Geosciences Resources, People, Places, Universities, Education Resources, and Software. None of these lists of lists should be viewed as definitive. One of the simultaneous strengths and weaknesses of the way that the Internet has evolved is that relatively few people are paid to maintain Web resources and maintenance is most often accomplished by volunteers. This is but one of a complex of issues faced by those who use the Net and want to see its utility increased. Sooner rather than later will come the need for locating specific information and a number of searching services are available on the Internet. This issue will focus on WWW search procedures. A special file on Searching has been created for the readers of this column and is available at http://www.nsm.uh.edu/anon.search. You may save this file (save the source) to your desktop/main directory and access it with any browser that allows you to specify a particular file. In Netscape command 0 will allow the user to specify which file is to be viewed. If you do this, you will note that there is a missing image/icon as indicated by the small rectangle in the upper right hand corner with a question mark. If you go to the ANON page and reload the search file you can download this small image by placing the cursor on the image and holding down the command key. A window will open up and you will be allowed to save the image to your machine. If you put the hypertext mark-up language file and the image in the same folder (or directory) you will have the duplicate of the ANON page file. Thus, any material which you can view with a browser can be downloaded to your machine. This raises other critical issues regarding intellectual property, copyrights, and plagiarism. I will try and arrange a guest article on these subjects for a future issue. Load the downloaded file or access the file on the ANON home page. On the left-hand side of the table are links to pages which contain a number of different search engines which seem to be maintained on a regular basis. I recommend, however, starting with one of the individual search engines on the right-hand side of the page, especially if you think you may be using the search process as an integral part of your work. Each of these algorithms is underlain by a set of rules which the user should be aware of. Some search on full-text whereas others search on the titles of files or the contents of the IP address. The World Wide Web Worm (WWWW) search engine (http://wwwmcb.cs.colorado.edu/home/mcbryan/ WWWW.html) is limited to searching: URL references, URL addresses, document titles, and document addresses, If you produce WWW resources which you would like others to locate via this searching strategy, be certain to name the files so that their contents are obvious. A file labeled stuKhtm1 might be in great demand but the Worm would only locate it if the user specified “stuff’ as the key word in the search. The Worm user should read the sections on Instructions, Examples, Search, Failure, Register, WWWW Paper which are referenced at the top of the WWWW page at the address given above or in the downloaded Anon.Search file. Other search strategies in the file Anon.search search on the text of the html documents. The Web Crawler (http://webcrawler.com/) was an early text search service. Users should take time to read the explanations: Help, Facts, Top 25 Sites, Submit URLs, Random Links, No-forms Search which appear as links near the top of the home page. The design is described in the Facts section. Note that a developer can submit the URL of a particular site. The Web Crawler “agent” will visit the site and “read” the text. Most of these search agents allow submission of URLs which should be considered when the developer is “semi satisfied” with the product. Lycos (http://lycos.cs.cmu.edu/), and Inktomi Yahoo (http://www.yahoo.com/search.html), (http://inktomi.berkeley.edu/), the other text-based search algorithms listed in the table, are similar to the Web Crawler but each has its own spin on how the search takes place and how the results are displayed. All require submission of key words and all allow use of some form of Boolean operation(s). The sequence in which 193
John C. Butler
194
“matches” are displayed is based on the frequency of occurrence of the key word(s) in the documents with those at the top of the list having the highest frequency. All of these search engines allow the user to specify how many “matches” are displayed starting from the top of the list and all allow the user to examine the entire set of matches. The matches are presented as links to the articles. The search engines differ in the size of the database which is accessed by the searching algorithm and how the database was assembled. Inktomi (the newest of the four referenced above) has a paper on Counting URLs which should be reviewed by the user. A limited, but personally revealing, set of tests are described in the remainder of this months’ column. While looking for some resources for the physical geology course I was teaching, I found a short article on diamonds found at the Ries Crater. The article is “buried” in files maintained by the Open University and is located at: http://exodus.open.ac.uk/earth/research/groups/Gilmour/Ries3.html. The number of matches for Ries and Crater (Boolean and) obtained from the four search engines is given below. Inktomi Lycos Web Crawler Yahoo
13 7 4 0
None of the searches uncovered the article which means that either the URL has not been submitted or that a search agent has not found that site. This is particularly frustrating as the amount of useful material about the Ries Crater on the Web is unknown. The four matches found by Web Crawler are anomalous in that Ries does not occur but “. . . ties” as part of summaries, repositories, etc. do occur. Perhaps this resulted from a problem in coding the text of these four sources. Five of the seven Lycos matches occur within the set of 13 Inktomi matches. The top three documents located by Inktomi are identical versions of the same document stored at three different addresses. Up to a point this is good as these sites literally can disappear over night. A second test was a search on the key word Petrography. Inktomi Lycos Web Crawler Yahoo
656 641 76 4
Most, but certainly not all, of the more than 600 references identified by Lycos and Inktomi are parts of detailed descriptions of courses, academic programs and faculty research efforts published by academic institutions. The searcher should keep this in mind as the number of such Web “publications” continues to increase. At some point a coding scheme to identify materials as: (a) course descriptions; (b) research publication; (c) course presentation; etc. may help discriminate among kinds of Web contributions. Each user should read the pertinent material supplied with a particular search algorithm and experiment prior to deciding which one(s) will be used routinely. Although there are obvious shortcomings, as noted above, life on the Web without these algorithms truly would be like “drinking from a firehose”.