IRList Digest Monday, 10 Feb 1986 Volume 2 : Issue 8 Today's Topics: Email - Change in Origin of IRList (repeat of msg in Issue 7) Query - Opportunities for well trained graduate with MLS in NYC? - Central source on new technology and development? Abstracts - Articles selected by Salton or Raghavan (pt. 2 of 3) ---------------------------------------------------------------------- >From fox Mon Feb 3 09:40 EST 1986 Subject: Changing of site of origin of IRlist (repeat of msg in Issue 7) Dear IRList Subscribers, In the interest of reliability and cost savings, IRlist will be sent from seismo!vtisr1!irlistrq instead of fox%vpi@csnet-relay To be sure there is no mishap, Issue 7 is being sent from irlistrq with this message, and Issue 8 is being sent from the CSNET address above. If you receive one issue but not the other, please notify me at one of the addresses given below. Also, if you are missing any issues (V1 1-28, V2 1-8), feel free to contact me if you can't get a copy from another user or site. I hope this works smoothly! - Ed __________________________________________________________________ UUCP: seismo!vtisr1!irlistrq ARPA: vtisr1!irlistrq@seismo foxea@vtvax3.bitnet@wiscvm fox@vtcs1.bitnet@wiscvm fox%vpi@csnet-relay CSNET:fox@vpi BITNET:foxea@vtvax3 fox@vtcs1 ------------------------------ From: KJP%ibm-sj.arpa@CSNET-RELAY Date: 4 Feb 86 13:50:36 EST ... [...] just completed her MLIS degree in Library and Information Scienece from Berkeley. She concentrated in Information Systems (i.e., took several programming, data-base, etc. courses) and is now looking for a job in the New York City area. Her ultimate goal is to go into MIS or become a data-base administrator. The problem that she is encountering is that, on the East Coast, people look at an MLS degree and conclude that you are able to do only library work. On the other hand, had she remained in the San Francisco area where companies know about Berekeley's MLIS program, she could have had two job offers: as a programmer with [...], or as a data-base administrator with a [...] company. So, my question for you is: Do you know of any companies in the NY area which look beyond the "MLS" label to see that this degree is well-suited for non-traditional "library" jobs? Any help would be appreciated. Thanks, Ken Perry [Note: Information Science is indeed an area where employers must really look at the individual's background and gauge ability for the task at hand! The person mentioned might try the Information Industry Association, 316 Penn. Ave., SE, Ste. 400, Washington D.C. 20003 (202) 544-1969 or JOBLINE at American Society for Information Science, 1424 Sixteenth St., N.W., Suite 404, Washington D.C. 20036 (202) 462-1000 to file a resume and ask to be listed in announcements. Readers - send suggestions to Ken or me to forward if you have other suggestions. - Ed] ------------------------------ Date: 3-FEB-1986 16:20:31 From: ARCHIVE%vax3.oxford.ac.uk@cs.ucl.ac.uk .. I'd be interested in IRList - I do a lot of work at present in that area using special architectures to give content addressing capabilities. Use my personal account LOU @ OX.VAX1 rather than ARCHIVE tho. .. Incidentally, a company called Sydney has been showing a CD-ROM version of the Library of Congress Catalogue around here lately; also word has reached us from California of a CD-ROM version of the Thesaurus Linguae Graecae. Is there any central place where information about these evolving technologies can be obtained? Best wishes, Lou Burnard [Note: There is an annual publication of the Amer. Society of Inf. Science called ARIST. Volume 20 is the next due. They have good surveys on many topics such as one in Vol. 19 by Chuck Goldstein on Storage Technology. ASIS has numerous special interest groups to try to cover the field. [Too] many conferences are being held -- March 4-7 in Seattle will be the 1st Int'l Conf. on CD ROM. Does anyone have other comments on information sources? - Ed] ------------------------------ From: "V.J. Raghavan" Date: Fri, 24 Jan 86 19:20:08 cst To: IRList%vpi.csnet@CSNET-RELAY Subject: submission to IR list [long set of abstracts - Ed] op pl75 blurbs.vr ABSTRACTS (Chosen by G. Salton or V. Raghavan from 1983 issues of journals in the retrieval area) 11. INFORMATION RETRIEVAL AT THE SEDGWICK MUSEUM M.F. Porter Dept. of Earth Sciences, University of Cambridge, Downing Street, Cambridge CB2 3EQ, UK The Sedgwick Museum at the University of Cambridge now has a high quality and comprehensive online IR system covering its collection of 450,000 catalogued fossil objects. The indexing process, and the retrieval capabilities are described in detail, and an example is given of how the IR system is used with real museum enquiries. It is also shown how the IR system is used as an aid in many different apsects of data management, such as catalogue updating and editing, and dealing with loans of specimens and movements of specimens between drawers. (INFORMATION TECHNOLOGY: RESEARCH & DEVELOPMENT, Vol. 2, No. 4, pp. 169-186, 1983) 12. THE UTAH TEXT RETRIEVAL PROJECT L.A. Hollaar Dept. of computer Science, University of Utah, Salt Lake City, UT 84112 The Utah Text Retrieval Project seeks well-engineered solutions to the implementation of large (over 50 x 10**9) characters), inexpensive (less than a dollar a query), rapid (average response time of 10 seconds) text information retrieval systems. It was established in 1980 in the Department of Computer Science at the University of Utah, and is an outgrowth of a similar project at the University of Illinois with which the author was associated. At the present time, the project has three major components. Perhaps, the best known is the work on the specialized processors, particularly search engines, necessary to achieve the desired performance and cost. The other two concern the user interface to the system and the system's internal structure. The work on user interface development is not only concentrating on the syntax and semantics of the query language, but also on the overall environment the system presents to the user. Environmental enhancements include convenient ways to 'browse' through retrieved documents, access to other information retrieval systems through gateways supporting a common command interface, and interfaces to word processing systems. The system's internal structure is based on a high-level data communications protocol linking the user interface, index processor, search processor, and other system modules. This allows them to be easily distributed in a multi- or specialized-processor configuration. It also allows new modules, such as a knowledge-based query reformulator, to be added. (INFORMATION TECHNOLOGY: RESEARCH & DEVELOPMENT, Vol. 2, No. 4, pp. 155-168, 1983) 13. A GENERALIZED TERM DEPENDENCE MODEL IN INFORMATION RETRIEVAL C.T. Yu Dept. of Information Engineering, University of Illinois-Chicago Circle, Chicago, Illinois, 60680 C. Buckley Dept. of Computer Science, Cornell Univesity, Ithaca, NY 14853 K. Lam Dept. of Statistics, Hong Kong University, Hong Kong G. Salton Dept. of Computer Science, Cornell University, Ithaca, NY 14853 The tree dependence model has been used successfully to incorporate dependencies between certain term pairs in the information retrieval process, while the Bahadur Lazarsfeld Expansion (BLE) which specifies dependencies between all subsets of terms has been used to identify productive clusters of items in a clustered database environment. The successes of these models are unlikely to be accidental; it is of interest therefore to examine the similarities between the two models. The disadvantage of the BLE model is the exponential number of terms appearing in the full expression, while a truncated BLE system may produce negative probability values. The disadvantage of the tree dependence model is the restriction to dependencies between certain term pairs only and the exclusion of higher-order dependencies. A generalized term dependence model is introduced in this study which does not carry the disadvantages of either the tree dependence or the BLE models. Sample evaluation results are included to illustrate the operations of the generalized system. (INFORMATION TECHNOLOGY: RESEARCH & DEVELOPMENT, Vol. 2, No. 4, pp. 129-154, 1983) 14. FULLY AUTOMATIC BOOK INDEXING Martin Dillon School of Library Science, University of North Carolina Laura K. McDonald Information Systems, Blue Corss-Blue Sheild of North Carolina The Fully Automatic Syntactically-based Indexing of Text (FASIT) system represents the contents of a document without a full parse or semantic analysis of the text. Content-bearing units are isolated and then grouped into quasi-synonymous classes whose main term is used to index the document. Previous experiments with FASIT demonstrated its usefulness in an associational retrieval environment; the experiment described here explores FASIT's value as a book-indexing system. It is difficult to avoid the conclusion that this indexing approach offers the promise of being practical and effective. (JOURNAL OF DOCUMENTATION, Vol. 39, No. 3, pp. 135-154, 1983) 15. EXTENDED BOOLEAN INFORMATION RETRIEVAL Gerard Salton Cornell University Edward A Fox International Institute for Tropical Agriculture, Ibadan, Nigeria Harry Wu ITT Programming Technology Center A new, extended Boolean information-retrieval system is introduced that is intermediate between the Boolean system of query processing and the vector-processing model. The query structure inherent in the Boolean system is preserved, while at the same time weighted terms may be incorporated into both queries and stored documents; the retrieved output can also be ranked in strict similarity order with the user queries. A conventional retrieval system can be modified to make use of the extended system. Laboratory tests indicate that the extended system produces better retrieval output than either the Boolean or the vector-processing system. (ACM COMMUNICATIONS, Vol. 26, No. 11, pp. 1022-1036, 1983) 16. HIERARCHICAL FILE ORGANIZATION AND ITS APPLICATION TO SIMILAR-STRING MATCHING Tetsuro Ito and Makoto Kizawa University of Library and Information Science, Ibaraki, Japan The automatic correction of misspelled inputs is discussed from a viewoint of similar-string matching. First a hierarchical file organization based on a linear ordering of records is presented for retrieving records highly similar to any input query. Then the spelling problem is attacked by constructing a hierarchical file for a set of strings in a dictionary of English words. The spelling correction steps proceed as follows: (1) find one of the best-match strings which are most similar to a query, (2) expand the search area for obtaining the good-match strings, and (3) interrupt the file search as soon as the required string is displayed. Computational experiments verify the performance of the proposed methods for similar-string matching under the UNIX time-sharing system. (ACM TRANSACTIONS ON DATABASE SYSTEMS, Vol. 8, No. 3, pp. 410-433, 1983) 17. INDEXING AND RETRIEVAL STRATEGIES FOR NATURAL LANGUAGE FACT RETRIEVAL Janet L. Kolodner Georgia Institute of Technology Researchers in artificial intelligence have recently become interested in natural language fact retrieval; currently, their research is at a point where it can begin contributing to the field of Information Retrieval. In this paper, strategies for a natural language fact retrieval system are mapped out, and approaches to many of the organization and retrieval problems are presented. The CYRUS system, which keeps track of important people and is queried in English, is presented and used to illustrate those solutions. (ACM TRANSACTIONS ON DATABASE SYSTEMS, Vol. 8, No. 3, pp. 434-463, 1983) 18. PARTIAL MATCH RETRIEVAL USING HASHING AND DESCRIPTORS K. Ramamohanarao, John W. Lloyd, and James A. Thom University of Melbourne This paper studies a partial-match retrieval scheme based on hash functions and descriptors. The emphasis is placed on showing how the use of a descriptor file can improve the performance of the scheme. Records in the file are given addresses according to hash functions for each field in the record. Furthermore, each page of the file has associated with it a descriptor, which is a fixed-length bit string, determined by the records actually present in the page. Before a page is accessed to see if it contains records in the answer to a query, the descriptor for the page is checked. This check may show that no relevant records are on the page and, hence, that the page does not have to be accessed. The method is shown to have a very substantial performance advantage over pure hashing schemes, when some fields in the records have large key spaces. A mathematical model of the scheme, plus an algorithm for optimizing performance, is given. (ACM TRANSACTIONS ON DATABASE SYSTEMS, Vol. 8, No. 4, pp. 552-576, 1983) 19. OUTLINE OF A GENERAL PROBABILISTIC RETRIEVAL MODEL Abraham Bookstein University of Chicago For reasons of technical convenience, current retrieval algorithms based on probabilistic reasoning are derived from models that assume patrons evaluate documents using a two value relevance scale. This paper extends the theory by describing a model which includes a more general relevance scale. This model permits a re-examination of the earlier theory as a special case of that developed here and leads to a more satisfying interpretation of the ranking principle of the earlier models. (JOURNAL OF DOCUMENTATION, Vol. 39, No. 2, June 1983, pp. 63-72) 20. TEXT ANALYSIS AND BASIC CONCEPT STRUCTURES John M. Weiner University of Southern California, School of Medicine, 2025 Zonal Avenue, Los Angeles, CA 90033, U.S.A. Information specialists frequently are called upon to analyze unfamiliar subjects. With the growth in volume and topic, specialists will require techniques to deal with textual material rapidly and effectively. This paper describes a method of text analysis designed to facilitate extraction of terms related to a single characteristic or concept. The term extraction is performed by completing the sentence: "The characterisic of interest is described by ( - descriptive term - )." Using this method, the analyst can extract attributes of the basic characteristic and terms representing related characteristics. With these two classes of terms, the analyst can build a basic concept structure describing the subject matter. Prior knowledge of the subject is not required. The method is illustrated using pathological descriptions of female genital cancers. (INFORMATION PROCESSING & MANAGEMENT, Vol. 19, No. 5, pp. 313-319, 1983) 21. AUTOMATIC SPELLING CORRECTION USING A TRIGRAM SIMILARILY MEASURE Richard C. Angell, George E. Freund and Peter Willett Department of Information Studies, University of Sheffield Sheffield S10 2TN England A nearest neighbour search procedure is described for the automatic correction of misspellings. The procedure involves the replacement of a misspelt word by that word in a dictionary which best matches the misspelling, the degree of match being calculated using a similarity coefficient based on the number of trigrams common to the two words. Experiments with a collection of 1544 misspellings and a dictionary of 64,636 words suggest that the procedure results in the unique identification of the correct spelling for over 75% of the misspellings if the correct form of the word is in the dictionary, and that this figure may be increased to over 90% if near, rather than nearest, neighbours are acceptable. (INFORMATION PROCESSING & MANAGEMENT, Vol. 19, No. 4, pp. 255-261, 1983)