Received: by stubbs.ucop.edu (5.57/1.34) id AA04569; Wed, 24 Feb 93 16:22:16 -0800 Message-Id: <9302250022.AA04569@stubbs.ucop.edu> Received: from UCCVMA.UCOP.EDU by uccvma.ucop.edu (IBM VM SMTP V2R2) with BSMTP id 4477; Wed, 24 Feb 93 16:28:12 PDT Received: from UCCMVSA.BITNET (NJE origin NCG$UR@UCCMVSA) by UCCVMA.UCOP.EDU (LMail V1.1d/1.7f) with BSMTP id 0181; Wed, 24 Feb 1993 16:28:11 -0700 Received: by UCCMVSA.BITNET Wed, 24 Feb 93 16:27:35 PST Date: Wed, 24 Feb 93 16:27:35 PST From: "Nancy Gusack" To: NCG@stubbs IRLIST Digest March 16, 1990 Volume VII Number 8 Issue 14 ********************************************************** This issue and future issues of IRLIST Digest will be much shorter than the issues in 1989 and early 1990. The basic outline, or table of contents, will be included in every issue, although we may not have received submissions for each topic. Issues will be compiled and produced as material is received. I. NOTICES: A. Meetings annoncements/Calls for papers B. Publications announcements C. Miscellaneous II. QUERIES: A. Questions and answers 1. Cache theory question (UNIX) B. Requests for information III. JOB ANNOUNCEMENTS: IV. PROJECTS: A. Initiatives and proposals B. Bibliographies 1. Selected IR-related dissertation abstracts (This will be an ongoing submission as there are hundreds of abstracts.) C. Abstracts D. Miscellaneous ********************************************************** II. QUERIES: II.A.1. Fr: Ronald Guilmette Re: Cache theory question (UNIX) In article <1456@darkstar.ucsc.edu> cjp%megatek.UUCP@ucsd.edu (Chris Pikus) writes: > The Unix disk buffer cashe is organized as >a combination of both a hash table and a linked list. >that is there several linked lists and which one to use >is determined by a hashing function. > > To wit, when one wishes to access/add a particular block >in the buffer cache, the kernel performs a hashing function >on BLKNO (usually a mod( 17) i beleive) and then walks down >the that list to find said BLKNO. > > My question is thus: WHY mod 17??? > > I know the answer is fairly obvious, since 17 is >a prime number, there would more likely to be an even >distribution amongst the lists. If something like a filesystem >started to have certain things regularly spaced, they might have >a tendency to "resonate" adn fill up one list if one used a number >like 16. > > To be more precise, is there any reason they picked 17, and >if so, is there any supporting documentation/articles/mathematical_treatises? >Could they have picked any prime (like 29, or 53, or 2011)? > > I know this is a pretty esoteric question but I was >curious and I know there is a lot of UNIX historians on the net. > > Thank you very much, > Christopher J. Pikus I am curious about this very same question. Recently, I was writing a tool which reads assembly code for a particular machine and does some mangling on it. While doing this, I was hunting for a nice hash function for the valid mnemonics for that machine. I quickly found that hash functions which involved prime numbers gave much better results (i.e. in terms of minimizing collisions) than did ones without a prime involved. A quick check of Knuth's books confirmed this to be a well known fact. Now the only question remaining for me was "Which prime should I use for this particular set of mnemonics?" I wrote a short program which would try all of the primes up to 65536 with the following function: unsigned int hash_mask = /* some power of two */ ; static unsigned int hash (const char *string, unsigned int prime) { unsigned int ret_val = 0; for (const char *p = string; *p; p++) ret_val = (*p  ret_val) * prime; return ret_val & hash_mask; } I had the program evaluate the "efficiency" of all of the various primes for the given mnemonic set. For the mnemonic set I was interested in (consisting of 165 strings) my program found a prime that (when used with a primary table size of 2048) yielded a perfect hash function. The prime was 5381. Curiously however, the prime number 17 also yielded good efficiency (i.e. a small percentage of collisions) with several different primary table sizes (with my specific input key set). Other numbers that seemed good with various table sizes were 19 and 29, but 17 was always better. Now in my case, I had a fixed input key set that was not going to be changed, so I went with 5381 as my prime, but it occured to me that a different prime would probably be best for my hash function which hashed *identifiers*. Obviously, there is no way to predict the set of identifiers which will be used in a given input program, so I would like to know if there have been any studies which would suggest that any particular prime number (17?) would (on average) yield good results for arbitrary sets of input strings and arbitrary primary table sizes. Anybody know? Shall I do one? Is there any theoretical basis for determining the one "right" answer to this question (if there is one), or is an empirical study the only way to come up with (at least) a good guess? P.S. Knuth uses 1009 in his examples, but what does he know! :-) // rfg ********************************************************** IV. PROJECTS IV.B.1. Fr: Susanne Humphrey Re: Selected IR-Related Dissertation Abstracts Compiled by: Susanne M. Humphrey National Library of Medicine Bethesda, MD 20894 The following are citations selected by title and abstract as being related to Information Retrieval (IR), resulting from a computer search, using the BRS Information Technologies retrieval service, of the Dissertation Abstracts International (DAI) database produced by University Microfilms International. Included are the UMI order number; author; university, degree, and, if available, number of pages; title; DAI subject category chosen by the author of the dissertation; and abstract. Unless otherwise specified, paper or microform copies of dissertations may be ordered from University Microfilms International, Dissertation Copies, Post Office Box 1764, Ann Arbor, MI 48106; telephone for U.S. (except Michigan, Hawaii, Alaska): 1-800-521-3042, for Canada: 1-800-268-6090. Price lists and other ordering and shipping information are in the introduction to the published DAI. An alternate source for copies is sometimes provided at the end of the abstract. The dissertation titles and abstracts contained here are published with permission of University Microfilms International, publishers of Dissertation Abstracts International (copyright by University Microfilms International), and may not be reproduced without their prior permission. AU UGALDE ARIAS, LUIS ALBERTO. IN University of Minnesota Ph.D. 1988, 258 pages. TI EFFECTIVE INFORMATION MANAGEMENT IN FORESTRY: AN APPLICATION TO FUELWOOD AND MULTI-PURPOSE TREE SPECIES RESEARCH IN CENTRAL AMERICA. DE Agriculture, Forestry and Wildlife. AB The main goals of this study were: (a) the development of a methodology to collect and organize silvicultural and environmental information in forestry research on fuelwood and multi-purpose tree species (MPTS) production, and (b) the design of a Management Information System (MIS) to supply decision-support for different end-users. This study was supported by a project of USAID for fuelwood research in six Central American countries. Uniform standards and guidelines for implementing fuelwood and MPTS experiments were established in order to permit global exchange and transfer of information on MPTS research. These standards for data collection and field measurements, and minimum data sets were developed in coordination with scientists to ensure the collection of useful information and to gain acceptance of these standards. The minimum data sets were developed to reflect what can be achieved at a reasonable logistic expense with an acceptable degree of consistency. The approach emphasized flexibility and simplicity in the database implementation to allow for added complexity as the demand for information and models increases. A menu-driven retrieval process for the information was developed using a microcomputer. For the implementation of the database KnowledgeMan/2 software was used. Establishment of MPTS information databases will permit improvement in all phases of forest management, including seed procurement and species selection for environmental zones. The database can support the decision on what, where, and how to plant trees. Assessments of trade-offs among various production systems in terms of different types of biomass components are also possible. It should elucidate climatic, biophysical, and social constraints in tree seedling production and establishment, and improved nursery operation and stand-management techniques. Finally, the MIRA (Manejo de Informacion sobre Recursos Arboreos) system developed in this study can be considered as a pioneering effort in database and MIS technology in the tropical regions of the world for the application of silvicultural research. It is also a first in terms of the cooperative effort carried out by six Central American countries to develop and use standardized data collection procedures for better coordinated research in MPTS. AN University Microfilms Order Number ADG88-19418. AU MORRIS, ANDREW HUNTER. IN Texas Tech University Ph.D. 1988, 237 pages. TI SUPPORTING ENVIRONMENTAL SCANNING AND ORGANIZATIONAL COMMUNICATION WITH THE PROCESSING OF TEXT: THE USE OF COMPUTER-GENERATED ABSTRACTS. DE Business Administration, General. Information Science. AB This research proposes a model text-based decision support system designed to support the activities of environmental scanning and organizational communication by actively filtering and condensing text. To filter text-based information requires the use of automatic routing schemes; to condense text requires the use of computer-generated abstracts or extracts. A key element in the model system is the ability of the computer to condense text by generating short abstracts of documents. Two approaches to condensing text have been proposed: (1) using natural language processing techniques to construct a knowledge base of the document contents, from which to write an abstract, and (2) employing algorithm-based extracting systems to generate extracts of important sentences and phrases. Systems using natural language techniques are still being researched; most are successful only in limited domains. Systems using extracting algorithms have been researched, but have not been applied to the problem of information overload in an organizational decision-making context. These two approaches were tested in a laboratory setting with student subjects. An algorithm for generating extracts was developed based on the combined work of previous researchers, and tested against an expertly written abstract such as might be constructed by a non-domain specific artificial intelligence system if one is developed in the future. Results of the study indicate that there was no difference in comprehension of the documents when the information was presented with the full text, by extract, or by abstract. These results demonstrate that an algorithm for computer-generated extracts can be successfully applied to text, reducing reading time and document length without significantly reducing comprehension of the information contained in the original text. AN University Microfilms Order Number ADG88-23547. AU LEITHEISER, ROBERT LEO. IN University of Minnesota Ph.D. 1988, 490 pages. TI AN EXAMINATION OF THE EFFECTS OF ALTERNATIVE SCHEMA DESCRIPTIONS ON THE UNDERSTANDING OF DATABASE STRUCTURE AND THE USE OF A QUERY LANGUAGE. DE Business Administration, Management. Information Science. AB Business organizations are increasingly relying on data resources that are stored in computerized databases. Utilization of these resources requires knowing (1) the logical organization of data in the database and (2) a language for retrieving data from that organization. The logical organization of a database is described by a database representation. The purpose of this research is to investigate the effects of different features of database representations on the learning and use of database systems by business end users. Three features are examined: (1) type of concept used, (2) type of symbol used, and (3) use of explicit representations of associations between major database objects. These features define different approaches to representing database organizations. A two stage experiment was performed that compared four representation approaches. Each approach used a combination of the three representation features listed above. In the first stage, subjects (MBA students) learned one of the representation approaches and then applied what they had learned to three non-language database tasks. In the second stage they learned the SQL query language, and performed three database tasks with the approach and the language. The principal findings were that the approach that used semantic concepts took less time to learn than any of the storage concept approaches and led to higher performance on one of the "without language" performance tasks. The same approach resulted in longer times for learning the SQL language and poorer performance on "with language" tasks. No major differences were found in learning and performance that were due to the type of symbols used or the use of explicit association representations. These findings were taken as evidence that the approach that used semantic concepts has some advantages over the storage concept approaches. Unfortunately, these advantages are lost when the popular SQL query language is learned and used. The study calls for further research to develop more appropriate languages for semantic representations and to further explore the effects caused by specific representation features. AN University Microfilms Order Number ADG88-19930. AU CHEN, HUNG-PIN. IN The Louisiana State University and Agricultural and Mechanical Col. Ph.D. 1988, 173 pages. TI QUERY PROCESSING ON THE ENTITY-RELATIONSHIP GRAPH BASED RELATIONAL DATABASE SYSTEMS. DE Computer Science. AB An ERG (Entity-Relationship Graph) can be used to provide a semantic structure to a relational database system. An ERG is defined by local regions. A local region contains two nodes of entity types and a node of relationship type. The semantic constraints of the database represented by the ERG (Entity-Relationship Graph) can be used to enforce the global integrity of the database system. A query is mapped onto the ERG to obtain an ERQG (Entity-Relationship Query Graph). This mapping can be specified by the user by navigating the database or automatically allocated by the system via a universal relation interface. The ERQG representation of a query can be semantically decomposed into a sequence of Local Regions. These Local Regions can then be processed according to their order in the query. The ER-semijoin operation is introduced to process this sequence of Local Regions. Using this approach, architectures of database systems are proposed--two-phase interface and one-phase interface. An implementation of a user interface is also discussed. AN University Microfilms Order Number ADG88-18780. AU EPSTEIN, RICHARD GARY. IN Temple University Ph.D. 1988, 631 pages. TI INFORMATICS CALCULUS: A GRAPHICAL, FUNCTIONAL QUERY LANGUAGE FOR INFORMATION RESOURCE SYSTEMS. (VOLUMES I AND II). DE Computer Science. AB This dissertation presents the functional information resource model, a data model for multi-media data bases which is an extension of the functional data model. The significant contribution of this distribution is in the presentation of a graphical, functional query language for this data model. This query language is called the informatics calculus. The data model can be viewed as a data model for hypertext systems. Viewed in this light, the informatics calculus can be viewed as a proposal for an interface for hypertext systems which will enable such systems to capture the computation power of traditional databases. The informatics calculus is a data base language in which the user works in a workspace of objects. An object has a type and a state. The type of an object determines its functionality. The emphasis in this dissertation is upon query objects, objects which extract information or generate applications from the information resource. All informatics calculus objects consist of mosaics. A mosaic consists of tiles denoting entity classes, subclasses, conditions, functions, functors, functionals and other types of operations provided by the model. The spatial arrangement of these tiles determine a unique functional expression. For example, horizontal juxtaposition is used to denote the functional combinator of Cartesian product and vertical juxtaposition is used to denote functional composition. Each informatics calculus expression denotes a particular arrangement of information. The final chapter of the dissertation discusses extensions of the informatics calculus which includes an update language and a language for querying the system catalogue. None of the proposed extensions to the informatics calculus departs from the functional framework or the graphical framework of programming by constructing mosaics. The amenability of the informatics calculus for parallel execution is also discussed. AN University Microfilms Order Number ADG88-13974. AU LYNCH, CLIFFORD ALAN. IN University of California, Berkeley Ph.D. 1987, 247 pages. TI EXTENDING RELATIONAL DATABASE MANAGEMENT SYSTEMS FOR INFORMATION RETRIEVAL APPLICATIONS. DE Computer Science. Information Science. Library Science. AB This thesis studies the use of relational database systems to construct large, high performance information retrieval systems such as online library catalogs or citation retrieval applications. The major problem areas in relational implementations are query execution costs, poor space utilization, and functionality deficiencies both in query processing and in query languages such as SQL. Analytic and simulation methods are applied to quantify these problems. Proposals extending earlier work on user-defined operators for relational query languages and accompanying secondary index support allow both efficient query formulation and the definition of space-efficient relational bibliographic databases. When column values follow distributions typical of bibliographic databases (Zipf distributions), a key performance problem is inaccurate selectivity estimation. A framework for incorporating user-defined selectivity estimators into a relational query optimizer is established, and methods are given to construct highly accurate selectivity estimators for bibliographic databases. Relational query optimizer extensions are specified which incorporate query execution plans that use TID list manipulation algorithms for evaluating single-relation queries into the optimizer's vocabulary. With these extensions a relational system can outperform an inverted file retrieval system on bibliographic databases. Also explored are query planner extensions to implement nonmaterialized relations (allowing both partially deferred evaluation of queries and inexpensive iterative query construction) and preexecution identification of queries that will be costly to evaluate or will produce very large results. Both of these features are important for public access information retrieval applications. Finally, the thesis examines difficulties that arise in using a relational query language to support advanced information retrieval techniques such as ranking and weighted retrieval, and develops query language extensions that would significantly improve the performance of such searching techniques in a relational setting. ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch lynch@postgres.berkeley.edu calur@uccmvsa.bitnet Mary Engle engle@cmsa.berkeley.edu meeur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. These files are not to be sold or used for commercial purposes. Contact Mary Engle or Nancy Gusack for more information on IRLIST. The opinions expressed in IRLIST do not represent those of the editors or the University of California. Authors assume full responsibility for the contents of their submissions to IRLIST.