IRList Digest Tuesday, 23 September 1986 Volume 2 : Issue 48 Today's Topics: COGSCI - Dimensionality Reduction (conn. networks) Article - Software Reuse Through Information Retrieval - Part 2 News addresses are ARPANET: fox%vt@csnet-relay.arpa BITNET: foxea@vtvax3.bitnet CSNET: fox@vt UUCPNET: seismo!vtisr1!irlistrq ---------------------------------------------------------------------- Date: Mon, 15 Sep 86 18:47:55 edt From: DEJONG%OZ.AI.MIT.EDU@MC.LCS.MIT.EDU Subject: Cognitive Science Calendar Date: Monday, 15 September 1986 12:09-EDT subject: Center for Biological Information Processing Seminar Wednesday, 17 September 12:00pm Room: E25-401 Dimensionality-Reduction Using Connectionist Networks Eric Saund MIT Department of Psychology Recently, methods have been developed for training "connectionist" networks of simple computing elements to perform a wide variety of input/output associative mappings. These methods are interesting because, under some circumstances, networks are able to acquire automatically intermediate representations that capture regularities in the mapping. In this talk I will present a way to perform dimensionality-reduction in such a network. Dimensionality-reduction is a coding of multi-dimensional data that is constrained to lie on a lower-dimensional surface embedded in a high-dimensional feature-space; dimensionality-reduction is a generalization of factor analysis. I will discuss why dimensionality-reduction may prove a useful computational tool for later visual processing such as shape analysis. ------------------------------ Date: Fri, 12 Sep 86 22:31:44 EDT From: seismo!allegra!hoqam!wbf Subject: Software Reuse through IR [ Part 2 - Ed] Software Reuse Through Information Retrieval W. B. Frakes B. A. Nejmeh AT&T Bell Laboratories Holmdel, New Jersey 07733 [Note: sections 1-4 appeared in the last IRList issue - Ed] 5. Types of Interactive Information Systems Many different types of systems for handling information are currently in use. These system have different underlying models and capabilities. Perhaps the best known type of system is the database management system (DBMS) [11]. DBMS are widely used for storing, managing, and retrieving highly structured information such as parts lists, personnel files, etc. Retrieval from these systems is deterministic. For example, if a query is put to a DBMS asking for records of all employees in Kansas City who make more than $35,000, the system will retrieve all and only those records matching the query criteria. While DBMS are powerful, they are usually limited in their ability to handle data that is not highly structured, such as text or source code. Current systems for handling this kind of data are information retrieval (IR) systems. [12] [13] Originally developed to manage the literature of the natural sciences, IR systems incorporate many techniques for storing and retrieving unstructured data. These techniques, such as boolean queries and partial string matching are discussed below. As a demonstration of the use of IR Systems for software reuse, we built a small database of software modules using CATALOG, an IR system developed by Bill Frakes, Steve Cox, and Bill Leighton at AT&T Information Systems. These modules were from SUPER [14], a system built at Bell Laboratories for interactive reliability analysis. The information used to index these modules was taken from the descriptive headers required of each module in the SUPER system. The text from these headers was passed to CATALOG which placed the words from the text in inverted files. Searches to this database could then be made via CATALOG's menu or command driven search interfaces as described below. 6. The CATALOG Information Retrieval System CATALOG is a high performance information retrieval system designed to allow end users to create, maintain, and search databases containing both formatted records, such as are typically found in DBMS, and unformatted records, such as text, which most DBMS handle poorly. It is now being used widely within AT&T for such tasks as document management, marketing information management and distribution, as the basis of LATTIS, the AT&T IS library system [15], and as the basis of Video Data Locator, a CATALOG application that allows retrieval of both text and color images. CATALOG features a database generator which assists users in setting up databases, an interactive tool for creating, modifying, adding, and deleting records, and a search interface with a menu driven mode for novice users, and a command driven mode for expert users. The search interface allows full boolean combinations of search terms and sets of retrieved records, and sophisticated partial term matching techniques such as automatic stemming, and phonetic matching. CATALOG databases are built using B-Trees, providing rapid search and retrieval capabilities. CATALOG was written in the C programming language, and currently runs under UNIX, and MS-DOS. CATALOG was originally developed on a VAX 11-780, and has since been ported to the 3B2, 3B5, and 3B20, the IBM PC, the AT&T PC6300, and the PC7300. 6.1 Searching using CATALOG CATALOG will allow the complete source code for a module to be entered into the system providing full source searchable databases. It is also possible to enter only source code module surrogates, for example the information in the header such as title, author, and description. These surrogates are then available as primary searchable records, and the full records are available as secondary records for viewing and printing. Both record size and database size are unlimited. Searching is carried out using inverted indexes of every significant word in a database. CATALOG creates sets of records in response to user queries. These sets can then be combined using boolean operators to form new sets. The display of these sets shows the query used to create the list, and the number of records that match the query. 6.2 Multiple Search Interfaces CATALOG has two main user modes; a novice user mode, which is menu driven, and a command mode for more experienced users. This allows the system to adapt itself to the user's level of knowledge. Novice user mode assumes as little knowledge as possible of the user. In this menu driven mode, users are prompted to provide search queries. CATALOG then retrieves and places records corresponding to the queries into sets. By selecting the appropriate items from menus, users can sort, display, and perform boolean operations on retrieved sets of records. It is also possible for an expert user to overide many of the default settings for novice user mode using the methods described below. Expert mode assumes a knowledgeable system user, and thus provides only a simple prompt for commands. 6.3 Queries In novice mode, CATALOG prompts for queries with the phrase "Look for:". In response, a query or command (described below) may be entered. For example the query: | Look for: sorting routines | | | - 6 - will cause CATALOG to attempt to find records that contain the terms "sorting" and "routines", and their variants as described below. CATALOG provides for full boolean search specification through menu selection. It is also possible, though not necessary, to specify boolean logic in a query. For example, Look for: ((sorting and routines) or quicksort) not heapsort This query will retrieve source code records about sorting routines or quicksort, which are not about heapsort. To find records relevant to a query, CATALOG will take the words in the query one at a time, and try to find other words in the database which might be related. If it finds any possibly related words, CATALOG will present its guesses to the user for selection. For the query term "sorting" for example, CATALOG might respond as follows: Search Term: sorting Term Occurrences 1. sort 15 2. sorting 1 3. sorts 3 Which terms (0 = none, CR = all) : Users select the terms they want by entering their numbers. The "related word" feature can be suppressed by putting the character "\" at the end of an entered word, in which case the index is searched for an exact match. The "related word" feature can also be suppressed by putting wild card characters into an entered word. Two wild card characters are available. The character "*" stands for zero or more occurrences of any character, and the character "?" stands for a single arbitrary character. Thus, the term airlin* will match the words airline, airlines, airliner, etc., while the term airlin? will match airline but not airlines or airliner. Wild card characters cannot be used as the first character of a word. That is, air*ine and airlin? are legal search terms, but *irline and ???line are not. CATALOG also provides the ability to match on phonetic variants of a query term. This feature will be most useful with human names. If a field has been marked for phonetic searching, the phonetic match routine will relate such names as "Kahn", "Cohen", "Cohn", etc. The phonetic match is invoked by appending the character "#" to a search term. 6.4 List Display When a user has made his choices for all the words in the query, a list such as the following is formed. Lists (& indicates a stemmed term) records a) (software) 26 b) (sorting and algorithms) 9 c) (system& and call&) 3 This display indicates that three searches have been done, and that the last search formed list "c" which contains three records. These three records are related to both the concept "system" (i.e. the records contain one or more words related to the word stem "system") and the concept "call" (i.e. the records contain one or more words related to the word stem "call"). 6.5 Main Menu Display The main menu in CATALOG allows users to exit the system, access help messages, go back to the "Look for:" prompt, and perform operations on record lists. The main menu display looks like this. Options: 1 Exit from system 2 Display items from a list 3 Do another search (Go back to 'Look for') 4 Create a new list by finding items common to 2 or more lists (AND) 5 Create a new list by including all items from 2 or more lists (OR) 6 Create a new list by removing from a list items from 1 or more other lists (NOT) 7 Delete 1 or more lists 8 Help message menu Choice: By selecting appropriate items from this menu, users can manipulate the system to give desired results such as: o Creating new lists of records, e.g. software modules, from old lists o Removing lists o Displaying and printing records o Sorting lists o Placing records in files o Restricted field searching o Restricted field display o Restricted date searching 6.6 Help Messages Detailed help messages are available for all system functions. When users first log into CATALOG, they are asked if they need help. If they say yes, they can access help through interactive menus. Help messages can also be accessed throughout the search session by specifying the appropriate menu items or commands. 6.7 Command Mode By typing the command "-c" at the search prompt in naive user mode, a user can enter command mode. Command mode assumes a knowledgeable user, and thus only prompts for commands. It is possible to use any search or display function with commands. 6.8 SDI - Setting Up a User Interest Profile Selective Dissemination of Information (SDI) is a technique used to keep users of an information system alerted to new additions to the system database. CATALOG provides such a capability. SDI using CATALOG is done by maintaining a file containing user ID's and lists of interest words. The general form is: useridword1word2...wordn for example, smith set manipulation algorithms This file can be matched against the full database, or updates to the database. Lists of the records matching the profiles are then sent to the appropriate users. The SDI feature will be useful for alerting software developers to software modules of a given type that have been added to the system. [Note: continued in next issue of IRList - Ed]