IRList Digest Wednesday, 25 Sep 1985 Volume 1 : Issue 11 Today's Topics: Query - Location of Bruce Croft - Info desired on projects using NLP to create knowledge base - Format for SIGIR Forum, Values from cosine correlations Article - Proposal for SIGIR/SIGDOC Workshop ---------------------------------------------------------------------- From: "Robert B. Allen" Date: Tue, 17 Sep 85 15:59:12 edt Subject: Bruce Croft do you happen to have an electronic address for Bruce Croft in Dublin? [Try bcroft@irlearn on BITNET, or bcroft%IRLEARN.BITNET@csnet-relay - Ed] Thanks Bob Allen ------------------------------ From: MARS%red.rutgers.edu@CSNET-RELAY Date: 18 Sep 85 15:30:36 EDT Subject: NLP for knowledge acquisition ReSent-From: Ken Laws ReSent-To: IRList%vpi.csnet@CSNET-RELAY Hi: I am interested in info about projects which use Natural Language Processing Techniques to analyse scientific articles or abstracts with the aim of deriving knowledge bases from them. I am aware of a few projects in that field (UCLA, IBM Heidelberg, Leiden University, Chemical Abstracts), but I would appreciate any further pointers. Please reply directly to me, and I will summarize to the net. Thanks. Nicolaas J.I. Mars [Can anyone at NLM or CMU report on their efforts? Other projects? - Ed] ------------------------------ Subject: From: Michael_Gordon%UB-MTS%UMich-MTS.Mailnet@MIT-MULTICS Date: Mon, 23 Sep 85 11:15:30 EDT Subject: SIGIR Forum, etc. ... 1) I'm submitting a camera-ready article for WInter's SIGIR Forum (at the request of Vijay Raghaven). Do you want it in a special format ("long" paper, two columns) or is a typo-free laser printer docuemnt OK? [Since printing many pages is expensive, we ask that people either send us fairly closely typed submissions or else provide machine readable form (preferably TeX or TROFF so we can reformat). DO NOT double space. Laser printer or typeset originals, single spaced, without big margins, is the preferable form. Long paper, 2 cols, is fine since ACM can reduce and we can save costs. More comments, Vijay? - Ed] 2) Do you have Vijay's electronic mail address? [I use ihnp4!sask!regina!raghavan@ucb-vax which should help people on DARPA Internet and using UUCP. - Ed] 3) I've been performing some (simulated) Cosine-based retrieval experiments. I consistently see *extremely* high COsine scores (often over 0.8) between queries and rel docs. THe queries are weighted, often employing 10 or more non-0 terms. These correleations seem too high. I'm wondering what the data for some "typical" Cosine based retreival experioemts looked like. I've look in some places I expected to find such results, but without success. In particularar, what I'd like to see are things like this: a query that was submitted to the system (all terms, plus weights) a rank ordering of documents (including the indexed descriptions, including weights, if any) plus an indication of if, in fact, the document was relevant to the query. I'm interested in seeing a representative sampling of such data, for weighted and unweighted queries combined with weighted and unweighted docs. Data showing how Cosines seem to vary with the number of terms used to index a document (and/or query) are of interest, too. If you can point any thing out, or forward any data, I'd appre- ciate it greatly. Thx alot, Mike [Similarity depends on the queries and documents, obviously! Fairly short queries matched against titles can give many hits with similarity close to 1. Has anyone looked at the distributions of similarities for test collections, in detail? I do have data that may help and hope others can send some too, so you can summarize results.- Ed] ------------------------------ From: Michael Lesk Date: Sat, 7 Sep 85 18:28:58 edt Subject: SIGIR/SIGDOC Workshop The following is a proposal for a workshop which, although not yet formally approved, [Note: Diana Patterson of SIGDOC has signed - Ed] is very likely to take place in Snowbird, Utah, June 30-July 2, 1986. Chair: Michael Lesk; Local Arrangements: Lee Hollaar; Treasurer: Karen Kukich. Attendance will be limited to 75; there will be no formal proceedings, but a report will be written for some ACM publication; a number of prominent people (Karen Sparck Jones, David McDonald, Donald Walker, Patricia Wright, etc.) have indicated interest in attending. Comments on the workshop, or indications of interest, are welcome. Please notify the chair at: bellcore!lesk, or lesk%bellcore@csnet-relay, or (if you have current routing tables) lesk@bellcore. Phone: 201-829- 4070. NOTE: I will be on vacation Sept 9 - Oct 4; failure to reply during those dates merely means your message has not been read!! -- Thanks, Michael Lesk Writing to be Searched: A Workshop on Document Generation Principles As computers learn to write English, and others improve at searching it, they ought to benefit from people who know how to do these jobs. We're proposing a workshop bringing together AI special- ists in document generation, information retrieval experts, people who know how to write manuals, and those who write programs to evaluate writing. Introduction. In recent years there has been a surge of interest in the use of computer programs that write English.[1,2,3] Expert systems, for exam- ple, need to explain what they are doing. Programs are making increasing strides in fluency, domain coverage, and expressive power.[4,5] In fact, it is remarkable that there has been a long dis- cussion over the last ten years about whether or not apes have mastered language, based on utterances such as ``Please tickle more, come Roger tickle''[6] while computer programs saying things like ``The market crept upward early in the session yesterday, but stumbled shortly before trading ended''[7,8] have not impressed the public nearly as much. But even supposing that computers can now write English, what should they write? One obvious answer is computer programming manuals (``if X is good, recursive X is better''). Today it is more and more important to have good documentation for the increasingly complex systems, typi- cally computer based, which now pervade automobiles, airplanes, - 2 - military systems, hospitals, telephone companies, and many other areas of life. The manuals associated with many a microcomputer weigh more than the computer does. Worse yet, these manuals vary widely in qual- ity and good manuals are very important for proper use. It is partic- ularly urgent, for example, that operators of complex military systems or nuclear power plants be able to find out what to do in emergencies or other unusual circumstances. It is also important to consider language generation for other purposes, such as answering questions or explaining the output of expert systems. In fact, we should really be considering the entire information transfer system, in which English serves to represent knowledge and deliver it to people. The use of knowledge representa- tion formalisms for the first purpose and of graphical interaction for the second may greatly affect conventional writing. Indexing tech- niques, although used now primarily to identify relevant passages, may also serve the knowledge representation function. Browsing systems may well produce a need for new kinds of documents (Hypertext, Polytext) and thus new kinds of writing. What We Know Today. Reference manuals are not conventional literary works. Much of this documentation is never read cover to cover, but is only referred to as necessary. Thus the indexing of this material is almost as essential as its composition. Again, significant strides in this area have also been made by researchers. It is now possible to design full-text retrieval systems that accept conventional documents and questions in natural English,[9] and then retrieve documents or pas- sages from documents that probably answer the questions. Such systems are now for sale from several vendors.[10] Meanwhile, the researchers are exploring the construction and use of thesauri to identify synonyms and related terms automatically, and the use of feedback to improve retrieval based on the results of earlier searches.[11] The introduction of user feedback into retrieval systems and indexing means that we are moving towards a world in which the process of writ- ing followed by reading may be replaced with more integrated informa- tion systems, which respond to questions by generating appropriate replies on the fly. Retrieval studies have indicated that certain kinds of vocabulary control can improve indexing efficiency. Avoiding very common and broad words, for example, or very infrequent words, makes it easier to find the correct documents in response to queries. However, this information is rarely used to affect the writing of documents with the idea that they can then be indexed more satisfactorily. When new com- puter systems are being designed, and names must be given to a collec- tion of invented objects, nobody bothers to consult retrieval experts to decide on good names, despite their experience with vocabulary con- trol. The choice of names in fact matters, and people don't agree very much.[12] Indexers have, for years, been familiar with the problems of vocabulary control and choice of words to describe subjects. The - 3 - methodologies they use, and the similar expertise of lexicographers, have implications for natural language generation. The problems of the optimum selection of names both with respect to writing descrip- tions and to designing the systems being described (where possible) are not often considered and the choices evaluated. A recent paper suggesting the use of linguistic principles in designing computer com- mand languages, however, was greeted with enthusiasm at a meeting of computer hackers, so that interactions between these fields seem pro- fitable.[13] Some interesting retrieval experiments have been run on unusually formatted or unusually structured texts. The National Library of Medicine, for example, assembled the Hepatitis Knowledge Base to serve as an on-line encyclopedia of hepatitis information;[14,15] rather than a conventional monograph or review article, it is a tree- structured outline of this subject area, with many cross references. To date, however, no such experimental document architecture has been widely accepted. Nor have the more formal AI knowledge representation languages, despite their obvious promise and their attraction to information scientists, yet been able to cover a large subject domain.[16] There is, of course, an overall process of information transfer here. Somewhere, there is a database of information; and there are people who need that information. In between, we produce a document, which describes the program or the database, and which people can then refer to. Note that they rarely read it cover to cover, so that con- ventional writing rules about plot and characterization are often inappropriate. ``Technical manuals are not bedtime reading.''[17] These are documents typically meant only for retrieval; and yet they tend to be written in the same way, and in the same form, as documents written as literary works. Many of the same problems arise, of course, with respect to textbooks, handbooks, news articles, and even ordinary business correspondence. Again, much of it needs indexing and retrieval. And in many cases it doesn't get indexed or retrieved, because it is too much work when done by hand. Using reference manuals as an example, there is no general agree- ment on the kind of manual that ought to be provided. Research on the use of models to teach people about programs has not indicated whether having a concrete analogy to the program task helps or hinders learn- ing.[18] The relative value of examples, explanations, and terse reference summaries is not established by research. And there are arguments about how long a manual should be, with some computer scien- tists believing that longer is better and others believing that shorter is better. Much of the discussion eventually comes down to ``I know good writing when I see it,'' a statement that even if valid gives little guidance to those trying to produce computer-written documentation. Nor are the conventional style checking programs of a great deal of help. Too many of the programs now sold to assist writers are trivial Flesch-type readability indexes, or other very low level style and spelling checkers. There is research, but little production, on - 4 - programs to catch grammatical errors; and the idea of rating English for any rhetorical quality is impossible today. We often describe writing as ``informative,'' ``exciting,'' ``convincing,'' or ``easy to read,'' but there is no program that can evaluate any of these attri- butes when given a text. It is clear, however, that documentation is very important: nearly every survey of computer software rates not only the perfor- mance but also the manual. Users depend critically on the manual, and the absence of good documentation will often make an otherwise attrac- tive service or piece of software unusable. One experience indicated that documentation was the most frequent source of trouble in using a microcomputer system.[19] An important avenue now being explored is the use of computers to generate both their programs and their descriptions from the same for- mal specification, as in the GIST project of Bill Swartout.[20] This will at least guarantee the consistency of the program and its manual. It is important, of course, not to lose more in style and understanda- bility than is gained in cost and timeliness through the use of computer-written text. Much talk about documentation today concerns format: we should emphasize that the primary purpose of this workshop is to go beyond that. Page layout is not irrelevant, but it is not a substitute for good English, and it is unfortunate that the ease with which word pro- cessors can manipulate format has resulted in much experimentation with appearance, and less with content. The intent of this workshop is to deal with rhetoric and semantics, not with page makeup. Questions. Is there a better kind of information transfer system that could be devised? How should expert systems explain what they are doing? What kind of business and military correspondence systems should be built in the future? When, as we expect, computer systems will not only contain but generate their own explanations, should these look like conventional manuals?[21] Is there a difference, for example, between the principles of writing documents and writing explanations for expert systems? Which is more useful? Perhaps only a retrieval system is needed, with an explanation generator; or perhaps conven- tional manuals should be written, but designed specially to be searched rather than read. In practice, the people who produce docu- mentation are already starting to ask what kinds of formats, and what kinds of style and typography, are appropriate for documentation any- way. We'd like to upgrade this discussion to talk about rhetorical style. In particular, given programs that can produce reasonably well- phrased English, it should be possible to turn knobs inside the pro- grams, and see what effects these have on the larger properties of the text. This offers the possibility of producing text under very con- trolled conditions, much better understood than any generation of such complex material for normal psychological testing. In addition, - 5 - merely the presence in a workshop of several different natural language generation systems, combined with experts in producing actu- ally useful documents, should be very valuable in discussing what pro- perties of the systems are connected to what features of the resulting English. Workshop Specifics. In this workshop we will bring together subject specialists in four main areas: * Artificial intelligence researchers working in natural language generation; * Documentation specialists interested in writing style and qual- ity, and in the definition of a `good' document; * Text analysis developers, building programs that analyze text automatically and try to make value judgments about it; and * Retrieval experts, who know how to build systems for keyword matching and retrieval. Another major area that should be represented, but possibly not until a later meeting, is computer graphics. The value of illustrations, diagrams, and charts is unquestioned but it is not clear how we can integrate graphics with text today. Here are some examples of interesting comments, mostly recent research results in the above fields: 1. A high degree of grammatical variation does not seem important to produce natural effects in short paragraphs (as evidenced by Karen Kukich's stock market report generator).[22] 2. Structure in queries is not very useful in retrieval; unordered lists of keywords do about as well (Ellen Voorhees and Gerard Salton).[23] 3. Checking for hackneyed phrases, although seemingly a trivial operation, is perceived as very valuable by many writers (either Writer's Workbench, by Nina Macdonald and Lorinda Cherry,[24] or Epistle, by Lance Miller and George Heidorn).[25] 4. Syntax is much less important for retrieval than semantics; you need to know what the words mean more than you need to know their relationship (Harris, Cowie, and Tuttle).[26,27,28] 5. People frequently leaf through manuals, even when tables of con- tents and indexes are available; documents should be formatted to cater for this (Patricia Wright).[29,30] 6. Editing manuals to make them suitable for machine translation, requiring simple language, has turned out to make them better in - 6 - the original language as well. 7. Even humor has its place in documentation. ``The grace which eloquence had failed to work in those men's hearts, had been wrought by a laugh.'' (Mark Twain). Seriously, although the point of manuals is not to make the reader laugh, in some com- puter manuals anything that would keep the reader awake would be valuable. A possible new strategy might be to bypass the typical indexing step of making a list of words to represent a document, each with an assigned weight, by having the generator select these itself, possibly with greater accuracy. An improvement in the reverse direction might be the use of the same vocabulary control data base that is used for indexing to select the words used in the text. Note that for retrieval, what really matters is the choice of specific words used to name objects and actions. The structures which connect these words are less important and have been almost unused in retrieval systems. Yet, for those who evaluate English, the specific word choice is almost ignored! Instead, the vast majority of the effort is spent on syntax. Thus, for documents intended for refer- ence, almost the entire current literature on automatic checking is mis-aimed. Moreover, relatively little effort in document generation as a whole has been spent on choice of specific words or phrases, and yet this is the most important aspect for retrieval purposes. We hope that by talking to each other, the generators will dis- cover that they can significantly increase the utility of their output without increasing the effort of generating it. And we hope that the retrieval and analysis experts will learn what it is that they should be looking for in documents, and increase the performance of their systems without an increase in cost. Our best possible outcome, of course, is that the participants will find something which is not quite a conventional reference manual, but serves the same purpose and does it better. Whether this will be a structured document still written in English, or a question-answering database with an explanation generator, it is impossible to say. But unless the various groups start talking to one another, we'll never find out. Michael Lesk Bell Communications Research 435 South St., Rm. 2A-385 Morristown, NJ 07960 August 9, 1985 - 7 - References 1. E. Conklin and D. McDonald, "Salience: The Key to the Selection Problem in Natural Language Generation," Proc. 20th Meeting ACL, pp. 129-135, 1982. 2. K. R. McKeown, "The TEXT System for Natural Language Generation: An Overview," Proc. 20th Meeting ACL, pp. 113-120, Toronto, Ont., 1982. 3. R. E. Cullingford, M. W. Krueger, M. Selfridge, and M. A. Bien- kowski, "Automated Explanations as a Component of a Computer- Aided Design System," IEEE Trans. Sys., Man & Cybernetics, pp. 168-181, 1982. 4. W. C. Mann, "An Overview of the NIGEL Text Generation Grammar," Proc. 21st ACL Meeting, pp. 79-84, 1983. 5. A. K. Joshi and B. L. Webber, "Beyond Syntactic Sugar," Proc. 4th Jerusalem Conf. on Information Technology, pp. 590-594, 1984. 6. S. Chevalier-Skolnikoff, "The Clever Hans Phenomenon, Cuing and Ape Signing: A Piagetan Analysis of Methods for Instructing Animals," in The Clever Hans Phenomenon: Communication with Horses, Whales, Apes and People, ed. Thomas Sebeok and Robert Rosenthal, vol. 364, pp. 60-93, New York Academy of Sciences, 1981. 7. Karen Kukich, Knowledge-Based Report Generation: A Knowledge- Engineering Approach to Natural Language Report Generation. Ph.D Thesis, University of Pittsburgh, 1983 8. Karen Kukich, "ANA's First Sentences: Sample Output from a Natural Language Stock Report Generator," Proc. Nat'l Online Meeting, pp. 271-80, 1983. 9. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983. 10. Among sellers of free text retrieval systems are ``Cucumber Information Systems'' (5611 Kraft Drive, Rockville, MD 20852) and ``Knowledge Systems, Inc.'' (12 Melrose St., Chevy Chase, MD 20815). 11. G. Salton, The SMART Retrieval System -- Experiments in Automatic Document Processing, Prentice-Hall, 1971. 12. G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, "Statistical Semantics: Analysis of the potential performance of key-word information systems," Bell Sys. Tech. J., vol. 62, no. 6, pp. 1753-1806, 1983. 13. Marion O. Harris, "Thoughts on an All-Natural User Interface," - 8 - Proc. Summer USENIX Conf., pp. 343-347, Portland, Oregon, June 1985. 14. L. M. Bernstein and R. E. Williamson, "Testing of a Natural Language Retrieval System for a Full Text Knowledge Base," J. Amer. Soc. Inf. Sci, vol. 35, no. 4, pp. 235-247, 1984. 15. R. E. Williamson, "ANNOD -- A Navigator of Natural-Language Organized (Textual) Data," Proc. 8th SIGIR Meeting, pp. 252-266, Montreal, Quebec, 1985. 16. M. E. Lesk, "Programming Languages for Text and Knowledge Pro- cessing," Ann. Rev. Inf. Sci. and Tech., vol. 19, pp. 97-128, 1984. 17. Janet Asteroff, "On Technical Writing and Technical Reading," Information Technology and Libraries, vol. 4, no. 1, pp. 3-8, March 1985. 18. Christine Borgmann, "The User's Mental Model of an Information Retrieval System," Proc. 8th SIGIR Meeting, pp. 268-273, Mont- real, Quebec, 1985. 19. Marilyn Mantel and Nancy Haskell, "Autobiography of a First-Time Discretionary Microcomputer User," Human Factors in Computing Systems: Proc. CHI '83 Conference, pp. 286-290, 1983. 20. Bill Swartout, "GIST English Generator," Proc. AAAI-82, pp. 404- 409, Pittsburgh, Penn., 1982. 21. Ariel Shattan and Jenny Hecker, "Documenting UNIX: Beyond Man Pages," Proc. Summer USENIX meeting, pp. 437-454, Portland, Ore., 1985. 22. Karen Kukich, "Design of a Knowledge-Based Report Generator," Proc. 21st Meeting ACL, pp. 145-50, 1983. 23. E. Voorhees and G. Salton, "Automatic Assignment of Soft Boolean Operators," Proc. SIGIR Conf., pp. 54-69, 1985. 24. L. L. Cherry and N. H. Macdonald, "The Unix Writer's Workbench Software," Byte, vol. 8, no. 10, pp. 241-248, Oct. 1983. 25. G. E. Heidorn, K. Jensen, L. A. Miller, and R. J. Byrd, "The Epistle Text-Critiquing System," IBM Systems J., vol. 21, no. 3, pp. 305-326, 1982. 26. M. O. Harris, Howto: An Amateur System for Program Counseling, 1983. private communication. 27. J. R. Cowie, "Automatic Analysis of Descriptive Texts," Conf. on Applied Natural Language Processing, pp. 117-123, Santa Monica, Cal., Feb. 1-3, 1983. - 9 - 28. M. S. Tuttle, D. D. Sherertz, M. S. Blois, and S. Nelson, "Expertness from Structured Text? Reconsider: A Diagnostic Prompting System," Conf. on Applied Natural Language Processing, pp. 124-131, Santa Monica, Cal., Feb. 1-3, 1983. 29. Patricia Wright, "Manual Dexterity: a user-oriented approach to creating computer documentation," Human Factors in Computing Sys- tems: Proc. CHI '83 Conference, pp. 11-18, 1983. 30. T. G. Sticht, "Comprehending Reading at Work," in Cognitive Processes in Comprehension, ed. M. A. Just and P. A. Carpenter, Lawrence Erlbaum, 1977.