IRLIST Digest November 1, 1990 Volume VII Number 31 Issue 37 ********************************************************** I. NOTICES A. Meetings announcements/Calls for papers 1. Avignon '91: Expert Systems and Their Applications 11th International Workshop Conference on Second Generation Expert Systems May 27-31, 1991 Avignon, France 2. International Conference on Current Issues in Computational Linguistics June 10-11, 1991 (Tutorials) June 12-14, 1991 (Conference) Universiti Sanis Malaysia, Penang, Malaysia II. QUERIES B. Requests for information 1. Suggestions on detecting identical texts 2. CDs on the holocaust ********************************************************** I. NOTICES I.A.1. Fr: Nancy M. Ide Re: Avignon '91: Expert Systems and Their Applications 11th International Workshop Conference on Second Generation Expert Systems May 27-31, 1991 Avignon, France For the third consecutive year, one of the AVIGNON conferences will be devoted to the study of Second Generation Expert Systems. The term "second generation" expert systems is used to characterize knowledge-based systems able to solve problems by combining different types of reasoning. Such systems often use multiple representations of the problem to develop different problem-solving strategies. The first generation expert systems were largely based on heuristic, associational rules. To overcome their limitations, a new line of research was begun into the use of deeper knowledge, often referred to as "model-based", "causal" or "qualitative" reasoning. Since model-based and heuristic approaches appear to be largely complementary, recent work has begun to combine these two reasoning processes into a single problem-solver. Another thread of research has been aimed at making the problem solving methods used much more explicit and elaborating "task-specific architectures". Research has then been conducted into designing particular problem-solvers by combining multiple generic or primitive task-specific architectures. Second Generation Expert Systems are intended to have those two approches converge. Building systems that make explicit the tasks to be realized, the problem solving methods to be implemented and the associated domain models would appear to be the basic objective of this new field. And because a non-trivial problem can only be solved by bringing a number of different resolution methods and domain models into play, the cooperation and integration of these methods and models is one of the key problems to be met in the building of such systems. Topics ------ The Program Committee is seeking papers on the following themes (list non exhaustive): + combining different reasoning types + architectures integrating heuristic and model-based reasoning; + reasoning with multiple models; + multi-expert, multi-agent cooperation; + cooperation of distributed problem-solvers; + task-specific architectures; + knowledge acquisition, explanation, validation, in second generation expert systems; + the use of qualitative, model-based, causal or temporal reasoning to supplement heuristic reasoning; + integrating qualitative and quantitative reasoning; + applications of these techniques to real-world problems (e.g. diagnosis, design, scheduling). Papers describing applications should outline the strengths as well as the weaknesses of the implemented systems. In particular, examples and analysis of failures will be appreciated in order to delineate the applicability of the methods. Theoretical papers should be clearly related to previous work and should enlighten the advantages and originality of the proposed approach. Submission ---------- Authors should submit 6 copies of their papers before January 7, 1991 to AVIGNON '91 general chairman: Jean-Claude Rault EC2 269-287, rue de la Garenne ; 92000 Nanterre ; France tel: 33 - 1 - 47.80.70.00 ; fax: 33 - 1 - 47.80.66.29 Paper should be a minimum of 2000 words to a maximum of 5000 words (about 10 pages single-spaced). Each submission should contain the following information: title of paper; full name of all authors; complete address of first author (including telephone, fax number and e-mail address if available); abstract of 100-200 words; list of key-words. Each submission will be reviewed by at least three referees. Notifications of acceptance or rejection will be mailed from March 1, 1991. Program Committee ----------------- Jean-Marc David (chairman) Renault ; Service Systemes Experts 860, Quai Stalingrad; Bt J4-D14; 92109 Boulogne Billancourt; France. e-mail: david@renault.uucp ; tel : 33 - 1 - 46.94.54.86 fax : 33 - 1 - 46.94.50.23 Alice Agogino (University of California; Berkeley, USA); Bert Bredeweg (University of Amsterdam; The Netherlands); B. Chandrasekaran (Ohio State University; Columbus, USA); Marie-Odile Cordier (Universite de Rennes; France); Jean-Luc Dormoy (Etudes et Recherches EDF; Clamart, France); Jacques Ferber (Universite Paris 6; France); Massimo Gallanti (CISE; Segrate, Italy); Jean-Paul Krivine (Etudes et Recherches EDF; Clamart, France); Benjamin Kuipers (University of Texas; Austin, USA); Roy Leitch (Heriot-Watt University; Edinburgh, UK); Robert Milne (Intelligent Applications; Livingston, UK); Richard Pelavin (Philips Laboratories; Briarcliff Manor, USA); Olivier Raiman (XEROX Palo Alto Research Centre, USA); Reid Simmons (Carnegie Mellon University; Pittsburgh, USA); Luc Steels (Vrije Universiteit; Brussels, Belgium); Jon Sticklen (Michigan State University; East-Lansing, USA); Pietro Torasso (Universita di Torino; Italy); Louise Trave-Massuyes (LAAS - CNRS; Toulouse, France); Walter van de Velde (Centre of Advanced Studies; Blanes, Spain). ********** I.A.2. Fr: Don Walker Re: International Conference on Current Issues in Computational Linguistics June 10-11, 1991 (tutorials) June 12-14, 1991 (conference) Universiti Sains Malaysia, Penang, Malaysia The above conference will be held in the university campus on the given dates. Please note the change of dates with respect to our previous announcement, which has been made specifically to allow those wishing to attend the 1991 ACL annual meeting to return in time. Topics covered include, but are not limited to: syntax, semantics, discourse, formal models, grammar formalisms, language analysis and generation, understanding and knowledge representation, lexical issues, machine translation, etc., with the main objectives of the conference being: (i) bringing awareness of the state-of-the-art of the various subfields, and (ii) highlighting the most recent developments and implementation. About 45 papers will be presented, including 8 from our invited speakers (who will also conduct the tutorials, as well as form the program committee for the selection of papers), namely: Makato NAGAO (Computational Linguistics - in general), Lauri KARTUNNEN (Morphology), Eva HAJICOVA (Syntax), Petr SGALL (Semantics), Martin KAY (Formal Models), Christian BOITET (Machine Translation), Yorick WILKS (Artificial Intelligence), with the program committee to be chaired by Makato NAGAO. Those wishing to submit papers are reminded that 4 copies (of length not exceeding 15 pages) are to be sent to the Program Chairman strictly before 1 December 1990, the address being: Prof. Makato Nagao, Dept. of Electrical Engineering, Kyoto University, Yoshida, Sakyo, Kyoto 606, JAPAN [FAX: +81-75-751-1576]. Notification of acceptance will be sent out by 15 March 1991, together with the formatting instructions for the final copies. Camera ready copies of accepted papers will have to be in Penang by 15 April 1991. The final announcement containing detailed information as well as the registration form should be going out to selected addresses by the middle of October 1990. All those who are interested but have not indicated their interest to attend the conference/tutorials are invited to write to the Secretariat: Josephine Ong, Pusat Pengajian Luar Kampus, Universiti Sains Malaysia, 11800 Penang, Malaysia [FAX: 60-4-871526, Telex: MA 40254]. Academic matters are handled by Zaharin Yusoff, Projek Terjemahan Melalui Komputer, PPS Matematik & S Komputer, Universiti Sains Malaysia, 11800 Penang, Malaysia [FAX: 60-4-871526, Telex: MA 40254, Tel: 60-4-874125]. Registration fees are set at M$300 for the conference and M$200 for the tutorials, with a 50% discount given to students. The approximate exchange rate is US$1 = M$2.70 or M$1 = US$0.37. The fees include conference/tutorial material, coffee, tea, lunch, conference dinner, and transport to and from assigned hotels. Registration forms and full remittance are due in by 15 April 1991. ********************************************************** II. NOTICES II.B.1. Fr: Dave Lewis Re: Suggestions on detecting identical texts I posted a query several weeks ago inquiring about detecting duplicate or near-duplicate documents in a large corpus. This message contains a summary of the suggestions I got. Since I forgot to warn people in advance that I wanted to post a summary of their advice for the list, I reproduce in this message summaries of the suggestions without direct attribution. Many thanks to John Guidi, Fred Damerau, Ed O'Neal, IJsbrand Jan Aalbersberg, and Stavros Macrakis for their contributions. Any of them who would like to step forward to claim credit for their part of the following are welcome---all the good ideas below are theirs! Dave Lewis lewis@cs.umass.edu -------------- A. Exact Duplicates When only exact duplicates are suspected, the use of some relatively short, readily defined field (such as the title) seems to be the best approach. A strategy that's been used with large collections of bibliographic records is the following: 1. Create a file with a record of the following form: <id of original record> corresponding to each original record. The key is found by throwing away leading function words from the title, then taking the first 3 characters from the first title word, the first 2 characters from the second and third title words, and the first character from the fourth title word. 2. Sort the above file on the title field. 3. Scan through the sorted file looking for identical titles, which must be adjacent. The check for identity can done first on the 8 character key field, and only if two records match on that field do they need to be compared further. When duplicate titles are found, the person responsible for the database can decide which of the original records to retain. B. Near Duplicates Detecting "near duplicates" (slightly modified versions of a document, true duplicates corrupted by typos, etc.) is harder. The best suggestion I got was this one, which I reproduce with some minor editing: "I have been using an algorithm involving computation of a bit vector for this purpose, on a file of UPI newswire stories (many repeats and near duplicates). It works pretty well, but requires diddling with parameters to get a decent compromise between excessive rejection and keeping too many. I don't actually enter all the words of a document into a bit vector, but throw away the 2000 most frequent words. In the UPI, this tends to leave things like names as well as the truly rare words, which tend to distinguish stories. In a more general text collection, maybe you would only want to discard the 1000 most frequent. The idea is to avoid saturating the bit vector." C. Inexact String/Pattern Matching Several people suggested the use of various methods of inexact matching of patterns or strings. These potentially have a role to play, but such techniques would seem to need to be used in combination with some method of limiting the number of pairs of documents to compare. Some papers (there are many, many others) on these matching techniques are: %A Ehrenfeucht, A. %A Haussler, D. %T A new distance metric on strings computable in linear time %J Discrete Applied Matehematics %V 20 %D 1988 %P 191 - 203 Forgy, CL. Rete: A fast algorithm for the many pattern/many object pattern match problem.Artificial Intelligence 19(1982) 17-37. Wagner RA, Fischer MJ. The string-to-string correction problem. JACM Vol 21 No 1 Jan 1974 168-173. Davidson L. Retrieval of misspelled names in an airlines passenger record system. CACM 5,3 (March 1962) 169-171. Hall PAV, Dowling GR. Approximate string matching. Comp Surveys Vol 12 No 4 (Dec 1980)381-402. Pollock JJ. Spelling error detection and correction by computer: Some notes and a bibliography. J of Documentation Vol 38 No 4 (Dec 1982) 282-291. Fu KS. Error correcting parsing for syntactic pattern recognition. In: Data structures, computer graphics and pattern recognition. Editors: A Klinger, Fu KS, Kunii TL. Academic Press 1977. Landau GM, Vishkin U. Fast parallel and serial approximate string matching. J of Algorithms 10, 157-169 (1989). McClelland EB, Trueblood RP, Eastman CM. Two approximate opertators for a data base query language: Sounds_like and Close_To. IEEE Trans on Sys Man and Cyber. Vol 18 No 6 (Nov/Dec 1988). Raiha L. Approximate sequence comparison: A study with histograms. Pattern Recognition Vol 12 No 1/2. 159-169 (1990). Baeza-Yates RA, Gonnet GH. A new approach to text searching. In: Proceedings of the 12th annual int'l AMCSIGIR conference on research and development in information retrieval. Ed: Belkin NJ, van Rijsbergen CJ. ACM. June 1989.168-175. D. Unix Tools Finally, here's a mildly edited version of some thoughts one person sent on using UNIX tools in duplicate detection: "If near-identical means modulo punctuation and line breaks, you could hash the word sequence of each file (in Unix: cat input | tr -cs A-Za-z '\012' | hash ; you'll have to write `hash' yourself, because there is no standard hashing tool) and then find hashes that appear more than once: sort | uniq -c | awk '$1 > 1' Then you just have to look up the files whose hashes come out. The problem with this approach is that a single character out of place makes two files different (you'll presumably want to strip out any headers or trailers if there are any). Another approach which ought to work (but in fact will not, for reasons I'll explain below) is to sample each text in several different places -- say four sequences of 20 words each. Use these sequences to search all the files using Aho-Corasick (the algorithm used by fgrep). Unfortunately, this will not work for several reasons: 1) there is no way to handle multi-line strings in fgrep 2) if you try normalizing by getting rid of punctuation and LFs, you have to choose line breaks since fgrep requires lines to have a (small) maximum length. 3) fgrep has a (smallish) limit on the number of different patterns it will search for. So, this looks pretty bad. " ********** II.B.2. Fr: Avigail Oren (The Computer in Education Research Lab) <C37@taunos> Re: CDs on the holocaust Hi, I'm looking for any information about a CD concerning the holocaust. Thanks in advance, Avigail Oren ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch lynch@postgres.berkeley.edu calur@uccmvsa.bitnet Mary Engle engle@cmsa.berkeley.edu meeur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. These files are not to be sold or used for commercial purposes. Contact Mary Engle or Nancy Gusack for more information on IRLIST. The opinions expressed in IRLIST do not represent those of the editors or the University of California. Authors assume full responsibility for the contents of their submissions to IRLIST.