print-friendly version

SIGIR'07

The 30th Annual International ACM SIGIR Conference
23-27 July 2007, Amsterdam

Tutorial 2H - XML Retrieval: Integrated IR-DB Challenges and Solutions

Sihem Amer-Yahia (Yahoo! Research), Ricardo Baeza-Yates (Yahoo! Research), Mariano Consens (Univ. of Toronto), Mounia Lalmas (Query Mary Univ. of London)

Documents today contain a mixture of structured and unstructured content. One way to format this mixed content is according to the adopted W3C (World Wide Web Consortium) standard for information repositories and exchanges, the so-called eXtensible Markup Language (XML). As a continuously growing number of XML content is being made available on the Web, Digital Libraries and Enterprise Environments, numerous approaches are being developed to store and query XML content.

These developments have generated a wealth of issues that are being addressed by the database (DB) and information retrieval (IR) communities. The DB community focuses on developing query languages and efficient evaluation algorithms for accessing highly structured content. The IR community focuses on developing techniques for ranking query results and evaluating their effectiveness for accessing unstructured content. The two communities are now meeting to provide effective and efficient access to XML content.

This tutorial will provide an overview of the different issues and approaches put forward by the IR and DB communities and will in particular survey the DB-IR integration efforts as they focus on the problem of retrieval from XML content. The tutorial will first cover the problem space (basic concepts, requirements) and then describe the solution space (approaches, evaluation).

The tutorial will consist of the following content:

  1. Motivation:
    Introduction providing a historical perspective on DB and IR communities and semi-structured data. This introduction also provides motivating applications, their needs and requirements.
  2. Data model and queries:
    XML basics and standards, introducing the XML model and schema, structured text models and query algebras. Querying XML content, both with respect to content and structure, including user expectations, the nature and form of results.
  3. Effectiveness and efficiency:
    Ranking algorithms, document pre-processing and indexing. We will discuss streaming, summaries, indices, and top-k query processing.
  4. Evaluation:
    In particular as carried out by the INEX initiative, describing document collections, topics, tasks, relevance, and metrics.

Outcome:

By attending this tutorial, attendants will
- learn about application needs and requirements for combined DB and IR approaches to access XML content,
- learn about models and indexes tailored to XML retrieval,
- understand the specific problems of XML retrieval and XML query processing, and
- learn about XML retrieval evaluation, and INEX in particular.

Our main goal is to provide a clear view of the major challenges in XML Retrieval: its problems, its solutions and the pitfalls that should be avoided.

Course material: Handouts of slides, and detailed bibliography.

Sihem Amer-Yahia joined Yahoo! Research in May 2006. Until then, she was a member of Technical Staff at AT&T Labs for 7 years. She has worked on various aspects related to XML full-text search in the past. She is a co-editor of the XQuery Full-Text Language Specification and Use Cases published by the W3C Full-Text Task Force. Her current research interest is to leverage structure when querying content, in particular, she is focusing on issues related to processing top-k queries in online shopping and community-aware ranking in online communities.

Ricardo Baeza-Yates is director of Yahoo! Research Barcelona and Yahoo! Research Latin-America in Santiago, Chile. Until 2005 he was an ICREA Professor at Universitat Pompeu Fabra in Barcelona and also a professor and director of the Center for Web Research, that he founded in 2002, at the Computer Science department of the University of Chile. His research interests include information retrieval, algorithms, and information visualization. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley. He received his PhD in Computer Science from the University of Waterloo, Canada, in 1989.

Mariano P. Consens research interests are in the areas of Data Management Systems and the Web, with a focus on XML retrieval and autonomic systems. He received his PhD and MSc degrees in Computer Science from the University of Toronto, and his Computer Systems Engineer degree from the Universidad de la Republica, Uruguay. He has been a faculty member in Information Engineering at the MIE Department, University of Toronto, since 2003. Before that, he was research faculty at the School of Computer Science, University of Waterloo, from 1994 to 1999. He has been active in the software industry as a founder and CTO of several start-ups.

Mounia Lalmas has a PhD in Computer Science from the University of Glasgow, in 1996. Presently she is a Professor of Information Retrieval at the Department of Computer Science, as Queen Mary, University of London, which she joined as a lecturer in 1999. Her research focuses on the development and evaluation of intelligent access to interactive heterogeneous and complex information repositories. She is the co-leader of the INEX initiative, with over 50 participating organizations worldwide.