print-friendly version

SIGIR'07

The 30th Annual International ACM SIGIR Conference
23-27 July 2007, Amsterdam

Tutorial 2D - The Probabilistic Relevance Model: BM25 and beyond.

Biographies

Stephen Robertson runs the Information Retrieval and Analysis group at Microsoft Research Cambridge (UK), which is concerned with core search processes such as term weighting, document scoring and ranking algorithms, and combination of evidence from different sources. The group works closely with product groups to transfer ideas and techniques. His main research interests are in the design and evaluation of retrieval systems. He is the author, jointly with Karen Sparck Jones, of a probabilistic theory of information retrieval, which has been moderately influential. A further development of that model, with Stephen Walker, led to the term weighting and document ranking function known as Okapi BM25, which is used in many experimental text retrieval systems. Prior to joining Microsoft, he was at City University London, where he retains a part-time position as Professor of Information Systems in the Department of Information Science. He was awarded the Tony Kent STRIX award by the Institute of Information Scientists in 1998. In 2000, he was awarded the Salton Award by ACM SIGIR. He is a Fellow of Girton College, Cambridge.

Hugo Zaragoza is a researcher working on Information Retrieval at Yahoo! Research Barcelona. He is interested in the applications of machine learning (ML) and natural language processing (NLP) for information retrieval (IR). More specifically, he is interested in developing measures of relevance (i.e. ranking functions) between linguistic objects such as search queries and web documents. From 2001 to 2006, Hugo worked at Microsoft Research (Cambridge, UK) with Stephen Robertson, mostly on probabilistic ranking methods for corporate and web search, but also on document classification, expert finding, relevance feedback, and dialogue generation for games. He also collaborated extensively with Microsoft product groups such as MSN-Search and SharePoint Portal Server. Prior to Microsoft Research, Hugo taught computer science and completed a Ph.D. at the LIP6 (U. Paris 6) on the application of dynamic probabilistic models to a wide range of Information Access problems.

Tutorial Description

Introduction

The Probabilistic Relevance Model (PRM) is a formal framework for document retrieval developed in the 1970-80s which lead to the developement of one of the most successfull text-retrieval algorithms, BM25. In recent years, research in the PRM has lead to new retrieval models capable of taking into account document meta-data (specially structure and link-graph information). Again, this has lead to some of the most succesfull web-search and corporate-search algorihms, BM25F. However, this has come with a cost: the explosion in the number of free parameters has put optimisation and machine learning at the core of the development of ranking functions in the PRM model.

One of the reasons of the success of the PRM, we believe, is the powerfull combination of sound theoretical modeling and pragmatic parameter tuning steps that lie at its core. When it is well understood, the PRM model can provide a solid ground on which to analyse new IR problems and derive new solutions beyond BM25. However, there are very few in-depth descriptions of the PRM framework in the literature. This leads many researchers to treat BM25 as a black box, and develop their research on the \emph{outside} of this box.

This tutorial will present the Probabilistic Relevance Model from a conceptual point of view, describing in-depth the probabilistic modelling assumptions behind the model and the ranking algorithms that result from its aplication. Furthermore, it will explain recent developments in PRMs leading to ranking functions which integrate document structure, document meta-data and link-graph information. Finally, we will discuss the different free parameters of the ranking functions and how to set them or optimise them.

The tutorial will be divided in two parts lasting roughly 90 minutes each:

I. Theory:

In the first part of this tutorial we will describe the intuitions and full derivation of the PRM model. We will then discuss how different modeling assumptions can be incorporated and how they lead to new ranking functions. In particular we will develop a series of document retrieval algorithms that can be derived from the PRM, from the binary independance model (BIM) to BM25. Free parameters will be introduced and discussed within the probabilistic inference and optimisation frameworks.

Topics:

  • Probabilistic Relevance Model
  • BIM & idf
  • Eliteness & BM25 [Robertson et.al. 1981][Robertson & Walker 1994]
  • Structure & BM25F [Robertson et.al. 2004][Zaragoza et.al. 2004]
  • Divergence From Randomness [Amati & Van Rijsbergen 2002]

II. Practice:

In the second part of this tutorial, we will discuss models that take the PRM model beyond its original scope. In recent years there have been a number of important extensions to the original PRM dealing with document and hyperlink structure. In particular, the BM25F model was developed to retrieve structured documents with heterogeneous fields (such as title and anchor text). Furthermore, it has been shown that hyperlink document features as well as other non-textual document features can be naturally added to the PRM. Finally, all these extensions come at the cost of new free parameters in the model, which need to be set. Fortunately, recent developments in machine learning can be applied successfully to this tuning problem.

Topics:

  • Anchor Text [Zaragoza et.al. 2004]
  • Metadata (Time, PageRank, etc.) [Craswell et.al. 2005]
  • Tuning (Greedy, RankNet, etc.) [Taylor et.al. 2006]

Course Materials

The main course material will be a 20 page technical report on the PRM and BM25F and accompanying slides which will be published in June 2007. This will be compiled with the rest of the papers in the bibliography bellow.

Bibliography

  • Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. 20(4):357-389, 2002.
  • Nick Craswell, Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Relevance weighting for query independent evidence. In Proceedings of ACM SIGIR'2005, Salvador, Brazil, 2005.
  • S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, and P.W. Williams, editors, Information Retrieval Research, pages 35-56, London, 1981. Butterworths.
  • S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In W B Croft and C J van Rijsbergen, editors, SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 345-354. Springer-Verlag, 1994.
  • S. E. Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In Thirteenth Conference on Information and Knowledge Management (CIKM), 2004.
  • Michael Taylor, Hugo Zaragoza, Nick Craswell, Stephen Robertson, and Chris Burges. Optimisation methods for ranking functions with multiple parameters. In Fifteenth Conference on Information and Knowledge Management (ACM CIKM), 2006.
  • Hugo Zaragoza, Nick Craswell, Michael Taylor, S. Saria, and S. E. Robertson. Microsoft Cambridge at TREC 13: Web and hard tracks. In Text REtrieval Conference (TREC-13), 2004.