Advances on the Development of Evaluation Measures

Bio:

Ben Carterette is an assistant professor of Computer and Information Sciences at the University of Delaware in Newark, Delaware, US. He completed his PhD in Computer Science at the University of Massachusetts Amherst in 2008. His work on information retrieval evaluation has been recognized with several Best Paper Awards at conferences such as SIGIR, ECIR, and ICTIR. With Evangelos Kanoulas, he has been actively involved in coordination of the TREC Million Query Track and the TREC Session Track. Evangelos Kanoulas is a postdoctoral research scientist at Google, Switzerland. Prior to that he was a Marie Curie fellow in the Information School at the University of Sheffield. Evangelos received his PhD from Northeastern University, Boston. He has published extensively in the field of information retrieval evaluation in SIGIR, CIKM and ECIR. Evangelos was actively involved in coordinating the TREC Million Query Track (2007-2009) and he is one of the coordinators of the TREC Session Track (2010-2012). Emine Yilmaz is a researcher at Microsoft Research Cambridge. She obtained her Ph.D. from Northeastern University in 2008. Her main interests are information retrieval and applications of information theory, statistics and machine learning. She has published research papers extensively at major information retrieval venues such as SIGIR, CIKM and WSDM. She has also organized several workshops on Crowdsourcing and served as one of the organizers of the ICTIR Conference.

Summary:

The effectiveness of a retrieval system, i.e. its ability to retrieve items that are relevant to the information need of an end user, is one of the most important aspects of retrieval quality. A number of different experimental frameworks have been designed in IR to measure retrieval effectiveness. The systems-based approach of using a test collection comprising canned information needs and static relevance judgments to compute evaluation measures has served IR experimentation well, but this approach is often criticized for failing to capture what users really want from IR systems. Today, the availability of query and click logs that demonstrate the interactions of a user with a retrieval system has led to an increasing interest in building measures on better models of user needs and user interactions with an engine. This approach has led to measures that are better correlated with the actual utility that a system is offering to an end user. Such measures have rapidly become established in evaluation forums such as TREC and NTCIR as well as in industry evaluations; thus it is important that the approach is well understood by the IR community. This tutorial focuses on methods of measuring effectiveness, in particular focusing on recent work that more directly models the utility of an engine to its users. We discuss traditional approaches to effectiveness evaluation based on test collections, then transition to approaches based on test collections along with explicit models of user interaction with search results. In particular, we will discuss:

Test collections and user‐based interpretations of traditional evaluation measures
Basic user models and measures (Robertson‘s interpretation of AP, graded relevance)
Cascade user models and measures (RBP, NDCG, ERR, EBU)
Beyond document relevance (nugget evaluation, pairwise preferences)
Models and measures for novelty and diversity (subtopic recall and precision, intent‐aware
Models and measures for evaluation over sessions (TREC Session Track, Session‐DCG, Expected Utility, Session Precision, Recall and AP)

Program

For Attendees

For Contributors

Instructions for Preparing Contributions

Organization

Advances on the Development of Evaluation Measures