Ricardo Baeza-Yates (Yahoo!) [Short Bio]
Raffaele Perego (ISTI - CNR) [Short Bio]
Fabrizio Silvestri (ISTI - CNR) [Short Bio]
Abstract:
The Web continues to grow and evolve very fast, changing our daily lives. It can be considered the unique result of the collaborative work of the millions of institutions and people that contribute content to the Web as well as the one billion people that use it. In this ocean of contributed data there is a huge amount of both explicit and implicit information and knowledge. Web Mining is the task of analyzing this data and extracting information and knowledge for many different purposes. The data comes in three main flavors: content (text, images, etc.), structure (hyperlinks) and usage (navigation info, Web search engine queries, etc.), implying different techniques to be used for their analysis such as text, graph or sequence mining. Each case reflects the wisdom of the crowds that can be used to make the Web better. For example, user generated tags in Web 2.0 sites.
As Web servers maintain the history of page requests received, all Web applications have stored in logs information about their usage since they started to appear on the Internet. Web search engines can be considered particular Web applications which are part of the more general class of Information Retrieval (IR) systems. The uncertainty in users' intent is present in Web search engines as well as in IR systems. Unlike old-fashioned IR systems, though, Web IR systems can rely on the availability of a huge amount of usage information stored in query logs. Therefore, query log analysis connects to IR in many different ways. For example, the exploitation of the knowledge contained within past queries helps to improve the quality (both in terms of effectiveness and efficiency) of a Web search engine.
Furthermore, Web search engines are queried by users to satisfy their information need. We will review studies analyzing how users interact with search engine systems; how can a query be considered correctly answered, and so on. In particular, the main objective of this tutorial is to give participants a unified view on the literature on query log analysis. We will introduce to the discipline of query log mining by showing its foundations and by analyzing the basic algorithms and techniques that could be used to extract and to exploit useful knowledge from this (potentially) infinite source of information. Some aspects of Web Site Optimization will also be covered.
Bios:
Ricardo Baeza-Yates
Ricardo Baeza-Yates is VP of Yahoo! Research for Europe and Latin America, leading the labs at Barcelona, Spain and Santiago, Chile. Until 2005 he was the director of the Center for Web Research at the Department of Computer Science of the Engineering School of the University of Chile; and ICREA Professor at the Dept. of Technology of Univ. Pompeu Fabra in Barcelona, Spain. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley, as well as co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, among more than 150 other publications. He has received the Organization of American States award for young researchers in exact sciences (1993) and with two Brazilian colleagues obtained the COMPAQ prize for the best CS Brazilian research article (1997). In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences. During 2007 he was awarded the Graham Medalfor innovation in computing, given by the University of Waterloo to distinguished ex-alumni.
Fabrizio Silvestri
Raffaele Perego is senior researcher at ISTI, an Institute of CNR, the Italian National Research Council, where he leads the High Performance Computing Lab. He received his Laurea degree in Computer Science from the University of Pisa in 1985. His main research interests include efficiency issues in data mining and Web information retrieval, distributed information retrieval, high performance parallel and distributed computing. Raffaele Perego co-authored more than 70 papers on these topics.
Fabrizio Silvestri
Fabrizio Silvestri is currently a Researcher at ISTI - CNR in Pisa. He received his Ph.D. from the Computer Science Department of the University of Pisa in 2004. His research interests are mainly focused on Web Information Retrieval with particular focus on efficiency related problems like caching, collection partitioning, distributed IR in general. In his professional activities Fabrizio Silvestri is member of the Program committee of many of the most important conferences in IR as well as organizer and, currently, member of the steering committee, of the workshop Large Scale and Distributed Systems for Information Retrieval (LSDS-IR). He has more than 40 publications on the field of efficiency in IR. In particular, in these last years his main research focus is on query log analysis for performance enhancement of web search engines. In the topic of the tutorial, Fabrizio Silvestri has written recently a survey paper for the journal Foundations and Trends in Information Retrieval, and has given a keynote speech at the LA-Web 2008 conference with a talk entitled “Past Searches Teach Everything: Including the Future.