The world of recommender systems has undergone quite an expansion since the Communications of the ACM published their feature issue on the topic two years ago. Projects such as GroupLens have gone on to be successful commercial ventures, and recommendation systems are de rigeur for Internet commerce. The basic technology has very quickly moved from the research world to popular applications.
Several methods for implementing recommender systems have emerged, including approaches that base recommendations on correlations of groups of users and methods that learn about individual users. However, the architectural issues of cold-start, sparse ratings, and scalability continue to dominate the field.
The state of the art in recommender systems will be enhanced by the development of evaluation methodologies for recommender systems. User studies are difficult to conduct and generalize from, and issues of presentation and relevance make traditional IR evaluation measures not entirely suited to the domain. Furthermore, test collections such as DEC SRC's EachMovie data set are becoming standard tools, but the need for larger collections in different domains is great.
Thus, the goal of the workshop was to discuss moving to the next phase of recommender systems research, from the basic "how do we do it" to "how can we do it better", and "how do we know that it's better".
SIGIR was a good forum for such a workshop. The recommender systems community has existed at the crossroads of information retrieval, machine learning, computer-human interfaces, digital libraries, and the World-Wide Web. The information retrieval community has long wrestled with questions of experiment and evaluation, and we hoped that for our particular goals we would benefit from the SIGIR enviroment.
The basic format of the workshop was three technical sessions with talks on submitted papers. This was surrounded by a keynote at the start of the day, and an animated panel discussion at the end.
Our keynote speaker was Joseph Konstan of the University of Minnesota, who is well-known from his involvement in the GroupLens project. GroupLens formed some of the earliest work in collaborative recommendation, and much of the underlying algorithmic approaches used throughout the field come from GroupLens. GroupLens itself has grown into a successful commercial company, NetPerceptions, whose software is used in several popular Internet storefronts.
Dr. Konstan first gave a brief history of recommender systems, from its beginnings around 1992. He observed that in the "early days", the expectations were fairly low by today's standards: 10,000 users and 100 predictions per second was good performance, and "better than random" was effective. This received a lot of laughs, but Konstan emphasized that people used it and they seemed to come back. Not much of an evaluation measure from the point of view of researchers, but for a web portal this might really be only measure that matters.
At the 1994 workshop in Berkeley, several big questions were identified: tackling scalability, sparse ratings, handling implicit ratings, adding content to collaboration, and whether the idea was even economically viable. That last question didn't take long to answer, as many groups quickly went commercial. Netperceptions attacked the scalability problem by hiring Cray employees when Cray was bought by SGI. Morita and Shinoda's 1992 paper on implicit ratings had a lot of skeptics, but now buying histories drive web recommendation. More advanced use of implicit ratings, as well as integration of content, are still open questions.
In 1997, the CACM article gave us widespread use of the label of "Recommender Systems." Konstan remarked that this reflects a change of focus from the algorithm to the interface; now, of course, a recommender system might not involve any collaborative aspects at all.
The summer of 1997 also brought us the first community corpus, DEC's EachMovie. EachMovie has increased the number of CF experiments being done, and it means that a research group doesn't necessarily need a working system with many users to have enough data to work with. In general the field has experienced a vast "mushrooming" in the last year and a half, with workshops and papers in the IR, AI/ML, CHI, CSCW, and Agents communities. Growth is good, but it is now much harder for anyone to know everything that is being done.
While commercial adoption has been high, it's hard to gauge what this will mean in the research community. All the studies we have so far indicate that people like believing that they are getting personalized recommendations. However, it might not even be worthwhile to work hard to make good recommendations or even to do real personalization: CDNow at one point tried sending random recommendations, and found that people liked what they got anyway.
However, Konstan's sense was that there is a lot of opportunity now for the community. Information overload is a huge problem, and personalization is a very hot topic. There is a substantial interest in leveraging human expertise and knowledge. Finally, the core technology is mature, so the interesting questions are broadening.
To Konstan, these interesting questions are (a) dealing with massive scale and sparsity, (b) dealing with "recurring startup" problems, such as with a newspaper which generates new articles every day which have no ratings, (c) actually making users more effective, that is, did the system improve users' decision-making process, and (d) building and leveraging recommendation communities that users identify and interact with.
Konstan also identified several obstacles standing in the way of these opportunities. One is that it is hard to do interesting recommender systems research: user experiments require users, and only large groups have the resources to construct systems which can attract enough users to get useful data. But a larger problem is that recommender systems science is fairly disorganized. Systems are mostly incomparable, metrics and data collections aren't standard, and there is no infrastructure for controlled experiments across research groups. There are widespread anecdotal feelings about what should work, but no well-defined "best practices". It has to be possible to do both "off-line" research on algorithms, and "on-line" research with user interfaces.
Given these opportunities and obstacles, Konstan presented a five-point solution:
Konstan said that GroupLens can offer some of these things in the near future, such as some data sets, a small engine, and statistical tools. They also have a lot of experience running experimental, on-line sites, and can help with network hosting and seeking funding for management and workshops.
For the workshop, Konstan said that the focus should be on brainstorming some of these ideas, and consolidating the list of available resources. He also said that the critical mass is there for a global standalone workshop. In the medium-term, the need is for a steering committee and sustained funding in the area. In the long term, maintain and document resources, hold user workshops, and more supported research.
David Evans asked about the relationship of recommender systems to data mining and statistical modeling. Konstan said that recommender systems are real-time, on-line, model-free associations among people, and are easily explainable. Evans pointed out that the current practice is not model-free, and that marketing people have the same goals. Konstan replied that data mining is presently very analyst-intensive.
Heinrich Schuetze asked why is there a need to move away from movies. Konstan said that the right domains are human-created media with a target audience that you expect to cluster people by tastes. With movies, the community knowledge is high but user investment is low. Can we work with mutual funds? Eric Glover also pointed out that movies have relatively stable value, and that other artifacts have issues of temporal decay and portfolio effects.
Another question was asked on the value of exposing the rules and enhancing explainability. Konstan pointed to Jon Herlocker's forthcoming dissertation. There are different ways of explaining the reason for what is recommended: what you rated that was most pivotal, success rate, simple reverse engineering of the computation ("You liked these other movies...").
Our first session consisted of three papers on improving the standard correlation-based prediction algorithms. Joaquin Delgado first presented a prediction algorithm which combines the correlation prediction with a weighted-majority voting scheme. In his paper, he was able to prove a bound for prediction errors in his algorithm, based on the size of the pool of predictions and the error of the best voting predictor. He used a subset of EachMovie, and simulated a chronological ordering. He reported precision and recall results, and compared them to those by Billsus and Pazzani. Questions asked include whether we can capture user's changes as they're exhibited, and whether the technique could be used to weight communities rather than users. Doug Oard pointed out that the metric was really relative recall.
Second, Ken Goldberg presented a technique to reduce the time needed to compute predictions. Rather than computing a full correlation matrix at the time of prediction, they would compute the principal components of the ratings matrix off-line, and use this to compute predictions quickly. Their test system, Jester, recommended jokes and was also a featured SIGIR demo. Jester has 30,000 registered users and around 1,000,000 ratings, but a small pool of jokes. Their evaluation focused on high-variance jokes.
Jon Herlocker presented work-in-progress on applying clustering and partitioning algorithms to the ratings matrix, and computing predictions within the clusters. Graph partitioning algorithms had a bias towards equal-sized partitions, while average link and ROCK produced a lot of single-item clusters. While there are scalability gains, the results were mixed on prediction accuracy. For evaluation, they used data from the MovieLens project. The approach might be suitable for parallelizing the problem without harming accuracy and coverage too much.
In the second session, three authors presented techniques for integrating document content with collaborative information. Mark Claypool presented his group's work in generating an online newspaper, which uses a weighted average of content and collaborative predictions. The weight is tunable on a per-user basis. Content was initially better that collaboration, which eventually performed better than content. It was asked if the misses might be useful for topic detection.
Michelle Condliff presented a Bayesian model which integrates content and collaborative information. The model fits a naive Bayes classifier for each user based on item features, and then combines the classifiers using a regression model to approximate maximum covariance. There were a set of user covariates: age, gender, geography region (zip codes, which didn't work very well for university undergraduates) integrated into an information matrix with document features and ratings. She evaluated the model using a small example dataset and EachMovie.
Finally, Raymond Mooney presented a book recommender which used text categorization methods on information extracted from Amazon.com book descriptions and reviews. The content information was modeled as a semistructured bag of words. Mooney's questions were, is it possible to use content-based learning based on the content of collaborative recommendations, and is it useful to let users control the training samples and avoid the "rate these 100 sample items" approach.
The last session featured two papers with rather different subjects. In the first presentation, David McDonald discussed ongoing work in building recommender systems to support "expertise networks" in organizations. In the particular project he discussed, they conducted a field study of how people go about finding an expert in some domain in their organization. Based on this study, a recommender system was designed to support this process electronically.
Lastly, Eric Glover presented a World-Wide Web metasearch engine which models categories of user search needs. These prototypes are translated into engine-specific search queries, and are also used to filter and organize the results from each engine.
Each author was asked to try to address in their presentation how they evaluated their system, and what such approaches were available to them.
Several researchers enhancing traditional collaborative filtering, and also exploring combinations of collaboration and content, were able to test their systems using the EachMovie database from Digital. This was generally agreed as helping broaded our understanding of the dataset and its limitations. Many projects are looking at prediction error as a metric, but there was not a lot of discussion of metrics and their pros and cons.
Additionally, there are lots of domains which we think of as recommendation, but which are fairly distinct. For example, the problem of recommending experts, both in McDonald's work and in Henry Kautz's classic ReferralWeb project, have different requirements and data than a collaborative filtering system. Glover's work in metasearch engines is perhaps illustrative of one way of overcoming privacy issues, by asking users to assume a prototype need rather than providing personal information.
At the close of the workshop, a panel discussion was held to try and address some of these overarching issues. The panelists were Ali Kamal of Tivo, Inc., Clifford Lynch of the Coalition for Networked Information, and presenters Joseph Konstan and Raymond Mooney. Kamal brought some industry experiences in recommending television programs, and Lynch illutsrated several important social issues in collaborative systems, such as privacy and reliability. The panel was started off with a kind of open-ended, get-your-cards-on-the-table question, but the workshop participants quickly threw in a wide variety of interesting issues.