After adventuring my way up the Eastern Seaboard, I am finally settled into my swanky pad outside of Boston and beginning my summer with Project Aura. So far everyone seem quite friendly and they seem, by and large, to have their souls intact, so I’m hopeful that I’ll emerge from the summer not too deeply scarred from being an agent of the man.
I’m starting off with building a simple recommender, a collaborative filter, on top of the distributed datastore that they have been developing. Before delving into collaborative filters directly though, I wanted to consider some of the simpler forms of recommendations.
Perhaps the simplest recommendation is one that takes no preference into account at all. This is what I get when I put Amarok on random and set it playing. Not great on accuracy, but pretty much impossible to beat on speed.
A step up from that is a recommender system that incorporates preferences, but aggregates them — voting. Voting makes a lot of sense when the aggregate choice is going to be applied to everyone, say with a political candidate. It makes less sense when the choice is only for an individual. It is still used though: when you go to a website and it tells you the ten most popular products, those have effectively been chosen by a vote.
Not only are vote-based systems not personalized, they are also subject to some pretty serious mathematical problems. I’ll not go into them other than to mention that one vote plurality is just about the worst possible method if you’re concerned about picking the choice with the highest average preference. Similar choices will cloud the genuine group preference, as Ralph Nader and Ross Perot helped to demonstrate.
There are other vote-based systems that have better mathematical properties, but I’ll not go into them because all votes are about systems where a recommendation is being made for a group and Aura is concerned with making recommendations for an individual. (Well, it may be interested in making recommendations for groups of individuals for speed purposes or community generation purposes, but not at this point.)
Some of the earliest methods to consider individual preferences, which are standing the test of time, are pieces of software known as collaborative filters.
The first collaborative filters were user-based. If, when you and I are asked about ten movies we responded similarly, then there is a higher than average probability that our responses will be correlated for an eleventh film. This is a reasonable assumption and it does work, but it has some problems.
One that affects almost all recommender systems is “cold start” — if I’ve not rated anything (or only rated a couple things) then the system doesn’t really know who to compare me to for guessing my preferences.
Another is update cost. There’s not simply two users in the system; there are potentially millions. The system doesn’t generally look for one that has the exact preferences as me. Instead it compares my similarity to everyone and uses that similarity to weight the influences of the other preferences. This means that any time anyone in the system adds or changes a preference for something it affects the weight for everyone else who also has a preference for that item.
Collaborative filters are either active — explicitly asking the users to rate items, or passive — collecting incidental data such as time on a page to infer a user’s preference. Either way, a user will potentially generate a meaningful difference to their preference profile during a single visit.
A method pioneered by Amazon is based around identifying item similarity. Rather than comparing users to each other, take user preference data and generate a model for how related items are assuming that the more people are interested in two things, the more likely they are to be related. This helps with the previous issues in two ways:
Cold start — when a new user enters the system you know nothing about the user. With a new preference item, however, you have the item. Depending on the nature of the item you can examine its properties and attempt to fit it into a relatedness model. This type of work is the purpose behind autotagging music and documents — figure out what they’re related to without human intervention.
Update cost — it isn’t cheaper to update an item-based system, but the rate of change is generally slower, so the process doesn’t need to be as quickly reactive. It also helps that if a new user sees a slow reaction time you may never get another chance to collect more information from them whereas an item will likely hang around for a few days.
I have two problems with item relatedness. The concept was pioneered to some extent by Amazon and it makes sense within the context of people purchasing things. I often go shopping and buy a set of things for a project or I will tend to buy things related in an area. This is a measure of relatedness however and distinctly not similarity. In fact if I bought one book on growing tomatoes it makes it likely than average that I’m not going to buy another unless I’m really into tomatoes. So there’s a blank spot around each item where items are similar enough not to be purchased.
Amazon seems to fill this in with info about browsing because I am likely to look at two similar items to compare them. There’s also probably some time series considerations in there as well since the longer between when I bought two things the less likely they are to be related, I would think.
With preference there is also the danger of not having all the salient characteristics in the system. For instance, I’m listing to Garth Brooks’ The Hits right now. I listened to country in high school. I like ’90’s country songs because they remind me of that period. I don’t really listen to much modern country. Relatedness of items is individual to certain users and item-based similarity loses those distinctions.
Also, for something like blogs or music though there is a Heisenberg sort of effect. Liking a song may get me to like a genre or hearing too much of a song may turn me off. The sweet spot is to give me what I want, but not too much of it. Relatedness is not really going to give me that information.