Archive for May, 2008

Inner Classes Are Not Closures

I’ve been absorbing JINI all morning. My brain’s full and as it percolates, I think I’ll post on language design. I’ve been reading Matt writing about language design. This is a subject he actually knows. He has a vocabulary for describing language features and can discuss them in the abstract.

I, on the other hand, am just a programmer who can’t do what I want at times because computers are uncooperative.

For example, I am testing the piece of code that I mentioned previously that loads a large dataset. I know it’s dying at some point and I would like to know the progress.

Read the rest of this entry »

Comments (1)

Java Memory Usage

I’ve been working a bit this morning with a tool that’s pretty interesting and I thought I would mention for Matt because I know he works with big datasets.

I’m not quite to the scale of his utility systems, but I’m loading data for about 50,000 LastFM users and some of their artist preferences. All in all, it ought to be a couple million entries.

The system starts out strong adding a thousand or so entries per second. Then it’ll choke for a bit and drop off, go collect some garbage, burn through a few more, slow down and repeat. Each time it comes back for less and less time until eventually it stops coming back and slowly dies.

Read the rest of this entry »

Leave a Comment

Unit Testing

I’m currently trying to write some JUnit tests for parts of Project Aura and it is irritating me. JUnit is based around the concept that tests should be atomic and capable of being run in any order. I like the idea of avoiding side effects so your have conceptually purer tests, but there are times that tests are conceptually linked.

What I’m doing right now is instantiating a simple version of a datastore designed to be deployed on a grid. This creates dozens of files and takes a few seconds. Not that the overhead is a huge problem for this one test, but over the course of time it could start to matter.

Read the rest of this entry »

Comments (1)

Basic Recommenders

After adventuring my way up the Eastern Seaboard, I am finally settled into my swanky pad outside of Boston and beginning my summer with Project Aura. So far everyone seem quite friendly and they seem, by and large, to have their souls intact, so I’m hopeful that I’ll emerge from the summer not too deeply scarred from being an agent of the man.

I’m starting off with building a simple recommender, a collaborative filter, on top of the distributed datastore that they have been developing. Before delving into collaborative filters directly though, I wanted to consider some of the simpler forms of recommendations.

Perhaps the simplest recommendation is one that takes no preference into account at all. This is what I get when I put Amarok on random and set it playing. Not great on accuracy, but pretty much impossible to beat on speed.

A step up from that is a recommender system that incorporates preferences, but aggregates them — voting. Voting makes a lot of sense when the aggregate choice is going to be applied to everyone, say with a political candidate. It makes less sense when the choice is only for an individual. It is still used though: when you go to a website and it tells you the ten most popular products, those have effectively been chosen by a vote.

Not only are vote-based systems not personalized, they are also subject to some pretty serious mathematical problems. I’ll not go into them other than to mention that one vote plurality is just about the worst possible method if you’re concerned about picking the choice with the highest average preference. Similar choices will cloud the genuine group preference, as Ralph Nader and Ross Perot helped to demonstrate.

There are other vote-based systems that have better mathematical properties, but I’ll not go into them because all votes are about systems where a recommendation is being made for a group and Aura is concerned with making recommendations for an individual. (Well, it may be interested in making recommendations for groups of individuals for speed purposes or community generation purposes, but not at this point.)

Some of the earliest methods to consider individual preferences, which are standing the test of time, are pieces of software known as collaborative filters.

The first collaborative filters were user-based. If, when you and I are asked about ten movies we responded similarly, then there is a higher than average probability that our responses will be correlated for an eleventh film. This is a reasonable assumption and it does work, but it has some problems.

One that affects almost all recommender systems is “cold start” — if I’ve not rated anything (or only rated a couple things) then the system doesn’t really know who to compare me to for guessing my preferences.

Another is update cost. There’s not simply two users in the system; there are potentially millions. The system doesn’t generally look for one that has the exact preferences as me. Instead it compares my similarity to everyone and uses that similarity to weight the influences of the other preferences. This means that any time anyone in the system adds or changes a preference for something it affects the weight for everyone else who also has a preference for that item.

Collaborative filters are either active — explicitly asking the users to rate items, or passive — collecting incidental data such as time on a page to infer a user’s preference. Either way, a user will potentially generate a meaningful difference to their preference profile during a single visit.

A method pioneered by Amazon is based around identifying item similarity. Rather than comparing users to each other, take user preference data and generate a model for how related items are assuming that the more people are interested in two things, the more likely they are to be related. This helps with the previous issues in two ways:

Cold start — when a new user enters the system you know nothing about the user. With a new preference item, however, you have the item. Depending on the nature of the item you can examine its properties and attempt to fit it into a relatedness model. This type of work is the purpose behind autotagging music and documents — figure out what they’re related to without human intervention.

Update cost — it isn’t cheaper to update an item-based system, but the rate of change is generally slower, so the process doesn’t need to be as quickly reactive. It also helps that if a new user sees a slow reaction time you may never get another chance to collect more information from them whereas an item will likely hang around for a few days.

I have two problems with item relatedness. The concept was pioneered to some extent by Amazon and it makes sense within the context of people purchasing things. I often go shopping and buy a set of things for a project or I will tend to buy things related in an area. This is a measure of relatedness however and distinctly not similarity. In fact if I bought one book on growing tomatoes it makes it likely than average that I’m not going to buy another unless I’m really into tomatoes. So there’s a blank spot around each item where items are similar enough not to be purchased.

Amazon seems to fill this in with info about browsing because I am likely to look at two similar items to compare them. There’s also probably some time series considerations in there as well since the longer between when I bought two things the less likely they are to be related, I would think.

With preference there is also the danger of not having all the salient characteristics in the system. For instance, I’m listing to Garth Brooks’ The Hits right now. I listened to country in high school. I like ’90’s country songs because they remind me of that period. I don’t really listen to much modern country. Relatedness of items is individual to certain users and item-based similarity loses those distinctions.

Also, for something like blogs or music though there is a Heisenberg sort of effect. Liking a song may get me to like a genre or hearing too much of a song may turn me off. The sweet spot is to give me what I want, but not too much of it. Relatedness is not really going to give me that information.

Comments (1)

Recommending Meaning

I am a huge storytelling fan. I’m from Bristol, Tennessee which is a mere stones throw away from Jonesborough where each year of my youth we went to see the National Storytelling Festival.

I was down in Asheville these past few days visiting with the middle of the three Holcomb boys: Matt. (I’m the eldest.) We were there ostensibly for his birthday, and one of the events we went to see was the live simulcast of This American Life.

If you are unfamiliar with This American Life, I recommend you take a moment to listen to a piece or two from the archive. Every week they have a theme and collect three stories on that subject from someone. The stories can be simple or they can be topical. I blogged months ago about crying while listening to stories from Iraq.

Anyhow, scientist metacognician that as I am, after I got through watching the movie I was thinking about what made particular stories particularly touching to me. There was a topical piece where an Iraqi now living in the States went around with a little booth, ala Lucy Van Pelt, with a placard “Talk to an Iraqi”. He recorded the responses from everyone who came to talk to him and, with his own narrative, drew together a story.

The piece ultimately makes a compelling argument against the War in Iraq. My family was discussing the piece in the car. No one in the car is particularly fond of how the war is going, but despite our common opinion, we had a variety of perspectives on what was important. My father and I were interested in the piece as propaganda, my brother in the salient information for American policy, my mom in the lives of some of the speakers…

Much like with what I understand of music, different characteristics appealed to different people. It begs the question of if we can form a set of salient and discrete descriptors for a story (as we attempt to do with timbral characteristics of music) so that I could guess ahead of time how much someone is going to like a story.

Also it makes me wonder what you would attempt to measure about a person to try and help you build a model.

I’ve always thought it would be fascinating to have access to a data source like OkCupid where people have completed thousands of psychometric tests. I don’t know if feeding that data into a clustering algorithm along with preference data might help us build a better model of how personality and musical preference interact, but it would be so amazingly interesting to give it a shot.

(As I’ve mentioned, control of the data is already starting to shape scientific progress.)

Comments (3)