I have been discussing the idea of Survivor Bias from Taleb’s The Black Swan.
The basic idea of survivor bias is we generally abstract the characterisitics that make up a set only from the members of that set. The unknown component of the analysis is frequently the extent to which those characterisitics were present in elements that didn’t make it into the set.
The example he gave is a researcher who has been tasked with fortifying the planes going out to fight the Nazis in WWII. Simply adding more plating all over will make the planes unreasonably heavy, so he looked at all planes coming back from missions and put plating wherever those planes hadn’t been shot.
It may seem as though you want to take your planes and shore them up in the places they’re getting shot so they’ll be stronger. He realized though that he was working with the set of planes that had been shot in places unimportant enough to take them down.
The basic idea is that you study can’t tell you what caused someone to succeed. It can only probably tell you things that contributed to failure.
And it can only do that if there’s only one factor at play. Imagine that planes shot through one wing have a 5% chance of going down, planes shot through both have a 25% chance of going down and planes shot through both and the tail have a 75% chance of going down. The set of planes you look at are going to have holes in all those places and you might not shore up any one of them. The analysis necessary to see the trends increases expodentially with the number of factors you consider.
One of the themes of The Black Swan is because there are so many factors and our models are so limited there will always be “black swans” — events completely outside of reasonable expectations that change the ways of thinking.
How does this relate to recommendations? Well, I have some info on some songs and the sort of person I think likes them. The cold start problem is what to do when I either have a song I know nothing about or a person I know nothing about. The person is tricky, but with the song I can look at the auditory characterisitics of the song and compare them to other songs a person likes. This guess though is going to potentially be skewed because of survivor bias.
People like music for a variety of reasons: the song at my prom, the song I associate with my first infatuation, a song from a particularly awesome party, a song a sang in a choir — songs that have to do with the moment rather than the actual musical characteristics of the song. Using those songs to guess the characteristics of what I like musically isn’t going to work.
Pragmatically, this issue encourages a system that asks the question “how likely is someone to dislike this song?” as well as “how likely is someone to like this song?” It also reinforces the drive to incorporate external information sources that can help build a profile of the user that is independent from simply their listening profile.