Item:Item Artist Similarity

Item to item similarity is a method popularized by Amazon for computing the similarity of items in its catalog. The reasoning is that item similarities are more static than user similarities and so in situations where finding the similarity requires extensive computation they have an advantage in robustness to infrequent updates.

In the paper Recommendations: Item-to-Item Collaborative Filtering the algorithm is applied to purchases, though I think conceptually it makes more sense to consider then in terms of pageviews. I’m not likely to purchase a bunch of similar items, but I am likely to look at a bunch of them consecutively while shopping.

The algorithm as described in the paper is:

  1. For each item in product catalog, I1

    1. For each customer C who purchased I1

      1. For each item I2 purchased by C

        1. Record that C purchased I1 and I2
    2. For each item I2

      1. Compute the similarity between I1 and I2

Converting this to artist and listeners is simple transliteration:

  1. For each artist, a1

    1. For each listener who listened to a1

      1. For each artist a2 listened to by

        1. Record how many times listened to a1 and a2 respectively
    2. For each artist a2

      1. Compute the similarity between a1 and a2

Since cosine similarity is being used, it is easiest to think of this initially as being about the construction of vectors. The algorithm could be rewritten as: “For each artist, construct a vector where each position in the vector represents a particular user. The similarity between any two artists is the cosine of the angle between those vectors.”

In Amazon’s system, either an item has been purchased, or it hasn’t, so the vectors are boolean: either one or zero. For listen or pageview data however, the vector components can have more complex values. The question then becomes one of if and how those values should be altered to give the best results.

It is worth noting that TF-IDF as described here is simply one method for weighting the elements of those vectors. Analyzing the effectiveness of TF-IDF can be done by comparing the results of TF-IDF to the output of other methods. Even without a definitive ground truth as to which artists are more or less similar, it is possible to say that two systems of computing similarity are similar or dissimilar.

Leave a Reply

Your email address will not be published. Required fields are marked *