Archive for computers

Music Memes

I was riding home and two songs came on the radio, one right after the other: Katy Perry‘s “I Kissed a Girl” and Usher‘s “Love In This Club.”

Two popular songs focusing explicitly on anonymous sexual behaviors were particularly interesting because NPR ran a report that morning on this years CDC Youth Risk Behavior study which is showing slight increases in sexual activity.

I think it would be interesting for some far off researcher to look at correlations between the popularity of themes in songs and the studies tracking those behaviors for populations. I’m particularly interested in subpopulations that are particularly attuned to cultural trends, maybe hipsters and teenagers, and how cultural concepts spread. If recommender systems get particularly good at what they are trying to do, they will hold potentially fascinating results for sociological researchers.

I was reading a dystopian future store where someone figures out a concrete epidemiology for memes and proceeds to rule the world by controlling what everyone believes. I suppose in some ways that’s what the Illuminatus Trilogy is about (which I am wading through currently).

Leave a Comment

What’s Going On At Sun

Hoare’s Dictum: Premature optimization is the root of all evil.

— C.A.R. Hoare

I’ve been spending the last couple days reading about collaborative filters again. I got sidetracked for a bit trying to load some data into the distributed data store. I’ve been spending the last couple days wandering through papers about collaborative filtering. In particular Herlocker’s Evaluating Collaborative Filtering Recommender Systems is an excellent summary of the issues surrounding writing a collaborative filter.

One of the things I’ve been realizing is I simply don’t know enough about the field to design a general purpose framework. I’ve looked at the structure of systems like Taste, Cofi and CoFE, but I don’t really have the background other than in a broad systems design sense to evaluate what they’ve done.

The reason I’ve been thinking about Hoare’s Dictum is the ideal computer program would run instantaneously using no resources and do everything. Premature optimization is frequently discussed only in terms of execution time or resource allocation, but if I attempt to create a general model for collaborative filtering before I really understand the field I am optimizing the axis of flexibility before I really have the conceptual background to do so.

I’m a programmer. The way I get a conceptual background in something is to write a program. I’m not going to say I’m going to write a throwaway program (since people debate that most throwaway programs aren’t [thrown away]), but secretly that’s what I want.

Because I’m not shooting for a general framework and rather a specific program for the purpose of learning, it allows me to make a specific demonstrable goal which is much more manageable from a research perspective. So, what is a good program for a collaborative filter?

I figure a good choice would be something that is already being done by Project Aura so that we can compare their text mining techniques with a collaborative filter. So, what is Project Aura doing? Here’s what I know of from my month (has it been a month?!?) here:

  • Document Similarity Based Recommendations:
    • Tagomendations — Finding artists that are similar to each other based on the tags provided by last.fm. The tags are “cleaned” prior to clustering so that distinctive tags will be more influential.
    • User-Based Recommendations — Instead of using the tags as the criteria for determining if artists are similar, use the listeners who enjoy a particular artist.
  • Aardvark — Generating blog recommendations based a RSS feed of entries. The RSS feed is generally generated from Google Reader‘s shared items.
  • Tastebroker:
  • GUI Visualizations — Populating 3D interfaces with similarities of both blogs and music using dimensionality reductions and color and size to represent some of the characteristics.

It seems like a good initial project is simply take a last.fm user profile and recommend an artist to them based on the tag space. Been done before, but that’s fine for a learning project.

Comments (1)

Software Is Applied Philosophy

Yesterday, I wrote an entry on science as a coping mechanism for dealing with complexity.

It’s a fairly lengthy and philosophical piece. I originally intended to structure things a bit differently, but once it started to get long, I decided to split things up.

It would, after all, be somewhat ironic to spend three pages arguing that good science is based around breaking things into mentally manageable chunks and then, having reached a potential stopping place, water down my point by stuffing more information into the post.

Which is central to what I would like to focus on: establishing boundaries. The previous post started a concept by Patrick Grimm that over time rational models go from being broad philosophical abstractions to specific systemitized sciences. I added the idea that the systemic boundaries are not necessarily “natural” characteristics of reality, but rather that they are drawn to deal with the limited capacities of our mental hardware to deal with complexity.

For a somewhat expanded (and more entertaining) version of that argument, consider reading Lewis Padgett’s Mimsy Were the Borogoves.

Read the rest of this entry »

Leave a Comment

Progressing Toward Specificity

I went down to New Haven this weekend to help Wayne move into his beautiful new apartment. Along the way I was listening to a recording of Patrick Grim discuss “A Philosophy of Mind.”

Read the rest of this entry »

Comments (3)

Inner Classes Are Not Closures

I’ve been absorbing JINI all morning. My brain’s full and as it percolates, I think I’ll post on language design. I’ve been reading Matt writing about language design. This is a subject he actually knows. He has a vocabulary for describing language features and can discuss them in the abstract.

I, on the other hand, am just a programmer who can’t do what I want at times because computers are uncooperative.

For example, I am testing the piece of code that I mentioned previously that loads a large dataset. I know it’s dying at some point and I would like to know the progress.

Read the rest of this entry »

Comments (1)

Java Memory Usage

I’ve been working a bit this morning with a tool that’s pretty interesting and I thought I would mention for Matt because I know he works with big datasets.

I’m not quite to the scale of his utility systems, but I’m loading data for about 50,000 LastFM users and some of their artist preferences. All in all, it ought to be a couple million entries.

The system starts out strong adding a thousand or so entries per second. Then it’ll choke for a bit and drop off, go collect some garbage, burn through a few more, slow down and repeat. Each time it comes back for less and less time until eventually it stops coming back and slowly dies.

Read the rest of this entry »

Leave a Comment

Unit Testing

I’m currently trying to write some JUnit tests for parts of Project Aura and it is irritating me. JUnit is based around the concept that tests should be atomic and capable of being run in any order. I like the idea of avoiding side effects so your have conceptually purer tests, but there are times that tests are conceptually linked.

What I’m doing right now is instantiating a simple version of a datastore designed to be deployed on a grid. This creates dozens of files and takes a few seconds. Not that the overhead is a huge problem for this one test, but over the course of time it could start to matter.

Read the rest of this entry »

Comments (1)

Building the Matrix

I made a snippy comment the other day to Jenni (my fiancée who’s doing her Ph.D. in public health at Johns Hopkins) about the difficulty of “soft” sciences like history and psychology versus “hard” sciences like physics and chemistry.

In the ensuing amicable discussion, she effectively proved her primary thesis — I was being an ass. ☺

Soft sciences represent the forefront of human knowledge. The models don’t lack precision because the people making them aren’t bright enough to make them. They lack precision because they’re so amazingly complex that the sum total of human knowledge hasn’t given us a precise model.

Consider the process of immunology. Long long ago we ate whatever the hell we wanted if it didn’t smell too funky. After a while people started to recognize a trend that certain things smell ok, but will still kill you. For example trichinosis can affect seemingly healthy pork. So, we come up with a model where God tells us to not eat pigs.

If you follow a religion that still doesn’t dig on swine, I’m cool with that, but most of the Western world has decided God’s ok with it. I blame bacon, how long could we be expected to hold out against that delicious temptation?

Read the rest of this entry »

Comments (4)

Blog Recommenders

On the subject of my previous post about economic modeled music recommendations — I think a really good application of this would also be to blogging.

Imagine an app something like Google Reader, but where instead of me manually adding in bunches of feeds by myself, I log in and the program gives me a feed of items I am likely to like.

It’s related to the service that Stumble is doing, but collected in one place and with a more visible data model. Since the entrance to creating blog entries is lower than with music, you’d have a new factor. You’d have your audience, your exemplars and, if the application was popular, the bloggers would start to react as well.

Thinking about it in terms of blogging made me realize an assumption I made about music geeks. I assumed that a music geek would just start to slowly wider their horizons and start to like new genres. That the set of optimal songs for a given genre though would stay the same.

That’s not necessarily the case though. Imagine that as someone expands their musical horizons they begin to recognize good musical form. Peppy but sloppy songs they used to like may fall out of favor. Musicianship doesn’t always correlate to popularity.

Mathematically what this means is that a song isn’t simply a member of a single cluster because depending on the cluster the quality of that song will differ. A song is essentially in every cluster simultaneously to a varying (and frequently very small) amount. It suggests a different method for finding clusters by looking at patterns across axes rather than something like k-centroids that looks at all of them as a whole.

Leave a Comment

Geekery in Meme Form

This is one of the geekier memes I’ve seen, but I really like it. You’ll know what it means if it applies to you:

will@ebene:~$ history|awk '{a[$2]++} END{for(i in a){printf "%5dt%sn",a[i],i}}'|sort -rn|head
  169   fg
  138   make
   35   emacs
   29   svn
   26   cd
   24   ls
   20   for
   15   rm
    7   R
    6   mv

I found it from a fellow named Killus. I’ve been wandering the archive from his blog for a bit and I like the simple-but-complete math explanations theme. There’s quite a bit of neat stuff, but I’ll just mention a post on working for WRPI where, he argues (as a side note) that music recommenders will deviate from the public opinion simply by virtue of being so deeply involved in it.

I’ve got a note on that, but I’ll put it into another post to not melange these ideas.

Comments (1)

« Newer Posts · Older Posts »