Archive for June, 2008

Array Indexing Methods In Java, R and Python

I’ve been working a bit more with R doing some data analysis. I keep getting hung up on the array access semantics and wanted to write them out so I could remember them.

Arrays are a very common data structure in computer science and most every language has support for them. Most intro classes describe an array as a set of boxes where you can stick stuff and refer to them by number.

Read the rest of this entry »

Comments (3)

Leaving the Ocean Unboiled

I’ve been wandering a bit about intellectually. There was a method to this madness. Here was my plan for profit:

  1. Semiotics — Separate items and their meanings. Rather than considering a song a discrete thing that a user has a preference for, think of it as a complex symbol that has meaning for a user.
  2. Memetics — Examine shared cultural myths as philosophies of human nature and argue that the process guiding their specification is the same as the one driving philosophies about the world toward sciences.
  3. Preference as Conditioning — Distinguish between cultural symbols (guys wear pants) and simple symbols (the sun meaning warmth), argue that music communicates both types and that a unified perspective of messages can incorporate both.
  4. Hidden Markov Models — Posit that preference arises from the conditioning of a relatively small number of elements. Attempt to use patterns in the expressed preferences to guess the layout of this hidden network. Introduce the concept of ego as a state maintenance function on a stateless network.
  5. Vector Distance — Come up with some sort of unified way to train a Markov model on cepestral coefficients, tags and lyrics and use the weights of the nodes as a vector for computing user similarity (or, by examining the networks of the users liking a certain thing, compute a vector representing the messages communicated by a complex sign).

So, kinda out there as an idea. I didn’t really mean for it to come out quite that strangely. The task I was given was, “come up with a collaborative filter.” Recall that my definition of “collaborative filter” is pretty amorphous. I’ve read several papers on collaborative filtering, but none of them is particularly explicit. The definition I got from wikipedia was:

Read the rest of this entry »

Leave a Comment

Webmaster Tools and WordPress 404

I have Google’s webmaster tools as a gadget on my Google start page.

You have to verify that you control a domain before it Google will tell you stuff like the top search phrases and RSS subscribers and whatnot. When I logged in this morning I noticed that a couple of the domains that I had previously verified by placing a specially named HTML file on the site were now unverified.

“Stupid Google,” I thought to myself. “Why would you forget that I had verified those things?” I was thinking that perhaps it was because I had removed the specially named files and Google wanted to be able to reverify every so often.

Turns out Google was in fact brighter than whoever wrote my WordPress theme.

Every time your web browser gets a page from a webserver there’s a number, called a status code, associated with that transaction that categorizes the type of page that you are getting. If you try to access a page that doesn’t exist on my site you get a cute picture of a puppy. This is all fine and good except the status code that page sends is 200 (“OK”) which isn’t right. The reason Google unverified my site is it realized that my site reported every single page requested as being there rather than returning a 404 (“Not Found”) status for pages don’t exist.

This means that if someone else tried to put my page into webmaster tools and verify it then whatever the special filename that Google gives them would also come back as being present (status 200).

Smart as it is, Google made me fix this. So, I’m sorry Google for my lack of faith.

Leave a Comment

Signs When Driving

I was driving home from work on I-95 which was traffic about two car lengths apart going around 70mph. I was thinking about semiotics and my turn signal.

It’s a good example because the most surface interpretation of a turn signal is a blinking light. When I stick it on a car, and particularly on a certain side of the car it has additional levels of meaning from that context.

Why do I signal?

  • It’s habit? I sometimes signal when I’m on a road by myself. I am conditioned to catch the turn signal with my hand as I turn the wheel.
  • It’s the rule? I’m more likely to signal if there’s a cop driving behind me.
  • It’s polite? I sometimes wave to drivers that let me into traffic and there’s an abstract sense of a social dynamic between me and the rest of the cars. If someone cuts me there is a sense of a breach of social contract. It can even change the dynamic and I’ll start driving more aggressively — treating the whole of traffic as an organism to be responded to.

Read the rest of this entry »

Leave a Comment

Semiotics 101

I’ve been considering tags and what exactly it means when I “tag” a song. It has a different meaning than rating. I think I have an idea of how to design a collaborative filter using tags, but I lack the vocabulary to really work out the idea.

I think the terms I need exist in the field of semiotics. This post is to define them so I can use them. To be precise, this is a combination of selected definitions with some additional interpretation. Semiotics is large, complex and controversial, and this is in no way authoritative.


Semotics attempts define a terminology to take complex inferences underlying interactions and make them explicit. The primary thesis is that interactions are significantly more complex than they seem at first glance, and as a result semtiotic writings frequently end up taking something seemingly simple and describing it in excruciating detail.

The basic building block of semiotics is the “sign:”

Read the rest of this entry »

Comments (1)

Morality and Food Production

Matt has a post on issues with food production.

I don’t really have time for a detailed response, but would like to mention a sermon I heard Lori Kenschaft deliver at the local Unitarian Church entitled Life In Balance: The Spiritual Practice of Gardening.

I really liked it as a mature perspective on food production. I particularly enjoyed her blending of spirituality and science.

For most of the history of life, nitrogen was one of the things that most limited how much life could exist. There was plenty of nitrogen in the air, but most living beings cannot use that nitrogen. Some bacteria, however, can take nitrogen from the air and bring it into their bodies, a process that is usually called “fixing nitrogen.” Until the 1940s most of the world’s biologically available nitrogen was fixed by those bacteria, and the rest of us depended on them for our lives.

During World War I, however, German scientists figured out how to fix nitrogen chemically, on an industrial scale. After World War II this technology was adapted to make fertilizer instead of munitions. Chemical fertilizers allowed a huge expansion in the world’s food supply and therefore the world’s human population. It is estimated that about half of the nitrogen that is now biologically available on this planet was fixed chemically, not by bacteria. And about 40% of the human population, more than two billion people, owe their existence to those chemical fertilizers.

The problem is that chemical fertilizers, and chemical herbicides and pesticides, kill many of the things that live in soil and form the community that makes soil healthy. Over time, therefore, farmers have to use more and more fertilizer to keep producing the same amount of food. If the soil becomes too damaged, crops fall no matter what.

Fixing nitrogen chemically requires a lot of energy. It requires heating the mixture to about 700° Fahrenheit at 200 times atmospheric pressure. So farmers nowadays often use more than a calorie’s worth of fossil fuel to produce a calorie’s worth of food. As the soil becomes more damaged, that ratio gets worse over time. This is, by the way, why corn ethanol is not an answer to our climate change problem.

There are several reasons why the global price of food commodities increased by more than 60% in the last year. But one of them is that the price of food is now closely linked to the price of oil. And the diversion of corn into making ethanol has helped create real problems for the 2½ billion people who live on less than two dollars a day.

Personally, I like the concept of community gardens both as a mechanism for creating interpersonal connections and to help address some of the issues surrounding food production.

Leave a Comment

Music Memes

I was riding home and two songs came on the radio, one right after the other: Katy Perry‘s “I Kissed a Girl” and Usher‘s “Love In This Club.”

Two popular songs focusing explicitly on anonymous sexual behaviors were particularly interesting because NPR ran a report that morning on this years CDC Youth Risk Behavior study which is showing slight increases in sexual activity.

I think it would be interesting for some far off researcher to look at correlations between the popularity of themes in songs and the studies tracking those behaviors for populations. I’m particularly interested in subpopulations that are particularly attuned to cultural trends, maybe hipsters and teenagers, and how cultural concepts spread. If recommender systems get particularly good at what they are trying to do, they will hold potentially fascinating results for sociological researchers.

I was reading a dystopian future store where someone figures out a concrete epidemiology for memes and proceeds to rule the world by controlling what everyone believes. I suppose in some ways that’s what the Illuminatus Trilogy is about (which I am wading through currently).

Leave a Comment

Songs That Stick

One of the arguments put forth in the Herlocker survey is that if you ask a person to rate a song several times over the course of a few months, they are highly unlikely to give the same answer every time.

They describe this as a “natural variability” in human preference that perhaps represents a hard limit to how effective recommender systems can be.

While I do agree that preference is the product of a chaotic system influenced by variables many of which are unavailable to the computer system, it is true that much of the variability is encompassed by simple and easy to capture information. For example, I sometimes listen to dub when I’m writing code, because I can ignore it, but it would be completely inappropriate for working out. If the computer knew the types of music I liked while coding it could do a better job of pulling stuff for that category.

François is working on stuff to address the problem by allowing an explicit weighting of tags. Paul mentioned automatic characterization of tags such that I might have a “coding music” tag that is recognized as being situational rather than genre or mood, and specifically that the computer will figure out the category of that tag rather than me specifying it.

Another tact that I think will ultimately be necessary is to model preference as a time series characteristic rather than something static. What I like today is simply not the same thing that I will like tomorrow. The plasticity of the mind is undeniable (though there is certainly a neophyte / neophobe continuum along which most people lie).

I would not at all be surprised if there are characteristics common to songs that I have continued to like over the course of years and other characteristics common to songs that I liked for bit but have fallen out of favor. The changes are not just noise, they are important predictive data.

Maybe I’ll tackle that. Right after, I manage to write a baby collaborative filter. Who am I to let a complete lack of knowledge prevent me from doing something. ☺

Leave a Comment

What’s Going On At Sun

Hoare’s Dictum: Premature optimization is the root of all evil.

— C.A.R. Hoare

I’ve been spending the last couple days reading about collaborative filters again. I got sidetracked for a bit trying to load some data into the distributed data store. I’ve been spending the last couple days wandering through papers about collaborative filtering. In particular Herlocker’s Evaluating Collaborative Filtering Recommender Systems is an excellent summary of the issues surrounding writing a collaborative filter.

One of the things I’ve been realizing is I simply don’t know enough about the field to design a general purpose framework. I’ve looked at the structure of systems like Taste, Cofi and CoFE, but I don’t really have the background other than in a broad systems design sense to evaluate what they’ve done.

The reason I’ve been thinking about Hoare’s Dictum is the ideal computer program would run instantaneously using no resources and do everything. Premature optimization is frequently discussed only in terms of execution time or resource allocation, but if I attempt to create a general model for collaborative filtering before I really understand the field I am optimizing the axis of flexibility before I really have the conceptual background to do so.

I’m a programmer. The way I get a conceptual background in something is to write a program. I’m not going to say I’m going to write a throwaway program (since people debate that most throwaway programs aren’t [thrown away]), but secretly that’s what I want.

Because I’m not shooting for a general framework and rather a specific program for the purpose of learning, it allows me to make a specific demonstrable goal which is much more manageable from a research perspective. So, what is a good program for a collaborative filter?

I figure a good choice would be something that is already being done by Project Aura so that we can compare their text mining techniques with a collaborative filter. So, what is Project Aura doing? Here’s what I know of from my month (has it been a month?!?) here:

  • Document Similarity Based Recommendations:
    • Tagomendations — Finding artists that are similar to each other based on the tags provided by last.fm. The tags are “cleaned” prior to clustering so that distinctive tags will be more influential.
    • User-Based Recommendations — Instead of using the tags as the criteria for determining if artists are similar, use the listeners who enjoy a particular artist.
  • Aardvark — Generating blog recommendations based a RSS feed of entries. The RSS feed is generally generated from Google Reader‘s shared items.
  • Tastebroker:
  • GUI Visualizations — Populating 3D interfaces with similarities of both blogs and music using dimensionality reductions and color and size to represent some of the characteristics.

It seems like a good initial project is simply take a last.fm user profile and recommend an artist to them based on the tag space. Been done before, but that’s fine for a learning project.

Comments (1)

Scheduling Productivity

This last week has been less than stellar for me productivity-wise. The issues have been, to some extent, systemic.

One of my biggest problems has been a simple one of biology. I do pretty well in the morning. Getting settled in, catching up on e-mail, doing some coding… I cruise along until around lunchtime when I hit a lull and, if I’m lucky, end up in a stupor staring off into the space about two feet behind my monitor. Equally unproductive, but slightly less discrete is lolled back in my chair snoring slightly and drooling on myself.

I’ve tried various methods for combating this phenomena. It isn’t just that I’m worried someone is going to catch me, it’s how pointless it is. If I’m going to be productive, I want to be productive. If I’m going to rest, I want to rest. What I’m doing with this half-breed amalgam is the worst of both worlds — being unproductive in a really uncomfortable way.

I thought for a while that it might be the act of eating. Maybe energy necessary for running my brain was being redirected to my stomach, so if I reduce resources going to the stomach, I can keep the brain going stronger. This line of reasoning led to the not terribly successful experiments in boosting energy levels by not eating.

I did have some luck with the grazing pattern where I make a sandwich and eat it a bite at a time over the course of five or six hours. (A really good diet strategy, fyi. It significantly reduced my overall caloric intake.) Part of the reason I’m here at Sun for the summer is the people I’m around. When I go down to lunch I get to hear interesting people espouse unique ideas, and I think it might seem a bit odd if I just came to lunch and took two bites out of my sandwich in half an hour.

Grazing wasn’t a complete solution in any case. The central issue is thinking is taxing. If I was loading hay all day, I wouldn’t try to come up with some magical plan whereby at the end of the day I’m not tired. Because the work of an engineer goes on inside our heads, we are more apt to assume that we can simply change the ramifications of doing it with moral resolve. Just because the action isn’t visible however doesn’t mean it is less real.

The solution I’ve been relying on for the last week has been a tasty one: chocolate covered espresso beans. Coffee might taste like dirty water, but the magical font of Goodness that is chocolate manages to make it delicious. Honestly, I think if I stuck with it for long enough I could condition myself to enjoy the taste of coffee (much like I now enjoy beer, which pretty much every first-time drinker agrees tastes like horse pee).

The problem isn’t really solved though. I do manage to stay conscious through the afternoon, but my focus is sharp right after a shot of caffeine and sugar, and drops off again with increasing rapidity. The big problem is there isn’t such a thing as a free lunch — three nights this week I got home around 6:00 and was asleep by 7:00 only to wake up at 2am unable to get back to sleep. (And being up from 2-5am leaves one with the reactive efficiency of roadkill the following day.)

I suppose I could view making it through the work day as a success and say that it is unprofessional of me to consider my discomfort at home when structuring my schedule, but I’m pretty sure that is the express train to getting your soul sucked out. (Something that would ultimately not only be bad for me, but for Sun as well.)

I figure though that my specialty is systemization, and if there is solution to be had, I can find it. I’ve got a more formal set of ideas on the process for doing that, but for the sake of brevity I’ll not go into all that. I’ll just mention that the plan for the next week is to work 7am-2pm, go home, probably nap and then work another hour or so in the evening. I’ve not got the criteria yet for doing a more formal evaluation, but I figure I’ll at least get a sense of how it leaves me feeling.

Comments (2)

Older Posts »