Switching Development Paths

This is a note primarily for my own reference, I’ve been screwing around with code for a week and this is just a record my thinking should I wonder in the future about the clarity of my reasoning ☺.

lotus

The base project is trying to move toward some form of a GGG. Specifically a P2P one where each user writes to their own namespace.

The ecology of information can be divided into three layers:

  1. Documents — Data that has been combined with context and is expected to be consumed directly. This could be anything from books to webpages to graphs.
  2. Transformation & Aggregations — Information about relationships of documents to atomic data. This is programmatic logic of various sorts.
  3. Atomic Data — Raw information, be it text, images or structured data.

The schema would probably be better represented by chaining holons and transformations, but I keep getting bogged down trying to draw a picture of that.

I want to try to capture those transformations and relationships in a system that creates links rather than copies. In essence there is one big graph and the user connects to it in a variety of ways. When they add information, it propagates through a pub/sub model.

The graph is different than the XML model. XML has two types of relationships to a node, attributes which are a hashtable local to the node and children. I was moving toward a graph closer to the filesystem model of just named links to blobs.

All that aside, I’m not working on any of the distributed graph stuff at this point. I’m just trying to make a simple application to generate content from a collaboratively maintained graph. Each person’s edits are stored in their own section of the graph and content generation happens by generating parse events from a token bouncing around the metagraph.

The problem has been eventual deployment. I would like to try this out as a publishing platform, but that means making it available on the web somehow. I’ve been interested in cocoon‘s processing model for a while, but it’s in java which my webhost doesn’t support.

To avoid having to migrate things or opening a new account, I was hoping to find an alternative. I noticed that Google’s app engine recently added java and was free for under 5 million views a month (which I don’t expect to pass). Unfortunately, app engine doesn’t allow local storage and cocoon uses a disk cache to what seems (from a cursory look at the source) enough that switching the backend would be prohibitive.

The gae data store is a type of non-relational store that looked promising for what I wanted to do. So, I did a test program that read a file to the data store using a lxml parser and reading it back out.

All app engine processes have to complete in under 9 seconds, and my app chokes on even a small file. I’ve poked at it some, and I’m pretty sure it’s the datastore that’s slowing it down.

As a test, the program can either use a gae-based model or a memory-based model. Both models retrieve a file and load it into a tree the first run and then load a parsed form (either from the data store or pickle subsequent times).

The run times (according to time using the dev server are):

App Engine Memory
In/Out 7.5s .75s
Parsed/Out 1.5s .25s

I could likely parse and load the data separately and get the program to run, but for this proof of concept prototyping there’s a lot of potential hassles that I could run into.

That and the app engine datastore isn’t really a graph database. Well, it is in the sense that it is nodes and edges, but that’s pretty much any object model. What I really would like to experiment with is Neo4j that has a promising looking traversal mechanism. It means that I would be able to deploy anything I write on my existing host, but since I plan on seeking employment once I finish this project, if it proves worthwhile, I can find a host.

Leave a Reply

Your email address will not be published. Required fields are marked *