Monday, August 25, 2008

The Joys of DIY Dynamic Programming

For nearly the first decade of my bioinformatics career I carried around a dirty little secret -- well, at least at times I felt it was one. I had coded many things, I could explain many algorithms, but I had never coded a dynamic programming alignment algorithm -- the core to so much I did. I had slightly hacked one version (just to have it do an all-all comparison of a database, doing each possible pairing only once). Finally, for a bunch of reasons, I sat down and did it -- my very own Smith-Waterman implementation.

I'm reminded of this because a couple of weeks ago I rolled back my sleeves and knocked one out again. Now, just the fact I did this reveals a bit about me. I did find at least two freely available C# implementations on-line (e.g. the C# version of JAlign) and there is a plethora of C implementations. There is also Ewan Birney's magnificent Dynamite, pretty much the catch-all for the field (Dynamite is a programming kit for doing this; in effect a programming language for dynamic programming). But, partly as a point of pride & partly because I saw I'd need to hack the one C# copy I looked at in detail, I did it. I even wrote a schmancy version -- a simple cDNA to genomic sequence aligner with two classes of gaps (one being an intron, with a really trivial model of a splice junction -- I think it used dinucleotides) All coded in Perl -- no speed demon, but it solved the problem where we needed it.

Now, it took me a good few hours to do it -- better than the few days of the first time, but not instant. I can claim that this time I didn't fall back on any study aids, such as the many online descriptions or Eddy & Durbin's & co. very well written book.

The implementation says a lot about me too. I thought of many ways to code it and finally settled on one. For example, there is the question of how to represent the alignment matrix; I used a two-dimensional array scheme (actually implemented using dictionaries -- a holdover from my Perl-centric days) but I could have also made it a graph of nodes. There is also the actual thrashing through the matrix -- the algorithm is inherently recursive, but following familial idiosyncracies I wrote the code to use loops -- well, actually I completely waffled and implemented so it can use recursion, but actually loops through! The applications I'm considering are going to be short alignments, so I didn't worry about memory efficiency (who wants to be that will bite me back!) nor did I fixate on speed (care to double the bet?) -- indeed, I wrote it to allow all sorts of baroque variations, such as different penalties for opening gaps in the two different sequences & for basic profile-to-sequence alignments. Plus it is either Smith-Waterman (local) or Needleman-Wunsch-Sellers (global), with a simple toggle.

So now the pitch: If you are a bioinformatics programmer & you haven't written one, I urge you to do it. It's great practice & nothing illustrates an algorithm like trying to implement it. If you don't consider yourself a programmer, guess what? It's perhaps not the obviously easy first start, but just thinking about it will stretch your mind. Plus, you get a free bioinformatics Rorschach test from your implementation choices!

One last thought: who can think up (and execute) the most comically baroque -- but functional -- implementation of S-W/NWS? Has it already been done in PostScript? How about in a relational database (I've written some pretty baroque SQL this year, but I doubt I could tackle this)? S-W as an Excel spreadsheet? Coded with glider guns? A full description for a true Turing machine? Of course, the grand prize winner would clearly either be to build a DNA computer to compute an alignment -- but perhaps that could even be topped by implementing the algorithm with living cells as the alignment cells!

Sunday, August 24, 2008

One more Olympic thought

One other item that was in the mental draft of yesterday's Olympic pondering, but was inadvertantly dropped. Another possible genetically-driven edge in athletic performance would not be directly on performance but on the reaction to performance. Prime athletes might have different pain or endorphin responses, less post-exercise inflammation, different injury responses. Some of these might be specific to specific events or types of sports -- joint pounding running or gymnastics puts very different stresses on the body than something like swimming or speedskating.

Saturday, August 23, 2008

An Olympic Pondering Decathlon

Okay, my biannual stint of Olympic watching is about to conclude. A bunch of speculations suggested by this year's stretch, starting with the utterly unscientific and ending with more genomic oriented queries.

1) Having now watched two Olympics using a Digital Video Recorder, it's completely clear that having a few fast forward speeds is no way to navigate multi-hour recordings to find what you want. Surely there are better UIs for this! The thumbwheel on an iPod is one obvious choice, but there must be other ways.

2) The Summer games are blessed with multiple events which touch on multiple disciplines: the decathlon, heptathlon, modern pentathlon & triathlon. Why isn't there a true multi-discipline Winter sport? Biathlon is a glorious combination of two diametrically opposed skills -- racing and precision shooting -- and there is also the Nordic combined, but neither of these sample a wide range. How about this for a Winter hexathlon:
A) 500m long-track speedskating
B) 2500m long-track speedskating
C) Downhill skiing
D) Slalom
E) 10Km X-C skiing
F) ski jump (small hill)

3) I once contemplated attending a HUPO meeting in Beijing; atop an interesting program there was a post-conference trip option to tour the country. There were two issues: I'd have to foot my own travel expenses & I'd be in serious hot water at home for visiting the Wolong panda center solo. But, now I'd definitely go -- particularly if they replaced a typical dry scientific kickoff with another Zhang Yimou spectacular, I wouldn't hesitate!

4) As a kid I did occasionally have Olympic daydreams. I'm a bit over the hill now, but between athletes older than I such as Dara Torres and hearing that some countries will field just about anyone, perhaps I gave up too soon. If I could pick anything, it would be long track speedskating, the most graceful speed sport bar none. But, more realistically perhaps I could go for the 1500 meter freestyle swimming -- I'd estimate my time is off by only a factor of 3 -- perhaps with some regular training I can get that down to 2!

5) Of course, even with some extensive training I'd look pretty odd on the blocks -- I'm 5'8" and from what I can tell in the TV coverage, it's a rare swimmer who isn't a few inches over 6'. Clearly there are advantages to height in a large number of sports -- but I'm also clearly too tall for a shot at women's gymnastics (atop the other obvious issue). It would be interesting to see which sports have the highest and lowest dispersion in athlete height -- and what those patterns look like. What sport should I have chosen based only on my height?

6) During the Olympics, the world's tallest living woman died, after a life with many health difficulties. Clearly, there are limits to the advantages to height. What sport has the tallest athletes?

7) What more subtle anatomic characteristics might lead to athletic advantage? Differences in muscle fiber composition are an oft-cited one. A TV profile of superswimmer Michael Phelps claimed he is 'double jointed'. But what else. For example, are there subtle differences in some individual's lungs which lead to more efficient air exchange? Smoother surfaces on bone joints?

8) Diving lower, are there biochemical differences? Again, could there be differences in oxygen transfer or usage? Differences in energy metabolism?

9) If we did genome screens of the athletes, what SNPs would we find over-represented? How many of those would be 'obvious' and how many would lead to new genes which influence performance? Already there is at least one company offering genome scans to predict what sport you should stuff (er, steer) your kid into.

10) The sad story of Flo Hyman illustrates another aspect of selection for unusual body types: she died of a aortic dissection due to Marfan's syndrome, which probably also led to her tall, thin stature which was an advantage on the volleyball court. What other genetic variants have a dicey risk/reward trade-off in the athletic arena? And how many of these are a serious medical issue for regular folks?

Friday, August 22, 2008

Any leads on when pandas join the genome club?

Tonight's bedtime conversation veered all over the map (par for the course), but at one point touched on the announced Chinese effort to sequence the genome of the giant panda. Of course, a key concern was that this did not involve any pain or injury to the beloved bicolors, so the concept of buccal swabbing was introduced.

This project was announced last spring. I thought it was planned to be released during the Beijing Olympics, but that is apparently an invention of my imagination.

So, anyone out there in the know care to hint or leak? When will the first ursid genome arrive? And who was the lucky bear?

Wednesday, August 13, 2008

Larry Ellison, please join the 21st century!

I've sniped at Microsoft at least once, so in the interest of balance I'll take a crack at the other software giant I rely on but also frequently complain about: Oracle.

Oracle is truly amazing. Now, I don't have much experience with other relational database systems, so this isn't comparative. But relational databases are amazing. I give it a query of what I want and if I cross all my t's and dot all my i's, then huge databases are searched rapidly (often a matter of seconds).

My first complaint is with inconsistency in syntax. Oracle has several flavors of text types depending on how big you might let your text get. I mostly query databases, not create them, and so I generally want to treat them all the same. Now there might be some good reason I need to use a different function to get the substring from each type, but I really don't want that hassle. But if I'm stuck with it, why couldn't you keep the argument orders the same? Standard substring, like every substring method I've ever met, has the order: string, start, length. But for the really big text columns ("CLOB"s), it's string, length, start. WHY???

But worse, is when I'm having trouble dotting those i's and crossing the t's, Oracle really doesn't give much help. The error messages are somewhere out of the 60's.

For example, one handy feature of Perl (and other environments) is some attempt to identify common pitfalls and give hints about them in the error messages. A common mistake for me is to include an extra , in my query

select x,y,z,
from mytable

In this example, of course, it's small -- but many of my queries are 30-40 lines long. Surely it could detect that the unrecognized field name is a reserved word and therefore hint that I've included an extra comma.

Another example. For one query I have I've been parsing out a numeric string and then trying to convert it to a number. Alas, somehow my parse is failing and I'm getting some unconvertable strings back. Oracle gives me an error that it can't convert something to a number -- but keeps that something a secret from me!

I could go on-and-on. Line numbers for the error are frequently non-helpful, the error messages don't give the context of the offending bit, etc, etc.

The one thing I haven't tried is to edit my queries in Visual Studio, which has an SQL mode. I really should try that -- not that VS's error messages are always golden, but it is good about highlighting the likely neighborhood of mistakes in a way SQL Developer (the Oracle interface I use) just doesn't even attempt

Ah well, I'll live. Larry probably has bigger fish to fry. Personally, though, if it was my software I'd be cringing.