Friday, December 22, 2006

My Year's End, My Millennium's End

Tonight will be my last post for 2006. The holiday week is a good time to clear one's head and focus on family and fun. Thank you all for taking a gander at this new blog -- it's been fun & I look forward to continuing. I have a bunch of topics scribbled down & a few dozen bookmarks of interesting papers.

Today was also my final day at Millennium. I am on the payroll through the end of the year, severance for a while next year, and various confidentiality agreements to my grave. But for me, this was the end. I turned in all my trappings -- laptop, badge, security token, sent a farewell message, and took my last box out.

For ten years, Millennium has been a constant in my life, yet nothing has been constant at Millennium. Eight desks, six bosses (one twice), uncountable reorganizations (and bosses' bosses), a multitude of department/group names -- I once joked that the company portal should have a banner "Today is Monday & you report to XXX in department YYY" so you could keep track of it. Two CEOs, 5 CFOs (I think), 3 big mergers (defined as changing my daily life!), four big 'restructurings'' -- one could invent dozens more measurs for how MLNM exemplified creative destruction.

As one might expect, that meant a lot different people. By my guess, somewhere between 5-10% of people who worked for Millennium the day I started will still be there after I leave. One back-of-the-envelope estimate is about 3K people who can claim to Millennium alumni, with perhaps close to 2K who were in Discovery at some point.

It's been an amazing experience & quite an education. I will miss it, and it seems a lot to hope that the next career waystation can match that. Fare-thee-well Millennium.

Thursday, December 21, 2006

Nature's The Year In Pictures

Great science does not require great photography, but it never hurts. Nature has a very nice collection of scientific images from the year (free!) . Only one is directly relevant to the usual topics here, an image of a microfluidic DNA sequencing chip, but they are all striking. Enjoy!

Wednesday, December 20, 2006

Blast from the past

In my senior year at Delaware I took a course in molecular evolution from a wonderful teacher, Hal Brown. Hal was the first (I'm pretty sure) to suggest that resemblance of many enzyme cofactors to RNA was a glimpse at an earlier RNA-dominated biochemistry. Another interesting brush with history is that when at Berkeley he got Dobzhansky's desk. He's a great guy & was a wonderful professor.

I had Hal's class fall semester that year, and still was planning to continue my undergraduate research line of molecular biology in a plant system; my overnight mental conversion to a computational genomicist wouldn't occur until Christmas break. So I picked a term paper topic that sounded interesting (and was!), but in retrospect was a glimpse at my future.

Very few eukaryotes have a single genome, as most have membrane-bounded organelles with their own genomes. For mammals, these are the mitochondria, and in plants and alga there are both the mitochondria and chloroplasts. The fact that chloroplasts and mitochondria had their own genetics, which had quirks such as uniparental inheritance, had been known since the early 60's, but their origin had been hotly disputed. The original theory that they had somehow blebbed off from the nuclear genome had been challenged by Lynn Margulies' radical notion of organelles as captured endosymbionts. By the time I wrote my paper, Margulies thesis had pretty much won out. But it still made a fascinating paper topic, especially when thinking about genomes.

The most fascinating thing about these organelle genomes is not that they have shed many genes they no longer need, as those functions are provided by the nucleus, but the real kicker is that so many genes for their maintenance & operation are now found in the nucleus. Over evolutionary time, genes have somehow migrated from one genome to another. For metazoan mitochondria, the effect is striking: only a tiny number of genes remain (human mitochondrial DNA is <20kb>100Kb were either viruses or complete plant chloroplast genomes, so there was some real data to ponder -- my first flicker of genomics interest.

One strong argument for the endosymbiont hypothesis is an unusual alga with the appropriate name Cyanophora paradoxa, which has chloroplast-like structures called cyanelles -- but the cyanelles retain rudimentary peptidoglycan walls -- just like the cyanobacteria postulated to be the predecessors of chloroplasts. Another strong argument is that for a set of enzymes found both in organelle and cytoplasm, in most cases the organellar isozyme treed with bacterial enzymes (and in the right group: proteobacteria for mitochondrial enzymes, cyanobacteria for chloroplast ones). Even the exceptions are interesting, such as a few known examples where alternative transcripts can generate the appropriate signals to lead to cytoplasmic or organellar targeting.

I hadn't really kept up closely with the field after that (not for lack of interest: one graduate posting I considered was on a Cyanophora sequencing project at Penn State). So it was neat when I spotted a mini-review in Current Biology on the current state of things. What is interesting, and new to me, is that the Arabidopsis sequencing effort had revealed that nuclear-encoded proteins of chloroplastic origin covered a wide spectrum of metabolism and not just chloroplast-specific functions. What is interesting in the newer work is that Cyanophora does not share this pattern: here nuclear-encoded genes of chloroplast origin are strongly restricted to functioning in the chloroplast.

Things get even wierder in some other unicellular creatures, which executed secondary captures: they captured as endosymbionts eukaryotes which had already captured endosymbionts.

The same issue contains some back-and-forth arguing over the distinction between organelles and endosymbionts, which I don't care to take a stand on, but they illustrate another case I wasn't very familiar with: a sponge which has apparently taken in an algal boarder. If we can figure the mechanisms out & replicate them, the applications might enable some people to truly be 'in the green' and 'looking green around the gills' will take on a whole new meaning!

Tuesday, December 19, 2006

Next-Gen Sequencing Blips

Two items on next generation sequencing that caught my eye.

First, another company has thrown its hat in the next generation ring: Intelligent Bio-Systems. As detailed in GenomeWeb, it's located somewhere here in the Boston area & is licensing technology from Columbia.

The Columbia group last week published a proof-of-concept paper in PNAS (open access option, so free for all!). The technology involves using reversible terminators -- the labeled terminator blocks further extension, but then can be converted into a non-terminator. Such a concept has been around a long time (I'm pretty sure I heard people floating it in the early-90's) & apparently is close to what Solexa is working on, though Solexa (soon to be Illumina) hasn't published their tech. One proposed advantage is that reversible terminators shouldn't have problems with homopolymers (e.g. CCCCCC) whereas methods such as pyrosequencing may -- and the paper contains a figure showing the contrast in traces from pyrosequencing and their method. The company is also claiming they can have a much faster cycle time than other methods. It will be interesting to see if this holds out.

Given the very short reads of many of these technologies, everyone knows they won't work on repeats, right? It's nice to see someone choosing to ignore the conventional wisdom. Granger Sutton, who spearheaded TIGR's & then Celera's assembly efforts, has a paper in Bioinformatics describing an assembler using suffix trees which attempts to assemble the repeats anyway while assuming no errors -- but with a high degree of oversampling that may not be a bad assumption. They report significant success:

We ran the algorithm on simulated
error-free 25-mers from the bacteriophage PhiX174 (Sanger, et al.,
1978), coronavirus SARS TOR2 (Marra, et al., 2003), bacteria
Haemophilus influenzae (Fleischmann, et al., 1995) genomes and
on 40 million 25-mers from the whole-genome shotgun (WGS)
sequence data from the Sargasso sea metagenomics project
(Venter, et al., 2004). Our results indicate that SSAKE could be
used for complete assembly of sequencing targets that are 30 kbp
in length (eg. viral targets) and to cluster millions of identical short
sequences from a complex microbial community.

Monday, December 18, 2006

Breast Cancer Genomics

This month's Cancer Cell has a pair of papers (from the same group), plus a minireview, on breast cancer genomics.

One paper focuses on comparing 51 breast cancer cell lines to 145 breast cancer samples, using a combination of array CGH and mRNA profiling. The general notion is to identify which cell lines resemble which subsets of the actual breast cancer world. Cell lines long propagated in vitro are likely (almost assured) to have undergone evolution in the lab; this means they are not the perfect proxies for studying the disease. Array CGH is a technique for examining DNA copy number changes, which are rampant in many cancers. Its use has exploded over the last few years, with a number of interesting discoveries. It is also a useful way to fingerprint cell lines; at least one cell line was described recently as an imposter (wrong tissue type), but I can't find the paper because of the huge flood of papers a query for 'array CGH' brings up.

The second paper looks at a set of clinical samples from early breast cancer, and again uses both transcriptional profiling and aCGH. I need to really dig into this paper, but the abstract has some interesting tidbits (CNAs=copy number abberations) -- emphasis my own

It shows that the recurrent CNAs differ between tumor subtypes defined by expression pattern and that stratification of patients according to outcome can be improved by measuring both expression and copy number, especially high-level amplification. Sixty-six genes deregulated by the high-level amplifications are potential therapeutic targets.
The mini-review does highlight a key point: as impressive as this study is, no study can ever hope to be the final word. As new omics tools are developed, new studies will be desirable. Two obvious examples here: running intensive proteomics and looking in depth at alternative transcripts.

Friday, December 15, 2006

Wierd memories from cleaning up

One week left. Time to get serious about the lack of time. One week.

I am a terrible pack rat. I periodically attempt to organize things into folders, but for the most part I use the geologic filing method -- that stratum is roughly October, below that November, below that November (earthquakes & uplifting occur frequently!).

Occasionally my supervisors would crack down (most notably prior to the FDA swinging through the labs one time), but in general there was a better trigger: moving. I was pretty good about lightening up prior to each move. One office lasted 5 years, so there was quite a lot of overburden to deal with that time, but the office one previous to the layoffs was only 2 years and we just moved in the spring. Even at my worst, that's not much time to lay down a mountain. The planners through in one more twist by moving me after the layoffs -- but then again, I was on extended time and they hadn't planned on me being there at all.

However, there was still a lot to go through, with severalmajor categories

  1. Paper for recycling

  2. Confidential material to shred

  3. Items to throw out

  4. Items to forward within Millennium or return

  5. Items to bring home or to next position



We have these big shredder bins which collect stuff for an outside vendor to shred in big trucks -- this is a huge improvement over office shredders, as I always spent more time unjamming them than shredding. It's not efficient to run to the bin each time, so I had a paper grocery bag for batching things. This worked very well -- my four-foot tall unpaid consultant gleefully fed the bins one weekend while I went through papers.

Due to the shortcomings of my system, I came across all sorts of obsolete things. Will I ever again need a serial-to-USB converter? Vendor catalog CD-ROMs from 3 years ago?

On the other hand, some things are really valuable, such as address lists from recent meetings. Others are what I collect too much of but useful: papers that I might want to comment on in this space, old papers I consider really interesting and might refer to.

And, of course, lots goes to recycling or shredding: papers relevant to projects, sequence alignments, snippets of code, etc.

One of the more interesting mixed bags are the business cards, and that's also where I got a strange trip down memory lane. I found some recent ones I thought I had lost, which would have been good contacts to have in my job search (aargh!). Others I couldn't remember at all -- I really should put some context on the back. And finally, I found one from early in my career that I remember vividly.

We had a group (MBio) trying to find the next Epogen and I was the main bioinformatics scientist attached to the group. They were constantly growing & constantly recruiting. I was going to the Hilton Head conference, and the MBio research chief wanted me to screen a candidate: simple enough.

We set up a meeting in one of the hotel bars. The conversation was pleasant, but neither of us seemed to have a strong reaction either way. He wasn't sure he wanted to leave his existing position or take this new one. I reported back to base the equivocal meeting, and moved on.

So it was stunning to see a news item several years later that the same person I had interviewed was the perpetrator of a murder-suicide. I think I saw it on GenomeWeb, but they don't seem to archive very well. I found (via Wikipedia) another item, which adds a truly surreal note about what happened to the pizzas used to lure the victim from her home (you have to read it to believe it).

You meet a lot of unusual people in science, but perhaps you never know -- and never want to -- who are truly outside the norm.

Thursday, December 14, 2006

Red Alert Mr. Pseudomonas!

I finally decided that four weeks of laryngitis was perhaps too long and got myself in to the nurse practitioner, who obliged me with an antibiotic script. Our bar for using antibiotics has historically been too low, but perhaps I overshot in the other direction.

Or maybe not. A recent paper in PNAS presents the provocative thesis that low doses of antibiotics can stimulate nasty traits in pathogenic bacteria. Using a microarray and low doses of three structurally unrelated antibiotics, they detected switching on of a number of unpleasant genetic programs.

All three antibiotics induce biofilm formation; tobramycin increases bacterial motility, and tetracycline triggers expression of P. aeruginosa type III secretion system and consequently bacterial cytotoxicity. Besides their relevance in the infection process, those determinants are relevant for the ecological behavior of this bacterial species in natural, nonclinical environments, either by favoring colonization of surfaces (biofilm, motility) or for fighting against eukaryotic predators (cytotoxicity)


The authors go on to suggest that antibiotics may be important signalling molecules in natural communities. This is in contrast to the older model of antibiotics as weapons in microbial battles for dominance. It is a provocative thesis worth watching for stronger evidence. In my mind, their data still fits the weapons model -- what they see is the same sort of signalling as my blood in the ocean signals a shark. Or, a bacterial Captain Kirk detecting an unseen ship raising its shields, prompting a defense posture.

Wednesday, December 13, 2006

Grading the graders

I picked up a copy of The Economist last week, as is my habit when flying, and it happened to have a quarterly review of technology. There is a quite accurate story on microarrays that does a good job of explaining the technology for non-scientists.

One of the bits of the microarray story I had forgotten is retold there: how both the Affymetrix and Stanford groups pioneering microarrays had grant proposals which received truly dismal priority scores.

For those readers not steeped in academic science, when you ask various funding sources for money, your proposal is put together with a bunch of related proposals. A group of volunteers, called a study section, review the proposals and rate them. The best are given numeric scores and these scores are used to decide which proposals will receive funding. Study sections also have some power to suggest changes to grants -- i.e. cuts -- and to make written critiques. Ideally, these are constructive in nature but such niceties are not always observed.

A commonly heard complaint is that daring grant proposals are not funded. Judah Folkman apparently has an entire office wallpapered with grant rejections for his proposal of soluble pro- and anti- angiogenic factors. Robert Langer apparently has a similar collection trashing his ideas for novel drug delivery methods, such as drug-releasing wafers to be embedded in brain tumors. Of course, both of these concepts have now been clinically validated so they can gleefully recount these tales (I heard them at a Millennium outside speaker series I will dearly miss).

I've participated once (this summer) in a grant review study section and would love to comment on it -- but by rule what happens in Gaithersburg stays in Gaithersburg. There are good reasons for such secrecy, but it is definitely a double-edged sword. It has the potential to encourage both candor and back-stabbing. It certainly prevents any sort of systematic review of how study sections function and dysfunction.

What I think is a serious issue is that such grant review processes have little or no mechanism for selecting good judges and avoiding poor ones. Reviewers who torpedo daring good proposals have no sanction and those who champion heterodoxy no bonus. It isn't obvious how you could do this, so I do not propose a solution, but I wish somehow it could work.

One wonders whether the persons who passed over microarrays regret their decisions or stand by them (and what got funded instead?). Do they even remember their role in retarding these technologies? If you could ask them now, would they say "Boy did I blow it!" or "Microarrays? Why ask about that passing fad?"

Wednesday, December 06, 2006

Image Enhancers!

That must have been the command that went out at the Joint Genome Institute, as they have a nice paper in Nature a few weeks back showing how sequence conservation can be used to find enhancer elements.

They started with non-coding sequence elements that are either ultraconserved between mammalian species or showing conservation in Fugu. These elements were placed in a vector wit a naked promoter driving a reporter gene (lacZ) & microinjected into mouse eggs. Embryos were then stained at day 11.5.

Greater than 50% of the ultraconserved elements drove expression of the reporter gene, and further conservation in Fugu did not improve the finding of enhancers. But more than 1/4 of the Fugu-conserved sequences lacking mammalian ultra-conservation functioned as reporters. I do wish they had put in the frequency that random mammalian fragments will score positive in this assay; surely that sort of negative control data is out there somewhere. It's probably a very small fraction.

The articles, alas, require a Nature subscription -- but you can also browse the data at http://enhancer.lbl.gov.

One of their figures shows a variety of staining patterns driven from elements pulled from near the SALL1 gene. Various elements show very different staining patterns.

They also use 4 enhancers driving forebrain expression to find motifs, and in turn use those motifs to search the Fugu-Human element set. 17% of the hits act as forebrain-specific enhancers, whereas only 5% of the tested elements are forebrain enhancers. So even a with very small training set they were able to sigificantly enrich for the target expression pattern.

It will be interesting to see this approach continued, especially to annotate GUUFs (Genes of Utterly Unknown Function).

Tuesday, December 05, 2006

Cousin May's Least Favorite Bacteria

Ogden Nash was a witty poet, but skipped some key biology. Termites may have found wood yummy, but without some endosymbiotic bacteria, wood wouldn't be more than garnish to them -- and the parlor floor would still support Cousin May.

It shouldn't be surprising that such bacteria might be a challenge to cultivate in a non-termite setting. Conversely, university facilities departments are not keen on keeping the native culture system in numbers! :-) Last week's Science has another paper showing off the digital PCR microfluidic chip I mentioned previously. They are again performing single cell PCR, except this time it is going for one cell per reaction chamber rather than one cell per set of chambers. That's because the goal now is not to count mRNAs, but to count bacteria positive for molecular markers. By performing multiplex PCR, they can count categories such as 'A not B', 'A and B', and 'B plus A'.

The particular A's and B's are degenerate primers targeting bacterial 16S ribosomal RNA and a key enzyme for some termite endosymbionts, FTHFS. The 16S rRNA primers have very broad specificity, whereas the FTHFS primers are specific to a subtype called 'clone H'. One more twist: reaction cells with amplifying both primer pairs were retrieved, further amplified, and sequenced. This enabled specific identification of the bacteria present in the positive wells, and in most cases the same 16S and FTHFS sequences were retrieved from wells amplifying both. This is some nifty linkage analysis!

In addition to all sorts of uses in microbiology, such chips might be interesting to apply to cancer samples. Tumors are complex evolving ecosystems, with both the tumors and some of their surrounding tissue undergoing a series of mutations. An interesting family of questions is what mutations happen in what order, and which mutations might be antagonistic. This device offers the opportunity to ask those sorts of questions, if you can design the appropriate PCR primer sets.

Monday, December 04, 2006

Computing Cancer

Last week's Cell has a paper using simulations to estimate the influence of the local microenvironment on the development of invasiveness in cancers. Their model includes both discrete elements (cells, which have a number of associated discrete states) and continuous variables, and is therefore referred to as a Hybrid Discrete-Continuum, or HDC, model. Properties associated with a cell include both internal activities, such as metabolism, and external ones, such as oxygen tension (which, the authors point out, could be any diffusible nutrient), secretion of extracellular matrix degrading enzymes, and the concentration of extracellular matrix.

With any model, the fun part are the predictions they make -- particularly the unorthodox ones. Predictions enable verification or invalidation. For the Cell paper, they do make quite an interesting prediction:
The HCC model predicts that invasive tumor properties are reversible under appropriate microenvironment conditions and suggests that differentiating therapy aimed at cancer-microenvironment interactions may be more useful than making the microenvironment harsher (e.g. by chemotherapy or antiangiogenic therapy).


Experimentally testing such predictions is decidedly non-trivial, but at least now the challenge has been posed. This prediction has clear implications for choosing therapeutic strategies, particularly in picking combinations of oncology drugs -- and few cancer patients are on only a single anti-tumor agent.

This paper is also part of Cell's experiment in electronic feedback -- readers can submit comments on the paper. The opportunity to be the first commenter still appears available -- is there anyone reading this brave enough to go for it?

Sunday, December 03, 2006

Perverse Milestone

Well, this blog hit a dubious milestone today -- my first comment spam! Unsurprisingly, given my topic choice, it was for an online "pharmacy.

On the other hand, it is wonderful to see helpful & friendly comments -- people actually are reading this! Thank you thank you thank you!

Friday, December 01, 2006

Dead Manuscripts #0

When Millennium originally cleaned house, I thought I would be idled almost immediately & this blog was one new initiative to maintain my sanity during the downtime. But then, thanks to some campaigning by friendly middle managers, I was given an extension until end-of-year. It's nice, since it gives me a little more time to hand-off a couple of projects to people.

But, there's still a lot of time left over, and so like Derek Lowe I find myself trying to cobble together some manuscripts before I go.

Now the problem here is that I tend to think I have a lot of interesting stuff to publish -- until I actually get going. It's serious work preparing something for publication. Plus, sometimes when you start dotting the i's and crossing the t's your results start looking less and less attractive.

I am trying to tackle too many papers, especially since all but one are solo affairs. So it will be time to cull some of the ideas soon. There's also stuff that previously stalled somewhere along the way & I doubt I'll ever resurrect. One was even submitted -- and rejected; I found myself agreeing with half the reviewer's comments about the quality of the writing.

Normally, these just go back into memory as items to trot out if they answer questions in interviews ("oh, yes , I once did a multiple alignment of llama GPCRs..."). But now, I have a place to unleash them on the world! I'm the editor & review board! (& 1/10th the readership? :-) Perhaps some of the nuggets will be useful to someone, and perhaps some will even be worked up by someone else into a full paper.

A lot of these little items are interesting, but not quite a Minimum Publishable Unit, or MPU. In academia, there are often debates as to the minimum content of a paper, and some authors push to publish no more than an MPU. Others go in the opposite direction: you need to read every last footnote in one of George Church's papers to get all the stuff he tries to cram in. For example, Craig Venter was the first to succeed at whole genome shotgun sequencing in 1995, but George was trying it back in 1988: see the footnotes to his multiplex sequencing paper.

I almost killed one idea today, but alas I figured out one more question to ask of the data. I'll do that, but I really should kill this one. It's the one where I'm skating way outside my recognized expertise and the results are useful but not stunning. The clock is ticking away, and it would be better to wrap up one good story than have 4 manuscript fragments to add to the queue for this space.

Wednesday, November 29, 2006

Phage Renaissance

Bacteriophage, or phage, occupy an exalted place in the history of modern biology. Hershey & Chase used phage to nail down DNA (and not protein) as the genetic material. Benzer pushed genetic mapping to the nucleotide level. And much, much more. Phage could be made in huge numbers, to scan for rare events. Great stuff, and even better that so many of the classic papers are freely available online!

Phage have also been great toolkits for molecular biology. First, various enzymes were purified, many still in use today. Later, whole phage machinery were borrowed to move DNA segments around.

Two of the best studied phage are T7 and lambda. Both have a lot of great history, and both have recently undergone very interesting makeovers.

T7 is a lytic phage; after infection it simply starts multiplying and soon lyses (breaks open) its host. T7 provided an interesting early computational conumdrum, one which I believe is still unsolved. Tom Schneider has an elegant theory about information and molecular biology, which can be summarized as locational codes contain only as much information as they need to be located uniquely in a genome, no more, no less. Testing on a number of promoters suggested the theory valid. However, a sore thumb stuck out: T7 promoters contain far more information than the theory called for, and a clever early artificial evolution approach showed that this information really wasn't needed by T7 RNA polymerase. So why is there more conservation than 'necessary'? It's still a mystery.

Phage lambda follows a very different lifestyle. After infection, most times it goes under deep cover, embedding itself at a single location in its E.coli host's genome, a state called lysogeny. But when the going gets tight, the phage get going and go through a lytic phase much like that of T7. The molecular circuitry responsible for this bistable system was one of the first complex genetic systems elucidated in detail. Mark Ptashne's book on this, A Genetic Switch, should be part of the Western canon -- if you haven't read it, go do so! (Amazon link)

With classical molecular biology techniques, only either modest tinkering or wholesale vandalism were the only really practical ways to play with a phage genome. You could rewrite a little or delete a lot. Despite that, it is possible to do a lot with these approaches. In today's PNAS preprint section (alas, you'll need a subscription to get beyond the abstract) is a paper which re-engineers the classic lambda switch machinery. The two key repressors, CI and Cro, are replaced with two other well-studied repressors whose activity can be controlled chemically, LacI and TetR. Appropriate operator sites for these repressors were installed in the correct places. In theory, the new circuit should perform the same lytic-lysogeny switch as lambdaphage 1.0, except now under the control of tetracycline (TetR, replacing CI) and lactose (LacI, replacing Cro). Of course, things don't always turn out as planned.
These variants grew lytically and formed stable lysogens. Lysogens underwent prophage induction upon addition of a ligand that weakens binding by the Tet repressor. Strikingly, however, addition of a ligand that weakens binding by Lac repressor also induced lysogens. This finding indicates that Lac repressor was present in the lysogens and was necessary for stable lysogeny. Therefore, these isolates had an altered wiring diagram from that of lambda.
. When theory fails to predict, new science lies ahead!

Even better, with the advent of cheap synthesis of short DNA fragments ("oligos") and new methods of putting those together, the possibility of becoming the "all the phage that's fit to print" is really here. This new field of "synthetic biology" offers all sorts of new experimental options, and of course a new set of potential misuses. Disclosure: my next posting might be with one such company.

Such rewrites are starting to show up. Last year one team reported rewriting T7. Why rewrite? A key challenge in trying to dissect the functions of viral genes is that many viral genes overlap. Such genetic compression is common in small genomes, and gets more impressive the smaller the genome. But, if tinkering with one gene also tweaks one or more of its neighbors, interpreting the results becomes very hard. So by rewriting the whole genome to eliminate overlaps, cleaner functional analysis should be possible.

With genome editing becoming a reality, perhaps it's time to start writing a genetic version of Strunk & White :-)

Monday, November 27, 2006

Graphical table-of-contents

I am a serious journal junkie, and have been for some time. As an undergraduate I discovered where the new issues of each key journal (Cell, Science, Nature, PNAS) could be first found. In grad school, several of us had a healthy competition to first pluck the new issue from our advisor's mailbox -- and the number of key journals kept going up. Eventually, of course, all the journals went on-line and it became a new ritual of hitting the preprint sites at the appropriate time -- for example, just after 12 noon on Thursdays for the Cell journals. A good chunk of my week is organized around the rituals.

Most pre-print sites, indeed most on-line tables-of-contents, are barebones text affairs. That's fine and dandy with me -- quick & easy to skim. But, I do appreciate a few that have gone colorful. Some now feature a key figure from each article, or perhaps a figure collage specifically created for display (much like a cover image, but one for each paper).

Journal of Proteome Research is at the forefront of this trend. Of course, since it is a pre-print site the particular images will change over time. As I write this, I can see a schematic human fetus in utero, flow charts, Venn diagrams, spectra, a dartboard (!), bananas, 1D & 2D gels, a grossly overdone pie chart, and much more.

Nature Chemical Biology is the other journal I am aware of with this practice. The current view isn't quite such a riot, because NCB doesn't have the large set of pre-prints that JPR has, but both a fly and a worm are gracing the page.

The graphical views do provide another hint of what might be in the paper beyond the title. In particular, they give some feel for what the tone of the paper might be (that dartboard must indicate a bit of humor!). They certainly add some color to the day.

Sunday, November 26, 2006

Gene Patents


Today's Parade magazine has an article titled "How Gene Patents are Putting Your Health at Risk". The topic of gene patents deserves public scrutiny & debate, but better coverage than this article.

Featured prominently (with a picture in the print edition) is Michael Crichton, whose new book has been touched on previously in this space. Crichton in particular makes a number of concrete statements, some of which are a bit dubious.

First, let's take the statement
A fifth of your genes belong to someone else. That’s because the U.S. Patent Office has given various labs, companies and universities the rights to 20% of the genes found in everyone’s DNA— with some disturbing results.
. The first sentence is just plain wrong, and given its inflammatory nature that is very poor journalism. Nobody can own your genes -- genes, as natural entities, are not themselves patentable. What can be patented are uses of information in those genes. That is a critical, subtle distinction which is too often lost. What can be patented are uses for genes, not the genes themselves, just as I could patent a novel use for water, but not water itself.

Time for the full disclosure: I am a sole or co-inventor on 11 issued gene patents (e.g. U.S. Patent 6,989,363) , many of which are for the same gene, ACE2. Many more gene patents were applied for on my behalf, but most have already been abandoned as not worth the investment. Those patents attempted to make a wide range of claims, but interestingly they missed what may be the key importance for ACE2 (we never guessed it), which is that it is a critical receptor for the SARS virus.

Many of the gene patents do illustrate a key shortcoming of current patent law. When filing a gene patent, we (and all the other companies) tried to claim all sorts of uses for the information in the gene. These sorts of laundry lists are the equivalent of being able to buy as many lottery tickets for free. A rational system would penalize multiple claims, just as multiple testing is penalized in experiment designs. The patent office should also demand significant evidence for each claim (they may well do this now; I am no expert on the current patent law).

Another one of Crichton's claims deserves at least some supporting evidence, plus it confuses two distinct concepts in intellectual property law
Plus, Crichton says, in the race to patent genes and get rich, researchers are claiming they don’t have to report deaths from genetic studies, calling them “trade secrets.”

First, just because some idiots have the chutzpah to make such claims doesn't mean they are believed or enforceable. Second, such claims have nothing to do with gene patents -- such claims could exist in any medical field. Finally, trade secrets and patents are two different beasts altogether. In a patent, the government agrees to give you a monopoly on some invention in return for you disclosing that invention so others may try to improve on it; a trade secret must be kept secret to retain protection and should someone else discover the method by legal means, your protection is shot.

The on-line version also includes a proposed "Genetic Bill of Rights". I would propose that before enacting such a bill, one think very carefully about the ramifications of some of the proposals.

Take, for example,
Your genes should not be used in research without your consent, even if your tissue sample has been made anonymous.
. What exactly does this mean? What it will probably mostly mean is that the thicket of consent hurdles around tissue samples will get thicker. Does this really protect individual privacy more, or is it simply an impediment which will deter valuable research? Will it somehow put genetic testing of stored samples on a different footing than other testing (e.g. proteomic), in a way which is purely arbitrary?

Another 'right' proposed is
Your genes should not be patented.
.
First, an odd choice of verb? "Should"? Isn't that a bit mousy? Does that really change anything? And what, exactly, does it mean to patent "your genes"?

On the flip side, I'm no fan of unrestricted gene patenting. All patents should be precise and have definite bounds. They should also be based on good science. Patents around the BRCA (breast cancer) genes are the most notorious, both because they have been extensively challenged (particularly in Europe) and because the patent holders have been aggressive in defending them. This has led to the strange situation in (at least part of) Europe where the patent coverage on testing for breast cancer susceptibility depends on what heritage you declare: the patent applies only to testing in Ashkenazi Jews.

In a similar vein, I can find some agreement with Crichton when he states
During the SARS epidemic, he says, some researchers hesitated to study the virus because three groups claimed to own its genome.
It is tempting to give
non-profit researchers a lot of leeway around patents. However, the risk is that some such researchers will deliberately push the envelope between running research studies and running cut-rate genetic testing shops. Careless changes to the law could also hurt companies selling patented technologies used in research: if a researcher can ignore patents for genetic tests, why not for any other patented technologies.

Gene patents, like all patents, are an attempt by government (with a concept enshrined in the U.S. Constitution) to encourage innovation yet also enable further progress. There should be a constant debate as to how to achieve this. Ideas such as 'bills of rights', research exemptions, the definitions of obviousness and prior art, and many other topics need to be hashed over. But please, please, think carefully before throwing a huge stone, or volley of gravel, into the pool of intellectual property law.

Wednesday, November 22, 2006

Enjoy Your W!



No, this isn't revealing my politics. W is the single letter code for tryptophan, which is reputedly richly found in turkey meat (If you are a vegetarian, what vegetable matter is richest in W?)

Tryptophan is an odd amino acid, and probably the last one added to the code -- after all, in most genomes only a single codon and it is one of the rarest amino acids in proteins. It has a complex ring system (indole), which would also suggest it might have come last.

Why bother? Good question -- one I should know the answer to but don't. When W is conserved, is the chemistry of the indole ring being utilized where nothing else would do? That's my guess, but I'll need to put that on the list of questions to figure out sometime.

But why W? Well, there are 20 amino acids translated into most proteins (plus a few others in special cases) and they were named before anyone thought their shorthand would be very useful. There are clearly mneumonic three letter codes, but for long sequences a single letter works better -- once you are indoctrinated in them, the three letter codes become nice compact representations which can be scanned by eye. Some have even led to names of domains, such as WD40, PEST and RGD.

The first choice is to use the first letter of the name: Alanine, Cysteine, Glycine, Histidine, Isoleucine, Leucine, Methionine, Proline, Serine, Threonine and Valine follow this rule. If multiple amino acids start with the same letter, the smallest amino acid gets to take the letter. Some others are phonetically correct: aRginine, tYrosine, F:Phenylalanine. Others just fill in D:aspartate, E:glutamate, K:lysine, N: asparagine and Q:glutamine.

But Tryptophan? Perhaps it was studied early on by Dr. Fudd
, who gave long lectures about the wonders of tWiptophan.

What's a good 1Gbase to sequence?



My newest Nature arrived & has on the front a card touting Roche Applied Science's 1Gbase grant program. Submit an entry by December 8th (1000 words), and you (if you reside in the U.S. or Canada) might be able to get 1Gbase of free sequencing on a 454 machine. This can be run on various number of samples (see the description). They are guaranteeing 200bp per read. The system runs 200Kreads per plate and the grant is for 10 plates -- 2M reads -- but 2M x 200 bases = 400Mb -- so somewhere either I can't do math or their materials aren't quite right. The 200bp/read is a minimum, so apparently their average is quite a bit higher (or again, I forgot a factor somewhere). Hmm, paired end sequencing is available but not required, so that isn't the obvious factor of 2.

So what would you do with that firepower? I'm a bit embarassed that I'm having a hard time thinking of good uses. For better-or-worse, I was extended at Millennium until the end-of-year, so any brainstorms around cancer genomics can't be surfaced here. There are a few science-fair like ideas (yikes! will kids soon be sequencing genomes as science fair projects?), such as running metagenomics on slices out of a Winogradski column. When I was an undergrad, our autoclaved solutions of MgCl2 always turned a pale green -- my professor said this was routine & due to some alga that could live in that minimal world. Should that be sequenced? Metagenomics of my septic system? What is the most interesting genetics system that isn't yet undergoing a full genome scan?

Well, submission is free -- so please submit! Such an opportunity shouldn't be passed up lightly.

Thursday, November 16, 2006

One Company I'm Not Sending My Resume To


On some site today a Google ad caught my eye -- alas I cannot remember why -- and I found a site for NEXTgencode. However, it doesn't take long to realize that this isn't a real company . The standard links for a real biotech company, such as 'Careers', 'Investors', etc. are missing.

Clicking around what is there leads to a virtual Weekly World News of imaginative fabrications, though some have previously been presented as true by places that should know better, such as the BBC. I think I saw an item on grolars (grizzly-polar bear hybrids) in the press as well. Also thrown in is a reference to the recently published work: "Humans and Chips Interbred Until Recently".

Various ads show products in development. My favorite ad is the one for Perma Puppies, which never grow old or even lose their puppy physique (though their puppy isn't nearly as cute as mine was!). There's also the gene to buy with the HUGO symbol BLSHt.

The giveaway is the last news article, which describes a legal action by the company
Michael Crichton's book "Next" claims to be fiction, but its story line reveals proprietary informaiton of Nextgencode, a gene manipulation company.

Surprise! "Next" will be released at the end of the month.

When I read Andromeda Strain as a kid, I fell hook, line & sinker for a similar ploy in that book -- all of the photos were labeled just like the photos of real spacecraft in books on NASA ("Photo Courtesy of Project SCOOP"). It took some convincing from older & wiser siblings before I caught on.

Going back over the news items with the knowledge of who is behind it was revealing. Crichton has become noted for throwing his lot in with global warming skeptics. "Burn Fuel? Backside Fat Powers Boat" is the tamer of the digs; another item suggests Neanderthals were displaced by the Cro-Magnon due to the Neaderthals environmentalist tendencies.

Well, at least it's a tame fake -- a fake company purely to hawk a book. Sure beats the shameless hucksters who set up companies to peddle fake cures (we have stem cell injections to cure hypochondria!) to desperate patients.

Wednesday, November 15, 2006

Counting mRNAs


If you want to measure many mRNAs in a cell, microarray technologies are by far the winner. But for more careful scrutiny of the expression of a small number of genes, quantitative RT-PCR is the way to go. qRT-PCR is viewed as more consistent & has higher throughput (for lower cost) when looking at the number of samples which can be surveyed. It doesn't hurt that one specific qRT-PCR technology was branded TaqMan, which plays on both the source of the key PCR enzyme (Thermus aquaticus aka Taq) and the key role of Taq polymerase's exonuclease activity, which munches nucleotides in a manner reminiscent of a certain video game character (though I've never heard of any reagent kits being branded 'Power Pills'!).

RT-PCR quantitation relies on watching the time course of amplification. Many variables can play with amplification efficiencies, including buffer composition, primer sequence, and temperature variations. As a result, noise is introduced and results between assays are not easily comparable.

The PNAS site has an interesting paper which uses a different paradigm for RT-PCR quantitation. Instead of trying to monitor amplification dynamics, it relies on a digital assay. The sample is diluted and then aliquoted into many amplificaiton chambers. At the dilutions used, only a fraction of the aliquots will contain a single template molecule. By counting the number of chambers positive for amplification & working back from the dilution, the number of template molecules in the original sample can be estimated.

Such digital PCR is very hot right now and lies at the heart of many next generation DNA sequencing instruments. What makes this paper particularly interesting is that the assay has been reduced to microfluidic chip format. A dozen diluted samples are loaded on the chip, which then aliquots each sample into 1200 individual chambers. Thermocycling the entire chip drives the PCR, and the number of positive wells are counted. While the estimate is best if most chambers are empty of template (because then very few started with multiple templates), the authors show good measurement at higher (but non-saturating) template concentrations.

An additional layer of neato is here as well -- each sample is derived from a single cell, separated from its mates by flow sorting. While single cell sensitivity has been achieved previously, the new paper claims greater measurement consistency. By viewing individual cells, misunderstandings created by looking at populations are avoided. For example, suppose genes A and B were mutually exclusive in their expression -- but a population contained equal quantities of A-expressors and B-expressors. For a conventional expression analysis, one would just see equal amounts of A and B. By looking at single cells, the exclusive relationship would become apparent. The data in this paper show examples of wide mRNA expression ranges for the same gene in the 'same' type of cells; a typical profile of the cell population would see only the weighted mean value.

The digital approach is very attractive since it is counting molecules. Hence, elaborate normalization schemes are largely unnecessary (though the Reverse Transcriptase step may introduce noise). Furthermore, from a modeler's perspective actual counts are gold. Rather than having fold-change information with fuzzy estimates of baseline values, this assay is actually enumerating mRNAs. Comparing the expression of two genes becomes transparent and straightforward. Ultimately, such measurements can become fodder for modeling other processes, such as estimating protein molecule per cell counts.

Cell sorters can also be built on chips (this is just one architecture; many others can be found in the Related Articles for that reference). It doesn't take much to imagine marrying the two technologies to build a compact instrument capable of going from messy clinical samples to qRT-PCR results. Such a marriage might one day put single cell qRT-PCR clinical tests into a doctor's office near you.

Tuesday, November 14, 2006

More lupus news


Hot on my previous rant around lupus is some more news. Human Genome Sciences has announced positive results for its Lymphostat B drug in lupus. I won't go into detail on their results, other than to comment that the study size is large (>300), the trial is a Phase II double-blind placebo controlled trial (open label, single arm trials are much more common for Phase II -- HGS isn't taking the easy route) but these results haven't yet been subject to full peer review in a journal article.

Lymphostat B has a number of unusual historical notes attached to it. It is in that very rarified society of discoveries from genomics which have made it far into therapeutic clinical trials -- there are other examples (not on hand, but trust me on this!), but not many. It doesn't hurt that it was in a protein family (TNF ligands) which suggested a bit of the biology (e.g. a cognate receptor) & has led to quite a bit of biology which is in the right neighborhood for a lupus therapy (B-cell biology); most genomics finds were Churchillian enigmas.

Second, this is a drug that initially failed similar trials -- but HGS conducted a post-hoc subset analysis on the previous trial. However, instead of begging their way forward (such analyses get all the respect due used cat litter, but that doesn't stop desperate companies from trying to argue for advancement) they designed a new trial using a biomarker to subset the population. If their strategy works, it is likely that doctors will only prescribe it to this restricted population. HGS has, in effect, decided it is better to treat some percent of a small population than risk getting approval for 0% of a larger one -- a bit of math the pharmaceutical industry has frequently naysayed.

HGS and their partner GSK still have a long way to go on Lymphostat B. Good luck to them -- everyone in this business needs it, especially the patients.

Monday, November 13, 2006

Small results, big press release


The medical world is full of horrible diseases which need tackling, but you can't track them all. For me, it is natural to focus a touch more on those to which I have a personal connection.

Lupus is one such disease, as I have a friend with it. Lupus is an autoimmune disease in which the body produces antibodies targeting various normal cellular proteins. The result can be brutal biological chaos.

The pharmaceutical armamentarium for lupus isn't very good. Anti-lupus therapies fall into two general categories: anti-inflammatory agents and low doses of cancer chemotherapeutics (primarily anti-metabolite therapies such as methotrexate). Few of these have been adequately tested in lupus, and certainly not well tested in combination. The docs are flying by the seat of their pants. The side effects of the drugs are quite severe, so much so that lupus therapy can be an endless back-and-forth between minimizing disease damage & therapy side effects.

One reason lupus hasn't received a lot of attention from the pharmaceutical industry is that we really don't understand the disease. It is almost certainly a 'complex disease', meaning there are multiple genetic pathways that lead to or influence the disease. Different patients manifest the disease in different ways. For many patients, the most dangerous aspect is an autoimmune assault on the kidneys. but for my friend the most vicious flare-ups are pericarditis, an inflammation of the sac around the heart. These differences could reflect very different disease mechanisms; we really don't know.

We need to understand the mechanisms of lupus, so it is with interest I read items such as this one: New biomarkers for lupus found. The item starts promisingly

A Wake Forest University School of Medicine team believes it has found biomarkers for lupus that also may play a role in causing the disease.

The biomarkers are micro-ribonucleic acids (micro-RNAs), said Nilamadhab Mishra, M.D. He and colleagues reported at the American College of Rheumatology meeting in Washington that they had found profound differences in the expression of micro-RNAs...


So far, so good -- except now things go south
...between five lupus patients and six healthy control patients who did not have lupus.


Five patients? Six controls? These are exquisitely tiny samples, particularly when looking at microRNAs, of which there are >100 known for human. With so few samples, the risk of a chance association is high. And are these good comparisons? Were the samples well matched for age, concurrent & previous therapies, gender, etc?

Farther down is even more worrisome verbiage
In the new study, the researchers found 40 microRNAs in which the difference in expression between the lupus patients and the controls was more than 1.5 times, and focused on five micro-RNAs where the lupus patients had more than three times the amount of the microRNAs as healthy controls, and one, called miR 95 where the lupus patients had just one third of the gene expression of the microRNA of the controls.


Fold-change cutoffs are popular in expression studies, because they are intuitive, but are generally meaningless. Depending on how tight the assays are, fold changes of 3X can be meaningless (in an assay with high technical variance) and ones smaller than 1.5X can be quite significant (in an assay with very tight technical variance). Well-designed microarray studies are far more likely to use proper statistical tests, such as T-tests.

And one last statement to complain about
The team reported the lesser amount of miR 95 "results in aberrant gene expression in lupus patients."

Is this simply correlation between miR 95 and other gene expression -- which suffers both from the fact that correlation is not causation and that with such small samples gene expression differences will be found from pure chance. Are these genes which have previously been shown to be targets of miR 95? Has it been shown that actually interfering with miR 95 expression in the patient samples reverts the gene expression changes?

Of course, it is patently unfair for me to beat up on a scientific poster of preliminary results for which I have only seen a press release - one hopes that before this data gets to press a much more detailed workup is performed (please, please let me review this paper!). But, it is also patently unfair to yank the chains of patients with understudied diseases with press releases that take a nub of a preliminary result and headline it into a major advance.

Friday, November 10, 2006

Systems Biology Review Set



The Nature publishing group has made a set of reviews on systems biology available for free. It looks like an interesting set of discussions by key contributors.

Hairy Business



In addition to the sea urchin genome papers, the new Science also contains an article describing the positional cloning of a mutant gene resulting in hair loss. The gene encodes an enzyme which is now presumed to play a critical role in the health of hair follicles.

The first round of genomics companies had two basic scientific strategies. Companies such as Incyte and Human Genome Sciences planned to sequence the expressed genes & some how sift out the good stuff. Another set of companies, such as Millennium, Sequana, Myriad and Mercator planned to find important genes through positional cloning. Positional cloning uses either carefully collected human family samples or carefully bred mice to identify regions of the genome that track with the trait of interest. By progressively refining the resolution of the genetic maps, the work could narrow down the region to something that could be sequenced. Further arduous screening of the genes in that region for mutations which tracked with the trait would eventually nail down the gene. Prior to the human genome sequence this was a long & difficult process, and sometimes in the end not all the ambiguity could be squeezed out. It is still serious work, but the full human genome sequence and tools such as gene mapping chips make things much cheaper & easier.

Instead, it seemed like every one of the positional cloning companies picked new indications -- obesity, diabetes, depression, schizophrenia, etc. -- and generally the same ones. This set up heated rivalries to collect families, find genes, submit patents & publish papers. Sequana & Millennium were locking horns frequently when I first showed up at the latter. If memory serves, on the hotly contested genes it was pretty much a draw -- each sometimes beating the other to the prize.

Eventually, all of the positional cloning companies discovered that while they could achieve scientific success, it wasn't easy to convert that science into medical reality. Most of the cloned genes turned out to be not easily recognizable in terms of their function, and certainly not members of the elite fraternity of proteins known as 'druggable targets' -- the types of proteins the pharmaceutical industry has had success at creating small molecules (e.g. pills) to target. A few of the genes found were candidates for protein replacement therapy -- the strategy which has made Genzyme very rich -- but these were rare. Off-hand, I can't think of a therapeutic arising from one of these corporate positional cloning efforts that even made it to trials (anyone know if this is correct?).

Before long, the positional cloning companies either moved into ESTs & beyond (as Millennium did) or disappeared through mergers or even just shutting down.

I'm reminded of all this by the Science paper because hair loss was one area that wasn't targeted by these companies -- although the grapevine said that every one of them considered it. The commercial success of Rogaine made it an attractive area commercially, and there was certainly a suggestion of a strong genetic component.

If a company had pursued the route that led to the Science paper, it probably would have been one more commercial disappointment. While the gene encodes an enzyme (druggable), the hairless version is a loss-of-function mutant -- and small molecules targeting enzymes reduce their function. The protein isn't an obvious candidate for replacement therapy either. So, no quick fix. The results will certainly lead to a better understanding of what makes hair grow, but only after lots of work tying this gene into a larger pathway.

As for me, I'm hoping I inherited my hair genes from my maternal grandfather, who had quite a bit on his head even into his 90's, rather than my father's side, where nature is not quite so generous. As for urchins, I learned to avoid them after a close encounter on my honeymoon. I was lucky the hotel had a staff doctor, but I discovered on my return to the States that we had missed one & perhaps I still carry a little product of the sea urchin genome around in my leg.

Thursday, November 09, 2006

Betty Crocker Genomics



It is one thing to eagerly follow new technologies and muse about their differences, it is quite different to be in the position of playing the game with real money. In the genome gold rush years it was decided we needed more computing power to deal with searching the torrent of DNA sequence data, and so we started looking at the three then-extant providers of specialized sequence analysis computers. But how to pick which one, with each costing as much as a house?

So, I designed a bake-off: a careful evaluation of the three machines. Since I was busy with other projects, I attempted to define strict protocols which each company would follow with their own instrument. The results would be delivered to me within a set timeframe along with pre-specified summary information. Based on this, I would decide which machines met the minimal standard and which was the best value.

Designing good rules for a bake-off is a difficult task. You really need to understand your problem, as you want the rules to ensure that you get the comparative data you need in all the areas you need it, with no ambiguity. You also want to avoid wasting time drafting or evaluating criteria that aren't important to your mission. Of key importance is to not unfairly prejudice the competition against any particular entry or technology -- every rule must address the business goal, the whole business goal, and nothing but the business goal.

Our bake-off was a success, and we did purchase a specialized computer which sounded like a jet engine when it ran (fans! it ran hot!) -- but it was tucked away where nobody would routinely hear it. The machine worked well until we no longer needed it, and then we retired it -- and not long after the manufacturer retired from the scene, presumably because most of their customers had followed the same path as us.

I'm thinking about this now because a prize competition has been announced for DNA sequencing, the Archon X Prize. This is the same organization which successfully spurred the private development of a sub-orbital space vehicle, SpaceShip One. For the Genome X Prize, the basic goal is to sequence 100 diploid human genomes in 10 days for $1 million.

A recent GenomeWeb article described some of the early thoughts about the rules for this grand, public bake-off. The challenges in simply defining the rules are immense, and one can reasonably ask how they will shape the technologies which are used.

First off, what exactly does it mean to sequence 100 human genomes in 1 week for $1 million? Do you have to actually assemble the data in that time frame, or is that just the time to generate raw reads & filter them for variations? Can I run the sequencer in a developing country, where the labor & real estate costs are low? Does the capital cost of the machine count in the calculation? What happens to the small gaps in the current genome? To mistakes in the current assembly? To structural polymorphisms? Are all errors weighted equally, and what level is tolerable? Does every single repeat need to be sequenced correctly?

The precise laying down of rules will significantly affect which technologies will have a good chance. Requiring that repeats be finished completely, for example, would tend to favor long read lengths. On the other hand, very high basepair accuracy standards might favor other technologies. Cost calculation methods can be subject to dispute (e.g. this letter from George Church's group).

One can also ask the question as to whether fully sequencing 100 genomes is the correct goal. For example, one might argue that sequencing all of the coding regions from a normal human cell will get most of the information at lower cost. Perhaps the goal should be to sequence the complete transcriptomes from 1000 individuals. Perhaps the metagenomics of human tumors is what we really should be shooting for -- with appropriate goals for extreme sensitivity.

Despite all these issues, one can only applaud the attempt. After all, Consumer Reports does not review genomics technologies! With luck, the Genome X Prize will spur a new round of investment in genomics technologies and new companies and applications. Which reminds me, if anyone has Virgin Galactic tickets they don't plan to use, I'd be happy to take them off your hands...

Urchin Genome



The new Science has a paper reporting the sequence of a sea urchin genome, as well as articles looking at specific aspects. This is an important genome, since it is the first echinoderm sequenced and echinoderms share many key developmental aspects as vertebrates.

At ~860 Mb the urchin genome is a bit larger than the Fugu (pufferfish) genomes which have been sequenced but substantially smaller than mammalian genomes (generally around 4000 Mb). With fast, cheap sequencing power on the horizon, soon all our favorite developmental models will have their genomes revealed.

Tuesday, November 07, 2006

Long enough to cover the subject, short enough to be interesting.



That was the advice my 10th grade English teacher passed on when asked how much we should produce for a writing assignment. The context (a woman's skirt) he gave was risque enough to get a giggle from 10th graders of the 80's; probably the same joke would get him in serious hot water today -- unless perhaps he pointed out that the same applies for a man's kilt.

A letter in a recent Nature suggests that the same question that vexed me in my student days also bedevils the informatics world. The writer lodges a complaint against MIAME (Minimum Information About a Microarray Experiment), a standard for reporting the experimental context of a microarray experiment. MIAME attempts to capture some key information, such as what the samples are and what was done to them.

The letter writer's complaint is that this is all a fool's mission, as one cannot possibly capture all the key information, especially since what is key to record keeps changing. All reasonable points.

The solution proposed made me re-read the letter for a hint of satire, but I'm afraid they are dead serious.
How should we proceed? Reducing the costs of microarray technology so that experiments can be readily reproduced across laboratories seems a reasonable approach. Relying on minimal standards of annotation such as MIAME seems unreasonable, and should be abandoned.
.

At first, this just seems like good science. After all, the acid test in science is replication by an independent laboratory.

This utterly ignores two facts. First, by depositing annotated data in central databanks the data can be mined by researchers who don't have access to microarray gear. Second, most interesting microarray experiments involve either specialized manipulations (which only a few labs can do) or very precious limited samples (such as clinical ones); replication would be nice but just can't be done on those same samples.

This "the experiments will be too cheap to database" argument has come up before; I had it sent my way during a seminar in my graduate days. But, like electricity too cheap to meter, it is a tantalizing mirage which fades on close inspection.

Monday, November 06, 2006

From biochemical models to biochemical discovery

.

An initial goal of genome sequencing efforts was to discover the parts lists for various key living organisms. A new paper in PNAS now shows how far we've come in figuring out how those parts go together, and in particular how discrepancies between prediction & reality can lead to new discoveries.

E.coli has been fully sequenced for almost 10 years now, but we still don't know what all the genes do. A first start would be to see if we could explain all known E.coli biology in terms of genes of known function -- if true, that would say the rest are either for biology we don't know or are for fine-tuning the system beyond the resolution of our models. But if we can't, that says there are cellular activities we know about but haven't yet mapped to genes.

This is precisely the approach taken in Reed et al. First, they have a lot of data as to which conditions E.coli will grow on, thanks to a common assay system called Biolog (a PDF of the metabolic plate layout can be found on the Biolog website -- though curiously marked "Confidential -- do not circulate"!). They also have a quantitative metabolic model of E.coli. Marry the two and some media that support growth cannot be explained -- in other words, E.coli is living on nutrients it "shouldn't" according to the model.

Such a list of unexplained activities is a set of assays for finding the missing parts of the model, and deletion strains of E.coli provide the route to which genes plug the gaps. If a given deletion strain fails to grow in one of the unexplained growth-supporting media, then the gene deleted in that strain is probably the missing link. The list of gene to test can be made small by choosing based on the model -- if the model is missing a transport activity, then the intial efforts can focus on genes predicted to encode transporters. Similarly, if the model is missing an enzymatic reaction one can prioritize possible enzymes. The haul in this paper was to assign functions for 8 more genes -- a nice step forward.

It is sobering how much of each genome which has been sequenced is of unknown function, even in very compact genomes. Integration of experiment and model, as illustrated in this paper, is our best hope for closing that gap.

Friday, November 03, 2006

Phosphopallooza.

Protein phosphorylation is a hot topic in signal transduction research. Kinases can add phosphate groups to serines, threonines & tyrosines (and very rarely histidines), and phosphatases can take them off. These phosphorylations can shift the shape of the protein directly, or create (or destroy) binding sites for other proteins. Such bindings can in turn cause the assembly/disassembly of protein complexes, trigger the transport of a protein to another part of the cell, or lead to the protein being destroyed (or prevent such) by the proteasome. This is hardly a comprehensive list of what can happen.

Furthermore, a large (by some estimates 1/4 to 1/5) amount of the pharmaceutical industries efforts, including those at my (soon to be ex-) employer Millennium, are targeting protein kinases. If you wish to drug kinases, you really want to know what the downstream biology is and that starts with what does your kinase phosphorylate, when does it do it, and what events do those phosphorylations trigger.

A large number of methods have been published for finding phosphorylation sites on proteins, but by far the most productive have been mass spectrometric ones (MS for short). Using various sample workup strategies, cleverer-and-cleverer instrument designs, and better software, the MS folks keep pushing the envelope in an impressive manner.

The latest Cell has the latest leap forward: a paper describing 6,600 phosphorylation sites (on 2,244 proteins). To put this in perspective, the total number of previously published human phosphorylation sites (by my count) was around 12,000 -- this paper has found 50% as many as were previously known! Some prior papers (such as these two examples) had found close to 2,000 sites.

Now some of this depth came from many MS runs -- but that in itself illustrates how this task is getting simpler; otherwise so many runs wouldn't be practical. The multiple runs also were used to gather more data: looking at phosphorylation changes (quantitatively!) over a timecourse.

One this this study wasn't designed to do is clearly assign the sites to kinases. Bioinformatic methods can be used to make guesses, but without some really painful work you can't really make a strong case. And if the site shouldn't look like any pattern for a known kinase -- good luck! There really aren't great methods for solving this (not to say there aren't some really clever tries).

Also interesting in this study is the low degree of overlap with previous studies. While the reference set they used is probably quite a bit lower than the 12K estimate I give, it is still quite large -- and most sites in the new paper weren't found in the older ones. There are in excess of 20 million Ser/Thr/Tyr in the proteome and many are probably not phosphorylated, but certainly a reasonable estimate would be north of 20K are.

For drug discovery, the sort of timecourse data in this paper is another proof-of-concept of the idea of discovering biomarkers for your kinase using high-throughput MS approaches (another case can be found in another paper). By pushing for so many sites, the number of candidates goes up substantially, since many sites found aren't modulated in an interesting way, at least in terms of pursuing a biomarker. This is noted in Figure 3 -- for the same protein, the temporal dynamics of phosphorylation at different sites can be quite different.

However, it remains to be seen how far into the process these MS approaches can be pushed. Most likely, the sites of interest will need to probed with immunologic assays, as previously discussed.

Thursday, November 02, 2006

Metagenomics backlash.

Metagenomics is a burgeoning field enabled by cheap sequencing firepower -- which grows cheaper each year. You take some interesting microbial ecosystem (such as your mouth, a septic tank, the Sargasso sea), perform some minimal prep, and sequence everything in the soup. The results find everything in the sample, not just what you can culture in a dish.

Now in Nature we can see the backlash -- angry microbiologists irked at uneducated oafs stomping their turf. One complaint (scroll to the bottom) is the oft used term "unculturable species" -- i.e. the new stuff that metagenomics discovers. Quite appropriately, the microbiologists cry foul on this aspersion against their abilities, as the beasties aren't unculturable, just haven't been cultured yet.

The new letter says 'Amen' and goes on to gripe that sequencing unknown microbes is no way to properly discover biological diversity, only culturing them will do.

IMHO, a lot of this is the usual result of new disciplines with eager, arrogant new members (moi?) wading into the domain of old disciplines. According to my microbiology teaching assistant, a molecular biologists is defined as "someone who doesn't understand the biological organism they are working with". Similar issues of "hey, who's muscling in on my turf?" beset chemistry, as illustrated in this item from Derek Lowe's excellent medicinal chemistry blog.

These sorts of spats have some value but aren't terribly fun to watch. Worse, the smoke & dust from them can obscure the real common ground. There is already at least one example of using genome sequence data to guide culture medium design. Perhaps future metagenomic microbiologists will make this standard practice.

Wednesday, November 01, 2006

In vivo nanobodies.

The new print issue of Nature Methods showed up, and it is rare for this journal not to have a cool technology or two in it. If you are in the life sciences, you can generally get a free subscription to this journal.

Antibodies are cool things, but also complex molecular structures. They are huge proteins composed of 2 heavy chains & 2 light chains, all held together by disulfide linkages. Expressing recombinant antibodies is not a common feat -- it is very hard to do so given the precise ratio & the folding required. Trying to express them inside the cytoplasm would be even trickier, as the redox potential won't let those disulfides form.

Camels & their kin, however, have very funky antibodies -- only a single heavy chain. I've never come across the history of how these were found -- presumably some immunologist sampling all mammals to look for wierd antibodies. Because of this structure, they are much smaller & don't require disulfide linkages. In fact, the constant parts of the camelid antibody can be lopped off as well, leaving the very small variable region, termed a nanobody.

The new paper (subscription required for full text, alas) describes fusing nanobodies to fluorescent proteins & then expressing them in vivo. Since only a single chain is needed, the nanobody coding region can be PCRed out & fused to your favorite fluorescent protein. The paper shows that when expressed in cells, these hybrids glow just where you would expect them to. The ultimate vital stain for any protein or modification! With multiple fluorescent protein of different colors, multiplexing is even theoretically possible (though not approached in this paper).

Of course, one is going to need to generate all those nanobodies. There is already a company planning to commercialize therapeutic nanobodies (ablynx). Perhaps another company will specialize in research tool nanobodies -- ideally without the nanoprofits and nanoshareprices which are all too common in biotechnology!

Tuesday, October 31, 2006

Administrative note: RSS Feed

In my usual fashion, I plunged into this after doing all of 1 minute of homework. So I really didn't know what I was doing. Several people have suggested I needed RSS feed, but I couldn't find that at the pet store :-) Seriously, here it is (I assumed RSS came with things -- I have some learnin' to do!)

http://feeds.feedburner.com/OmicsOmics

If anyone is willing to comment, how gauche is it to host Google AdSense ads on a blog? With a little luck, I could use the proceeds to park a Solexa 1G in my garage, right? :-)

Monday, October 30, 2006

You can't always get what (samples) you want.

A key problem in omics research in medical research is getting the samples you need.

When I was an undergraduate, I had a fuzzy notion of a scheme for personalized medicine. Some analyzer would take a sample of what ailed you, look at it, and then generate a vial of customized antisense medicine that your doc would inject into you. I drove the pre-med in the lab nuts with my enthusiasm for it.

In graduate school, the analyzer became more clear: expression profiling. Look at the mRNA profile, figure out the disease, and voila, you are cured.

Fast forward to the latter part of my Millennium tenure. Rude surprise: you can rarely get the samples you want.

Most of my later work at Millennium was around cancer, originally because that is the research area I gravitated to & later because that was the one research area left (corporate evolution can be brutal!). Getting cancer samples turns out to be decidedly non-trivial.

If you are working in leukemia or related diseases (such as myeloproliferative syndromes), then things aren't bad. Your target tissue is floating around in the bloodstream & can be gotten with an ordinary blood draw. Patients in our society have been conditioned to expect lots of needle sticks, so this isn't hard.

For multiple myeloma and some lymphomas, you can go into the bone marrow. I'm a needlephobe, so the idea of a needle that crunches on the way in is decidedly unpleasant & sounds painful, and apparently is. Patients will do this infrequently, but not daily.

For a lot of solid tumors and other lymphomas, good luck -- particularly with recurrent disease. The tumors are hidden away (which is why they are often deadly) and quite small (if detected early). In many cases, getting a biopsy is surgical, painful, and perhaps significantly dangerous. You might get one sample; repeat visits are generally out of the question. Melanomas are one possible exception, but only for the primary lesion and not the metastases hiding everywhere.

This has significant implications. For a lot of studies, you would like to watch things over time. For example, what does the expression profile look like before and after drug treatment? How long does it take a pharmacodynamic protein marker to come up and what does its decay look like? Without multiple samples, these studies just can't happen.

Worse, what comes out may not be any good. Surgeons are in the business of saving lives, not going prospecting. Traditional practice is to cut first, then put away the samples after the patient is in recovery. But RNA & protein translational states are fragile, so if you don't pop the sample in liquid nitrogen immediately your sample may go downhill in a hurry. Multiple papers have reported finding expression signatures relating more to time-on-benchtop then any pathological state. It often takes dedicated personnel to perform this -- personnel the surgeons would rather not have 'in their way' (I've heard this first-hand from someone who used to be the sample grabber). A dirty not-so-secret in the business is that fresh frozen tissue just isn't practical for routine practice; you have to go with something else.

That is going to mean you go with several less palatable, but more available, options. One is to develop techniques to look at paraffin-embedded sections, which are the standard way of storing pathology samples. There are gazillions of such blocks sitting in hospitals, tempting the researchers. But, most of those sat on benchtops for uncontrolled time periods, so there may be some significant noise. Another is to try to fish the tiny number of tumor cells (or DNA) out of the bloodstream or perhaps an accessible fluid from the correct site (mucous from the lung; nipple aspirate for breast cancer). Or, you try to find markers in the blood or skin -- not where you are trying to treat, but easy to get to.

Whether these will work depends on what you are really looking for. For a predictive marker, it seems plausible that shed DNA or an old block might work. On the other hand, for a pharmacodynamic marker these are useless. A good PD marker allows you to measure whether your drug is hitting the target in vivo and at the correct site, and only by getting the real deal is that going to truly work. By necessity some studies use accessible non-tumor tissue, such as a skin punch or peripheral white blood cells, to at least see if the target is being hit somewhere. But that doesn't answer the question of whether the drug is getting to the tumor, a critical question. And many studies still use the traditional oncology PD marker of whether you are starting to destroy the patient's blood forming system.

At ASCO this summer, one speaker in a glioma section exhorted that a central repository for glioma samples must be imposed on the community, with a central authority determining who could do what experiments on which samples. That sort of extreme rationing shows how precious these samples are.

The scarcity of such samples also underlines why sensitive approaches, such as the nanowestern, are so critical. With small sample requirements, you might be able to go with fine needle biopsys rather than surgical biopsies, or be able to take lots of looks at the same sample (for different analytes).

Of course, things could be worse. What if you go to the trouble of getting a good sample, but then you look in the wrong place in that sample? Well, that's a post for another day.

Sunday, October 29, 2006

Nanowesterns: The future of signal transduction research?

Western blots are a workhorse of biology. When everything goes right, they allow for interrogating the state & quantity of a protein in a cellular system. They can be exquisitely sensitive and specific; Western blot assays have long been used as the definitive test for a number of medical conditions, particularly HIV infection. Given the right antibody, you can detect anything, including miniscule amounts of phosphorylated proteins. And, to a first approximation, they are quantitative.

A Western blot involves several steps. First, the samples of interest are placed in a denaturing buffer, causing the proteins to unfold & disaggregate. The unfolding is performed by large quantities of detergent (primarily SDS, which also shows up in your toothpaste, laundry detergent, dishwashing liquid, etc -- the stuff is ubiquitous) and the disaggregation is assisted by some sort of sulfhydryl compound to destroy disulfide bridges. Such compounds are uniformly smelly, except to a lucky few (I had a graduate school colleague who was smell-blind for them).

Now the samples are loaded on an SDS-PAGE gel, which uses electricity to separate the proteins by size -- approximately. In theory. Once they are separated, the proteins are transferred to a membrane by osmosis or electrophoresis perpendicular to the first direction. The extra protein binding sites on the membrane are then blocked, often with Carnation non-fat dry milk (I kid you not; the stuff is cheap & works). An antibody for the target of interest is added, and then an antibody to detect the first antibody; this one carries a label of some sort. Occasionally it is a third antibody which detects the 2nd (which bound the first) -- each level can enhance sensitivity. The appropriate detection chemistry is run & voila! You have a Western blot. Between all the steps after the blocking are lots of washes to remove excess reagents.

The beauty of a Western is that the technology is pretty cheap & simple -- I did a bunch of Western's in my senior thesis in a lab that ran on a shoestring budget -- and I'm all thumbs in the lab. The truly amazing part is that Westerns today are run pretty much the same way. You might buy pre-poured gels, but the basics are all the same.

The problems are legion.

First, this is a decidedly low-throughput assay scheme -- typical gels have maybe two dozen lanes for running. This is one reason it is used as a confirmatory test for HIV and other infections; large scale testing is out of the question.

Second, it is very labor intensive. Setting up the transfer from gel to membrane is inherently a manual process, but somewhat surprisingly it still seems uncommon to automate the later steps or even the washing. During a short lived Western blot process improvement project I initiated, I discovered that the folks running the blots both disliked the washing but also found it a social activity -- everyone is doing something mindless, so there is time to talk (simple fly pushing in a Drosophila lab is similar; the lab I was in almost always had NPR going in the background).

Third, they can require a lot of tuning. Different extraction ("lysis") buffers for the initial extraction, different gel or running conditions, different membranes, antibody dilutions, etc. -- these are all variables one can play with on the blot. Some rules of thumb are out there based on the location of the protein or how greasy it is, but it is clearly more art & lore than science. Some proteins never seem to work. Many result in big messes -- which is another advantage of Westerns, as you have the electrophoretic separation perhaps parsing the mixture into uninteresting bands & the one you want -- which may well be much fainter than the junk. And, of course, the gels don't always run the right way. Too hot -- trouble. Not poured evenly -- trouble. And please don't drop them on the floor!

Some of the trouble comes from the antibodies, but that is easily a topic for another time. But most is inherent in the Western scheme (no, there was no Dr. Western -- but there was a Dr. Southern and the other compass blots are plays on that).

But they are still extremely useful. Some folks have tried to push the envelope within the boundaries of a conventional Western. Perhaps the best example of this has been commercialized at Kinexus, which has pushed multiplex Westerns to amazing limits. They work carefully to identify sets of antibodies which will not interfere with each other & which also generate non-overlapping bands. One way to think about this is a really good Western antibody generates a single band in the same spot on the gel -- which means the rest of the blot is wasted data. Kinexus tries to maximize the amount of data from one gel. But this is a lot of work.

A new publication (free!) describes an approach that has a bit in common with the Western, but in many ways is altogether a different beast. The work is done by a startup called Cell Biosciences.

The slab gel is replaced by capillaries -- easy to control on the thermal side. SDS-PAGE is replaced by isoelectric focusing (IEF). Instead of blotting to a membrane, the separated proteins are locked onto the wall of the capillary. But other than everything being different, it's a Western!

Isoelectric focusing is a technique for electrophoretic separation of proteins. Instead of size, which SDS-PAGE sorts on, IEF uses protein charge. Each protein contains some amino acids which have positive charge and some with negative charge, plus postively & negatively charged ends. Post-translational modifications can further stir the pot: phosphorylation adds two negative charges per phosphate, and something big like ubiquitin tacks on a complex mess. On the other end, some modifications, such as acetylation, may replace a charged group with an uncharged one. In a gel with a pH gradient & subject to an electric field, the proteins will migrate to the pH where they have no charge -- the positively ionized and negatively ionized groups are in perfect balance.

An advantage of this pointed out in the nanowestern paper is that you can load a lot more sample on a gel. For a size separation, only a narrow band of sample can enter the gel because the separation is based on different sized proteins traveling at different speeds. Because IEF is an equilibrium method, you can actually fill the entire capillary with sample and then apply the electric field. This has important sensitivity implications.

The paper also describes a whole apparatus for automating the whole shebang; quite a contrast from an ordinary western. The model described runs only a dozen capillaries, but modern DNA sequencers routinely run hundreds simultaneously so there is plenty of room to grow. Each capillary detects one analyte for one sample, so with hundreds you could process hundreds of samples or detect hundreds of analytes, or some interesting middle ground.

Capillaries are also intriguing because they are at the heart of many lab-on-a-chip schemes. This paper might suggest the notion of a multi-analyte integrated chip.

The paper also describes using multiple fluorescent peptides as internal standards. These are synthesized in the opposite handedness as natural peptides, this doesn't change their IEF properties, but does make them unpalatable to proteases that might be present in the sample (though in general you use cocktails of inhibitors to prevent those proteases from attacking your sample).

The authors describe using a single antibody to assess the phosphorylation states of two related proteins, ERK1 and ERK2. Such determinations can be challenging on a Western if the two run closely with each other. In an SDS-PAGE gel, the behavior of phosphorylated proteins is maddening -- sometimes they run with the unphosphorylated form and sometimes they form new bands (a "phosphoshift"). Murphy's law rules; whichever behavior you don't want is the one you get! With IEF, phosphorylation should always give a strong phosphoshift, as a weakly ionizable side chain (hydroxyl on a Ser, Thr or Tyr) is replaced with a strongly acidic phosphate.

The paper describes using their system with a few proteins. That's a good start, but I suspect most people will want to see more. A lot more. And with other modifications. Ubiquitination would be particularly interesting, both because it is a big modification and because ubiquitin chains are often form. Presumably this will lead to a laddering effect. Also interesting to look at are more complicated phosphorylation systems than the ones examined here, with tens of phosphorylation sites rather than a handful. A reasonable guess is that the approach will still count sites, but if you want to distinguish them, which generally you will, you will still need specific antibodies for each site.

One last plus of their scheme, which they place right in the title & is the source of the nano moniker. The system is very sensitive, at least with the one analyte tested. This is important for rare or small samples (more on samples in another post). They even claim they might be able to push it from 25 cells down to 1 cell. If this is true, or even if 25 cells is achieved consistently with many antibodies, this will be an impressive feat & make this technique very attractive for signal transduction research.