Monday, December 20, 2010

Google's Ngram Viewer

I've been playing off and on with Google's Ngram viewer since it was announced on Friday. This is the tool that enables you to graph the frequency over time in usage of given words or phrases. All sorts of interesting experiments are possible -- for example, try comparing the usage of a word and a synonym vs. an antonym or a euphemism to compare their usage (or, you could examine those three words -- "antonym" seems to be much less frequently used but growing in frequency!).

But, I've already noted some anomalies. The plot for "United States of America" is surprisingly spiky, with surprisingly few mentions in the early 1800s. That is perhaps an artifact of the sources available for the Google book digitization project, but it does cast concern on some of the conclusions being drawn from this tool.

But worse, there are definitely some issues with dating and with automated text recognition. Search for "Genomics", and some awfully early references show up. These seem to fall into two categories: serious book dating errors and text errors. In the former category, I don't believe Nucleic Acids Research published in 1835, and a number of other periodicals seem to be afflicted with similar misdatings. In the latter, "générales" seems to be a favorite to transmute to "genomics".

These issues do not invalidate the tool, but they do urge caution in interpreting results -- particularly if trying to explore the emergence and acceptance of a new term.

An approach to deal with this would be to turn the problem around. A systematic search for anachronistic word patterns could identify misdatings or questionable datings in either direction. Not only would this identify documents transported backwards in time, but also ones which should be flagged for time travel in the other direction. For example, using the tool I discovered that someone sharing my surname co-authored a screed against Masonry back in the 1700s -- and this same work shows up as a modern book due to a reprinting in recent years.

But in any case, it is an interesting way to explore language and culture. Even without a little tidying & curation.

Sunday, December 19, 2010

Bone Marrow Registries of Contention, and the Future of Tissue Typing?

This summer I took TNG to the Portsmouth Air Show to enjoy viewing aerobatics and looking at some aircraft up close. As with many such events, there is a vendors section & at this one I came across the Caitlyn Raymond Bone Marrow Registry. Curious to check out someone else's consent form, experience a buccal swab & contribute to a good cause, I signed up. Quick & painless. It's also only the second time I've consented to have my own DNA analyzed, and the first professional job (we sequenced one polymorphism from each student in an undergrad class at Delaware).

I don't regret that decision, but one part that was a bit odd at first was filling out my medical insurance information. Okay, someone has to pay for the DNA testing but it seemed a little odd to stick my insurance with it -- but I didn't give it a lot of thought at the time. Since that time, I've regularly seen the registry at various community events as well as a kiosk at the local mall.

Yesterday's Globe had an article causing me to revisit that memory. U Mass Medical Center runs the Caitlyn Raymond registry, and someone there saw a dubious opportunity and ran with it. The lead for the article focused on the fact that professional models had been used as greeters at many events, helping a very high recruitment rate. Okay, that could be seen as just creative. But, the back end is that U Mass has been charging as much as $4K per sample for testing. YIKES! That's in excess of what I've heard BRCA testing goes for. Now U Mass will be getting a lot of attention from a number of attorneys general.

One point in the article is a concern that the use of models may compromise the informed consent process. The proof, as it continued, will be if registrants from the Raymond pool fail to follow-through with donations at an unusually high rate, but given that most will never be contacted it may never be known.

But it got me thinking: since the testing is purely a DNA analysis, then presumably each complete human genome sequence can be used to type an individual. Perhaps even the tests from 23 et al* hit the right markers, or at least some tightly linked ones.

So, is it ethical to reach out to such individuals? Given that I could, in theory, search the released DNA sequences from the personal genome project, would it be reasonable to try to track one down and beg for a donation? Of course, the odds of a successful match are tiny -- but as more and more PGP sequences pile up, the chance of such a search succeeding go up.

What about non-public DNA databases? Again, suppose a 23 et al had the right markers (or nearly so). Should you have to opt-in to be notified that you are predicted to be a possible donor match? Is there a mechanism to publish profiles to a central database, with an ability to ping the user back if a match is made? And if every newborn is being typed for a few thousand other markers, will testing for transplantation markers also be required?

* -- a great term, I believe originated by Kevin Davies in his $1K genome book.

Friday, December 10, 2010

Is Pacific Biosciences Really Polaroid Genomics?

The New England Journal of Medicine this week carries a paper from Harvard and Pacific Biosciences detailing the sequence of the Vibrio cholerae strain responsible for the outbreak of cholera in Haiti. The paper and supplementary materials (which contains detailed methods and some ginormous tables) are free for now. There's also a nice piece in BioIT World giving a lot of backstory. Not a few other media outlets have carried it as well, but that's where I've read.

All in all, the project took about a month from an initial phone call from Harvard to Pacific Biosciences until the publication in NEJM. Yow!! Actual sequence generation took place 2 days after PacBio received the DNA. And this is sequencing two isolates (which turned out to be essentially identical) of the Haitian bug plus three reference strains. While full sequence generation took longer, useful data emerged three hours after getting on the sequencer (though there are apparently around 10 wall clock hours of prep before you can get on the sequencer). With the right software & sufficient computational horsepower, one really could imagine standing around the sequencer and watching a genome develop before your eyes (just don't shake the machine!).

Between this & the data on PacBio's DevNet site (you'll need to register to get a password), in theory one could find the answers to all the nagging questions about the performance specs. Actually, this dataset is apparently available as the assembled sequence but only summary statistics for certain aspects of it. For example, apparently dropped bases are focused on C's & G's, so these were discounted.

Read lengths were 1,100+/-170bp, which is quite good -- and this is after filtering out lower quality data -- and 5% of the reads were monsters bigger than 2800 bases. It is interesting that they did not use the circular consensus method, which was previously published in a method paper (which I covered earlier) and yields higher qualities but shorter fragments. It would be particularly useful to know if the circular consensus approach effectively dealt with the C/G dropout issue.

One small focus of the paper, especially in the supplement, is depth of sequence analysis to infer copy number variation. There is a nice plot in Supplementary Figure 2 illustrating how the copy number varies with distance from the origin of replication. If you haven't looked at bacterial replication before, most bacteria have a single circular chromosome and initiate synthesis starting at one point (the 0 minute point in E.coli). In very rapidly dividing bacteria, the cell may not even wait for one round of synthesis to complete before firing off another synthesis round, but in any case in any dividing population there will be more DNA near the origin than near the terminus of replication. Presumably one could estimate the growth kinetics based on the slope of the copy number from ori to ter!

After subtracting out this effect, most of the copy number fits a Poisson model quite nicely (Supplementary Figure 3). However, there is still some variation. Much of this is around ribosomal RNA operons, which are challenging to assemble correctly since they appear in arrays of nearly (or completely) perfect repeats which are quite long. There's actually even a table of the sequencing depth for each strain at 500 nucleotide intervals! Furthermore, Supplementary Figure 4 shows the depth of coverage (uncorrected for the replication polarity effect) at 6X, 12X, 30X and 60X coverage, illustrating how many of the trends are actually noticeable in the 6X data.

What biology came out of this? A number of genetic elements were identified in the Haitian strains which are consistent with it being a very bad actor and also that it is a variant of a nasty Asian strain.

All-in-all, this neatly demonstrates how PacBio could be the backbone of a very rapid biosurveillance network. It is surprising that in this day-and-age that the CDC (as detailed in the BioIT article) even bothered with a pulsed field study; even on other platforms the turnaround for a complete sequence wouldn't be much longer than to do the gel study, and the results are so much richer. Other technologies might work too, but the very long read lengths and fast turnaround offered should be very appealing, even if the cost of the instrument (much closer to $1M than to my budget!) isn't. But, a few instruments around the world serving other customers but with priority given to such samples could form an important tripwire for new infections, whether they be acts of nature or evil persons. Now, it is important to note that this involved a known, culturable bug and the DNA was derived from pure cultures, not straight environmental isolates.

On a personal note, I am quite itchy to try out one of these beasts. As a result, I'm making sure we stash some DNA generated by several projects so that we could use them as test samples. We know something about the sequence of these samples and how they performed with their intended platforms, so they would be ideal test items. None of my applications are nearly as exciting as this work, but they are the workaday sorts of things which could be the building blocks of a major flow of business. Anyone with a PacBio interested in collaborating is welcome to leave me a private comment (I won't moderate it through), and of course my employer would pay reasonable costs for such an exercise. Or, I certainly wouldn't stamp "return to sender" on the crate if an instrument showed up on the loading dock! I don't see PacBio clearing the stage of all competitors, but I do see it both opening new markets and throwing some serious elbows against certain competing technologies.

Friday, December 03, 2010

Arsenic and New Microbes

Yesterday's announcement of a microbe which not only tolerates arsenic but actually appears to incorporate it in place of phosphorous has traveled a typical path for such a discovery: while it is quite a find, the media has generated more than a few ridiculous headlines. Yes, this potentially expands the definition of life, at least in an elemental sense, but it hardly suggests that such life forms exist elsewhere. A similar absurd atmosphere briefly reigned around a discovery of a potentially habitable world around a distant star -- the discoverer was quoted in at least one outlet that his find was guaranteed to have life. Given that we know very little about the probability of life starting, I always cringe when I hear someone announce that such events are either certain or certainly impossible; we simply can't calculate believable odds given our poor knowledge base. On the other end, suggestions have been raised as to this bug being a starting point for bioremediation of arsenic-contaminated aquifers; but really this discovery isn't a huge step in that direction beyond species already known to tolerate the stuff. It's also disappointing that none of the popular news items I've seen have pointed out how a periodic table can be read to show chemical similarity of phosphorous and arsenic.

That said, it is an intriguing discovery. The idea that all those phosphates on the metabolic diagrams might be substituted with arsenate is quite jarring. No reader of this space will be surprised to hear me advocate for immediate sequencing of this bug (if it hasn't already happened and just not yet reported). A microbial genome these days can be roughed out in well under a month (actually, sequence generation for Mycoplasma a decade ago took that long; clearly we can go faster now).

In order to interpret that genome, though, another whole line of experiments is needed. Assuming that the ability of this organism to incorporate arsenate in place of phosphate is confirmed, some of the precise enzymes capable of doing this trick need to be located. Simply finding arsenate-analogs of some key metabolites (such as phosphorylated intermediates in glycolysis) would point at a few enzymes, and then it would be valuable to demonstrate the purified enzymes pulling the trick. The next step then would be to test whether more conventional enzymes have this activity. Despite what many of us learned in various exposures to biochemistry from elementary school on up, enzymes aren't utterly specific for their substrates. Instead, there is a certain degree of promiscuity, though generally not with equal activity. So, to extend my analogy, if the new bug's triose phosphate isomerase can work on triose arsenates, then testing that activity in well-characterized TPIs would in order.

Assuming that such enzymes (from E.coli or human or yeast or what-not) do not have the activity, then crystal structures of the arsenate-lover would be an important next step. Of course, repeating this for the whole roster of enzymes in the bug would be quite an undertaking, but perhaps a number could be modeled to see if a consistent pattern of substitutions or other alterations emerges.

At one time, it was vogue to speculate on life forms which used silicon in place of carbon, given it's location one rung down on the periodic table. Did any author ever dare suggest arsenic for phosphate? I doubt it, but perhaps there was some mind playing with the possibilities who wrote it down somewhere (along with a large pile of other guesses that will not pan out).

Wednesday, November 03, 2010

Mild alleles in severe diseases: an opportunity for enlightenment

Monday's Globe had a blurb about a book signing which re-kindled a previus interest of mine. The author, Michael Dana Kennedy, had quit his job as a medical researcher to write the novel, a tale of two brothers ending up on opposing forces in the Pacific during WW2. While the book concept might have enough appeal to go to the back of my infinite reading list, it's the author's backstory that really grabs me.

The reason Kennedy quit his job is that it was perceived as a health threat: he was diagnosed with cystic fibrosis in his mid-50s. This reminded me of an elderly female patient the Gene Sherpa had mentioned who he had diagnosed with CF.

Cystic fibrosis is a very difficult disease, and for a long time few patients made it out of their twenties. I understand that with modern care, including antibiotics and regular respiratory therapy, many patients live substantially longer. But, this underscores what I find interesting about these two patients -- without any treatment at all they have long outlived most of their peers afflicted with cystic fibrosis. Hence, they must have comparatively mild cases. And those should be interesting.

The key question is why are these cases so mild? The simplest answer would be that they carry at least one allele which retains substantial function of CFTR (the gene mutated in CF). The more complex answer would be that they carry other genetic variants which substantially moderate the impact of the defective allele(s). Either answer would be very enlightening, both for CFTR specifically and for better understanding protein function in general.

I would expect that with the PGP and 1000 genomes project and all the other human genome sequencing efforts public and private, many new alleles will be discovered in many well understood disease genes (or as well understood as any disease gene is). A key follow-up to execute, when possible, is to determine which of these alleles had health impact. CF is appealing from this angle because we know a biochemical phenotype (altered salt excretion) which can be measured and we know a possible medical issue to assess (history of frequent respiratory infections). BRCA1 would be another valuable case where we already know many disease alleles, though there the question is more complicated to answer. I'm sure there are many more.

Studying some of these easier cases will, with luck, help shed some light on the avalanche of novel genetic variants which are pouring from germline genome projects -- and an order of magnitude higher from cancer genome projects (since in many tumors there is some combination of deficient DNA repair/replication as well as significant historical exposure to mutagens such as cigarette smoke). Lacking good high-throughput ways to assess most of these functionally, it would behoove the community to leverage the ultimate functional tests -- human survival.

Sunday, October 31, 2010

Plenty of Genomes are Still Fair Game for Sequencing

I've been grossly neglecting this space for an entire month with only the usual excuses -- big work projects, a lot of reading, etc. None good enough. Worst of all, as usual, it's not that I haven't composed possible entries in my head -- they just never get past my fingertips.

Tonight is the night most associated with pumpkins, and an earlier highlight was attending the Topsfield Fair, where the pictured specimen was on display. Amazing as it is, it fell nearly 15 pounds shy of the world record. If you want to try to grow your own, every year the variety which has dominated the winners can be purchased. Nature isn't all though; champion pumpkin growing requires a lot of specialized culture ranging from allowing only a single fruit to set to injecting nutrients just upstream of that fruit.

Sometime in recent memory there were some other blogs noted in GenomeWeb for discussing whether there are any truly remarkable genome sequencing projects left. Which I've been pondering: what makes for a very interesting species to sequence. Now, both of the bloggers mentioned clearly were not fond of either "K" genome project -- the 1,000 humans or 10,000 vertebrates. There were also some potshots taken at the "delicious or cute" genomes concept. One suggested that no interesting metazoa ("animals") are left.

So, what does make an interesting genome? Well, I can think of several broad categories. I'll try to throw out possible examples of each, though to be honest I wouldn't be surprised if some of these genomes are sequenced or nearly so -- it's very hard to keep track of complete genomes these days!

First, which I think would resonate with those two critical articles, would be genomes with interesting histories -- genomes that might tell us stories purely about DNA. This was the bent of these papers I refer to. In particular, they were thinking of many of the unicellular eukaryotes which are the result of multiple endosymbiont acquisition / genome fusion events. But, I would definitely throw into this category a particular animal: the Bdelloid rotifers, which have gone without recombination for a seeming eternity. Of course, to really understand that genome, you'd need to also sequence one of the less chaste rotifers.

Another hugely interesting class of genomes would be those to shed light on development and its evolution (evo-devo). In particular, there are a lot of arthopod genomes yet unsequenced -- from what I've noted it appears that most sequenced arthropods are either disease vectors, agricultural pests or economically important (plus, of course, the model Drosophila). Even so, I'd guess there are not many more than a dozen complete arthopod genomes so far -- quite a paucity considering the wealth of insects alone. And, if I'm not mistaken, mostly insects and an arachnid or two have gone fully through the sequencer -- where are all the others? By the way, I'd be happy to help with sample prep for the Homarus americanus genome!

Another huge space of genomes worth exploring are those were we are likely to find unusual biochemistry going on. Now, a lot of those genomes are bacterial or fungal, but there are also an awful lot of advanced plants that have interesting & useful biochemical syntheses.

All that said, I find it odd that some don't see the import and utility of sequencing many, many humans and a lot of vertebrates also. It is important to remember that a lot of funding is from the public, and the public considers many of these other pursuits less important than making medical advances. It is easy for those of us in the biology community to see the longer threads connecting these projects to human health or just the importance of pursuing curiosity, but that doesn't always sell well in public.

An optimistic view is that all the frustrated sequencers should hunker down and patiently wait; data generation for new genomes is getting cheaper by the minute, with short reads to fill out the sequence and ultra-long reads to replace physical mapping. A more conservative view holds that bioinformatics & data storage will soon dominate the equation, which might still make it hard to get lots of worthy genomes sequenced.

Personally, I can't stroll a country fair without wanting to sequence just about everything I see on display -- the chickens that look like Philadelphia Mummers, the two yard long squash, bizarrely shaped tomatoes -- and of course, the three quarter ton plus pumpkins.

Tuesday, September 28, 2010

Scenes from the Cancer Personalized Medicine Wilderness

I'm going to attempt to synthesize a number of thoughts which I've long pondered along with a bunch of news items I came across today. With luck, the result will be coherent and I'll not make a fool of myself.

There was a very interesting article last week in the New York Times on a serious ethical dilemma in melanoma and how different specialists in the field are voicing opinions on both sides of the divide. Even better, today I came across an excellent blog post reviewing that article which also added a lot of expert background. I'll summarize the two very quickly.

Metastatic melanoma is an awful diagnosis; the disease is very aggressive. Furthermore, the standard-of-care chemotherapy drug is a very ugly cytotoxic, with nasty side effects and very poor efficacy (more on that later). Sequencing studies have revealed that well over half of metastatic melanomas have a mutant form of the kinase B-RAF (gene: BRAF), most commonly the mutation V600E (which, alas, due to some sequencing error was for a while known as V599E). That's the substitution of an acidic residue (glutamate) for a hydrophobic one (valine), and it is right in the kinase active site.

Now, a biotech called Plexxikon, in conjunction with Roche, has developed an inhibitor of B-RAF called PLX4032. In Phase I trial results reported this summer in the New England Journal of Medicine, very promising tumor regressions were seen. Now remember, this was a single-arm Phase I trial for safety, meaning we don't have an objective comparison to make.

And there begins the rub. To some doctors (and many patients), the combination of great preclinical results, the theoretical and experimental underpinnings for targeting B-RAF in melanoma and the observed regression means we have a winner on our hands and it is now unethical to have a randomized trial comparing the new compound against the standard-of-care.

At the other pole are doctors who worry that we have been fooled before.
My standard example to trot out for such cases is a famous CAST cardiovascular trial to which a placebo arm was grudgingly added -- a sound theory had been advanced that
suppressing arrythmias in certain patients would prevent death. CAST was stopped early when it was clear the placebo arm fared far better; the toxicities of the drugs overwhelmed any benefits. Even closer to our current story is the drug sorafenib, which was originally developed as a B-RAF antagonist. Now, there are many in the field who argued that it really wasn't, but Bayer and Onyx got it to market (probably based on its inhibition of numerous other kinases) and the "raf" syllable in the generic name points to their belief in the B-RAF theory. Unfortunately, in randomized clinical trials it failed to work in V600E melanomas.

One idea that was apparently floated by at least one oncologist working in the trials, but rejected by the corporate sponsors, was to try to win approval based on nearly miraculous recoveries seen in some patients on death's door. What the NYT article failed to discuss is whether the FDA would buy that argument; there are many reasons to think they wouldn't -- they really do not like single arm trials, because all too often spurious results occur do to random chance (or rarely, to manipulation of the trial).

An important idea discussed in all this is the concept that once we have established a therapy as efficacious, it is generally unethical to withhold that therapy from patients. But, we are often not on such solid ground even in this area. Clinical trials represent a horrible case of multiple testing; more than a few drugs that squeaked through their trial would not if you ran the trial again; they just got lucky. Don't believe me? Think back to Iressa, which received accelerated approval for lung cancer and then had it withdrawn (only to later be reintroduced). We now know a key piece of that particular puzzle: Iressa works in patients whose tumors have mutant forms of the EGFR. The first trial, by chance, was enriched for such patients and the second trial (also by chance) was not as enriched. Given that the EGFR hypothesis wasn't known, neither trial could have been manipulated.

But another recent item, covered in a different post on the same blog, reminds us that even well-established clinical approaches may not hold true over time. Screening mammography is a hot potato issue in cancer: can you save lives by screening healthy women for breast cancer. Various studies have tried to ask this question not just for women overall, but by age groups since the incidence of breast cancer and the quality of mammograms changes with patient age. The newest fuel on this fire is a very clever Norwegian study, which I won't attempt to summarize, that suggests that much (but perhaps not all) of the benefit of screening mammography has been eroded by improvements in cancer care. In other words, the advantage of early detection has been blunted by better treatments. Now, I'm not qualified to really review that study, but certainly this is a concept we should keep in mind: the utility of medical strategies may change over time, and not always for the better.

In my mail tonight was a thick magazine-sized volume from Scientific American, which I confess I am not a subscriber of (it's a fine magazine; I just already subscribe to too many fine magazines). This special edition, titled "Pathways: The changing science, business & experience of health", focuses on healthcare with a mix of articles. Some appear to be written by professional writers, while others are thinly-veiled advertisements for various companies.

In scanning the table of contents, I was caught by "Pioneering Personalized Cancer Care", though unfortunately this turns out to be one of the puffier pieces. Written by two principles in the company, it mostly describes N-of-one, a company which has as its customers cancer patients. N-of-one tries to distill the available knowledge on a person's tumor and help them navigate to the most appropriate tests. It's a business model I've sometimes wondered about for myself, since playing an oncologic Sherlock Holmes could be both fascinating and rewarding. On the other hand, the regulatory environment is fraught with uncertainty and most likely this sort of organization will have to rely on wealthy customers willing to pay their own way.

Now, the article did set my teeth on edge early on with the statement "Recently, projects such as the Cancer Genome Atlas have documented thousands of mutations in cancer cells that can lead to unregulated cell growth and prevent apoptosis (cell death), the hallmarks of malignancy". Any regular reader of this space knows that I am a gung-ho proponent of sequencing tumors, but with that comes an obligation to be honest. And the honest truth is that sequencing has yielded thousands of candidates, but only a handful of those have actually been shown to have transforming ability -- there's just no high-throughput way to do that en masse.

But, what N-of-one and others are doing is where I strongly believe the future of oncology lies. But, it will be a complicated place. Getting back to B-RAF, I've heard noise that it has been found in a number of additional tumor types, albeit at low frequency. So, supposes it occurs at 1 in 1000 frequency in some awful tumor type. With routine whole-genome sequencing of tumors, we could detect that. Such sequencing is starting to be used to good effect, as reported recently in Nature. That leads to a conundrum for everyone. For a patient or clinician, do you go with PLX4032, given that we know it targets BRAF -- but knowing that we don't know whether BRAF is really driving your tumor (especially if the mutation is not V600E)? For those wanting to design clinical trials, could you really find enough patients to stock a trial -- or are you willing to have a trial with "any cancer, as long as it has a BRAF mutation"?

This is the challenge that personalized medicine presents us. With genome sequencing (and eventually also routine whole methylome profiling), we can find what makes cancers different -- but how will we ever actually sort through all those differences? Should we move away from randomized trials to going where the science seems to lead us, even knowing that more than a few times there have been dead ends?

I can find only one easy answer to all this: don't trust anyone who offers an easy answer to all this.

Tuesday, September 21, 2010

Review: The $1000 Genome

Kevin Davies' "The $1000 Genome" deserves to be widely read. Readers of this space will not be surprised that there are a few changes I might have imposed had I been its editor, but on the whole it presents a careful and I think entertaining view of the past and possible future of personal genomics.

The book is intended for a far wider audience than geeky genomics bloggers, so the emphasis is not on the science. Rather, it is on some of the key movers-and-shakers in the field and some of the companies which have been dominating this space, ranging from the first personal genetic mapping companies (23 and Me, Navigenics, Pathway Genomics and deCodeMe) to the instrument makers (such as Solexa/Illumina, Helicos, Pacific Biosciences, ABI and Oxford Nanopore) to those working on various aspects of human genome sequencing services (such as Knome and Complete Genomics. Various ups and downs of these companies -- and the debates they have engendered -- are covered as well as the possible impacts on society. Along the way, we see a few glimpses of Davies exploring his own genome and some of the biological history which he seeks to enlighten through these expeditions.

It is not a trivial task to try to explain this field to an educated lay public, but I think in general Davies does a good job. The overviews of the technologies are limited but give the gist of things. Anyone writing in this space is faced with the dilemma of trying to explain too much and losing the main thread or failing to explain and preventing the reader from finding it. Mostly I think he has succeeded in threading this needle, perhaps because only rarely did I feel he had missed. One example I did note was in explaining PacBio's technology; hardly anyone in science will know what a zeptoliter is, let alone someone outside of it. On the other hand, what analogy or refactoring of that term could remove it from the edges of science fiction? Not an easy challenge!

For better or worse, once I've decided I generally like a book like this my next thoughts are what could be removed and what could be added. I really could find little to remove. But, there are a few things I wish were either expanded or had made it in altogether.

It would be dreary to enumerate every company which has ever thrown its hat in the DNA sequencing ring. It is valuable that Davies covers a few of the abject failures, such as Manteia (which did yield some key technology to Illumina when sold for assets) and US Genomics. There is scant coverage, other than by mention, of most of the companies which have but nascent attempts to enter the arena. However, the one story I really did miss was anything about the Polonator. It's not that I really think this system will conquer the others (though perhaps I hope it will hold its own), it just represents a very different tack in corporate strategy that would have been interesting to contrast with the other players.

Davies has been in the thick of the field as editor of Bio IT World, so this is no stitching together of secondary sources. I also appreciated that he includes both the ups and the downs for these companies, emphasizing that this has not been easy for any of them. But, that added to my surprise at several incidents which were left out (believe me, many were left in I had never heard before). Davies describes how Helicos delivered an instrument to the CRO Expression Analysis, but not that it was very publicly returned for failing to perform to spec. Nor is Helicos' failed attempt to sell themselves mentioned. An interesting anecdote on Complete Genomics is how a wildfire nearly disrupted one of their first human genome runs; left out is the near-death experience of that company when it was forced to either lay off or defer salaries for nearly all of its staff. The section on Complete's founder Rade Drmanac mentioned Hyseq, but not the company (or was it two) which he ran between Hyseq and Complete to try to commercialize sequencing-by-hybridization. This would have added to this portrait of determination -- and the travails of the corporate arena. I was also surprised that the short profile of Sydney Brenner as a personal genomics skeptic didn't include the fact he invented the technology behind Lynx, which was another early attempt in non-electrophoretic sequencing. Some would see that as irony.

Another area I would like to have seen expanded was the exploration of groups such as Patients Like Me, which are windows on how much people are willing to chance disclosing sensitive medical information. One section explores the fact that several prominent persons interested in this field became so when their children were diagnosed with rare recessive disorders, leading them to ponder whether they would have made the same marriage had they known in advance of this danger. I was surprised that little of the existing experience in this area was explored; I believe the Ashkenazi population has dealt with this in screening for Tay-Sachs and other horrific disorders which are prevalent there.

The book is stunningly up-to-date for something published the beginning of September; some incidents as late as June are reported. Despite this, I found little evidence of haste. I'm still trying to figure out what a "nature capitalist" is, but that's the only case I spotted of a likely mis-wording.

Davies briefly explores possible uses of these sequencing technologies beyond our germline sequences, but only very briefly. Personally, I think that cancer genomics will have a more immediate and perhaps greater overall impact on human medicine, and wish it had gotten a bit more in depth treatment.

Davies in a expatriot Brit, living not very far from me. The sections on the possible impact of widespread genome sequencing on medicine are written almost entirely from a U.S. perspective, with our hybrid public-private healthcare system. I suspect European readers would hunger for more discussion of how personal genomics might be handled within their socialized medical systems and different histories of handling the ethical issues (Germany, I believe, has pretty much banned personal genomics services). On this side of the pond, he does a nice job of showing how different state agencies have charged into the breach left, until recently, by the FDA.

Okay, too many quibbles. Well, maybe one last one -- it would have been nice to see more on some of the academic bioinformaticians who have created such wonderful and amazing open-source tools as Bowtie and BWA.

As I mentioned above, Davies injects a good amount of himself into all this. I've encountered books (indeed, on recently on moon walkers), in which this becomes a tedious over-exposure to the author's ego. This is not such a book. The personal bits either link pieces of the story or make them more approachable. We find out that he has already attained a greater age than his father did (due to testicular cancer, one of the few cancers in which overwhelming progress has been made), leading to questions he hopes his genome can answer. Hence, his trying out of pretty much all of the array-based personal genetic services. But, he does not address one question that the book raised in my mind: will the royalties from this project fund a complete Davies genome?

Saturday, September 11, 2010

ARID1A A Fertile Ground for Mutations in Ovarian Clear Cell Carcinoma

Although ovarian clear cell carcinoma does not respond
well to conventional platinum–taxane chemotherapy
for ovarian carcinoma, this remains
the adjuvant treatment of choice, because effective
alternatives have not been identified.

This sentence is a depressing reminder of the status of medical treatment of far too many tumor types. Present in roughly 12% of U.S. ovarian cancer cases, ovarian clear cell carcinoma (OCCC) is a dreadful diagnosis.

Two papers this week made a significant step forward in understanding the molecular basis -- and heterogeneity -- of this horror. Seemingly the finale of an old-fashioned race to publish, groups centered at the British Columbia Cancer Center (in New England Journal of Medicine) and Johns Hopkins University (in Science) published papers with the same headline finding: inactivating mutations in the chromatin regulating gene ARID1A (whose gene product is known as BAF250) are a key step in many -- but not all -- OCCC. I'll use the shorthand Vancouver and Baltimore to refer to the respective groups.

Both papers got here by the largest applications of second generation sequencing to cancer so far published. The Vancouver work relied on transcriptome sequencing (RNA-Seq) of a discovery cohort of 18 patients; the Baltimore group used hybridization targeted exome sequencing on just 8 patients. Both used Illumina paired-end sequencing for the discovery phase; Vancouver also used the same platform for validation on a larger cohort.

Whole genome sequencing is likely the future for cancer genomics. A non-cancer paper just published 20 genomes in one shot, underscoring how this is becoming routine with easy samples & a work which is apparently in press (I have no inside knowledge; it has been discussed at several public meetings) will have perhaps a dozen human genomes in it. But, there are still cost advantages to focusing on expressed genetic regions (and perhaps a bit more) and perhaps further information to be gleaned from actually looking a gene expression. These two papers give an opportunity, albeit a bit constrained, to compare the two approaches.

One interesting note comes straight out of the Vancouver data. After finding ARID1A mutations in 6/18 discovery samples, they re-screened those samples plus 211 additional samples. In total this set included 1 OCCC cell line, 119 OCCC, 33 endometrioid carcinomas and 76 high-grade serous carcinomas. The validation screen was by long-range PCR (mean product size 2067 bp) products sheared and sequenced on the Illumina. One exon proved troublesome and required further PCR and sequencing by Sanger. In any case, the key bit here is in the discovery cohort this approach found ARID1A mutations which had been missed by the original RNA-Seq. As the authors state, a likely culprit is nonsense mediated decay (NMD). It would be interesting to go into their dataset to see if these samples had a markedly lower expression of ARID1A, though I don't have easy access to it (it has been deposited, but with protections that should be the subject of a future post).

One interesting contrast between the two studies is the haul of genes. The Vancouver group found ARID1A as a recurrently mutated gene; the Hopkins group not only bagged ARID1A but also KRAS, PIK3CA and PPP2R1A. KRAS and PIK3CA are well-known oncogenes in multiple tumor types and had previously been implicated in OCCC, but PP2R1A is a novel find. The Vancouver group did specifically search for KRAS and PIK3CA mutants in their cohorts by PCR assays and found one patient sample and one cell line with KRAS mutations. Again, it would be interesting to review the RNA-Seq data to generate hypotheses as to why these were not found in the Vancouver set. On the other hand, the RNA-Seq data did identify one case of a rearranged ARID1A. While it is possible to use hybridization capture to identify gene fusions, this cannot be practically done in a hypothesis-free manner. In other words, without advance interest in ARID1A that approach would not work. In addition, CTTNB1 (beta catenin) mutations had been found previously in OCCC and were specifically checked (and found) by the Vancouver group, but none were reported by the Baltimore group. One final small discrepancy: both groups looked at cell line TOV21G for their mutations of interest and both found the same activating KRAS and PIK3CA alleles. However, Vancouver found one ARID1A allele but Baltimore found that one and a second one (actually, the two mutations I am calling the same [1645insC and 1650dupC] aren't described precisely the same, though I'm guessing it is a difference in an ambiguous alignment).

One other surprise is that TP53 (p53) and PTEN mutants had apparently been reported either for OCCC or endometriosis-associated tumors, yet neither group reported any.

An analysis that is not explicitly found in either paper but I feel is valuable is to look at the co-occurrence of these mutations. If we look only at patient samples, then the big take-home is that neither group saw co-occurrence of KRAS and ARID1A (the TOV21G cell line is at odds with this conclusion). Mutually-exclusive mutations have been seen in many tumors. For example, KRAS mutations are generally mutually-exclusive with other mutations in the RTK-RAS-RAF-MAPK pathway. In contrast, ARID1A mutations are found in conjunction with mutations in CTTNB1, PIK3CA and PPP2R1A -- one patient sample in the Baltimore data was even triple mutant for ARID1A, PIK3CA and PPP2R1A. About 30-40% of sample are mutated for none of these genes as far as this data can tell; the hunt for further causes will continue. Will they be epigenetic? Mutations in regulatory elements?

Another interesting comparison is simply the number of mutations per sample. The Hopkins exome data typically has very small numbers of mutations (after filtering out germ line variants); as few as 13 in a sample and as many as 125 -- and the high number was from a tumor which had previously been treated with DNA-damaging agents (all of the other tumors in the Hopkins study were treatment naive). In contrast, the Vancouver data often found more than 1000 non-synonymous variants per tumor. Unfortunately, no clinical history information is available for the Vancouver cohort, so we don't know if this is from DNA-damaging therapeutics or differences in the sequencing or variant filtering. In an ideal world, we could filter each data set with the other group's filtering scheme to see how much of an effect that would have.

The Vancouver group went beyond sequencing to examine samples by immunohistochemistry (IHC) for expression of the ARID1A gene product, BAF250. There is a strong, but imperfect, negative correlation between mutations and BAF250 expression. Some mutated but BAF250-expressing samples may be explained by the target of the antibody; the truncated forms may still express the correct epitope. Alternatively, ovarian cells may be very sensitive to the dosage of this gene product (in some samples both wt and mutant alleles were clearly found in the RNA-Seq data). Also of interest will be samples lacking expression but unmutated; these may be the places to identify further mechanisms for tumors to eliminate BAF250 expression.

The Vancouver study illustrates one additional bonus from RNA-Seq data: a list (in the supplemental data) of genes differentially expressed between ARID1A mutant and ARID1A wild-type cells.

Another interesting bit from the Vancouver paper is looking at two cases in which the tumor was adjacent to endometrial tissue. In one of these, the same truncating mutation was found in the adjacent lesion and tumor -- but not in a distant endrometriosis. Hence, the mutation was not driving the endometriosis but occurred afterwards.

I'm sure I'm short-shrifting further details from the paper; there's a lot of data packed in these two reports. But, what will it all mean for ovarian cancer patients? Alas, none of the genes save PIK3CA are obvious druggable targets. PIK3CA encodes the alpha isoform of PI3 kinase, a target many companies are working on. But that wasn't novel to these papers. PP2R1A is a regulatory subunit of a protein phosphatase and the mutations are concentrated on a single amino acid, suggesting these are activating mutations (as seen in ARID1A, inactivating mutations can sprawl all over a gene). Phosphatases have not been a productive source of drugs in the past, but perhaps that can be changed in the future. Chromatin regulation is a hot topic, but ARID1A is deficient here, not active. Given that tumors can apparently live with two mutated copies, the idea of further inactivating complexes with ARID1A mutations is probably not a profitable one. But, perhaps there is a ying-yang relationship with another chromatin regulator which can be leveraged. In other words, perhaps inhibiting an opposing complex could restore balance to the cell's chromatin regulation and inhibit the tumor. That's the sort of work which can build off of the foundation these two cancer genomics papers have provided.

Kimberly C. Wiegand, Sohrab P. Shah, Osama M. Al-Agha, Yongjun Zhao, Kane Tse, Thomas Zeng, Janine Senz, Melissa K. McConechy, Michael S. Anglesio, Steve E. Kalloger, Winnie Yang, Alireza Heravi-Moussavi, Ryan Giuliany,Christine Chow, John Fee, Abdalnas (2010). ARID1A Mutations in Endometriosis-Associated Ovarian Carcinomas New England Journal of Medicine : 10.1056/NEJMoa1008433

Jones S, Wang TL, Shih IM, Mao TL, Nakayama K, Roden R, Glas R, Slamon D, Diaz LA Jr, Vogelstein B, Kinzler KW, Velculescu VE, & Papadopoulos N (2010). Frequent Mutations of Chromatin Remodeling Gene ARID1A in Ovarian Clear Cell Carcinoma. Science (New York, N.Y.) PMID: 20826764

Tuesday, August 31, 2010

Worse Could Be Better

My eldest brother was in town recently on business & in our many discussions reminded me of the thought-provoking essay "The Rise of 'Worse is Better'". It is on a thought train similar to Clayton Christensen's books -- sometimes really elegant technologies are undermined by ones which are initially far less elegant. In the "WiB" case, the more elegant system is too good for its own good, and never gets off the ground. In Christensen's "disruptive technology" scenarios, the initially inferior serves utterly new markets priced out by the more elegant approaches, but the inferior technology then nibbles slowly but surely to replacing the dominant one. But a key conceptual requirement is to evaluate the new technology on the dimensions of the new markets, not the existing ones.

I'd argue that anyone trying to develop new sequencing technologies would be well advised to ponder these notions, even if they ultimately reject them. The newer and more different the technology, the longer they should ponder. For it is my argument that there are indeed markets to be served other than $1K high quality canid genomes, and some of those offer opportunities. Even existing players should think about this, as there may be interesting trade-offs that might go after totally new markets.

For example, I have an RNA-Seq experiment off at a vendor. In the quoting process, it became pretty clear that about 50% of my costs are going to the sequencing run and the other 50% of costs to library preparation (of course, within both of those are buried various other costs such as facilities & equipment as well as profit, but those aren't broken out). As I've mentioned before, the costs of the sequencing are plummeting but library construction is not on such a steep trend.

So, what if you had a technology that could do away with library construction? Helicos simplified it greatly, but for cDNA still required reverse transcription with some sort of oligo library (oligo-dT, random primers or a carefully picked cocktail to discourage rRNA from getting in). What if you could either get rid of that step, read the sequence during reverse transcription or not even reverse transcribe at all? A fertile imagination could suggest a PacBio-like system with reverse transcriptase immobilized instead of DNA polymerase. Some of the nanopore systems theoretically could read the original RNA directly.

Now, if the cost came down a lot I'd be willing to give up a lot of accuracy. Maybe you couldn't read mutations out or allele-specific transcription, but suppose expression profiles could be had for tens of dollars a sample rather than hundreds? That might be a big market.

Another play might be to trade read length or quality of an existing platform for more reads. For example, Ion Torrent is projected to initially offer ~1M reads of modal length 150 for $500 a pop. For expression profiling, that's not ideal -- you really want many more reads but don't need them so long. Suppose Ion Torrent's next quadrupling of features came at a cost of shorter reads and lower accuracy. For the sequencing market that would be disastrous -- but for expression profiling that might be getting in the ballpark. Perhaps a 16X the initial chip -- but with only 35bp reads -- could help drive adoption of the platform by supplanting microarrays for many profiling experiments.

One last wild idea. The PacBio system has been demonstrated in a fascinating mode they call "strobe sequencing". The gist is that the read length on PacBio is largely limited by photodamage to the polymerase, so letting the polymerase run for a while in the dark enables spacing reads apart by distances known to some statistical limits. There's been noise about this going at least 20K and perhaps much longer. How long? Again, if you're trapped in "how many bases can I generate for cost X", then giving up a lot of features for such long strobe runs might not make sense. But, suppose you really could get 1/100th the number of reads (300)-- but strobed out over 100Kb (with a 150bp island every 10Kb). I.e. get 5X the fragment size by giving up about 99% of the sequence data. 100 such runs would be around $10K -- but would give a 30,000 fragment physical map with markers spaced about every 10Kb (and in runs of 100Kb). For a mammalian genome, even allowing for some loss due to unmappable islands, that would be at least a 500X coverage physical map -- not shabby at all!

Now, I won't claim anyone is going to make a mint off this -- but with serious proposals to sequence 10K vertebrate genomes, such high-throughput physical mapping could be really useful and not a tiny business.

Sunday, August 29, 2010

Who has the lead in the $1K genome race?

A former colleague and friend has asked over on a LinkedIn group for speculation on which sequencing platform will deliver a $1K 30X human genome (reagent cost only). It is somewhat unfortunate that this is the benchmark, given the very real cost of sample prep (not to mention other real costs such as data processing), but it has tended to be the metric of most focus.

Of existing platforms, there are two which are potentially close to this arbimagical goal (that is, a goal which is arbitrary yet has obtained a luster of magic through repetition).
ABI's SOLiD 4 platform can supposedly generate a genome for $6K, though even with pricing from academic core labs I can't actually buy that for less than about $12K (commercial providers will run quite a bit more; they have the twin nasty issues of "equipment amortization" and "solvency" to deal with).
The SOLiD 4 hq upgrade is promised for this fall with a $3K/genome target. Could Life Tech squeeze that out? I'm guessing the answer is yes, as the hq does not use an optimal bead packing. Furthermore, the new paired end reagents will offer 75 bp reads in one direction but only 25 in the other.
I've never understood why a ligation chemistry should have an asymmetry to it (though perhaps it is in the cleavage step), so perhaps there is significant room for improvement there. Of course, those possible 40 cycles are not free, so whether this would help with cost/genome is not obvious (though it would be advantageous for many other reasons). Though, since they can currently get a 30X genome on one slide longer reads would enable packing more genomes per slide & perhaps that's where the accounting ends up favoring longer reads.

Complete Genomics is the other possible player, but we have an even murkier lens on the reagent costs per genome, given that Complete deals only in complete genomes and only in bulk. But, they do have to actually ensure they are not losing money (or at least, with their IPO they won't be able to hide the bleed). Indeed, Kevin Davies (who has a book on $1K genomes coming out) replied on the thread that Complete Genomics has already declared to be at $1K/genome in reagent costs. Perhaps we should move the target to something else (Miss Amanda suggests that $1K canid genomes are far more interesting).

What about Illumina? With HiSeq, they are supposedly at $10K/genome with the HiSeq and many have noted that
the initial HiSeq specs were for a lower cluster packing than many genome centers achieve. That also brings up an interesting issue of consistency -- how variable are cluster packings & therefore the output per run. In other words,
what sigma are we willing to accept in our $1K/genome estimate? Also, the HiSeq specs were for shorter reads than the 2 x 150 paired end
reads that are quite common in 1000 genomes depositions in the SRA (how much longer can Illumina go?).

So, perhaps any of these three existing platforms might meet the mark (454 is a non-starter; piling up data cheaply is not
its sweet spot). What about the ones in the wings? Of course, these are even murkier and we must rely even more on their maker's
projections (and potentially, wishful thinking).

IonTorrent's technology (to be re-branded by Life Tech?) isn't nearly there right now. For $500 (the claim is) you'd get 150Mb of data, or about 0.1X for $1000, so we need about 300X improvement. However, there should be a lot of opportunity to improve. The one touted most in the past is further improvement in the feature density; Ion Torrent was apparently already working on a chip with about 4X the number of features. If we round 300 to 256, then that would only be 4 rounds of quadruplings. If Life could pump those out every 6 months, then that would only be two years to a $1K genome. Who knows how realistic that schedule would be?

But IonTorrent could push on other dimensions as well. Because the flowcell itself is a huge chunk of the cost of a run, squeezing longer read lengths should be possible. Since 454 gets nearly 500 basepair reads routinely (and up to a kilobase when things are really humming), perhaps there is a factor of nearly 4 to get from longer reads. In a similar manner, a paired-end protocol could potentially double the amount of sequence per chip (at a cost of perhaps a bit more than double the runtime; not such a big deal if the run is really an hour). Could that be done? I think I have the schematic for an approach (which might also work on 454); trade proposals for sequencing instruments will be put to my employer for consideration! Finally, as noted in a thread on SEQAnswers, IonTorrent is apparently achieving only about a 1/8th efficiency in converting chip features to sequence-generating sites; better loading schemes might squeeze another few fold out. So perhaps IonTorrent really is 1-2 years away from having $1K genomes (much more likely the 2).

Moving on, could Pacific Biosciences (or the Life tech StarLight (nee VisiGen)) technology have a shot? Lumping them together (since we have virtually no price/performance information for StarLight), PacBio is initially promising $100 runs generating ~60Mb, so $1K would get you about 0.2X coverage, or about 150-fold off, which we'll round to 128-fold or 7 doublings. I think they've already been said to be testing a chip with twice the density, plus a better loading scheme to yield around 2X -- so perhaps it's only 5 doublings.

Finally, there are the technologies which haven't yet demonstrated the ability to read any DNA, but could do so and then move quickly (or not). In this category are any nanopore-based systems (which is a dizzying array of approaches) and Gnu Bio's sequencing-by-synthesis-in-nanodrops approach. And perhaps a few more. These don't even work yet, so even speculative price performance information isn't available.

Finally, a quick note about what a $1K genome means. The X-prize folks have set very strong standards, standards which are far beyond what any short read technology could hope to accomplish and also far beyond what many sequencing applications need. The organizers did not super-design them for no reason; there are applications which need that rigor and also it will greatly cut down on false positives. But, as the regular stream of papers shows, much lower standards will suffice to get interesting biology of whole human genomes.

Tuesday, August 24, 2010

Lawyers v. Research Funding?

An ongoing personal quest is to attempt to fill in the gaps in my original education, particularly outside the areas of science in which I feel there exist gaping chasms. Through Wikipedia, books and especially recorded college courses, I slowly patch up what the deficiencies of my education (or all too commonly, my youthful deficiencies in attention during that education) have failed to cover. I'm currently making a third pass through a wonderful course on Roman history since I enjoyed it very much the first two times.

During Rome's early expansion it was ruled by rotating sets of elected officials under a system known to us as the Roman Republic. A series of events (known to scholars as the Roman Revolution) over many decades disrupted this system, culiminating in the replacement of the Republic with the military dictatorship of the Emperors, which would remain until the fall of the empire. An initiating event in the Revolution was an official named Tiberius Gracchus, who in the service of high-minded ideals (rewarding landless soldiers with their own plots on which to support themselves), changed the nature of Roman politics by introducing mob violence to the process (as well as a certain degree of ruthlessness in dealing with the opposition of colleagues).

I fear that yesterday's court decision regarding embryonic stem cell research represents a similar horrible turn. Now, what most commentators will focus on is the very issue of creating human embryonic stem cells and whether the government should finance this. This is an area in which the proponents of both sides of the issue have deeply and sincerely held beliefs which I feel must be respected, though in the end they are fundamentally irreconcilable. But peripheral to that, the case represents a very scary intrusion of lawyers into the research funding process.

One of the claims made by the plaintiffs (in particular, the research James Sherley) is that the new guidelines on what embryonic stem cell research can be funded represent a very real cause of harm to those working on adult stem cell research; they will have more competition for research funding. That is certainly true; if we view research funding for stem cells as a zero sum game (and that is another whole can of balled waxworms I won't dela with). The danger now is that every possible change in federal (or even private?) funding aim will be an opportuntity for litigators to intrude. Wind down project X to fund project Y? LAWSUIT! Either this will dissuade funding from the ebb and flow which is necessary, or a far worse than zero sum game ensues in which funding for science instead funds litigation (or the buy-offs of potential suits which are routine in that field).

Can this genie be stuffed back in the bottle? I'm not legally trained enough to know. Perhaps it was inevitable. Perhaps we need Congress to explicitly forbid it (but would that be legal?) -- and what are the chances of that? Has a terrible Rubicon been crossed; I hope I am wrong in thinking it has.

Saturday, August 21, 2010

Varus! Where are my legions (of data)!?!?

Bring up the subject of outsourcing, and many minds will immediately jump to the idea of a company using outside services to more cheaply replace operations formerly conducted in house. But the other side of the topic is what I frequently experience: outsourcing allows me to access technologies and capabilities which I simply could not afford to do so on my own, or at least try very expensive technologies prior to investing in them. This is very useful, but has its own issues.

I've now gotten data from 4 different large outsourced sequencing projects. Rated on a five star system, they would (in order) be rated less than expected (**), complete failure (*), less than expected (**) and greater than expected (****). Samples for two more projects just shipped out last week. Given that we don't have any sort of sequencer in house (one project above was conventional Sanger) nor can we willy-nilly buy any specialized hardware for target enrichment (two projects involved enrichment), this has been valuable -- though I really wish I could have been able to rate all as greater than expected (or at least one off the charts).

After the quality of the delivered data, my next greatest frustration is with knowing when that data will be delivered. Now a few projects (plus some explicit vendor tests not included in the above) have gone on schedule, but the utter failure had the pain compounded by being grossly overdue (1-3 months, depending on how you quite define the start point) and one of the other projects came in a week overdue.

But even worse than being late is not knowing how late until the data shows up. Partly this revolves around trying to appropriately budget my time, but it also affects transmitting expectations to others awaiting the results.

In an ideal world, I'd have a real-time portal onto the vendor's LIMS -- one cancer model outfit claimed exactly this. But in any case, I'd really like to have regular updates as to the progress of my project -- and especially to what's happening if the vendor has gone into troubleshooting mode.

After all, what these outfits wish to claim is that they will act as an extension of my organization. Now, if the work was going on in house & I was concerned about progress, I could easily pop in and chat with the person(s) working on it. I'm not interested in hanging over someone's shoulder & making them nervous, but I do like to try to at least understand what is going on & what approaches are being used to solve this. Unfortunately, in several outsourcing projects this is specifically what was lacking -- no concrete estimate of a schedule nor any regular communication when projects were overdue.

In a basic sense, I'd like an update every time my project crosses a significant threshold. Now, the exact definition of that is tricky. But, imagine a typical hybridization capture targeted sequencing. The vendor receives my DNA, shears, size selects, ligates adapters, amplifies and has a library. Some QC happens at various stages. Then there is the hybridization, recovery and further amplification. At some point the platform-specific upstream-of-sequencer step occurs (cluster formation or ePCR). Then it goes on the sequencer. Each cycle of sequencing occurs, plus (for Illumina) cluster regeneration and paired end sequencing. Then downstream basecalling (if not in line). Once basecalls are done, then whatever steps occur to get me the data. And that's all the correct workflow: throw in some troubleshooting for problems should they occur.

Now, ideally I could see all of those steps. But how? I really don't want an email after every sequencer cycle. Could something like Twitter be adapted for this purpose?

Happily, the recent experience when I thought of the title for this post the data did finally come in (after some hiccups with delivery) and was quite exciting. So I'm not tearing my lab coat like Augustus. But when vendors try to solicit my business or when I'm rating the experience afterwards, the transparency and granularity of their communication will be a critical consideration. Vendors who are reading this take note!

Tuesday, August 17, 2010

Life Tech Gobbles Ion Torrent

Tonight's big news is that Life Technologies, the giant formed by the merger of ABI and Invitrogen, has acquired Ion Torrent for an eye popping $375M (mixed cash & stock) with another $325M possible in milestones and such.

I'm not shocked Ion Torrent was shopping itself; by linking with an established player Ion Torrent can access marketing channels -- a talent they have displayed a serious handicap in. While Ion Torrent was adept at creating buzz with founder Jonathon Rothberg's rock star presentations and their sequencer giveaway contests, actual marketing infrastructure to follow-up on all the leads generated through those efforts was clearly lacking (as in, they have yet to contact me!).

One interesting detail of the press release is the fact that the price point for their sequencer is placed at "below $100K"; Ion Torrent had previously billed their machine at under $50K. Is this a real shift, or does it simply reflect the true cost once sample prep gear is thrown in?

Now, there are several interesting angles to watch. First, how will Life position their full lineup of sequencers -- now that they have 3 different technologies (SOLiD, Ion Torrent & VisiGen) with very different performance characteristics. Plus, they had the SOLiD PI in line to be an entry level second generation sequencer -- how will this affect that?

Another area to watch is how tightly Ion Torrent is tied into the SOLiD line. While the chemistry is very different, there are opportunities. For example, can the EZ Bead emulsion PCR robots be used for Ion Torrent sample prep (with the whole sample prep issue being a big black box for the technology? Will the same library prep reagents for SOLiD be usable with Ion Torrent? I'd love to see that -- especially if Ion Torrent drives volumes which ultimately result in driving kit costs down. Of course, the biggest question is when can people actually buy one of the beasts?

Roche/454 seemed like a more obvious partner for Ion Torrent -- very similar chemistries & a tie-up of that sort might have meant a very rapid extension of Ion Torrent read lengths. Roche should be quite nervous; between Ion Torrent and Pacific Biosciences they are going to be under extreme pressure in long read niches and their next technology (GE's nanopores) are unlikely to be ready for many years. Ion Torrent could have also been an interesting play for a reagent company looking to jump into sequencing instruments. Such a company could have also brought the right sales network into play. A non-bio player could have happened, but I doubt that would have ended well -- Ion Torrent needs to complete their act & get their machine out to biologists.

Saturday, August 07, 2010

Perchance to dream

I had an amusing dream the other night. Nothing earth shattering: neither starved calves consuming fatted ones nor serpentine molecular orbitals. But, an amusing spin on something I had recently discussed with a friend.

In the dream, I've apparently gotten to a presentation late -- and just missed the announcement of the sample preparation upstream of the Ion Torrent instrument. I look to my side & it's a guy from Ion Torrent with all sorts of stuff in front of him, but when I try to ask him what I missed, he indicates silence. And then I wake up.

How exactly one goes from DNA to the instrument still appears to be a mystery. There is certainly nothing on the Ion Torrent website (which is rather focused on flash, not substance) to suggest it. A reasonable assumption is emulsion PCR, but there are other candidates (e.g. rolling circle).

Given that this is a rather important piece of the puzzle, there are several common guesses for why it is a mystery. One is that there are IP issues to still be resolved. In a similar vein, Ion Torrent just licensed some IP from a British company (DNA Electronics)which sounds like a near clone in terms of approach. A second is that they are still working out what approach to support. A third is that it isn't flashy enough to be worth mentioning.

Interestingly, Ion Torrent has apparently already sold a machine each to the USGS and NCI, plus there are the ones promised to the grant winners (which alas, I am not one of -- though a winning proposal was a kissing cousin of mine). And Ion Torrent certainly hasn't started beating the bushes hard for sales.

I'm still very eager to try out the Ion Torrent box. While it won't replace some of the other systems for many applications, the cost profile of Ion Torrent will open up very high throughput sequencing to many more labs. I have a number of ideas of how I might use one rather frequently. Now if only they'd try to sell me one -- and ideally in the real world, and not while I slumber!

Sunday, August 01, 2010

Curse you Larry the CEO!

A bit after getting to my current shop, I requested some serious iron for my work and it was decided I would have a Linux box. The question came up as to which flavor, and after canvassing my networks we went with Ubuntu. I had never administered a Linux system before and had to learn the whole package installation procedure, which is so easy even I could learn it. The "apt" tool works beautifully 99% of the time, not only getting and installing the package of interest but also all its dependencies. The occasional exceptions were cases where either the package of interest didn't seem to be available from a package repository or the Ubuntu repositories were behind the version I needed. But in general, it was nice and painless.

Earlier this year, it was clear I needed an Oracle play space and the obvious place was my machine -- not only is it quite powerful, but then any blow-back from any misdeeds of mine would hit only the perpetrator. However, when our skilled Oracle expert contractor tried to install Oracle, not much luck -- Oracle apparently doesn't support Ubuntu well. So the decision was made to switch to Red Hat.

This did not go cleanly -- the admins were fighting with the reinstall most of the week (the RAID drive had protections on it that did not wish to go quietly) but finally the new system was configured on Friday. So on Saturday night, I declared to "Amanda, I know what we are going to do today! Install packages!".

Now, I've actually made a consistent habit here of logging all my installs, so I had a menu of what to try to install. Some quick Googling found some guides to using the different installation tools on Red Hat. So I started trying to install stuff. A few went cleanly, but that is definitely the rarity -- and the worst part is that R is proving to be a major headache.

The problem is trying to get all the dependencies to install, and R has a heap. The fact that many have "-devel" in the title can't make things easy. Worse, one package required "tetex-latex" is no longer supported by its creator. Despite configuring multiple repositories and trying to download some packages manually, I have made little headway so far. So from that standpoint, at the moment my system is "Busted!".

Now, I could blame our contractor, but how was he to know this would be so miserable (though the comment by someone at Red Hat support that this is the first time he'd heard of someone going from Ubuntu to Red Hat does give pause!)? I could also take umbrage with the Linux community, which seems to be a hydra of endless subvariants (Ubuntu, Debian, Red Hat, Red Hat Enterprise, CentOS, Fedora, Mandriva -- and I'm sure that's an incomplete list!). But, it's easiest to blame Oracle, who doesn't support Ubuntu, and if I'm going to do that I'll single out the face of Oracle. On the other hand, it's a bit pointless to hold anger over this against Mr. Ellison. He's a CEO; they don't do much.

Friday, July 30, 2010

A huge scan through cancer genomes

Genentech and Affymetrix just published a huge paper in Nature using a novel technology to scan 4Mb in 441 tumor genomes for mutations, the largest number of tumor samples screened for many genes. Dan Koboldt over at MassGenomics has given a nice overview of the paper, but there are some bits I'd like to fill in as well. I'll blame some of my sloth in getting this out to the fact I was reading back through a chain of papers to really understand the core technique, but that's a weak excuse.

It's probably clear by now that I am a strong proponent (verging on cheerleader) for advanced sequencing technologies and their aggressive application, especially in cancer. The technology used here is intriguing, but it is in some ways a bit of a throwback. Now, on thinking that (and then saying it aloud) forces me to think about why I say that and perhaps this is a wave of the future, but I am skeptical -- but that doesn't detract from what they did here.

The technology, termed "mismatch repair detection", relies on some clever co-opting of the normal DNA repair mechanisms in E.coli. So clever is the co-opting, that the repair mechanisms are used to sometimes break a perfectly good gene!

The assay starts by designing PCR primers to generate roughly 200 bp amplicons. A reference library is generated from a normal genome and cloned into a special plasmid. This plasmid contains a functional copy of the Cre recombinase gene as well as the usual complement of gear in a cloning plasmid. This plasmid is grown in a host which does not Dam methylate its DNA, a modification in E.coli which marks old DNA to distinguish it from newly synthesized DNA.

The same primers are used to amplify target regions from the cancer genomes. These are cloned into a nearly identical vector, but with two significant differences. First, it has been propagated in a Dam+ E.coli strain; the plasmid will be fully methylated. Second, it also contains a Cre gene, but with a 5 nucleotide deletion which renders it inactive.

If you hybridize the test plasmids to the reference plasmids and then transform E.coli, one of two results occur. If there are no point mismatches, then pretty much nothing happens and Cre is expressed from the reference strand. The E.coli host contains an engineered cassette for resistance to one antibiotic (Tet) but sensitivity to another antibiotic (Str). With active Cre, this cassette is destroyed and the antibiotic resistance phenotype switched to Tet sensitivity and Str resistance.

However, the magic occurs if there is a single base mismatch. In this case, the methylated (test) strand is assumed to be the trustworthy one, and so the repair process eliminates the reference strand -- along with the functional allele of Cre. Without Cre activity, the cells remain resistant to Tet and sensitive to Str.

So, by splitting the transformation pool (all the amplicons from one sample transformed en masse) and selecting one half with Str and the other with Tet, plasmids are selected that either carry or lack a variant allele. Compare these two populations to a two-color resequencing array and you can identify the precise changes in the samples.

A significant limitation of the system is that it is really sensitive only for single base mismatches; any sort of indels or rearrangements are not detectable. The authors wave indels away ash "typically are a small proportion of somatic mutation", but of course they are a very critical type of mutation in cancer as they frequently are a means to knock out tumor suppressors. For large scale deletions or amplifications they use a medium density (244K) array, amusingly from Agilent. Mutation scanning was performed in both tumor tissue and matched normal, enabling the bioinformatic filtering of germline variants (though dbSNP was apparently used as an additional filter).

No cost estimates are given for the approach. Given the use of arrays, the floor can't be much below $500/sample or $1000/patient. The MRD system can probably be automated reasonably well but with a large investment in robots. Now, a comparable second generation approach (scanning about 4Mb) using any of the selection technologies would probably run $1000-$2000 per sample (2X that per patient), or perhaps 2-4X as much. So, if you were planning such an experiment you'd need to trade off your budget versus being blind to any sort of indels. The copy number arrays add expense but enable seeing big deletions and amplifications, though with sequencing the incremental cost of that information in a large study might be a few hundred dollars.

I think the main challenge to this approach is it is off the beaten path. Sequencing based methods are receiving so much investment that they will continue to push the price gap (whatever it is) closer. Perhaps the array step will be replaced with a sequencing assay, but the system both relies on and is hindered by the repair system's blindness to small indels. Sensitivity for the assay is benchmarked at 1%, which is quite good. Alas, no discussion was made of amplicon failure rates or regions of the genome which could not be accessed. Between high/low GC content and E.coli-unfriendly human sequences, there must have been some of this.

There is another expense which is not trivial. In order to scan the 4Mb of DNA, nearly 31K PCR amplicons were amplified out of each sample. This is a pretty herculean effort in itself. Alas, the Materials & Methods section is annoyingly (though not atypically) silent on the PCR approach. With correct automation, setting up that many PCRs is tedious but not undoable (though did they really make nearly 1K 384 well plates per sample??). But, conventional PCR quite often requires about 10ng of DNA per amplification, with a naive implication of nearly half a milligram of input DNA -- impossible without whole genome amplification, which is at best a necessary evil as it can introduce biases and errors. Second generation sequencing libraries can be built from perhaps 100ng-1ug of DNA, a significant advantage on this cost axis (though sometimes still a huge amount from a clinical tumor sample).

Now, perhaps one of the microfluidic PCR systems could be used, but if the hybridization of tester and reference DNAs requires low complexity pools, a technique such as RainDance isn't in the cards. My friend who sells the 48 sample by 48 amplicon PCR arrays would be in heaven if they adopted that technology to run these studies.

One plus of the study is a rigorous sample selection process. In addition to requiring 50% tumor content, every sample was reclassified by a board-certified pathologist and immunohistochemistry was used to ensure correct differentiation of the three different lung tumor types in the study (non-small cell adenocarcinoma, non-small cell squamous, and small cell carcinoma). Other staining was used to subclassify breast tumors by common criteria (HER2, estrogen receptor and progesterone receptor) and the prostate tumors were typed by an RT-PCR assay for a common (70+% of these samples!) driver fusion protein (TMPRSS2-ERG).

Also, it should be noted that they experimentally demonstrated a generic oncogenic phenotype (anchorage independent growth) upon transformation with mutants discovered in the study. That they could scan for so much and test so few is not an indictment of the paper, but a sobering reminder of how fast mutation finding is advancing and how slowly our ability to experimentally test those findings.
Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VE, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, Faham M, & Seshagiri S (2010). Diverse somatic mutation patterns and pathway alterations in human cancers. Nature PMID: 20668451

Thursday, July 22, 2010

Salespeople, don't forget your props!

I had lunch today with a friend & former colleague who sells some cool genomics gadgets. One thing I've noted about him is whenever we meet he has a part of his system with him; it's striking how often this isn't the case (he's also been kind enough to leave me one).

Now, different systems have different sorts of gadgets with different levels of portability and attractiveness. The PacBio instrument is reputed to weigh in at one imperial ton, making it impractical for bringing along. Far too many folks are selling molecular biology reagents, which all come in the same sorts of Eppendorf tubes.

But on the other hand, there are plenty of cool parts that can be shown off. Flowcells for sequencers are amazing devices, which I've seen far too few times in the hands of salespersons. One of the Fluidigm microfluidic disposables is quite a conversation piece -- and the best illustration for how the technology works. The ABI 3730 sequencer's 96 capillary array was so striking I once took a picture of it -- or I thought I had until I looked through the camera files. The capillaries are coated with Kapton (or a similar polymer), giving them a dark amber appearance. They are delicate yet sturdy, bending over individually but in an organized fashion.

However, my most favorite memory of a gadget was an Illumina 96-pin bead array device. The beads are small enough that they produce all sorts of interesting optical effects -- 96 individually mounted opals!

Of course, those gadgets are not cheap. However, any well run manufacturing process is going to have failures, which is a good source for display units. Yes, if you have a really good process you won't generate any defective products, but given the rough state of the field I'm a bit suspicious of a process whose downstream QC never finds a problem. In any case, even if you can't afford to give them away at least the major trade show representatives should carry one. If a picture is worth a thousand words, then an actual physical object is worth exponentially more in persuasive ability.

One final thought. Given the rapidly changing nature of the business, many of these devices have very short lifetimes (in some cases because the company making them has a similarly short lifetime). I sincerely hope some museum is collecting examples of these, as they are important artifacts of today's technology. Plus, I really could imagine an art installation centered around some 3730 capillaries & Illumina bead arrays.

Wednesday, July 21, 2010

Distractions -- there's an app for that

Today I finally gave in to temptation & developed a Hello World application for my Droid. Okay, developed is a gross overstatement -- I successfully followed a recipe. But, it take a while to install the SDK & its plugin for the Eclipse environment plus the necessary device driver so I can debug stuff on my phone.

Since I purchased my Droid in November the idea of writing something for it has periodically tempted me. Indeed, one attraction of Scala (which I've done little with for weeks) was that it can be used to write Android apps, though it definitely means a new layer of comlexity. This week's caving in had two drivers.

First, Google last week announced a novice "you can write an app even if you can't program" tool called AppInventor. I rushed to try it out, only to find that they hadn't actually made it available but only a registration form. Supposedly they'll get back to you, but they haven't yet. Perhaps it's because I'm not an educator -- the form has lots of fields tilted at educators.

The second trigger is that an Android book I had requested came in at the library. Now, it's for a few versions back of the OS -- but certainly okay for a start (trying to keep public library collections current on technical stuff is a quixotic task in my opinion, though I do enjoy the fruits of the effort). So that was my train reading this mornign & it got me stoked. The book is certainly not much more than a starting springboard -- I'm debating buying one called "Advanced Android Programming" (or something close to that) or whether just to sponge off on-line resources.

The big question is what to do next. The general challenge is choosing between apps that don't do anything particularly sophisticated but are clearly doable vs. more interesting apps that might be a bit to take on -- especially given the challenge of operating a simulator for a device very unlike my laptop (accelerometers! GPS!). I have a bunch of ideas for silly games or demos, most of which shouldn't be too hard -- and then one concept that could be somewhat cool but also really pushing the envelope on difficulty.

It would be nice to come up with something practical for my work, but right now I haven't many ideas in that area. Given that most of the datasets I work with now are enormous, it's hard to see any point to trying to access them via phone. A tiny browser for the UCSC genome database has some appeal, but that's sounding a bit ambitious.

If I were still back at Codon Devices, I could definitely see some app opportunities, either to demo "tech cred" or really useful. For example, at one point we were developing (though an outsource vendor) a drag-and-drop gene design interface. The full version probably wouldn't be very app appropriate, but something along those lines could be envisioned -- call up any protein out of Entrez & have it codon optimized with appopropriate constraints & sent to the quoting system. In our terminal phase, it would have been very handy to have a phone app to browse metabolic databases such as KEGG or BioCyc.

That thought has suggested what I would develop if I were back in school. There is a certain amount of simple rote memorization that is either demanded or turns out to expedite later studies. For example, I really do feel you need to memorize the single letter IUPAC codes for nucleotides and amino acids. I remember having to memorize amino acid structures and the Krebs cycle and glycolysis and all sorts of organic synthesis reactions and so forth. I often devised either decks of flash cards or study sheets, which I would look at while standing in line for the cafeteria or other bits of solitary time. Some of those decks were a bit sophisticated -- for the pathways I remember making both compound-centric and reaction-centric cards for the same pathways. That sort of flashcard app could be quite valuable -- and perhaps even profitable if you could get students to try it out. I can't quite see myself committing to such a business, even as a side-line, so I'm okay with suggesting it here.

Tuesday, July 13, 2010

There are 2 styles of Excel reports: Mine & Wrong

A key discovery which is made by many programmers, both inside and outside bioinformatics, is that Microsoft Excel is very useful as a general framework for reporting to users. Unfortunately, many developers don't get beyond that discovery to think about how to use this to best advantage. I've developed some pretty strong opinions on this, which have been repeatedly tested recently by various files I've been sent. I've also used this mechanism repeatedly, with some Codon reports for which I am guilty of excessive pride.

An overriding principle for me is that I am probably going to use any report in Excel as a starting point for further analysis, not an endpoint. I'm going to do further work in Excel or import it into Spotfire (my preference) or JMP or R or another fine tool. Unfortunately, there are a lot of practices which frustrate this.

First, as much data as possible should be packed into as few tabs as practical. Unless you have a very good reason, don't put data formatted the same way into multiple files or multiple tabs. I recently got some sequencing results from a vendor and there was one file per amplicon per sample. I want one file per total project!

Second, the column headers need to be ready for import. That means a single row of column headers and every column has a specific and unique header. Yes, for viewing it sometimes looks better to have multiple rows and use cell fusing and other tricks to minimize repetition -- but for import this is a disaster either guaranteed or likely to happen.

Third, every row needs to tell as complete a story as possible. Again, don't go fusing cells! It looks good, but nobody downstream can tell that the second row really repeats the first N cells of the row above (because they are fused).

Fourth, don't worry about extra rows. One tool I use for analysis of Sanger data spits out a single row per sample with N columns, one column for each mutation. This is not a good format! Similarly, think very carefully before packing a lot into a single cell -- Excel is terrible for parsing that back out. Don't be afraid to create lots of columns & rows -- Excel is much better at hiding, filtering or consolidating than it is at parsing or expanding.

Finally, color or font coding can be useful -- but use it carefully and generally redundantly. Ignoring the careful part means generating confusing "angry fruit salad" displays (and never EVER make text blink in a report or slide!!!).

Follow these simple rules and you can make reports which are springboards for further exploration. It's also a good start to thinking about using Excel as a simple front end to SQL databases.

So what was so great about my Codon reports? Well, I had figured out how to generate the XML to handle a lot of nice features of the sort I've discussed above. The report had multiple tabs, each giving a different view or summary of the data. The top tab did break my rules -- it was a purely summary table & was not formatted for input into other tools (though now I'm feeling guilty about that; perhaps I should have had another tab with it properly formatted). But each additional tab stuck to the rules. All of them had AutoFilter already turned on and had carefully chosen highlighting when useful -- using a combination of cell color and text highlighting to emphasize key cells. Furthermore, it also hewed to my absolute dictum "Sequences must always be in a fixed width font!". I didn't have it automatically generate Pivot Tables; perhaps eventually I would have gotten there.