Sunday, August 29, 2010

Who has the lead in the $1K genome race?

A former colleague and friend has asked over on a LinkedIn group for speculation on which sequencing platform will deliver a $1K 30X human genome (reagent cost only). It is somewhat unfortunate that this is the benchmark, given the very real cost of sample prep (not to mention other real costs such as data processing), but it has tended to be the metric of most focus.

Of existing platforms, there are two which are potentially close to this arbimagical goal (that is, a goal which is arbitrary yet has obtained a luster of magic through repetition).
ABI's SOLiD 4 platform can supposedly generate a genome for $6K, though even with pricing from academic core labs I can't actually buy that for less than about $12K (commercial providers will run quite a bit more; they have the twin nasty issues of "equipment amortization" and "solvency" to deal with).
The SOLiD 4 hq upgrade is promised for this fall with a $3K/genome target. Could Life Tech squeeze that out? I'm guessing the answer is yes, as the hq does not use an optimal bead packing. Furthermore, the new paired end reagents will offer 75 bp reads in one direction but only 25 in the other.
I've never understood why a ligation chemistry should have an asymmetry to it (though perhaps it is in the cleavage step), so perhaps there is significant room for improvement there. Of course, those possible 40 cycles are not free, so whether this would help with cost/genome is not obvious (though it would be advantageous for many other reasons). Though, since they can currently get a 30X genome on one slide longer reads would enable packing more genomes per slide & perhaps that's where the accounting ends up favoring longer reads.

Complete Genomics is the other possible player, but we have an even murkier lens on the reagent costs per genome, given that Complete deals only in complete genomes and only in bulk. But, they do have to actually ensure they are not losing money (or at least, with their IPO they won't be able to hide the bleed). Indeed, Kevin Davies (who has a book on $1K genomes coming out) replied on the thread that Complete Genomics has already declared to be at $1K/genome in reagent costs. Perhaps we should move the target to something else (Miss Amanda suggests that $1K canid genomes are far more interesting).

What about Illumina? With HiSeq, they are supposedly at $10K/genome with the HiSeq and many have noted that
the initial HiSeq specs were for a lower cluster packing than many genome centers achieve. That also brings up an interesting issue of consistency -- how variable are cluster packings & therefore the output per run. In other words,
what sigma are we willing to accept in our $1K/genome estimate? Also, the HiSeq specs were for shorter reads than the 2 x 150 paired end
reads that are quite common in 1000 genomes depositions in the SRA (how much longer can Illumina go?).

So, perhaps any of these three existing platforms might meet the mark (454 is a non-starter; piling up data cheaply is not
its sweet spot). What about the ones in the wings? Of course, these are even murkier and we must rely even more on their maker's
projections (and potentially, wishful thinking).

IonTorrent's technology (to be re-branded by Life Tech?) isn't nearly there right now. For $500 (the claim is) you'd get 150Mb of data, or about 0.1X for $1000, so we need about 300X improvement. However, there should be a lot of opportunity to improve. The one touted most in the past is further improvement in the feature density; Ion Torrent was apparently already working on a chip with about 4X the number of features. If we round 300 to 256, then that would only be 4 rounds of quadruplings. If Life could pump those out every 6 months, then that would only be two years to a $1K genome. Who knows how realistic that schedule would be?

But IonTorrent could push on other dimensions as well. Because the flowcell itself is a huge chunk of the cost of a run, squeezing longer read lengths should be possible. Since 454 gets nearly 500 basepair reads routinely (and up to a kilobase when things are really humming), perhaps there is a factor of nearly 4 to get from longer reads. In a similar manner, a paired-end protocol could potentially double the amount of sequence per chip (at a cost of perhaps a bit more than double the runtime; not such a big deal if the run is really an hour). Could that be done? I think I have the schematic for an approach (which might also work on 454); trade proposals for sequencing instruments will be put to my employer for consideration! Finally, as noted in a thread on SEQAnswers, IonTorrent is apparently achieving only about a 1/8th efficiency in converting chip features to sequence-generating sites; better loading schemes might squeeze another few fold out. So perhaps IonTorrent really is 1-2 years away from having $1K genomes (much more likely the 2).

Moving on, could Pacific Biosciences (or the Life tech StarLight (nee VisiGen)) technology have a shot? Lumping them together (since we have virtually no price/performance information for StarLight), PacBio is initially promising $100 runs generating ~60Mb, so $1K would get you about 0.2X coverage, or about 150-fold off, which we'll round to 128-fold or 7 doublings. I think they've already been said to be testing a chip with twice the density, plus a better loading scheme to yield around 2X -- so perhaps it's only 5 doublings.

Finally, there are the technologies which haven't yet demonstrated the ability to read any DNA, but could do so and then move quickly (or not). In this category are any nanopore-based systems (which is a dizzying array of approaches) and Gnu Bio's sequencing-by-synthesis-in-nanodrops approach. And perhaps a few more. These don't even work yet, so even speculative price performance information isn't available.

Finally, a quick note about what a $1K genome means. The X-prize folks have set very strong standards, standards which are far beyond what any short read technology could hope to accomplish and also far beyond what many sequencing applications need. The organizers did not super-design them for no reason; there are applications which need that rigor and also it will greatly cut down on false positives. But, as the regular stream of papers shows, much lower standards will suffice to get interesting biology of whole human genomes.

5 comments:

Anonymous said...

Thanks for the overview, very helpful. As one working in the field, it is not trivial to keep up with the constantly changing pricing, and also to account for real costs to consumer versus claims about intrinsic costs at the companyies.

Anonymous said...

You seem to have left error rates out of your calculations, but most real applications depend either on de novo assembly or accurate mapping to a reference, both of which rely more on the length and accuracy of the reads than on the total volume of data.

Anonymous said...

For your question about the asymmetrical SOLiD runs in ligation chemistry -- it's because it is a reverse chemistry on the same strand rather than a forward chemistry on newly sythesized second strand. Reverse ligation chemistry is more difficult than forward, but provides the benefit of removing second strand sythesis error.

Anonymous said...

I think that Illumina has a lot more headroom than SOLiD in terms of price per base despite the current situation (with HS2000 at ~$10,000/genome and SOLiD at ~$6,000/genome). With Illumina's market share, however, there's no point in dropping their reagent cost relative to SOLiD's.

Roald Forsberg said...

Thanks for the overview - any thoughts on where Ox ford Nanopore are in this race ?