Friday, December 10, 2010

Is Pacific Biosciences Really Polaroid Genomics?

The New England Journal of Medicine this week carries a paper from Harvard and Pacific Biosciences detailing the sequence of the Vibrio cholerae strain responsible for the outbreak of cholera in Haiti. The paper and supplementary materials (which contains detailed methods and some ginormous tables) are free for now. There's also a nice piece in BioIT World giving a lot of backstory. Not a few other media outlets have carried it as well, but that's where I've read.

All in all, the project took about a month from an initial phone call from Harvard to Pacific Biosciences until the publication in NEJM. Yow!! Actual sequence generation took place 2 days after PacBio received the DNA. And this is sequencing two isolates (which turned out to be essentially identical) of the Haitian bug plus three reference strains. While full sequence generation took longer, useful data emerged three hours after getting on the sequencer (though there are apparently around 10 wall clock hours of prep before you can get on the sequencer). With the right software & sufficient computational horsepower, one really could imagine standing around the sequencer and watching a genome develop before your eyes (just don't shake the machine!).

Between this & the data on PacBio's DevNet site (you'll need to register to get a password), in theory one could find the answers to all the nagging questions about the performance specs. Actually, this dataset is apparently available as the assembled sequence but only summary statistics for certain aspects of it. For example, apparently dropped bases are focused on C's & G's, so these were discounted.

Read lengths were 1,100+/-170bp, which is quite good -- and this is after filtering out lower quality data -- and 5% of the reads were monsters bigger than 2800 bases. It is interesting that they did not use the circular consensus method, which was previously published in a method paper (which I covered earlier) and yields higher qualities but shorter fragments. It would be particularly useful to know if the circular consensus approach effectively dealt with the C/G dropout issue.

One small focus of the paper, especially in the supplement, is depth of sequence analysis to infer copy number variation. There is a nice plot in Supplementary Figure 2 illustrating how the copy number varies with distance from the origin of replication. If you haven't looked at bacterial replication before, most bacteria have a single circular chromosome and initiate synthesis starting at one point (the 0 minute point in E.coli). In very rapidly dividing bacteria, the cell may not even wait for one round of synthesis to complete before firing off another synthesis round, but in any case in any dividing population there will be more DNA near the origin than near the terminus of replication. Presumably one could estimate the growth kinetics based on the slope of the copy number from ori to ter!

After subtracting out this effect, most of the copy number fits a Poisson model quite nicely (Supplementary Figure 3). However, there is still some variation. Much of this is around ribosomal RNA operons, which are challenging to assemble correctly since they appear in arrays of nearly (or completely) perfect repeats which are quite long. There's actually even a table of the sequencing depth for each strain at 500 nucleotide intervals! Furthermore, Supplementary Figure 4 shows the depth of coverage (uncorrected for the replication polarity effect) at 6X, 12X, 30X and 60X coverage, illustrating how many of the trends are actually noticeable in the 6X data.

What biology came out of this? A number of genetic elements were identified in the Haitian strains which are consistent with it being a very bad actor and also that it is a variant of a nasty Asian strain.

All-in-all, this neatly demonstrates how PacBio could be the backbone of a very rapid biosurveillance network. It is surprising that in this day-and-age that the CDC (as detailed in the BioIT article) even bothered with a pulsed field study; even on other platforms the turnaround for a complete sequence wouldn't be much longer than to do the gel study, and the results are so much richer. Other technologies might work too, but the very long read lengths and fast turnaround offered should be very appealing, even if the cost of the instrument (much closer to $1M than to my budget!) isn't. But, a few instruments around the world serving other customers but with priority given to such samples could form an important tripwire for new infections, whether they be acts of nature or evil persons. Now, it is important to note that this involved a known, culturable bug and the DNA was derived from pure cultures, not straight environmental isolates.

On a personal note, I am quite itchy to try out one of these beasts. As a result, I'm making sure we stash some DNA generated by several projects so that we could use them as test samples. We know something about the sequence of these samples and how they performed with their intended platforms, so they would be ideal test items. None of my applications are nearly as exciting as this work, but they are the workaday sorts of things which could be the building blocks of a major flow of business. Anyone with a PacBio interested in collaborating is welcome to leave me a private comment (I won't moderate it through), and of course my employer would pay reasonable costs for such an exercise. Or, I certainly wouldn't stamp "return to sender" on the crate if an instrument showed up on the loading dock! I don't see PacBio clearing the stage of all competitors, but I do see it both opening new markets and throwing some serious elbows against certain competing technologies.

1 comment:

Anonymous said...

The data is at NCBI:
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP004712

ftp://ftp.ncbi.nlm.nih.gov/sra/Submissions/SRA026/SRA026766/provisional/fastq/

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Compare to:
http://www.ncbi.nlm.nih.gov/sra/SRP004647