Thursday, May 18, 2017

London Calling 2017: Plant & Animal de novo Genomes

Okay, I'm desperately behind on writing up the external science from London Calling.  Not helpful that I claimed I would not only do so, but in multiple installments.  A number of the plenaries focused on large genome assembly, so that's what I'll tackle now -- plus a few other bits.   See also my Storify summaries, which include other reports on the conference.  Also check out my storifies on the SMRT Leiden conference, which ran at the beginning of the same week and discusses many similar topics.

Karen Miga: Human Y-Centromere

Karen Miga from UCSC got the science off to a roaring start by demonstrating a particularly valuable use for the ultra-long reads coming off the nanopore sequencers.  Well, these days only 200Kb isn't so ultra-long, but let's just call it that for now.  Miga pointed out the dirty little secret of the human genome, which is there are large heterochromatic regions, such as centromeres but also entire chromosome short arms, which haven't been assembled.  Y-chromosome centromeres are particularly interesting due to length heterogeneity.  They are also an attractive target for sequencing as they are on the shorter side, in the neighborhood of 300kbp.  

Starting with a set of 5 BACs which David Page's group at MIT had shown tile a complete Y centromere, Miga set out to assemble these into a contiguous sequence.  The challenge is first to sequence the individual BACs and then figure out how precisely to overlap them

For the BAC stage, Miga used the Longboard protocol developed at UCSC by Miten Jain. By titrating the transposase so that large numbers of BACs receive one-and-only-one hit, a large number of reads can be obtained which contain the entire BAC sequence.  So rather than assemble (at least to me, a post for the near future), one just reorients-and-permutes on the BAC vector.  So assembling really hairy repeats becomes a non-issue; it's now a relatively simple problem of multiple alignment of a bunch of roughly equal length sequences which are anchored by unique sequence (the BACs) on each end. Over 3500 reads of 150kb or more in length were generated for this project, including a 221kbp BAC!

Once the BACs were in hand, Miga could use the rare landmarks within these highly-repetitive sequences, such as transposon insertions, to position them relative to each other.  Presto! A complete human centromere, 346kbp in length.

Tsil Gabrieli, CATCH

Okay, this breakout session (parallel track) talk wasn't about mammalian de novo, but there's a connection to the Miga talk.  Tsil Gabrieli from the Weizmann Institute talked about CATCH.  I touched on CATCH in my piece on the Sage HLS instrument.  CATCH is a scheme for capturing long pieces of DNA by specifically cutting them with Cas9 and custom guide RNAs. Because the original goal was cloning the fragments, the group hasn't yet pushed the boundaries very far.  They did attain 74X coverage of a 200kb fragment, with very little off-target capture.  Indeed, most of the off-targets were fragments on the "other" side of the Cas9 cutting.  A beautiful part of CATCH is that it requires knowing only flanking DNA; the precise sequence to be captured isn't important so long as it doesn't have target sites for the guides.

Spinning the imagination, suppose CATCH was used on human DNA to capture more Y centromeres?  It might be possible to generate reads spanning entire Y centromeres, perhaps quickly surveying the diversity of these structures from a variety of ethnic groups.  The largest human centromeres, according to Miga, are 3Mbp in size, which could well be beyond current technology, but it could well be possible to capture multiple smaller centromeres in a single CATCH experiment.  Perhaps CATCH augmented with a bit of random fragmentation could capture very long chunks of 
the large centromeres, giving a greater window into their evolution.

Over at SMRT Leiden, one of the talks mentioned a company called Samplx which is creating a microfluidic system for targeted enrichment of long fragments. DNAs bound by a fluorescent probe lead to their enclosing microdroplet being sorted away from probe-negative droplets.  Another microfluidic approach has recently been described for generating human genome sequences which compartmentalizes long fragments (much like 10X Genomics or iGenomX).

Sally James: Counting Chromosomes by Assembly

Sally James spoke on sequencing an extremophile red alga, Galdiera suphuraria. She brought up an issue that is often overlooked: for many species, including this one, the karyotype is not precisely known.  Nanopore sequencing gave very long contigs and comparison of these showed that many have very similar sequences at one or both ends, strongly suggesting she has succeeded in assembling in some cases complete chromosomes with the similar sequences the telomeres. Of her 76 contigs, 40 appear to be complete chromosomes.

Ivo Gut: Birds

Ivo Gut from Spain discussed efforts to sequence two avian species, hummingbirds and a species of bustard.  A hummingbird has been selected as a pilot species for grand vertebrate sequencing projects and pretty much every plausible technology is being thrown at it (the project, not the birds -- they are good at darting out of harm's way!).  Gut has already generated a hummingbird assembly with an 8Mb contig N50, which is quite respectable among vertebrate genome assemblies.

David Eccles: Evil Repeats in a Worm

David Eccles, hailing from New Zealand, described his work trying to sequence Nippostrongylus brasilensis. a parasitic worm known as "Nippo".  He has run into some hideous repeats in these worms.  One aspect of the sequencing is that a single nanopore run can generate better contiguity statistics than the existing reference sequence!

Kazuharu Arawaka: Evil Repeats in Spiders

I'll admit that fatigue was really setting in by the time Kazuharu Arawaka took the stage for the final scientific talk, but he quickly grabbed everyone's attention by extolling the structural wonder of spider silk.  Amazingly strong, different silks have different properties and he is working with a Japanese company trying to engineer silk-based materials.  

The fun part are spider silk genes have highly repetitive structures.  He put one protein sequence on a slide and reminded everyone that it really was a protein: all those runs of A in the sequence might have you thinking otherwise.  Periodic repeats showed up as patterns in the FASTA sequence.  So only long reads can assemble these.

Bjorn Usadel: Tomatoes

Bjorn Usadel described some of the general challenges of sequencing plant species as well as the specific experiences sequencing a tomato cultivar.  Plants often have huge genomes and high ploidies.  Genome size can be driven by multiple ancient duplications, allopolyploidy (two species merging; common tobacco is a natural example) and repeat expansions.  

In addition to the high economic importance of tomatoes, they are also interesting from an ecological standpoint.  Many wild species of tomatoes will at least to some degree interbreed.  Some of these may be inedible (or possibly even poisonous), yet have useful traits which might be crossed into edible cultivars.  

Usadel also believes from his experience that generating very long reads will significantly reduce coverage requirements.

Raymond Hulzink, Plant DNA Prep

Atop all the other issues with plants, polysaccharides and secondary metabolites can interfere with getting a good DNA prep.  Usadel and others mentioned moving towards nuclei preps as sources of high quality, high molecular weight DNA, as the secondary metabolites are not abundant in nuclei. This showed up not only in Hulzink's talk, but also several others. Hans Jansen mentioned to me in conversation that this can be improved further by extracting nuclei from actively growing root tips, as they have nearly no secondary metabolites.  The catch is, as any gardener knows, is that actively growing root tips are tiny structures, so to make a lot of DNA by this route is a challenging undertaking.

Hulzink discussed using the Boreal Genomics Aurora instrument for DNA preparation.  Aurora uses circulating electric fields to tease high molecular weight DNA from other contaminants.  Hulzink is using this approach in an ongoing project to generate 150X coverage of the honeydew melon genome.

Christaan Henkel & Hans Jansen: Eels and Tulips

Colleagues Christaan Henkel and Hans Jansen discussed at different times (plenary and the workshop respectively) the sequencing of eel genomes and ambitions to sequence the vast tulip genome.  Oxford Nanopore CEO Gordon Sanghera had opened the conference by commenting that the venue had once been a fishmarket with a granted near monopoly, with the exception being that the Dutch were allowed to sell eels.  This was recognition of Dutch eel sales feeding survivors of the Great London Fire.  

The tulip genome is huge: it has 11 chromosome pairs, and most chromosomes are individually as long or longer than the entire human genome.  Tulips are economically important to Holland and very challenging to breed; I think the quote was on the order of a decade to go from a single prized bulb to sufficient breeding stock to enable export.  Their group has developed an assembler, also named TULIP (and written in Perl!) which uses some heuristics to speed the assembly. 

Benjamin Istace: Bananas

Benjamin Istace from Genoscope in France described sequencing the banana genome on nanopore.  Four R9.4 runs gave 22Gb with a read N50 of 25.5kbp, enabling a 1.85Mb assembly N50.

Closing Thoughts

Okay, so there's a taste of a some plant genomes and some animals you really shouldn't taste!  The next big nanopore-focused all-comers scientific event will be the New York Community meeting in early December.  By that time, I would expect that more genomes will be falling to MinIONs and the first GridIONs (and perhaps even PromethIONs).  There's a lot of interest in long read preps, which even if they don't provide new record monsters are likely to provide much higher quality assemblies.  In particular, the nanopore human genome assembly paper (which both Jared Simpson and Sergey Koren spoke on at the meeting) used simulation to estimate that an assembly using only ultra-long reads might achieve a genome N50 in excess of 30Mb.

For example, at Josh Quick's oversubscribed presentation during a coffee break he apparently discussed obtaining runs with read N50s exceeding 100kbp.  A trick he is now using is to dilute the ultra-HMW DNA with shorter DNA in order to titrate the transposase.  Measuring out very goopy long DNA is difficult, so salting in the more manageable conventional preps allows better control (a similar approach is used by seqWell in their Illumina library scheme).  An idea I've thrown out for consideration (both in chats there and now here) is that if you used a small, known DNA such as lambda as the diluent, then it might be possible to use read-until to avoid sequencing the diluent. At a minimum that would reduce the basecalling compute load, but might also enable a higher yield of long DNAs of interest. Alternatively, on a system such as Sage HLS, it may be possible to use physical size selection to filter out the diluent fragments after library selection. Or, just use something interesting with a less complex genome (such as bacterial samples) as your diluent so these reads are valuable too.

Certainly the recent reports of leviathan reads has spurred interest in DNA preparation methods, causing the dusting off of many ancient ones or involved approaches such as nuclei prep and instruments such as the Boreal Genomics Aurora and Sage Sciences HLS.  Will the NY meeting see a gaggle of different methods, or will the rapid-communication (via Twitter and preprints) which characterizes the nanopore experimental community might trigger a consolidation around relatively few methods?

No comments: