Saturday, April 22, 2017

Pinniped Karyotypes & N50 Statistics

In my recent piece on long read assembly, I laid out part of the case against the N50 statistic.  Historically, the issues with the statistic have been around the fact it can be gamed at the expense of assembly correctness or assembly coverage. These are concerns for the typical sort of short read assemblies we've grown used to: lots of contigs and the temptation (perhaps justified) to try to go for higher N50s by more aggressive merging or by filtering out the short contigs.  Elin Videvall over at The Molecular Ecologist has a nice ongoing series of posts illustrating the statistic and these commonplace issues:
I'm going to come at the problem from the other end, as a new preprint from 10x Genomics illustrates the problem of using an N50 statistic (or any related Nxx statistic) with good long-read / linked read assemblies -- but doesn't demonstrate this point quite as strongly as I thought when I first started drafting this.
10x Genomics has two recent preprints on de novo assembly of mammalian genomes using their Supernova assembler.  One tackles multiple human samples using 10x's technology alone, whereas the other generates an assembly for the endangered Hawaiian monk seal using both 10x and BioNano Genomics optical mapping.  When I first read the abstract, I was struck by the modest improvement in N50 that was reported when the BioNano data was added to the 10x assembly, 22.23Mb to 29.65Mb.  Digging into the paper, I realized that this was a good example of the N50 statistic failing, as the number of scaffolds had gone from 7,932 to 216 with the addition of the optical mapping data.  Why the discrepancy?

With a typical single-replicon microbial genome, the problem can crop up as the assembly really comes together.  Since the N50 statistic is the size of the contig for which half of the assembly lies in contigs of that size or greater, the approach fails if the largest contig accounts for more than half the data. Once that occurs, an assembly could improve by correctly coalescing smaller contigs, but if they don't join the main contig then the statistic won't budge.  

With eukaryotes, an even more serious problem arises.  Since eukaryotes have multiple chromosomes, there is a limit to the N50 statistic which is much smaller than the genome size.   For example, N50 for human male genome (based on GRCh38) is 156Mb, the length of the X chromosome, as chromosomes 1-7 plus X account for 50.1% of the total genome.  If your N50 hits 156Mb, it's never getting any better.

Dogs are more closely related to seals than humans are.  The dog reference is female and the seal genome sequenced by 10x male, but let's assume the Y makes a small contribution.  N50 for female dog is 64.2Mb, requiring 14 chromosomes (X, the largest in dog, plus 1-8, 10-12, 15 and 17). 

Now, that's short of the 29.65Mb achieved with 10x plus optical mapping.  Apparently all dog chromosomes are acrocentric except for the sex chromosomes,  But another seal (Baikal seal) has only 15 autosomes, all appearing (to my untrained eye) be metacentric (the squares in the diagram below mark the centromeres; synteny is indicated for HSA=human, CFA=dog, MFG=stone marten)

Now, it turns out that pinnipeds (seals, sea lions and walruses) show variation in chromosome counts, and even within the family Phocidae (seals) there are two different chromosome counts: 2n=32 and 2n=34.  The 2n=34 is regarded as ancestral, with a fusion generating the 2n=32 count. FISH studies exist for two different 2n=34 seals, harbor seals and Baikal seals, but not for any 2n=34 seals such as the Hawaiian monk seal (as far as I can tell; my sudden expertise on Pinniped karyotypes is due to all the key papers available as free full text; I was a pinhead on such issues before tonight).  Chromosome 1 in the diagram above appears to be the fusion result.

Now, in the absence of details on the 10x+BioNano scaffolds (which unfortunately are not made available in the preprint), I can't prove what seems likely -- my hypothesis that improvement in the assembly N50 is limited by the sizes of chromosomes doesn't appear likely to hold water.  I looked at estimating a seal genome N50 using the dog chromosomes and the FISH mappings, but this turns out to be quite a mess.  For example, dog chromosome 5 maps to parts of Baikal seal chromosomes 1, 5, and 15.  If I calculate the N50 for measured chromosome lengths in a 1970s karyotype paper, it looks like N50 would be reached with the 7 largest chromosomes.  So N50 in seal might well be about twice that for dog, or well over 120Mb.

So why didn't the BioNano data have a greater impact on the assembly N50?  It could be that most centromeres are uncrossable even with this data, but it's not obvious that this could have quite such a drastic impact. If we were to calculate N50 for chromosome arms, I think it can't be less than 50% of the chromosome N50 (pure intution; welcome to entertain the idea that this is mistaken), with the exact number depending on where the centromeres are placed.  But that's still much larger (60Mb) than the attained N50.

Another new preprint details assembly of a human genome entirely from MinION data.  The paper has all sorts of interest side-excursions in it, one of which is modeling the effect of different long read lengths on the contiguity of a human genome assembly.  Presumably BioNano data approximately fits this model as well.  While the repeat structure of seals is likely different in specifics, this is probably the explanation -- some very large repeats, probably segmental duplications, require much longer read spans.  That paper (which deserves its own post for a detailed look) saw a significant improvement in assembly contiguity when a small set of reads with a read N50 approaching 100kb was added to the input data.  The DNA preparation method used by 10x for the seal paper is QIAGEN's MagAttract, which claims to generate DNA in the 100kb-200kb size range. It would be interesting to see the degree to which ultra-HMW DNA preps would boost the 10X assembly contiguity.

When I do get around to writing up the nanopore human genome preprint (and besides a backlog of ideas of my own, I have requests to look at Oxford Nanopore's legal salvo at Pacific Biosciences -- oh joy!), I plan to compare and contrast it to the 10x Supernova preprint, as one of the genomes assembled by 10x is the one tackled with nanopore.  High contiguity de novo genome assembly continues to evolve in nearly real time; if only I could write equally rapidly.

No comments: