Pixabay |
There is a vast array of technologies in use today that scientists have developed and utilize to reconstruct complex genomes.
By Patrick James Hibbert
1 Jul 2019
In those genomes, DNA sequencing software often produces incomplete and fragmented reconstructions that require additional, experimentally derived, information and manual intervention to reconstruct individual chromosome arms. Newer technologies, designed to capture chromatin structure, have proven to effectively complement sequencing data, creating more contiguous reconstructions of genomes than previously possible.
Automating the reconstruction of entire genomes is a difficult task, mostly because of genomic repeats. The ambiguity produced cannot be resolved with the information contained in the reads alone. And because of unusual base-pair compositions, genomes also contain regions difficult to sequence.
As a result, typical genome assemblies of eukaryotes are highly fragmented and contain thousands of contiguous genomic segments (contigs). Sources of information for the genome scaffold comes from any type of information that hints at the relative location of genomic segments along a chromosome. Usually, the information comes from custom genomic technologies, designed to analyze the structure of chromosomes.
Recently, researchers from the University of Maryland surveyed technologies and algorithms used to assemble and analyze large eukaryotic genomes and placed them within the historical context of genome scaffolding technologies that have been in existence since the dawn of the genomic era. They published this research in PLOS Computational Biology’s scientific journal.
Their work shows technological advances and how, both, the experimental and computational sides have dramatically improved the ability to reconstruct the genomes of complex eukaryotic organisms. It looks at and explains six sequencing and mapping technologies. They are physical mapping, subcloning, long-read data, paired read, chromosome conformation, and synteny.
Physical mapping technologies attempt to estimate the location of specific loci along genomic chromosomes. The loci can be short DNA segments that are unique within the genome, as in the case of sequence-tagged sites.
The approximate location of the markers along chromosomes can be identified through a number of techniques, from fluorescence in situ hybridization (FISH) to the analysis of the random breakage of DNA being exposed to X-rays (radiation hybrid mapping) to the direct measurement of restriction fragment sizes, as performed in restriction mapping. It’s data is among the earliest technologies used to order genomic contigs along a chromosome.
Subcloning involves breaking up the genome into large fragments that are then sequenced separately, retaining the connection between the sequencing reads generated from the same fragment, creating what are subsequently called “linked reads”. The assembly process is then ran for each fragment separately, and the resulting assemblies are merged to reconstruct the complete genome sequence.
This technique was used in the early days of genomics, to sequence the first human genome. Recently, new technologies have been developed that perform the subcloning process in labs.
The long-read data sequencing technologies generate long sequencing reads and can be seen as a special case of subcloning. Genome assemblers are effective at reconstructing genomic contigs from long-read data. They achieve high-quality assemblies with only long-read data, but the genome needs to be sequenced at considerably high coverage, incurring significant costs.
Paired-read technologies are the most common source of information for scaffolding. The technology yields information about the relative placement of pairs of reads along the genome being sequenced. Typically, this information is produced by carefully controlling DNA shearing prior to sequencing in order to obtain fragments of uniform sizes and by tracking the link between DNA sequences “read” from the same fragment.
Chromosomal contact data is a special type of paired-read data generated by techniques recently developed to study the three-dimensional structure of chromosomes inside a cell.
These techniques are collectively referred to as chromosomal conformation capture, which generate pairwise linking information between reads that originate from genomic regions that are physically adjacent in a cell. Unlike mate-pair data, the distance and the relative orientation between the paired reads are not known beforehand.
Synteny refers to the co-localization of genes or genomic loci along a chromosome. In many cases, whereas the DNA sequence itself may diverge significantly during evolution, related organisms often preserve synteny and gene order.
Synteny refers to the co-localization of genes or genomic loci along a chromosome. In many cases, whereas the DNA sequence itself may diverge significantly during evolution, related organisms often preserve synteny and gene order.
The conservation of synteny can be used to help order contigs along a chromosome by inferring their placement based on their location within a related genome of the orthologs of the genes found in the contigs. Despite the rapid increase of complete and draft genomes in public databases, the use of synteny information in genome reconstruction has not been widely adopted.
Advances in genomic technologies may make the automatic reconstruction of mammalian genomes possible in the near future. Recent advances in nanopore sequencing devices are already creating longer reads. This could lead to the ability to assemble complete eukaryotic genomes from nanopore data alone.
In the near future, it is also likely that many previously intractable genomes will be reconstructed with the help of long-read sequencing data coupled with paired-read information from chromosome conformation capture technologies, augmented by short-read and short mate-pair technologies aimed at resolving the small-scale structure of genomes.
Comments
Post a Comment