Initial publication year: 2022
How to cite
Deciding on a sequencing approach for a new project can be challenging. Your choice depends primarily on the questions of interest, but also constraints set by funding, existing genomic resources, sample size, genome characteristics and expertise in the research group.
The main categories of population genomic sequencing approaches include…
Two approaches for whole-genome resequencing are distinguished by their coverage depth. Coverage depth refers to the average number of sequencing reads per region of the genome (distinct from coverage breadth, see (Sims et al. 2014)). With more sequencing reads at a region (higher coverage), errors that arise during sequencing have a smaller impact on accurately identifying variable sites (termed “genotyping”) (Fuentes-Pardo and Ruzzante 2017).
Whole-genome resequencing (WGS) is where whole genomes of multiple individuals are sequenced and mapped to an assembled reference genome in order to compare variable sites between individuals or populations. High-coverage (>20x) sequencing is able to find rare variants with high confidence. This approach is considered the ‘gold standard’ for DNA re-sequencing because it includes all variant types (i.e SNVs, indels, structural variants and CNVs) (Sims et al. 2014).
Low coverage whole genome sequencing (lcWGS) is where whole genomes of many individuals are sequenced, but depth of coverage is so low (<5× per site per individual, on average (R. Nielsen et al. 2011)) that individual genotypes cannot be confidently assigned. Instead of assigning specific (“hard called”) genotypes, lcWGS data requires probabilistic analyses that account for uncertainty about the true genotype and can incorporate uncertainty across all sequenced samples (Korneliussen, Albrechtsen, and Nielsen 2014). This approach is useful in cases where the research questions of interest are targeted at the population-level (i.e. allele frequencies, linkage disequilibrium (LD) patterns, etc) (Lou et al. 2021).
As the name suggests, this approach aims to sequence a reduced portion of the genome across multiple individuals at moderate to high coverage for variant discovery and genotyping with high confidence. The genome fragments can be selected at random, or through targeted probe assays. Variations of this type of sequencing approach include…
A class of methods involving sequencing a subset of genomic regions randomly distributed throughout the genome, often using restriction enzymes (i.e. restriction site-associated DNA (RADseq)).
Sequence capture methods use a set of probes designed with a-priori knowledge to focus sequencing effort on a set of hundreds to tens of thousands of specific loci. Several capture approaches for non-model species can utilise other sequencing strategies to design probes corresponding to genomic regions (i.e. Whole-exome sequencing (WES), see (Jones and Good 2016) for a review, Expressed Exome Capture Sequencing (eecSeq)(Puritz and Lotterhos 2018)).
Sequencing pools of individuals to provide information on genome-wide population allele frequencies (see review in (Futschik and Schlötterer 2010)).
It should be noted that you don’t need to choose only one approach for all samples! If you are starting a research program on a species without any genomic resources, you may want to start with denovo RADseq to understand basic population structure and genetic diversity, then once a reference genome is assembled move to high/low coverage WGS.In the MarineOmics seminar, Misha Matz recommended choosing one individual to use for genome assembly, doing high coverage WGS on a few other individuals that represent most of the variation in your taxa, and then WGS many other individuals at low coverage. The high coverage individuals can be useful for imputing missing genotypes in the low coverage samples.
The table below briefly summarizes the pros and cons of five sequencing approaches and their appropriateness for answering specific questions. If you are interested in answering multiple questions (i.e. neutral population structure and adaptive variation) then you would often want a method that can do both (i.e. WGS).
Goals | hcWGS | lcWGS | RADseq | Poolseq | Target.Capture |
---|---|---|---|---|---|
pros | provides high quality, high density genotypes; can be used to improve development of a reference genome | provides high density genotypes at a reduced cost | generally cheaper per sample, density of genotypes can be tuned to fit question, does not require a reference | many individuals can be mixed for a low library prep cost, only option for some larval studies | allows consistent capture of loci between batches, good for sequencing targeted regions across many samples |
cons | most expensive per sample, requires a reference genome assembly, large computational resources | requires a reference, can produce false (+) heterozygotes, sensitive to batch effects, may be inappropriate to use individual SNP calls for some analyses | only 1-5% of genome covered which limits studies on adaptive variation, de novo assembly can result in paralogs without quality control, can be hard to capture the same loci between batches | need higher sequencing depth so it can be more expensive than RAD (but less than other options), no info on individual genotypes (duh), requires a reference, requires some replicates to account for batch effects | expensive to start a new project and requires reference of some kind to design baits (unless using method like eecSeq), only gives info on targeted region so can result in ascertainment bias for some analyses |
population structure, genetic diversity | great, but cheaper methods work almost as well | good, but limited to popgen analyses based on genotype likelihoods | great, esp. for many samples | good for methods that only use allele frequencies, won’t work for individual based methods (eg Admixture) | ascertainment bias is likely to skew results, esp. for measures of genetic diversity |
demography (mig. rates, pop. size through time) | great if incorporating haplotype-based methods | good, esp. with some high coverage samples to help with imputation | good for methods based on site frequency spectrum (eg moments), but not ideal for methods using extended haplotype or phasing info | active area of method development, but still not common except in cases of multiple temporal samples (Nielsen et al. 2020) | likely not appropriate due to ascertainment bias |
signatures of selection, GWAS | great | good | hotly debated, but unless your genome is way too big, WGS is the way to go | good if covering majority of the genome | only works for targeted regions |
phylogenetic inference | computationally challenging, but good at all divergence levels | good for tree shapes but not branch lengths, good for extracting organelle sequences and cost-effective primers for species and hybrid ID studies | good at shallow to medium divergence levels, but need to play with filtering parameters | would work in certain situations | an improvement over RAD as there will be less missing data among samples |
genetic crosses/mapping panels | expensive | great | great as it allows for many invididuals | maybe | poor |
molecular evolution (accurate low-freq alleles required) | best | poor | poor | poor | good if only interested in targeted regions |
Below we list a number of great review articles and empirical comparison studies that can help when making the decision. This topic was also discussed during the MarineOmics seminars.
Review articles:
Empirical comparisons:
Presently, our site covers four approaches:
For each approach, we provide guiding principles and tutorials for bioinformatic processing from raw data to genotypes. While we currently don’t discuss library preparation methods in depth, we do include advice on how to organize your sequencing runs in order to improve quality control downstream.