Initial publication year: 2022
How to cite

Introduction

For the purpose of this guide, we use “whole genome (re)sequencing (WGS)) to refer to methods where a reference genome already exists (whether for the focal species or a related species), and uniquely barcoded samples are sequenced and then mapped to most or all of the reference genome. This method can provide high density genetic variants (e.g., SNPs) and structural information for population genomic analyses, while also facilitating functional insights if the reference genome is functionally annotated. This is separate from whole-genome de novo sequencing, which aims to produce a reference genome by sequencing and assembling a complete genome for a species for the first time.

An excellent and thorough review of the methodology and consideration for WGS approaches, especially as they apply to non-model organisms, is (Fuentes-Pardo and Ruzzante 2017). But briefly, current WGS approaches can be broadly categorized into two types: high-to-moderate-coverage WGS, and low-coverage WGS. Their major distinction is that with high-to-moderate-coverage WGS, each individual is sequenced at a depth with which genotype can be confidently called at most sites, whereas with low-coverage WGS, each individual is sequenced at a depth too low to call genotype with, and downstream analyses should take such genotype uncertainties into acount.

However, the line between high-to-moderate-coverage WGS and low-coverage WGS is not always as clear-cut as presented above. For example, with moderate-coverage WGS (e.g. 5-20x), many sites within an individual can still have low coverage due to random sampling, resulting in unreliable genotype calls and/or missing data that can become problematic in downstream analysis. Therefore, it could be preferrable to avoid hard-calling genotypes with moderate-coverage WGS in certain applications. On the other hand, in populations with high levels of linkage disequilibrium (LD), it could be possible to leverage LD to carry out genotype imputation, making genotype calling a lot more accurate with low-coverage WGS data. Imputation is more likely to be successful when a high-quality reference panel exists in the system (Fuller et al. (2020), Rubinacci et al. (2021)), but methods for imputation without such reference panels have also been developed (in which case a very large sample size would be required, Davies et al. (2016)).

This said, here are the main differences between these two approaches in practice: with same sample size, high-to-moderate-coverage WGS tends to provide higher resolution data that are more versatile and less susceptible to technical artefacts (especially those caused by sequencing errors) when compared with low-coverage WGS, but it could be a lot more costly. Low-coverage WGS, in contrast, can be used to achieve higher sample size with a fixed budget, which can then contribute to higher-resolution population-level inferences, but it does require a different computational toolbox that takes genotype uncertainties into account, and is thus constrained by limitations of the current toolbox (see Section 6 in (Lou et al. 2021) for a more detailed discussion on this).

Tutorials

Here we provide a couple different tutorials by our working group members, as well as links to other detailed tutorials on the web. The goal of these MarineOmics tutorials are to provide extensive details on “why” certain parameters are chosen, and some guidance on how to evaluate different parameter options to fit your data.

High to moderate coverage WGS

Fastq-to-VCF SnakeMake pipeline: in-depth explanation of an automated short-read mapping and variant calling pipeline maintained by Harvard Informatics. Useful if you plan to adopt this pre-packaged automated and parallelizable pipeline, and would like to understand its different components, but not necessarily change it substantially.
Fastq-to-VCF workflow: detailed walkthrough of processing 15x sequencing depth WGS data for cod, from raw reads to a VCF. Useful if you would like to run each component of the pipeline yourself and potentially tweak some of them for your own purpose. In other words, you can more easily add, skip, or change parts of this pipeline, but will lose the convenience offered by an automated pipeline.

Low coverage WGS

The quality control and read alignment part of the pipeline for high-to-moderate-coverage WGS also applies for low-coverage WGS. Therefore, the two tutorials for high-to-moderate-coverage WGS are also useful for low-coverage WGS until the point where variants and genotypes are called. In addition to these, here are some resources specifically designed for low-coverage WGS.

Low-coverage WGS tutorial: a tutorial for the processing and analysis of low-coverage WGS data (i.e. from raw fastq files to population genomic inference), with example datasets and hands-on exercises. It is associated with the paper (Lou et al. 2021).
Detection and mitigation of batch effects: a tutorial for the detection and mitigation of batch effects with low-coverage WGS data, with example datasets and hands-on exercises. It is associated with the paper Lou and Therkildsen (2021).
Low-coverage WGS data analysis pipeline: a collection of scripts for the efficient and reproducible analysis of low-coverage WGS data (i.e. from bam to population genomic inference).
Low-coverage WGS data processing pipeline: a collection of script for the efficient and reproducible processing of low-coverage WGS data (i.e. from raw fastq to bam). This pipeline should also be compatible with high-to-moderate-coverage WGS data.

References

Davies, Robert W, Jonathan Flint, Simon Myers, and Richard Mott. 2016. “Rapid Genotype Imputation from Sequence Without Reference Panels.” Nat. Genet. 48 (8): 965–69. https://doi.org/10.1038/ng.3594.

Fuentes-Pardo, Angela P, and Daniel E Ruzzante. 2017. “Whole-Genome Sequencing Approaches for Conservation Biology: Advantages, Limitations and Practical Recommendations.” Mol. Ecol. 26 (20): 5369–5406. https://doi.org/10.1111/mec.14264.

Fuller, Zachary L., Veronique J. L. Mocellin, Luke A. Morris, Neal Cantin, Jihanne Shepherd, Luke Sarre, Julie Peng, et al. 2020. “Population Genetics of the Coral <i>acropora Millepora</i>: Toward Genomic Prediction of Bleaching.” Science 369 (6501): eaba4674. https://doi.org/10.1126/science.aba4674.

Lou, Runyang Nicolas, Arne Jacobs, Aryn P Wilder, and Nina Overgaard Therkildsen. 2021. “A Beginner’s Guide to Low-Coverage Whole Genome Sequencing for Population Genomics.” Mol. Ecol., July. https://doi.org/10.1111/mec.16077.

Lou, Runyang Nicolas, and Nina Overgaard Therkildsen. 2021. “Batch Effects in Population Genomic Studies with Low-Coverage Whole Genome Sequencing Data: Causes, Detection, and Mitigation.” Authorea Preprints, August. https://doi.org/10.22541/au.162791857.78788821/v2.

Rubinacci, Simone, Diogo M Ribeiro, Robin J Hofmeister, and Olivier Delaneau. 2021. “Efficient Phasing and Imputation of Low-Coverage Sequencing Data Using Large Reference Panels.” Nat. Genet. 53 (1): 120–26. https://doi.org/10.1038/s41588-020-00756-0.

Whole Genome Resequencing for Population Genomics

Katherine Silliman, Nicolas Lou

Introduction

Tutorials

High to moderate coverage WGS

Low coverage WGS

References