Initial publication year: 2022
How to cite

Setup for running code

If you would like to run the R code examples that are scattered throughout the guide (recommended but not required!), you will need to install some R packages. Only need to run this code once:

install.packages("tidyverse")
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SeqArray")
BiocManager::install("SNPRelate")

Now load those packages, if using:

library(SeqArray) # efficient storage and filtering of genomic data

## Loading required package: gdsfmt

## 
## Attaching package: 'SeqArray'

## The following object is masked from 'package:stringr':
## 
##     fixed

library(tidyverse) # plotting data formatting and manipulation
library(SNPRelate) # PCA and other popgen analyses

## SNPRelate

We also provide optional alternative coding examples that are based only on the command line, requiring these software packages:

Reduced Representation Sequencing (RADseq/GBS)

“Restriction-site Associated DNA sequencing” - RADseq - combines restriction enzymes with next-gen, massively parallel, short-read sequencing. RADseq involves the use of restriction enzymes, which are used to shear DNA at restriction enzyme cutsites. RADseq comes in different flavors. Double-digest RADseq (ddRAD; Peterson et al. (2012)) selects markers with two restriction enzymes with different cut frequencies. Genotype-by-Sequencing (GBS; Elshire et al. (2011)) uses a frequent-cutting restriction enzyme with PCR size selection. There are a number of reviews comparing different RADseq and GBS methods (e.g., Andrews et al. (2016)). In this guide we use the term “RADseq” to refer to any of these protocols, including those that don’t involve random shearing of data. When recommendations are specific to a certain type of RAD/GBS, we will explicitly say so.

Using RADseq to generate single nucleotide polymorphisms (SNPs) involves:

library preparation in the lab
bioinformatic processing through assembly and/or mapping to a reference, then
filtering of SNPs and individuals for quality.

All of these steps can (and will) introduce some error, so the goal is to minimize this error through mitigation steps at all three parts of the process. Every dataset is different!

Much of this guide is directly inspired by this excellent review paper (O’Leary et al. 2018), especially the section on minimizing errors due to library prep. We recommend reading and cross-referencing with this paper, and citing it if you follow its suggestions. Table 1 from this paper summarizes the various potential issues that can arise from RAD datasets and some mitigation steps. The goal of this guide is to expand on the O’Leary paper and provide some example code to help implement quality control and mitigation steps.

Table 1 from (O’Leary et al. 2018).

Considerations During Lab Work

There are steps you can take before you even sequence RAD libraries that can help minimize issues downstream. Here, we use “library” to refer to a set of RADseq fragments from a group of individuals that are barcoded and sequenced together on a single lane or group of lanes. While specific RAD library prep methods have their own nuances for minimizing error, there are some steps you can take that are common across methods.

If this RAD/GBS method has not been done in your species or in your molecular lab setup before, spend some time optimizing the protocol using a representative subset of individuals. Then try to keep everything about the library prep as consistent as possible across samples (eg, DNA extraction kit, PCR cycles, sequencing platform). This isn’t always possible, especially if you need to optimize the protocol for certain tricky samples. Just make sure to keep track of everything!
Randomize samples across library prep batches and sequencing lanes! For example, if you are sequencing two different groups of samples on two different sequencing lanes, make sure they are randomized with respect to sample location or whatever your groups of interest are. If you are preparing groups of individuals in different batches to be pooled later, randomize across these batches. Alicia Mastretta-Yanes even recommends randomizing across DNA extractions, especially if the person doing the extractions is new to molecular work. This is to allow you to control for potential batch effects that are often observed with RAD data.

Keep track of all potential batch effect sources in a Metadata file (eg, storage conditions of tissue/DNA, date/method of DNA extraction, date/method library prep batch, sequencing lane).
Have a core set of 2-4 technical sample replicates across all libraries. Ideally these are true technical replicates, meaning the same tissue/DNA sample is processed multiple times all the way from library preparation to sequencing. Sequencing replicates, where you sequence the same sample library preparation multiple times, can also be useful for downstream quality control.

Principles for Analyzing Your Data

Steps for a robust RAD analysis

This is just one approach for working through your data. Some people will prefer to run just a subset of samples through a pipeline at first and evaluate parameters, then run all the samples through. Either way, be prepared to make MULTIPLE assemblies and go through this process iteratively, especially if this is a new-to-you study system.

Look at your raw data.
Run an assembly pipeline, using either all samples or a representative subset.
Evaluate potential sources of error.
Remove individuals based on initial evaluation.
Test a range of key parameters to optimize your assembly.
Run your optimized assembly on all non-removed samples.
Filter your SNPs
Repeat as needed!

First, look at the raw data!

Always look at your data with FastQC before starting an assembly. First, this is a good check to just make sure the sequencing worked. If you have demultiplexed data, you can use MultiQC to generate FastQC plots for all individuals and quickly identify ones that did not sequence well.

Check out this informative tutorial on running FastQC and interpreting the output.

Questions to ask: Do you have a lot of adapter sequences? Are the ends of your reads very low quality? If so, you should expect a fair amount of trimming and read filtering to occur prior to assembly. If that doesn’t occur or if too many reads are being filtered so as to only recover a small number of SNPs, something might need to be tweaked with your trimming and filtering step.

You should also look at the top few reads in the Terminal.

## zcat: unZip and conCATenate the file to the screen
## head -n 20: Just take the first 8 lines of input

$ zcat raw-fastqs/BC2_10_C5.fastq.gz | head -n 20
@BC2_10_C5.1 1 length=96
CAGCGTGTAATAGTCACCGGCGGCTCCCTCTGGAGAATAGCACAAGTGATCATTTTGCTCATCTTCCGTCCACTGGTGATTGTGGACCAGCCTCAC
+BC2_10_C5.1 1 length=96
<GGGGGA<GGGGIIIGIGGIGGGIIIIIGGGGGGGGGGGGGGGGGIIIIIIIIGGIIIGGGGIGIIGIGGGIIIIIAGGGGIIIGIIGGGGGAAGG
@BC2_10_C5.2 2 length=96
CTGCTACATGCAGTGTTCTGTATTACTTTTATTGTACGTTGATATGAATGAATGAGTGTTTTGTATACTTAGAGTACAAGTTTGTCAGTCATATCG
+BC2_10_C5.2 2 length=96
GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIG
@BC2_10_C5.3 3 length=96
CAGCACATGTTCCTGTGTAGAAAGCTTGTTAGTAGAATAAATAACACATGGCTGGTCAAACACAACACATGAAGAAACAACTTTCTGAACAGTTTT
+BC2_10_C5.3 3 length=96
GIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIGGGGIIGIIIGGGII
@BC2_10_C5.4 4 length=96
CAGCGATTCGGCCCAAATTTGCACCACATCAGGCCCTTGACAGGGCGCTTCGATGGTGCAAATTTGGTGCGATTCGCTGCGCACCTAGCATATATG
+BC2_10_C5.4 4 length=96
AGGGIGIIIGIGGGIIGIGGGIIIGGGGAGGGGGAGGIGIGGIGGGGAGGGGGGGGGGGGGGGIIIII.G<GGIGGGIIIGGGGGIGGIIIGAGGI
@BC2_10_C5.5 5 length=96
CAGCAGTTTGGTGGAGTTCTGCAACCTTCCATTTCCAAAGAATTACCCAGGAGCTCTTCCCAGTGAATTTCTTCGGCACTTTTCATTGACCTTTTA
+BC2_10_C5.5 5 length=96
GGAGAAA.<AGAA.G.GGAGA<.GGA.<GAAGGAAGGIA<...<GA<..<G.<.<<.<AAAGGGG..<GGGGG<G.A.<GGGII.AG..<.GGGGG

If the sequencing center gave you one fastq file with all your samples, you should expect to see a barcode sequence, followed by the cutsite at the start of the read. If the Data are already demultiplexed (as the example is), you should only see the cutsite overhang (in this case, CWGC). Sometimes you can look at your fastq data files and see that there was a problem with the sequencing such that the cut site is either offset by one or more bases, or contains many errors. If this is not being addressed by the default filtering steps in your assembly pipeline, you can trim off N bases from the beginning or end of R1 and R2 reads in ipyrad (trim_reads param), or with cutadapt before using stacks, or customize the Trimmoatic step for dDocent.

Run an assembly pipeline

There are a number of freely available pipelines for processing RADseq data, with the most popular being dDocent, Stacks, and ipyrad. These pipelines vary slightly in their underlying methodologies, customization options, and additional included analyses.

	dDocent	ipyrad	Stacks2
Supported datatypes	Paired-end: ddRAD, ezRAD, RAD (random shearing), data with large overlap between forward and reverse reads; Single-end: any RADseq method de novo and reference-based. If doing de novo assembly, reads cannot be trimmed outside of dDocent	Wide range of RAD/GBS methods with paired/single end data, see ipyrad documentation	Paired-end and single-end data for single and double digest RAD and DART; GBS with single-end sequencing. NOT suitable for paired-end GBS.
Unique aspects	Novel data reduction approach used to inform coverage cutoffs	Python API for popgen and phylogenetic analyses written specifically for RAD data	`populations` module calculates sliding window and site-specific popgen metrics
Documentation	Very good (esp. tutorials), active community support on Google Groups	Excellent (esp. installation, parameter explanations, and tutorials), active community support on Gitter	Very good, in depth tutorials as published manuscripts, active community support on Google Groups
Speed/Accuracy (based on discussion in panel seminar)	fastest and most accurate	close to dDocent accuracy, more over-splitting is possible if parameters are not tuned	produces some untrue genome fragments (esp. with higher levels of indel polymorphism), but can be addressed with downstream filtering
Open source/development	Open source, primarily combines existing software	Open source	No
Filtering options	Minimal default filtering as implemented in VCFtools. Ideal for those who want full freedom in filtering their SNPs.	Wide range of filtering options, some hard-coded filtering to deal with paralogs	Moderate range of filtering options
Output options	Only produces a VCF, fasta of the de novo assembly, and individual BAM files from mapping to the reference	Lots of output formats (see documentation)	Lots of Stacks-specific output and log files, as well as various inputs for popgen programs (see documentation)
Popgen analyses	No additional popgen analyses included	Unique, flexible Python API implementing range of popgen and phylogenetic methods (see documentation)	Popgen summary statistics (F-statistics, pi), including kernel-smoothing along reference genome. Underlying equations and assumptions for these are not very clear.

Some groups have also developed pipelines for specific flavors of RAD (e.g., Matz lab for 2b-RAD) or proprietary software (eg, DARTseq). In most cases, it is recommended that you use a pipeline developed and tested for RAD data, especially if you are making a de novo assembly. (LaCava et al. 2020) have an excellent study where they review various de novo assemblers that are used in these pipelines. Assemblers that were not explicitly developed for short reads (eg, Velvet, ABySS) performed very poorly, while CD-HIT (the assembler in dDocent) performed the best. Stacks/Stacks2 and vsearch (the assembler in ipyrad) both performed worse when analyzing simulated data with indels. It should be noted that all of these RAD pipelines have optimized the specific parameters of these de novo assemblers to work with RAD data of various flavors, which may not be fully reflected in the LaCava study.

Based on the MarineOmics RAD panel, the general consensus is that all the popular pipelines can produce adequately accurate datasets (with appropriate parameter optimization and data filtering).
Table 3 from LaCava et al 2019. CD-HIT is used in dDocent, VSEARCH is used in ipyrad.

Evaluate potential sources of error

Once you have processed your samples (or a subset of samples) through a genotyping pipeline, you will have a bunch of different output file options for genetic data. The VCF file is one of the most popular file formats, and is the most versatile for initial data exploration as many programs exist to filter and accept VCF files. Below are a list of potential confounding factors that may exist in your data, and how to tease them out.

Using SeqArray in R If you are following along with the R code examples, we need to 1st read in our data (skip this section if not running R code).

For the code examples in this section, we primarily make use of the R package SeqArray (Zheng et al. 2017), which can read and manipulate VCFs. If you’re familiar with R, SeqArray is simple to use. The package can efficiently store sequence variant calls with hundreds of thousands of individuals, variants (e.g., SNPs, indels and structural variation calls) and annotations. It employs a data container called a CoreArray Genomic Data Structure (GDS). It’s super-fast (5X faster than PLINK v1.9; 16X faster than vcftools) and it integrates well with other R packages you might use in your analysis pipeline. (i.e. SNPRelate, SeqVarTools). We also like it because you can filter your data before running certain analyses without 1st generating a separate filtered VCF file.

First, we need to convert our VCF file into the GDS format. We will do this once here, then use the GDS file for subsequent code examples.

filename = "OL_subset" #replace with y our file name 
filename.gds = paste0("POP_02_RADseq_files/", paste0(filename, ".gds"))
filename.vcf = paste0("POP_02_RADseq_files/", paste0(filename, ".vcf"))
  # 1 . Convert VCF to GDS
SeqArray::seqVCF2GDS(vcf.fn = filename.vcf, out.fn = filename.gds, storage.option="ZIP_RA")

## Sat Mar 18 13:56:29 2023
## Variant Call Format (VCF) Import:
##     file(s):
##         OL_subset.vcf (22.3M)
##     file format: VCFv4.0
##     genome reference: pseudo-reference (most common base at site)
##     the number of sets of chromosomes (ploidy): 2
##     the number of samples: 18
##     genotype storage: bit2
##     compression method: ZIP_RA
##     # of samples: 18
## Output:
##     POP_02_RADseq_files/OL_subset.gds
## Parsing 'OL_subset.vcf':
## + genotype/data   { Bit2 2x18x70005 ZIP_ra(34.4%), 216.0K }
## Digests:
##     sample.id  [md5: dd28e5c928ffd0a817743a0e9447a808]
##     variant.id  [md5: 03df6156357de104368e6ed4694ebf92]
##     position  [md5: 690fe39440c87b7bfb5eeb07d7c0a310]
##     chromosome  [md5: 9f967c382b54d12060ab45d8c293652b]
##     allele  [md5: fb5bdfa95fb2448dd960bb41c96f7bff]
##     genotype  [md5: 25350d469a1102cc70b367663332282f]
##     phase  [md5: cd66242aebb89cfc3f2082c9413847dc]
##     annotation/id  [md5: c95d7c12f4fdae536da42bfec73942c9]
##     annotation/qual  [md5: 2864e4ded2a2cdc9dccfeb45c4fb3465]
##     annotation/filter  [md5: 2134c67ca1fdd51b7d3bf17ca1ca2c9e]
##     annotation/info/NS  [md5: 4ad6db26b594a567be24b0bff7f1f909]
##     annotation/info/DP  [md5: c0e5e2a0856c1f43788d2fa9842d2e7a]
##     annotation/format/DP  [md5: ae6991de5adbf082a733198faf0f11ca]
##     annotation/format/CATG  [md5: 71d62ffa848999a375e8bb5f329be275]
## Done.
## Sat Mar 18 13:56:30 2023
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'POP_02_RADseq_files/OL_subset.gds' (2.8M)
##     # of fragments: 214
##     save to 'POP_02_RADseq_files/OL_subset.gds.tmp'
##     rename 'POP_02_RADseq_files/OL_subset.gds.tmp' (2.8M, reduced: 1.8K)
##     # of fragments: 64
## Sat Mar 18 13:56:30 2023

gdsin = SeqArray::seqOpen(filename.gds)
print(paste0("The number of SAMPLES in data: ", length(c(SeqArray::seqGetData(gdsin, "sample.id")))))

## [1] "The number of SAMPLES in data: 18"

print(paste0("The number of SNPs in data: ",  length(c(SeqArray::seqGetData(gdsin, "variant.id")))))

## [1] "The number of SNPs in data: 69989"

It is always helpful to have a metadata file with information for each sample, such as sampling site, sequencing library, etc. In our example, our metadata file (OL.popmap) is tab-delimited and has the column headers:
ID: sample ID STRATA: sampling location/population PLATE: sequencing batch
Next we read in our metadata file, and make sure the samples are in the same order as your VCF file:

metafile = "POP_02_RADseq_files/OL.popmap"
sample.ids = seqGetData(gdsin, "sample.id")
sample.strata =  read.table(metafile, header = T, sep = "\t") %>%
                  dplyr::select(ID, STRATA, PLATE)

Now, on to evaluating our data!

“Bad” samples

Sometimes a sample doesn’t sequence well (few sequencing reads, higher error rate). This can be due to DNA quality, an issue during library prep, or not enough sequencing depth (average # of reads per sample). Generally, it will lead to a sample with fewer sequencing reads, higher missing data in a SNP dataset, and fewer shared loci with other samples. Identifying and then removing these samples prior to the final RADseq assembly analysis can help minimize mis-assembled loci, genotyping errors, and excessive filtering of acceptable loci.

The process of identifying low quality individuals is usually iterative, as the way you initially filter your SNPs will influence the amount of missing data and locus sharing among samples. This is why we recommend minimally filtering your SNPs for sample coverage (the # of individuals a locus is called in) when initially exploring your data. Some ways to identify bad samples:

For every SNP dataset you generate, it is a good idea is always evaluate the missingness per sample (and report this distribution in your manuscript!). Identify samples with way more missingness than the rest, and observe how they look in a PCA and locus sharing plot. If they stick out or all cluster together in the middle, then try removing them from the assembly and seeing if it changes downstream analyses. If so, you may want to specify a missingness cutoff for including samples in the final analysis.

Missingness in R with SeqArray

#using previously loaded gdsin object
print("Per variant: ")

## [1] "Per variant: "

summary(m1 <- SeqArray::seqMissing(gdsin, per.variant=TRUE))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2778  0.3889  0.3855  0.4444  0.9444

print("Per sample: ")

## [1] "Per sample: "

summary(m2 <- SeqArray::seqMissing(gdsin, per.variant=FALSE))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1218  0.1574  0.2287  0.3855  0.6213  0.9177

samples <- SeqArray::seqGetData(gdsin, "sample.id")
cbind(samples,m2)[order(-m2),]

##       samples     m2                 
##  [1,] "OR3_13_C3" "0.917672777150695"
##  [2,] "CA4_12_C1" "0.886539313320665"
##  [3,] "OR3_3_C7"  "0.873308662789867"
##  [4,] "CA4_13_C1" "0.692623126491306"
##  [5,] "BC2_6_C2"  "0.634999785680607"
##  [6,] "BC2_9_C4"  "0.580248324736744"
##  [7,] "CA4_8_C9"  "0.365486004943634"
##  [8,] "CA4_1_C4"  "0.292274500278615"
##  [9,] "OR3_1_C3"  "0.245538584634728"
## [10,] "CA4_2_C3"  "0.211790424209519"
## [11,] "CA4_14_C8" "0.201074454557145"
## [12,] "OR3_20_C4" "0.180756976096244"
## [13,] "BC2_10_C5" "0.165683178785238"
## [14,] "OR3_9_C8"  "0.154638586063524"
## [15,] "OR3_5b_C6" "0.15163811456086" 
## [16,] "BC2_17_C7" "0.132277929388904"
## [17,] "BC2_12_C6" "0.130006143822601"
## [18,] "BC2_16_C6" "0.121776279129578"

#plot histogram 
hist(m2,breaks=50)

Another common method of filtering and evaluating you data without ever using R is VCFtools.

Missingness with vcftools, on the command line

vcftools --vcf POP_02_RADseq_files/OL_subset.vcf --missing-indv --out POP_02_RADseq_files/OL_subset
# sort the file by most missing data and print the top 10 samples  
cat POP_02_RADseq_files/OL_subset.imiss | (read h; echo "$h"; sort -k5 -r)

## /Users/jason/.bashrc: line 1: /Users/jason/perl5/perlbrew/etc/bashrc: No such file or directory
## 
## VCFtools - 0.1.17
## (C) Adam Auton and Anthony Marcketta 2009
## 
## Parameters as interpreted:
##  --vcf POP_02_RADseq_files/OL_subset.vcf
##  --missing-indv
##  --out POP_02_RADseq_files/OL_subset
## 
## After filtering, kept 18 out of 18 Individuals
## Outputting Individual Missingness
## After filtering, kept 69989 out of a possible 69989 Sites
## Run Time = 0.00 seconds
## INDV N_DATA  N_GENOTYPES_FILTERED    N_MISS  F_MISS
## OR3_13_C3    69989   0   64227   0.917673
## CA4_12_C1    69989   0   62048   0.886539
## OR3_3_C7 69989   0   61122   0.873309
## CA4_13_C1    69989   0   48476   0.692623
## BC2_6_C2 69989   0   44443   0.635
## BC2_9_C4 69989   0   40611   0.580248
## CA4_8_C9 69989   0   25580   0.365486
## CA4_1_C4 69989   0   20456   0.292275
## OR3_1_C3 69989   0   17185   0.245539
## CA4_2_C3 69989   0   14823   0.21179
## CA4_14_C8    69989   0   14073   0.201074
## OR3_20_C4    69989   0   12651   0.180757
## BC2_10_C5    69989   0   11596   0.165683
## OR3_9_C8 69989   0   10823   0.154639
## OR3_5b_C6    69989   0   10613   0.151638
## BC2_17_C7    69989   0   9258    0.132278
## BC2_12_C6    69989   0   9099    0.130006
## BC2_16_C6    69989   0   8523    0.121776

You can open the file analysis/OL_subset.imiss in any text editor an look at the missingness values by eye.

To plot the missingness on the command line, you can use gnuplot:

#code from Jon Puritz
mawk '!/IN/' POP_02_RADseq_files/OL_subset.imiss | cut -f5 > totalmissing
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
#set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF

## /Users/jason/.bashrc: line 1: /Users/jason/perl5/perlbrew/etc/bashrc: No such file or directory
##                                                                                                                         
##                                                                                                                         
##                                          Histogram of % missing data per individual                                     
##        2 +----------------------------------------------------------------------------------------------------------+   
##          |   ** *    +           +           +           +          +           +           +           +           |   
##          |   ** *                                             'totalmissing' using (bin($1,binwidth)):(1.0) ******* |   
##          |   ** *                                                                                                   |   
##          |   ** *                                                                                                   |   
##          |   ** *                                                                                                   |   
##      1.5 |-+ ** *                                                                                                 +-|   
##          |   ** *                                                                                                   |   
##          |   ** *                                                                                                   |   
##          |   ** *                                                                                                   |   
##          |   ** *                                                                                                   |   
##        1 |-**** *******************************************************************************************       +-|   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##      0.5 |-* ** * * * * *    *      *                 *               *     *             *           * * *       +-|   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |   
##          | * ** * * *+* *    *   +  *        +        *  +          + *     *   +         * +         * * *         |   
##        0 +----------------------------------------------------------------------------------------------------------+   
##         0.1         0.2         0.3         0.4         0.5        0.6         0.7         0.8         0.9          1   
##                                                       % of missing data                                                 
##

It looks like samples OR3_13_C3,CA4_12_C1, OR3_3_C7, CA4_13_C1, BC2_6_C2, BC2_9_C4 have a lot of missing data relative to other samples. We can remove them from our dataset for now, and just keep exploring our data, but if you decide to exclude them permanently from the analysis you should eventually rerun your genotyping pipeline without those samples, esp. if doing a de novo assembly.

Another great way to explore missingness with your data is to construct a heatmap of loci that are genoptyped between pairs of samples. Generally, samples that are more closely related will share more loci with each other due to cutsite dropout. Deviations from that pattern can pinpoint bad samples as well as batch effects.

ipyrad API tutorial for making a locus sharing heatmap, requires ipyrad output or ipyrad’s vcf2hdf5 conversion
code to generate locus sharing/missingness heatmap from a VCF

The power of PCA

One of the most powerful methods for exploring your data is a Principle Components Analysis. From this genetics tutorial: “To understand how PCA works, consider a single individual and its representation by its 593,124 markers. Formally, each individual is a point in a 593,124-dimensional space, where each dimension can take only the three possible genotypes indicated above, or have missing data. To visualize this high-dimensional dataset, we would like to project it down to two dimensions. But as there are many ways to project the shadow of a three-dimensional object on a two dimensional plane, there are many (and even more) ways to project a 593,124-dimensional cloud of points to two dimensions. What PCA does is figuring out the “best” way to do this project in order to visualise the major components of variance in the data.” See here for a linear algebra-based explanation of PCA.

When you have a VCF file with SNPs, use PCA before extensive filtering or playing with parameters to look at the data. Check which SNPs are associated with axes showing the most variation.

Here we use SeqArray and SNPRelate to run a PCA in R.

Reminder: Missing data is a feature of RAD. Be aware of how different analysis tools deal with missingness, especially PCA which will fill in all missing data with some values. Here, the PCA from SNPRelate imputes missing genotypes as the mean genotype across samples.

Doing a PCA with SeqArray and SNPRelate in R.

#open the gds file previously created, if not already open
#filename.gds = "POP_02_RADseq_files/OL_subset.gds" #replace with your file name 
#gdsin = SeqArray::seqOpen(filename.gds)
# exclude samples previously identified as having too much missing data
bad_samples = c("OR3_13_C3","CA4_12_C1","OR3_3_C7","CA4_13_C1","BC2_6_C2","BC2_9_C4")
sample.ids = seqGetData(gdsin, "sample.id")
keep = sample.ids[which(!sample.ids %in% bad_samples)]

Whether your data are denovo or reference-based, it is important to filter out linked sites before performing a PCA. Here’s how to do it in SeqArray. Note, if you have already filtered your VCF file to have only 1 SNP per RAD locus, you don’t need to do this:

snpset <- SNPRelate::snpgdsLDpruning(gdsin, ld.threshold=0.2, autosome.only = F, start.pos="random", num.thread=1, remove.monosnp = T, sample.id = keep)  
snpset.id <- unlist(unname(snpset))

Now we will actually run the PCA, again removing the samples with missing data, keeping unlinked SNPs, and removing SNPs with less than 5% minor allele frequency.

# PCA only on SNPs with a minor allele freq greater than 5%
pca.out = SNPRelate::snpgdsPCA(autosome.only = F, gdsin, num.thread=2, remove.monosnp = T, maf = 0.05,
                               snp.id=snpset.id,
                               sample.id = keep) # filtering for pruned SNPs

## Principal Component Analysis (PCA) on genotypes:
## Calculating allele counts/frequencies ...
## # of selected variants: 22,693
## Excluding 5,350 SNVs (monomorphic: TRUE, MAF: 0.05, missing rate: NaN)
##     # of samples: 12
##     # of SNVs: 22,693
##     using 2 threads
##     # of principal components: 32
## CPU capabilities:
## Sat Mar 18 13:57:03 2023    (internal increment: 40960)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## Sat Mar 18 13:57:03 2023    Begin (eigenvalues and eigenvectors)
## Sat Mar 18 13:57:03 2023    Done.

#close the gds file (saves memory)
#seqClose(gdsin)

eig = pca.out$eigenval[!is.na(pca.out$eigenval)]
barplot(100*eig/sum(eig), main="PCA Eigenvalues")

First color/shape the individuals by the factor you expect to matter (eg, sampling site or region, family, ecotype).

#PLOT PCA
#PC1 v PC2 colored by collection location
id.order = sapply(keep, function(x,df){which(df$ID == x)}, df=sample.strata) #incase your strata file is not in the same order as your vcf
sample.strata.order = sample.strata[id.order,]
print(
  as.data.frame(pca.out$eigenvect) %>%
      tibble::add_column(., STRATA =  sample.strata.order$STRATA) %>%
      ggplot(., aes(x=V1, y=V2, color = STRATA)) + 
      geom_point(size=2) +
      stat_ellipse(level = 0.95, size = 1) +
      geom_hline(yintercept = 0) +
      geom_vline(xintercept = 0) +
      theme_bw() +
      xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) +
      ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) +
      ggtitle("PCA Colored by Collection Location")
)

## Warning: Using `size` aesthetic for lines was deprecated in
## ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Questions to ask about your PCA:

Is there any clustering? If so, is it different than your expectation?
- If different than expectation, suggests either batch effects, cryptic variation among individuals, or systematic biological issues affecting assembly/mapping
Are there outlier samples driving a large amount of variation?
- If so, they may be cryptic species, clones/sample replicates, or they sequenced/genotyped poorly
- In our example we see 2 potential outlier samples in the bottom left corner. We can check which samples they are by plotting the PCA with sample ID and/or looking at the actual PC scores.

#PLOT PCA
#PC1 v PC2 with Sample Labels
print(
  as.data.frame(pca.out$eigenvect) %>%
      tibble::add_column(., ID =  sample.strata.order$ID) %>%
      ggplot(., aes(x=V1, y=V2, label = ID)) +
      geom_text(size =3) +
      geom_hline(yintercept = 0) +
      geom_vline(xintercept = 0) +
      theme_bw() +
      xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) +
      ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) +
      ggtitle("PCA with Labels")
)

# which samples have PC1 > -0.7 and PC2 < -0.3?
as.data.frame(pca.out$eigenvect) %>%
      tibble::add_column(., ID =  sample.strata.order$ID) %>%
  filter(V1 > 0.7 & V2 < -0.3) %>%
  select(ID,V1,V2)

Is there no clustering at all? If so, it may be due to:
- bioinformatic artifacts leading to noise, see Testing Parameters
- issues with how you are filtering the data (eg, too strict or too lax)
- issues with how your PCA is treating missing data
- or you actually have no structure in your dataset!

If you used sample replicates, first see if replicates cluster very close together. If not, then there may be batch effects or bioinformatic artifacts leading to genotyping error that must be addressed. Then remove replicates for a subsequent PCA evaluation.

When removing a sample replicate, you can choose one at random or pick the replicate with the least missing data.

# example code for removing samples from a SNPRelate PCA 
# assumes you have already loaded a vcf file as a GDS
samples.to.remove = c("SampleA_rep","SampleB_rep")
sample.ids = seqGetData(gdsin, "sample.id")
keep = sample.ids[which(!sample.ids %in% bad_samples)]

# PCA only on SNPs with a minor allele freq greater than 2.5%, only keeping samples in keep
pca.out.noreps= SNPRelate::snpgdsPCA(autosome.only = F, gdsin, num.thread=1, remove.monosnp = T, maf = 0.05,
                               snp.id=snpset.id,
                               sample.id = keep)

Next, color the individuals by potential sources of batch effects (sequencing lane, library prep batch, age of tissue sample, person doing the DNA extraction…).

# again, but colored by batch
print(
  as.data.frame(pca.out$eigenvect) %>%
      tibble::add_column(., PLATE =  as.factor(sample.strata.order$PLATE)) %>%
      ggplot(., aes(x=V1, y=V2, color = PLATE))  + #label = ID
      geom_point(size=2) +
      stat_ellipse(level = 0.95, size = 1) +
      geom_hline(yintercept = 0) +
      geom_vline(xintercept = 0) +
      theme_bw() +
      xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) +
      ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) +
      ggtitle("PCA colored by Batch/Sequencing Plate")
)

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## Warning: Removed 7 rows containing missing values (`geom_path()`).

Do any observed clusters match a source of batch effect? How to remove batch effects

Batch effects

As discussed in Considerations During Lab Work, batch effects can arise by minor (or not so minor) differences during the library prep and sequencing stage. Randomizing samples across libraries and sequencing lanes can help mitigate the influence of batch effects on your downstream analyses, but sometimes this isn’t possible, esp. when using pre-existing data. Even if you randomize your samples, it is a good idea to check for batch effects with a PCA and then try to mitigate them.

Nicolas Lou and Nina Therkildsen have a great article and Github repo discussing causes, detection, and mitigation of batch effects in low coverage WGS, where batch effects can have particularly large effects (@ Lou and Therkildsen 2021). While some of their recommendations are specific to reference-based analyses, others are relevant to RAD methods too (eg, batch effects due to different sequencing platforms, DNA degradation levels, and sequencing depth).

You can identify which loci are driving the batch effect by 1) identifying the top SNPs contributing to PC variation (termed “loadings”) and/or 2) do an Fst outlier analysis with the batches specified as populations (eg, Bayescan). Both of these methods are most appropriate if you’ve randomized your samples between batches. If you have not randomized samples but are certain there is a batch effect driving variation in your dataset, you can still remove loci this way but you may also be removing biologically informative loci.

In SeqArray/SNPRelate, if you have batch effects that are identifiable by PCA, you would extract the PC loadings for each SNP at the PC which represents the batch effect, plot a histogram to see the distribution, choose a cutoff, then filter those SNPs out and create a new VCF for downstream analyses. For this example, let’s pretend that we see a batch effect on PC1.

# assume you have a GDS object called gdsin and a PCA object called pca.out
# extract PCA loadings
SnpLoad <- snpgdsPCASNPLoading(pca.out, gdsin)

## SNP Loading:
##     # of samples: 12
##     # of SNPs: 22,693
##     using 1 thread
##     using the top 12 eigenvectors
## Sat Mar 18 13:57:04 2023    (internal increment: 65536)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## Sat Mar 18 13:57:04 2023    Done.

names(SnpLoad)

## [1] "sample.id"  "snp.id"     "eigenval"   "snploading" "TraceXTX"  
## [6] "Bayesian"   "avgfreq"    "scale"

dim(SnpLoad$snploading)

## [1]    12 22693

#plot a histogram of absolute PC loading values for PC1, with larger loading meaning more effect on PC1
pc<-1
hist(sort(abs(SnpLoad$snploading[pc,]),decreasing=T,index.return=T)[[1]],breaks = 30,main="PC loadings: PC1",)

Let’s remove the largest effect loci over abs(0.010) and see if it changes our PCA.

#extract SNP.IDs for loadings on PC 1 > abs(0.010) 
batch.snp.ids <- SnpLoad$snp.id[which(abs(SnpLoad$snploading[1,]) >= 0.010)]

#rerun PCA
snp.id.allfilt <- setdiff(snpset.id,batch.snp.ids)
pca.out.pcfilt = SNPRelate::snpgdsPCA(autosome.only = F, gdsin, num.thread=2, remove.monosnp = T, maf = 0.05,
                               snp.id=snp.id.allfilt,
                               sample.id = keep) # filtering for pruned SNPs

## Principal Component Analysis (PCA) on genotypes:
## Calculating allele counts/frequencies ...
## # of selected variants: 19,885
## Excluding 5,350 SNVs (monomorphic: TRUE, MAF: 0.05, missing rate: NaN)
##     # of samples: 12
##     # of SNVs: 19,885
##     using 2 threads
##     # of principal components: 32
## CPU capabilities:
## Sat Mar 18 13:57:04 2023    (internal increment: 40960)
## [..................................................]  0%, ETC: ---        [==================================================] 100%, completed, 0s
## Sat Mar 18 13:57:04 2023    Begin (eigenvalues and eigenvectors)
## Sat Mar 18 13:57:04 2023    Done.

print(
  as.data.frame(pca.out.pcfilt$eigenvect) %>%
      tibble::add_column(., STRATA =  sample.strata.order$STRATA) %>%
      ggplot(., aes(x=V1, y=V2, color = STRATA)) + 
      geom_point(size=2) +
      stat_ellipse(level = 0.95, size = 1) +
      geom_hline(yintercept = 0) +
      geom_vline(xintercept = 0) +
      theme_bw() +
      xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) +
      ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) +
      ggtitle("PCA Colored by Collection Location")
)

Our PCA is now clustering better by population, so we will now export a new VCF that includes all the filtering we’ve done so far: 1) removed individuals with missing data, 2) thinned our SNPs for linkage, and 3) removed SNPs driving a batch effect.

To save your filtered GDS object as a VCF in SeqArray:

# set a filter to exclude samples with missing data, and SNPs in linkage or with a batch effect
seqSetFilter(gdsin, sample.id=keep,variant.id=snp.id.allfilt)

## # of selected samples: 12
## # of selected variants: 25,235

# convert to vcf
seqGDS2VCF(gdsin, "POP_02_RADseq_files/OL-miss-pc1-linkage-filt.vcf.gz")

## Loading required namespace: Rsamtools

## Hint: install Rsamtools to enable the BGZF-format output.

## Sat Mar 18 13:57:04 2023
## VCF Export: OL-miss-pc1-linkage-filt.vcf.gz
##     12 samples, 25,235 variants
##     INFO Field: NS, DP
##     FORMAT Field: DP, CATG
##     output to a general gzip file
## [..................................................]  0%, ETC: ---    [==================================================] 100%, completed, 1s
## Sat Mar 18 13:57:05 2023    Done.

#close the gds file to save memory
closefn.gds(gdsin)

Cryptic species/contamination/clones

One issue that is particularly common, especially for marine invertebrates, are cryptic, yet highly diverged lineages occurring in the same geographic location. If you are trying to do a landscape/seascape genetics study, cryptic species can dramatically throw off population genetic analyses. A PCA using all the samples can help identify cryptic species if samples are forming clusters that do not align with any other possible factor (eg, geography, sequencing batch).

You can also check heterozygosity. Large differences in heterozygosity between outliers can suggest different species or cryptic hybrids. Two outliers are obvious in the previous PCA. Large differences in heterozygosity between samples are clear, with one location showing two distinct clusters of samples. These samples were later determined using mtDNA and morphological analysis to be cryptic species. Unpublished data from a marine species in Australian waters.

You can then estimate genetic divergence between these clusters using pairwise Fst and/or phylogenetic inference to determine how diverged the groups are. If they are very diverged, it may be better to assemble each cryptic species separately, especially for denovo assemblies with dDocent and Stacks, especially since often times restriction cut-sites are not always conserved; see (Wagner et al. 2013; Rubin, Ree, and Moreau 2012). These two pipelines used a set of given samples to create a “reference” or “catalog” to which all other samples are mapped. If highly diverged samples are included in creating this reference, it can lead to oversplitting.

Cryptic clones can also dramatically throw off population genetic analyses. In a PCA, clones will often appear as their own very tight cluster, similar to sample replicates. A calculation of sample pairwise distance (e.g., Manhattan distance) can very duplicates/clones.

Test a range of key parameters

There are many parameter choices to make when assembling and genotyping, and those parameters vary slightly across pipelines. The parameters below are general across RADseq pipelines (although with different names), and are shown to have the largest impact on genoyping error rates.

Clustering threshold

(de novo assembly only)

During de novo RADseq assembly, reads are assembled into contigs (contiguous sequences), so that each contig represents a single locus. This assembly can occur separately for each individual (ipyrad) or jointly for multiple individuals (dDocent, Stacks). To call SNPs, in the case of ipyrad, contigs are then clustered and aligned across individuals, while in dDocent and Stacks, reads for each individual are mapped to the “reference” contigs. All of these pipelines use sequence similarity (or clustering threshold) to determine if fragments are homologous (from the same place in the genome) and therefore should be clustered together.

If the clustering threshold is too low, undersplitting can occur so that multiple loci are combined into a single contig. Undersplitting is common with repetitive regions and paralogs. It can inflate mean observed heterozygosity. If the threshold is too high, it can lead to oversplitting, where alleles of the same locus are split into two or more contigs. This can lead to reduced mean observed heterozygosity.

If severely off, both of these issues can bias the models used to identify SNPs. Undersplitting a little is better than oversplitting, as it is easier to identify lumped paralogs downstream and remove them.

What influences the “optimal” clustering threshold of your data?

Clustering threshold is usually determined by how much genetic diversity you have in your dataset. If your dataset has high genetic diversity (eg., a single species with high heterozygosity, multiple highly diverged populations, multiple species), then you will need a lower clustering threshold. If your dataset has low genetic diversity (a single species with lower heterozygosity, a single family), then you will need a higher clustering threshold. Polyploid or highly repetitive (i.e., large) genomes may also have different requirements than diploid species with smaller genomes. Other factors, such as sequencing error rate, may play a role as well. We recommend always testing a few clustering thresholds with a new dataset and/or new species. However, as long as your clustering threshold is in the ballpark it shouldn’t affect downstream analyses (according to Isaac Overcast).

There are a few strategies to identify a good clustering threshold for your data, all of which require assembling your data with a range of clustering thresholds.
1) Metrics developed by (McCartney-Melstad, Gidiş, and Shaffer 2019), based on popgen theory and landscape genetic expectations
2) Utilize sample replicates (either exact replicates or individuals from the same locality) to quantify genotyping error and minimize error while maximizing the number of informative loci (Mastretta-Yanes et al. 2015) 3) Evaluating the number of haplotypes across loci, with many 3+ haplotypes indicating undersplitting (Harvey et al. 2015; Ilut, Nydam, and Hare 2014) 4) Make assemblies at a range of thresholds and parameters, plot the # of SNPs or contigs per assembly, and make a judgement call example in dDocent tutorial. If most parameters give you about the same # of SNPs or contigs, you can feel confident that this parameter doesn’t matter much for your data and just choose a reasonable value.

Mapping parameters

(reference-based and some de novo assemblers)

When a sequencing read maps to the wrong assembled contig or place in a reference genome (ie, a non-homologous locus), this can result in erroneous SNPs. In many cases, these SNPs can be filtered out later (link to filtering). However, data should be checked to make sure there aren’t systematic, wide-scale mapping issues that need to be addressed.

For referenced-based assembly, both dDocent and ipyrad make use of bwa(link) to map reads, while Stacks allows the user to choose their aligner. dDocent also utilizes bwa for de novo assemblies.

Mapping defaults for popular RADseq pipelines:
L: clipping penalty for 5’ and 3’
A: matching score
O: gap open penalty
T: mapping quality filter

dDocent: bwa mem -L 20,5 -t 8 -a -M -T 10 -A1 -B 3 -O 5 -R

can change -A and -O interactively during the dDocent pipeline.

ipyrad: (defaults) -L 5,5 -B 4 -A1 -O 6 -T 30

requires changing the ipyrad source code to change parameters

Our whole genome resequencing tutorials] has some good content on mapping. General considerations with RAD data:

how many reads aligned for each sample? If many reads didn’t aligned, this could be an issue with the reference, read trimming/filtering, or the mapping parameters.
What is the distribution of mapping scores?
ratio of mean (across individuals) mapping quality scores for reference and alternate allele (O’Leary et al. 2018). 0.25-1.75 cut off implemented in dDocent.
- this info might not be coded in all VCFs

Read depth threshold

Mistakes that occur during sequencing can lead to a base-calling error in a read, which can then lead to an erroneous SNP. This type of error most commonly manifests as rare alleles (eg, only observed in one individual) or singletons (only one copy of the minor allele across all individuals). A conservative approach to deal with these errors is to filter out all loci with rare alleles (link filtering), but for some demographic analyses this filtering is inappropriate as they require the site frequency spectrum.
The best ways to minimize the impact of sequencing error during bioinformatic processing is to 1) properly trim and filter your data, and 2) require higher read depth thresholds for assembly and genotyping.

1) Filtering reads

FASTQ files that typically come from the sequencing center include a Phred scale quality score for each base, which indicates the probability of that base call being correct. This information can be used prior to assembly to trim low-quality sections from either end of reads or remove reads altogether. Some SNP callers, including FreeBayes which is part of dDocent, also use Phred scores when calling SNPs (ipyrad and Stacks do not). All three of the most popular RAD pipelines include steps to filter reads due to low quality bases low quality bases or adapter sequences, but vary slightly on what kinds of read trimming are allowed. Always make sure to filter out adapter sequences (this is not default with stacks), otherwise default parameters are usually adequate. If your initial fastqc indicates that you have many low quality bases in your reads, you may lose too much of your data during filtering. In this case, you may want to make your read filtering a little more permissible (keeping in mind that this may increase SNP error rate).
Always make sure to check with your sequencing center to see if they have done anything to the raw reads before you received them!

dDocent:

does not allow you to use reads that have already been trimmed and filtered for de novo assembly, only referenced-based.
Uses Trimmomatic to trim Illumina TruSeq adaters and low quality bases. Make sure to change the defaults if you used a different adapter/sequencing system.

Stacks:

requires your reads to be all the same length, so if they were previously trimmed of low quality bases or adapter sequence this must be specified and they will be truncated to the same length.

2) Require higher read depth thresholds for assembly and genotyping.

For RADseq, ideally you are generating an average of at least 10-20 sequence reads (10x-20x coverage) for each RAD locus in your library preparation (but this doesn’t always happen…). For diploid species, this would be sequencing each RAD haplotype at least 5x-10x. The SNP calling step of RAD pipelines all require setting a minimum number of reads to determine whether a site is homozygous or heterozygous. If this threshold is too low, sequencing errors are more likely to be included and the SNP caller will have a harder time accurately identifying SNPs (in which case, you’d be better off with genotype likelihood methods. If it’s too high, then real polymorphic loci will be excluded. The too high threshold for your dataset can usually best be determined by using a few different read depth thresholds and comparing how the results change downstream analyses.

Filtering SNPs

(and embrace the missingness!)

Tutorials for this section are still under development! Please contribute on Github or provide feedback in Discussions

Even with a reduced representation sequencing approach, your assembly and genotyping pipeline will generate data for thousands of SNPs across your individuals. These SNPs will always need to be filtered before doing downstream analyses.

Main reasons to filter your data:

Remove potentially erroneous SNPs that remain after optimizing your assembly pipeline
Remove SNPs that are uninformative for downstream analyses
Subsample to remove SNPs that are in linkage disequilibrium (required for analyses that assume independence like PCA, Structure)

Unique to RADseq approaches is the missingness of RAD data.

Don’t over filter for missingness
- Papers showing how too much filtering can skew results:
  - Empirical: (Eaton et al. 2017; Tripp et al. 2017; Martı́n-Hernanz et al. 2019; Silliman et al. 2021; Dı́az-Arce and Rodrı́guez-Ezpeleta 2019)
  - Simulation investigating effects and sources of missing data: (Huang and Knowles 2016; Leaché et al. 2015; Gautier et al. 2013; Rivera-Colón, Rochette, and Catchen 2021)
Some ways to account for missing data in downstream analyses:
- impute mising data (PCA), subsample (PCA, structure, pretty much anything with SNPs), try a few different missing filters and see what impact it has on results

Jon Puritz has a great tutorial on filtering SNPs from RADseq data assembled with dDocent. While it requires certain VCF fields to be present that aren’t generated by ipyrad or Stacks, it contains great concepts on how to filter.

Repeat as needed

Helpful tutorials

References

Andrews, Kimberly R, Jeffrey M Good, Michael R Miller, Gordon Luikart, and Paul A Hohenlohe. 2016. “Harnessing the Power of RADseq for Ecological and Evolutionary Genomics.” Nat. Rev. Genet. 17 (2): 81–92. https://doi.org/10.1038/nrg.2015.28.

Dı́az-Arce, Natalia, and Naiara Rodrı́guez-Ezpeleta. 2019. “Selecting RAD-Seq Data Analysis Parameters for Population Genetics: The More the Better?” Front. Genet. 10 (May): 533. https://doi.org/10.3389/fgene.2019.00533.

Eaton, Deren A R, Elizabeth L Spriggs, Brian Park, and Michael J Donoghue. 2017. “Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-Scale Example from Flowering Plants.” Syst. Biol. 66 (3): 399–412. https://doi.org/10.1093/sysbio/syw092.

Elshire, Robert J, Jeffrey C Glaubitz, Qi Sun, Jesse A Poland, Ken Kawamoto, Edward S Buckler, and Sharon E Mitchell. 2011. “A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species.” PLoS One 6 (5): e19379. https://doi.org/10.1371/journal.pone.0019379.

Gautier, Mathieu, Karim Gharbi, Timothee Cezard, Julien Foucaud, Carole Kerdelhué, Pierre Pudlo, Jean-Marie Cornuet, and Arnaud Estoup. 2013. “The Effect of RAD Allele Dropout on the Estimation of Genetic Variation Within and Between Populations.” Mol. Ecol. 22 (11): 3165–78. https://doi.org/10.1111/mec.12089.

Harvey, Michael G, Caroline Duffie Judy, Glenn F Seeholzer, James M Maley, Gary R Graves, and Robb T Brumfield. 2015. “Similarity Thresholds Used in DNA Sequence Assembly from Short Reads Can Reduce the Comparability of Population Histories Across Species.” PeerJ 3 (April): e895. https://doi.org/10.7717/peerj.895.

Huang, Huateng, and L Lacey Knowles. 2016. “Unforeseen Consequences of Excluding Missing Data from Next-Generation Sequences: Simulation Study of RAD Sequences.” Syst. Biol. 65 (3): 357–65. https://doi.org/10.1093/sysbio/syu046.

Ilut, Daniel C, Marie L Nydam, and Matthew P Hare. 2014. “Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering.” Biomed Res. Int. 2014 (June): 675158. https://doi.org/10.1155/2014/675158.

LaCava, Melanie E F, Ellen O Aikens, Libby C Megna, Gregg Randolph, Charley Hubbard, and C Alex Buerkle. 2020. “Accuracy of de Novo Assembly of DNA Sequences from Double-Digest Libraries Varies Substantially Among Software.” Mol. Ecol. Resour. 20 (2): 360–70. https://doi.org/10.1111/1755-0998.13108.

Leaché, Adam D, Andreas S Chavez, Leonard N Jones, Jared A Grummer, Andrew D Gottscho, and Charles W Linkem. 2015. “Phylogenomics of Phrynosomatid Lizards: Conflicting Signals from Sequence Capture Versus Restriction Site Associated DNA Sequencing.” Genome Biol. Evol. 7 (3): 706–19. https://doi.org/10.1093/gbe/evv026.

Lou, Runyang Nicolas, and Nina Overgaard Therkildsen. 2021. “Batch Effects in Population Genomic Studies with Low-Coverage Whole Genome Sequencing Data: Causes, Detection, and Mitigation.” Authorea Preprints, August. https://doi.org/10.22541/au.162791857.78788821/v2.

Martı́n-Hernanz, Sara, Abelardo Aparicio, Mario Fernández-Mazuecos, Encarnación Rubio, J Alfredo Reyes-Betancort, Arnoldo Santos-Guerra, Marı́a Olangua-Corral, and Rafael G Albaladejo. 2019. “Maximize Resolution or Minimize Error? Using Genotyping-By-Sequencing to Investigate the Recent Diversification of Helianthemum (Cistaceae).” Front. Plant Sci. 10 (November): 1416. https://doi.org/10.3389/fpls.2019.01416.

Mastretta-Yanes, A, N Arrigo, N Alvarez, T H Jorgensen, D Piñero, and B C Emerson. 2015. “Restriction Site-Associated DNA Sequencing, Genotyping Error Estimation and de Novo Assembly Optimization for Population Genetic Inference.” Mol. Ecol. Resour. 15 (1): 28–41. https://doi.org/10.1111/1755-0998.12291.

McCartney-Melstad, Evan, Müge Gidiş, and H Bradley Shaffer. 2019. “An Empirical Pipeline for Choosing the Optimal Clustering Threshold in RADseq Studies.” Mol. Ecol. Resour. 19 (5): 1195–1204. https://doi.org/10.1111/1755-0998.13029.

O’Leary, Shannon J, Jonathan B Puritz, Stuart C Willis, Christopher M Hollenbeck, and David S Portnoy. 2018. “These Aren’t the Loci You’e Looking for: Principles of Effective SNP Filtering for Molecular Ecologists.” Mol. Ecol., July. https://doi.org/10.1111/mec.14792.

Peterson, Brant K, Jesse N Weber, Emily H Kay, Heidi S Fisher, and Hopi E Hoekstra. 2012. “Double Digest RADseq: An Inexpensive Method for de Novo SNP Discovery and Genotyping in Model and Non-Model Species.” PLoS One 7 (5): e37135. https://doi.org/10.1371/journal.pone.0037135.

Rivera-Colón, Angel G, Nicolas C Rochette, and Julian M Catchen. 2021. “Simulation with RADinitio Improves RADseq Experimental Design and Sheds Light on Sources of Missing Data.” Mol. Ecol. Resour. 21 (2): 363–78. https://doi.org/10.1111/1755-0998.13163.

Rubin, Benjamin E R, Richard H Ree, and Corrie S Moreau. 2012. “Inferring Phylogenies from RAD Sequence Data.” PLoS One 7 (4): e33394. https://doi.org/10.1371/journal.pone.0033394.

Silliman, Katherine, Jane L Indorf, Nancy Knowlton, William E Browne, and Carla Hurt. 2021. “Base-Substitution Mutation Rate Across the Nuclear Genome of Alpheus Snapping Shrimp and the Timing of Isolation by the Isthmus of Panama.” BMC Ecol Evol 21 (1): 104. https://doi.org/10.1186/s12862-021-01836-3.

Tripp, Erin A, Yi-Hsin Erica Tsai, Yongbin Zhuang, and Kyle G Dexter. 2017. “RADseq Dataset with 90% Missing Data Fully Resolves Recent Radiation of Petalidium (Acanthaceae) in the Ultra-Arid Deserts of Namibia.” Ecol. Evol. 7 (19): 7920–36. https://doi.org/10.1002/ece3.3274.

Wagner, Catherine E, Irene Keller, Samuel Wittwer, Oliver M Selz, Salome Mwaiko, Lucie Greuter, Arjun Sivasundar, and Ole Seehausen. 2013. “Genome-Wide RAD Sequence Data Provide Unprecedented Resolution of Species Boundaries and Relationships in the Lake Victoria Cichlid Adaptive Radiation.” Mol. Ecol. 22 (3): 787–98. https://doi.org/10.1111/mec.12023.

Zheng, Xiuwen, Stephanie M Gogarten, Michael Lawrence, Adrienne Stilp, Matthew P Conomos, Bruce S Weir, Cathy Laurie, and David Levine. 2017. “SeqArray-a Storage-Efficient High-Performance Data Format for WGS Variant Calls.” Bioinformatics 33 (15): 2251–57. https://doi.org/10.1093/bioinformatics/btx145.

RADseq

Katherine Silliman, Danielle Davenport