If you would like to run the R code examples that are scattered throughout the guide (recommended but not required!), you will need to install some R packages:

install.packages("tidyverse")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("SeqArray")
BiocManager::install("SNPRelate")

Now load those packages, if using:

library(ggplot2)
library(SeqArray)
library(tidyverse)
library(SNPRelate)

We also provide alternative coding examples that are based only on the command line, requiring these software packages:

“Restriction-site Associated DNA sequencing” - RADseq - combines restriction enzymes with next-gen, massively parallel, short-read sequencing. RADseq involves the use of restriction enzymes which are used to shear DNA at restriction enzyme cutsite. RADseq comes in different flavors. Double-digest RADseq (ddRAD; Peterson et al.,2012) selects markers with two restriction enzymes with different cut frequencies. Genotype-by-Sequencing (GBS; (Elshire et al. 2011)) uses a frequent-cutting restriction enzyme with PCR size selection. There are a number of reviews comparing different RADseq.GBS methods (e.g., (Andrews et al. 2016)). In this guide we use the term “RADseq” to refer to any of these protocols, including those that don’t involve random shearing of data. When recommendations are specific to a certain type of RAD/GBS, we will explicitly say so.

Using RADseq to generate single nucleotide polymorphisms (SNPs) involves 1) library preparation in the lab, 2) bioinformatic processing through assembly and/or mapping to a reference, then 3) filtering of SNPs and individuals for quality. All of these steps can (and will) introduce some error, so the goal is to minimize this error through mitigation steps at all three parts of the process. Every dataset is different [add more about philosophy over best practices].
Much of this guide is directly inspired by this excellent review paper by (O’Leary et al. 2018), especially the section on minimizing errors due to library prep. We recommend reading and cross-referencing with this paper, and citing it if you follow its suggestions. Table 1 from this paper summarizes the various potential issues that can arise from RAD datasets and some mitigation steps. The goal of this guide is to expand on the O’Leary paper and provide some example code to help implement quality control and mitigation steps.

# Considerations During Lab Work

There are steps you can take before you even sequence RAD libraries that can help minimize issues downstream. Here, we use “library” to refer to a set of RADseq fragments from a group of individuals that are barcoded and sequenced together on a single lane or group of lanes. While specific RAD library prep methods have their own nuances for minimizing error, there are some steps you can take that are common across methods.

• If this RAD/GBS method has not been done in your species or in your molecular lab setup before, spend some time optimizing the protocol using a representative subset of individuals. Then try to keep everything about the library prep as consistent as possible across samples (eg, DNA extraction kit, PCR cycles, sequencing platform). This isn’t always possible, especially if you need to optimize the protocol for certain tricky samples. Just make sure to keep track of everything!

• Randomize samples across library prep batches and sequencing lanes! For example, if you are sequencing two different groups of samples on two different sequencing lanes, make sure they are randomized in respect to sample location or whatever your groups of interest are. If you are preparing groups of individuals in different batches to be pooled later, randomize across these batches. Alicia Mastretta-Yanes even recommends randomizing across DNA extractions, especially if the person doing the extractions is new to molecular work. This is to allow you to control for potential batch effects that are often observed with RAD data.

• Keep track of all potential batch effect sources in a Metadata file (eg, storage conditions of tissue/DNA, date/method of DNA extraction, date/method library prep batch, sequencing lane).

• Have a core set of 2-4 technical sample replicates across all libraries. Ideally these are true technical replicates from library preparation to sequencing, but even just sequencing replicates can be useful for downstream quality control.

# Principles for Analyzing Your Data

## Steps for a robust RAD analysis

This is just one approach for working through your data. Some people will prefer to run just a subset of samples through a pipeline at first and evaluate parameters, then run all the samples through. Either way, be prepared to make MULTIPLE assemblies and go through this process iteratively, especially if this is a new-to-you study system.

1. Look at your raw data.
2. Run an assembly pipeline through to a SNP dataset, using either all samples or a representative subset. Parameters can come from those used in a similar study, or default parameters.
3. Filter your data minimally, then evaluate for potential sources of error.
4. Subset or remove individuals based on initial evaluation.
5. Using a representative subset of samples, test key parameters to optimize your assembly.
6. Run your optimized assembly on all non-removed samples.
7. Evaluate the difference between multiple filtering schemes for your analyses of interest. Popgen analysis guide coming soon
8. Repeat as needed.

## First, look at the raw data!

Always look at your data with FastQC before starting an assembly. First, this is a good check to just make sure the sequencing worked. If you have demultiplexed data, you can use MultiQC to generate FastQC plots for all individuals and quickly identify ones that did not sequence well.
Check out this informative tutorial on running FastQC and interpreting the output.
Questions to ask: Do you have a lot of adapter sequences? Are the ends of your reads very low quality? If so, you should expect a fair amount of trimming and read filtering to occur prior to assembly. If that doesn’t occur or if too many reads are being filtered so as to only recover a small number of SNPs, something might need to be tweaked with your trimming and filtering step.

You should also look at the top few reads in the Terminal.

## zcat: unZip and conCATenate the file to the screen
## head -n 20: Just take the first 8 lines of input

$zcat raw-fastqs/BC2_10_C5.fastq.gz | head -n 20 @BC2_10_C5.1 1 length=96 CAGCGTGTAATAGTCACCGGCGGCTCCCTCTGGAGAATAGCACAAGTGATCATTTTGCTCATCTTCCGTCCACTGGTGATTGTGGACCAGCCTCAC +BC2_10_C5.1 1 length=96 <GGGGGA<GGGGIIIGIGGIGGGIIIIIGGGGGGGGGGGGGGGGGIIIIIIIIGGIIIGGGGIGIIGIGGGIIIIIAGGGGIIIGIIGGGGGAAGG @BC2_10_C5.2 2 length=96 CTGCTACATGCAGTGTTCTGTATTACTTTTATTGTACGTTGATATGAATGAATGAGTGTTTTGTATACTTAGAGTACAAGTTTGTCAGTCATATCG +BC2_10_C5.2 2 length=96 GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIG @BC2_10_C5.3 3 length=96 CAGCACATGTTCCTGTGTAGAAAGCTTGTTAGTAGAATAAATAACACATGGCTGGTCAAACACAACACATGAAGAAACAACTTTCTGAACAGTTTT +BC2_10_C5.3 3 length=96 GIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIGGGGIIGIIIGGGII @BC2_10_C5.4 4 length=96 CAGCGATTCGGCCCAAATTTGCACCACATCAGGCCCTTGACAGGGCGCTTCGATGGTGCAAATTTGGTGCGATTCGCTGCGCACCTAGCATATATG +BC2_10_C5.4 4 length=96 AGGGIGIIIGIGGGIIGIGGGIIIGGGGAGGGGGAGGIGIGGIGGGGAGGGGGGGGGGGGGGGIIIII.G<GGIGGGIIIGGGGGIGGIIIGAGGI @BC2_10_C5.5 5 length=96 CAGCAGTTTGGTGGAGTTCTGCAACCTTCCATTTCCAAAGAATTACCCAGGAGCTCTTCCCAGTGAATTTCTTCGGCACTTTTCATTGACCTTTTA +BC2_10_C5.5 5 length=96 GGAGAAA.<AGAA.G.GGAGA<.GGA.<GAAGGAAGGIA<...<GA<..<G.<.<<.<AAAGGGG..<GGGGG<G.A.<GGGII.AG..<.GGGGG  If the sequencing center gave you one fastq file with all your samples, you should expect to see a barcode sequence, followed by the cutsite at the start of the read. If the Data are already demultiplexed (as the example is), you should only see the cutsite overhang (in this case, CWGC). Sometimes you can look at your fastq data files and see that there was a problem with the sequencing such that the cut site is either offset by one or more bases, or contains many errors. If this is not being addressed by the default filtering steps in your assembly pipeline, you can trim off N bases from the beginning or end of R1 and R2 reads in ipyrad (trim_reads param), or with cutadapt before using stacks, or customize the Trimmoatic step for dDocent. ## Run an assembly pipeline There are a number of freely available pipelines for processing RADseq data, with the most popular being dDocent, Stacks, and ipyrad. These pipelines vary slightly in their underlying methodologies, customization options, and additional included analyses. dDocent ipyrad Stacks2 Supported datatypes Paired-end: ddRAD, ezRAD, RAD (random shearing), data with large overlap between forward and reverse reads Single-end: any RADseq method de novo and reference-based. If doing de novo assembly, reads cannot be trimmed outside of dDocent | | | | Unique aspects | Novel data reduction approach used to inform coverage cutoffs | | | | Documentation | Very good (esp. tutorials), active community support on Google Groups | Excellent (esp. installation, parameter explanations, and tutorials), active community support on Gitter | Very good, active community support on Google Groups | | Speed/Accuracy | fastest and most accurate | close to dDocent accuracy, more over-splitting is possible if parameters are not tuned | produces some untrue genome fragments (esp. with higher levels of indel polymorphism), but can be addressed with downstream filtering | | Open source/development | Open source | Open source | no | | Filtering options | Minimal default filtering as implemented in VCFtools. Ideal for those who want full freedom in filtering their SNPs. | | | | Output options | Only produces a VCF, fasta of the de novo assembly, and individual BAM files from mapping to the reference | | | | Popgen analyses | No additional popgen analyses included | | | | | | | | Some groups have also developed pipelines for specific flavors of RAD (e.g., Matz lab for 2b-RAD) or proprietary software (eg, DARTseq), which we touch on below. In most cases, it is recommended that you use a pipeline developed and tested for RAD data, especially if you are making a de novo assembly. (LaCava et al. 2020) have an excellent study where they review various de novo assemblers that are used in these pipelines. Assemblers that were not explicitly developed for short reads (eg, Velvet, ABySS) performed very poorly, while CD-HIT (the assembler in dDocent) performed the best. Stacks/Stacks2 and vsearch (the assembler in ipyrad) both performed worse when analyzing simulated data with indels. It should be noted that all of these RAD pipelines have optimized the specific parameters of these de novo assemblers to work with RAD data of various flavors, which may not be fully reflected in the LaCava study. Based on the MarineOmics RAD panel, the general consensus is that all the popular pipelines can produce adequately accurate datasets (with appropriate parameter optimization and data filtering). ## Evaluate potential confounding factors Once you have processed your samples (or a subset of samples) through a genotyping pipeline, you will have a bunch of different output file options for genetic data. The VCF file is one of the most popular file formats, and is the most versatile for initial data exploration as many programs exist to filter and accept VCF files. Below are a list of potential confounding factors that may exist in your data, and how to tease them out. Using SeqArray in R If you are following along with the R code examples, we need to 1st read in our data (skip this section if not running R code). For the code examples in this section, we primarily make use of the R package SeqArray (Zheng et al. 2017), which can read and manipulate VCFs. If you’re familiar with R, SeqArray is simple to use. The package can efficiently store sequence variant calls with hundreds of thousands of individuals, variants (e.g., SNPs, indels and structural variation calls) and annotations. It employs a data container called a CoreArray Genomic Data Structure (GDS). It’s super-fast (5X faster than PLINK v1.9; 16X faster than vcftools) and it integrates well with other R packages you might use in your analysis pipeline. (i.e. SNPRelate, SeqVarTools). We also like it because you can filter your data before running certain analyses without 1st generating a separate filtered VCF file. First, we need to convert our VCF file into the GDS format. We will do this once here, then use the GDS file for subsequent code examples. filename = "OL_subset" #replace with your file name filename.gds = paste0("RAD_data/", paste0(filename, ".gds")) filename.vcf = paste0("RAD_data/", paste0(filename, ".vcf")) # 1 . Convert VCF to GDS SeqArray::seqVCF2GDS(vcf.fn = filename.vcf, out.fn = filename.gds, storage.option="ZIP_RA") ## Mon Nov 8 14:56:19 2021 ## Variant Call Format (VCF) Import: ## file(s): ## OL_subset.vcf (22.3M) ## file format: VCFv4.0 ## the number of sets of chromosomes (ploidy): 2 ## the number of samples: 18 ## genotype storage: bit2 ## compression method: ZIP_RA ## # of samples: 18 ## Output: ## RAD_data/OL_subset.gds ## Parsing 'OL_subset.vcf': ## + genotype/data { Bit2 2x18x70005 ZIP_ra(34.4%), 216.0K } ## Digests: ## sample.id [md5: dd28e5c928ffd0a817743a0e9447a808] ## variant.id [md5: 03df6156357de104368e6ed4694ebf92] ## position [md5: 690fe39440c87b7bfb5eeb07d7c0a310] ## chromosome [md5: 9f967c382b54d12060ab45d8c293652b] ## allele [md5: fb5bdfa95fb2448dd960bb41c96f7bff] ## genotype [md5: 25350d469a1102cc70b367663332282f] ## phase [md5: cd66242aebb89cfc3f2082c9413847dc] ## annotation/id [md5: c95d7c12f4fdae536da42bfec73942c9] ## annotation/qual [md5: 2864e4ded2a2cdc9dccfeb45c4fb3465] ## annotation/filter [md5: 2134c67ca1fdd51b7d3bf17ca1ca2c9e] ## annotation/info/NS [md5: 4ad6db26b594a567be24b0bff7f1f909] ## annotation/info/DP [md5: c0e5e2a0856c1f43788d2fa9842d2e7a] ## annotation/format/DP [md5: ae6991de5adbf082a733198faf0f11ca] ## annotation/format/CATG [md5: 71d62ffa848999a375e8bb5f329be275] ## Done. ## Mon Nov 8 14:56:20 2021 ## Optimize the access efficiency ... ## Clean up the fragments of GDS file: ## open the file 'RAD_data/OL_subset.gds' (2.8M) ## # of fragments: 214 ## save to 'RAD_data/OL_subset.gds.tmp' ## rename 'RAD_data/OL_subset.gds.tmp' (2.8M, reduced: 1.8K) ## # of fragments: 64 ## Mon Nov 8 14:56:20 2021 gdsin = SeqArray::seqOpen(filename.gds) print(paste0("The number of SAMPLES in data: ", length(c(SeqArray::seqGetData(gdsin, "sample.id"))))) ## [1] "The number of SAMPLES in data: 18" print(paste0("The number of SNPs in data: ", length(c(SeqArray::seqGetData(gdsin, "variant.id"))))) ## [1] "The number of SNPs in data: 69989" It is always helpful to have a metadata file with information for each sample, such as sampling site, sequencing library, etc. In our example, our metadata file (OL.popmap) is tab-delimited and has the column headers: ID: sample ID STRATA: sampling location/population PLATE: sequencing batch Next we read in our metadata file, and make sure the samples are in the same order as your VCF file: metafile = "RAD_data/OL.popmap" sample.ids = seqGetData(gdsin, "sample.id") sample.strata = read.table(metafile, header = T, sep = "\t") %>% dplyr::select(ID, STRATA, PLATE) Now, on to evaluating our data! ### “Bad” samples Sometimes a sample doesn’t sequence well (few sequencing reads, higher error rate). This can be due to DNA quality, an issue during library prep, or not enough sequencing depth (average # of reads per sample). Generally, it will lead to a sample with fewer sequencing reads, higher missing data in a SNP dataset, and fewer shared loci with other samples. Identifying and then removing these samples prior to the final RADseq assembly analysis can help minimize mis-assembled loci, genotyping errors, and excessive filtering of acceptable loci. The process of identifying low quality individuals is usually iterative, as the way you initially filter your SNPs will influence the amount of missing data and locus sharing among samples. This is why we recommend minimally filtering your SNPs for sample coverage (the # of individuals a locus is called in) when initially exploring your data. Some ways to identify bad samples: 1. For every SNP dataset you generate, it is a good idea is always evaluate the missingness per sample (and report this distribution in your manuscript!). Identify samples with way more missingness than the rest, and observe how they look in a PCA and locus sharing plot. If they stick out or all cluster together in the middle, then try removing them from the assembly and seeing if it changes downstream analyses. If so, you may want to specify a missingness cutoff for including samples in the final analysis. Missingness in R with SeqArray #using previously loaded gdsin object print("Per variant: ") ## [1] "Per variant: " summary(m1 <- SeqArray::seqMissing(gdsin, per.variant=TRUE)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.2778 0.3889 0.3855 0.4444 0.9444 print("Per sample: ") ## [1] "Per sample: " summary(m2 <- SeqArray::seqMissing(gdsin, per.variant=FALSE)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1218 0.1574 0.2287 0.3855 0.6213 0.9177 samples <- SeqArray::seqGetData(gdsin, "sample.id") cbind(samples,m2)[order(-m2),] ## samples m2 ## [1,] "OR3_13_C3" "0.917672777150695" ## [2,] "CA4_12_C1" "0.886539313320665" ## [3,] "OR3_3_C7" "0.873308662789867" ## [4,] "CA4_13_C1" "0.692623126491306" ## [5,] "BC2_6_C2" "0.634999785680607" ## [6,] "BC2_9_C4" "0.580248324736744" ## [7,] "CA4_8_C9" "0.365486004943634" ## [8,] "CA4_1_C4" "0.292274500278615" ## [9,] "OR3_1_C3" "0.245538584634728" ## [10,] "CA4_2_C3" "0.211790424209519" ## [11,] "CA4_14_C8" "0.201074454557145" ## [12,] "OR3_20_C4" "0.180756976096244" ## [13,] "BC2_10_C5" "0.165683178785238" ## [14,] "OR3_9_C8" "0.154638586063524" ## [15,] "OR3_5b_C6" "0.15163811456086" ## [16,] "BC2_17_C7" "0.132277929388904" ## [17,] "BC2_12_C6" "0.130006143822601" ## [18,] "BC2_16_C6" "0.121776279129578" #plot histogram hist(m2,breaks=50) Another common method of filtering and evaluating you data without ever using R is VCFtools. Missingness with vcftools, on the command line vcftools --vcf RAD_data/OL_subset.vcf --missing-indv --out RAD_data/OL_subset # sort the file by most missing data and print the top 10 samples cat RAD_data/OL_subset.imiss | (read h; echo "$h"; sort -k5 -r) 
##
## VCFtools - 0.1.15
## (C) Adam Auton and Anthony Marcketta 2009
##
## Parameters as interpreted:
##  --missing-indv
##
## After filtering, kept 18 out of 18 Individuals
## Outputting Individual Missingness
## After filtering, kept 69989 out of a possible 69989 Sites
## Run Time = 0.00 seconds
## INDV N_DATA  N_GENOTYPES_FILTERED    N_MISS  F_MISS
## OR3_13_C3    69989   0   64227   0.917673
## CA4_12_C1    69989   0   62048   0.886539
## OR3_3_C7 69989   0   61122   0.873309
## CA4_13_C1    69989   0   48476   0.692623
## BC2_6_C2 69989   0   44443   0.635
## BC2_9_C4 69989   0   40611   0.580248
## CA4_8_C9 69989   0   25580   0.365486
## CA4_1_C4 69989   0   20456   0.292275
## OR3_1_C3 69989   0   17185   0.245539
## CA4_2_C3 69989   0   14823   0.21179
## CA4_14_C8    69989   0   14073   0.201074
## OR3_20_C4    69989   0   12651   0.180757
## BC2_10_C5    69989   0   11596   0.165683
## OR3_9_C8 69989   0   10823   0.154639
## OR3_5b_C6    69989   0   10613   0.151638
## BC2_17_C7    69989   0   9258    0.132278
## BC2_12_C6    69989   0   9099    0.130006
## BC2_16_C6    69989   0   8523    0.121776

You can open the file analysis/OL_subset.imiss in any text editor an look at the missingness values by eye.

To plot the missingness on the command line, you can use gnuplot:

#code from Jon Puritz
mawk '!/IN/' RAD_data/OL_subset.imiss | cut -f5 > totalmissing
gnuplot << \EOF
set terminal dumb size 120, 30
set autoscale
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
#set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes pause -1 EOF ## ## Histogram of % missing data per individual ## ## 2 +----------------------------------------------------------------------------------------------------------+ ## | ** * + + + + + + + + | ## | ** * 'totalmissing' using (bin($1,binwidth)):(1.0) ******* |
##          |   ** *                                                                                                   |
##          |   ** *                                                                                                   |
##          |   ** *                                                                                                   |
##      1.5 |-+ ** *                                                                                                 +-|
##          |   ** *                                                                                                   |
##          |   ** *                                                                                                   |
##          |   ** *                                                                                                   |
##          |   ** *                                                                                                   |
##        1 |-**** *******************************************************************************************       +-|
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##      0.5 |-* ** * * * * *    *      *                 *               *     *             *           * * *       +-|
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * * * *    *      *                 *               *     *             *           * * *         |
##          | * ** * * *+* *    *   +  *        +        *  +          + *     *   +         * +         * * *         |
##        0 +----------------------------------------------------------------------------------------------------------+
##         0.1         0.2         0.3         0.4         0.5        0.6         0.7         0.8         0.9          1
##                                                       % of missing data
## 

It looks like samples X,Y,Z have a lot of missing data relative to other samples. We can remove them from our dataset for now, and just keep exploring our data, but if you decide to exclude them permanently from the analysis you should eventually rerun your genotyping pipeline without those samples, esp. if doing a de novo assembly.

1. Another great way to explore missingness with your data is to construct a heatmap of loci that are genoptyped between pairs of samples. Generally, samples that are more closely related will share more loci with each other due to cutsite dropout (cite). Deviations from that pattern can pinpoint bad samples as well as batch effects.

### The power of PCA

One of the most powerful methods for exploring your data is a Principle Components Analysis. From this genetics tutorial: “To understand how PCA works, consider a single individual and its representation by its 593,124 markers. Formally, each individual is a point in a 593,124-dimensional space, where each dimension can take only the three possible genotypes indicated above, or have missing data. To visualize this high-dimensional dataset, we would like to project it down to two dimensions. But as there are many ways to project the shadow of a three-dimensional object on a two dimensional plane, there are many (and even more) ways to project a 593,124-dimensional cloud of points to two dimensions. What PCA does is figuring out the “best” way to do this project in order to visualise the major components of variance in the data." See here for a linear algebra-based explanation of PCA).

When you have a VCF file with SNPs, use PCA before extensive filtering or playing with parameters to look at the data. Check which SNPs are associated with axes showing the most variation.

Here we use SeqArray and SNPRelate to run a PCA in R.

Reminder: Missing data is a feature of RAD. Be aware of how different analysis tools deal with missingness, especially PCA which will fill in all missing data with some values. Here, the PCA from SNPRelate imputes missing genotypes as the dosage mean.

##### Doing a PCA with SeqArray and SNPRelate in R.
#open the gds file previously created, if not already open
#gdsin = SeqArray::seqOpen(filename.gds)
# exclude samples previously identified as having too much missing data
sample.ids = seqGetData(gdsin, "sample.id")
keep = sample.ids[which(!sample.ids %in% bad_samples)]

Whether your data are denovo or reference-based, it is important to filter out linked sites before performing a PCA. Here’s how to do it in SeqArray. Note, if you have already filtered your VCF file to have only 1 SNP per RAD locus, you don’t need to do this:

snpset <- SNPRelate::snpgdsLDpruning(gdsin, ld.threshold=0.2, autosome.only = F, start.pos="random", num.thread=1, remove.monosnp = T, sample.id = keep)
snpset.id <- unlist(unname(snpset))

Now we will actually run the PCA, again removing the samples with missing data, keeping unlinked SNPs, and removing SNPs with less than 5% minor allele frequency.

# PCA only on SNPs with a minor allele freq greater than 5%
pca.out = SNPRelate::snpgdsPCA(autosome.only = F, gdsin, num.thread=2, remove.monosnp = T, maf = 0.05,
snp.id=snpset.id,
sample.id = keep) # filtering for pruned SNPs
## Principal Component Analysis (PCA) on genotypes:
## Calculating allele counts/frequencies ...
## Excluding 5,386 SNVs (monomorphic: TRUE, MAF: 0.05, missing rate: NaN)
## Working space: 12 samples, 22,645 SNVs
##     using 2 (CPU) cores
## CPU capabilities: Double-Precision SSE2
## Mon Nov  8 14:57:25 2021    (internal increment: 29696)
##
[..................................................]  0%, ETC: ---
[==================================================] 100%, completed, 0s
## Mon Nov  8 14:57:25 2021    Begin (eigenvalues and eigenvectors)
## Mon Nov  8 14:57:25 2021    Done.
#close the gds file (saves memory)
#seqClose(gdsin)

eig = pca.out$eigenval[!is.na(pca.out$eigenval)]
barplot(100*eig/sum(eig), main="PCA Eigenvalues")

First color/shape the individuals by the factor you expect to matter (eg, sampling site or region, family, ecotype).

#PLOT PCA
#PC1 v PC2 colored by collection location
id.order = sapply(keep, function(x,df){which(df$ID == x)}, df=sample.strata) #incase your strata file is not in the same order as your vcf sample.strata.order = sample.strata[id.order,] print( as.data.frame(pca.out$eigenvect) %>%
tibble::add_column(., STRATA =  sample.strata.order$STRATA) %>% ggplot(., aes(x=V1, y=V2, color = STRATA)) + geom_point(size=2) + stat_ellipse(level = 0.95, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_bw() + xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) + ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) + ggtitle("PCA Colored by Collection Location") ) #### Questions to ask about your PCA: • Is there any clustering? If so, is it different than your expectation? • If different than expectation, suggests either batch effects, cryptic variation among individuals, or systematic biological issues affecting assembly/mapping • Are there outlier samples driving a large amount of variation? • If so, they may be cryptic species, clones/sample replicates, or they sequenced/genotyped poorly • In our example we see 2 potential outlier samples in the bottom left corner. We can check which samples they are by plotting the PCA with sample ID and/or looking at the actual PC scores. #PLOT PCA #PC1 v PC2 with Sample Labels print( as.data.frame(pca.out$eigenvect) %>%
tibble::add_column(., ID =  sample.strata.order$ID) %>% ggplot(., aes(x=V1, y=V2, label = ID)) + geom_text(size =3) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_bw() + xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) + ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) + ggtitle("PCA with Labels") ) # which samples have PC1 > -0.7 and PC2 < -0.3? as.data.frame(pca.out$eigenvect) %>%
tibble::add_column(., ID =  sample.strata.order$ID) %>% filter(V1 > 0.7 & V2 < -0.3) %>% select(ID,V1,V2) • Is there no clustering at all? If so, it may be due to: • bioinformatic artifacts leading to noise, see Testing Parameters • issues with how you are filtering the data (eg, too strict or too lax) • issues with how your PCA is treating missing data • or you actually have no structure in your dataset! • If you used sample replicates, first see if replicates cluster very close together. If not, then there may be batch effects or bioinformatic artifacts leading to genotyping error that must be addressed. Then remove replicates for a subsequent PCA evaluation. When removing a sample replicate, you can choose one at random or pick the replicate with the least missing data. # example code for removing samples from a SNPRelate PCA # assumes you have already loaded a vcf file as a GDS samples.to.remove = c("SampleA_rep","SampleB_rep") sample.ids = seqGetData(gdsin, "sample.id") keep = sample.ids[which(!sample.ids %in% bad_samples)] # PCA only on SNPs with a minor allele freq greater than 2.5%, only keeping samples in keep pca.out.noreps= SNPRelate::snpgdsPCA(autosome.only = F, gdsin, num.thread=1, remove.monosnp = T, maf = 0.05, snp.id=snpset.id, sample.id = keep) Next, color the individuals by potential sources of batch effects (sequencing lane, library prep batch, age of tissue sample, person doing the DNA extraction…). # again, but colored by batch print( as.data.frame(pca.out$eigenvect) %>%
tibble::add_column(., PLATE =  as.factor(sample.strata.order$PLATE)) %>% ggplot(., aes(x=V1, y=V2, color = PLATE)) + #label = ID geom_point(size=2) + stat_ellipse(level = 0.95, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_bw() + xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) + ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) + ggtitle("PCA colored by Batch/Sequencing Plate") ) ## Too few points to calculate an ellipse ## Too few points to calculate an ellipse ## Too few points to calculate an ellipse ## Too few points to calculate an ellipse ## Too few points to calculate an ellipse ## Too few points to calculate an ellipse ## Too few points to calculate an ellipse ## Warning: Removed 7 row(s) containing missing values (geom_path). ### Batch effects As discussed in Considerations During Lab Work, batch effects can arise by minor (or not so minor) differences during the library prep and sequencing stage. Randomizing samples across libraries and sequencing lanes can help mitigate the influence of batch effects on your downstream analyses, but sometimes this isn’t possible, esp. when using pre-existing data. Even if you randomize your samples, it is a good idea to check for batch effects with a PCA and then try to mitigate them. Nicolas Lou and Nina Therkildsen have a great preprint and Github repo discussing causes, detection, and mitigation of batch effects in low coverage WGS, where batch effects can have particularly large effects. While some of their recommendations are specific to reference-based analyses, others are relevant to RAD methods too (eg, batch effects due to different sequencing platforms, DNA degradation levels, and sequencing depth). You can identify which loci are driving the batch effect by 1) identifying the top SNPs contributing to PC variation (termed “loadings”) and/or 2) do an Fst outlier analysis with the batches specified as populations (eg, Bayescan). Both of these methods are most appropriate if you’ve randomized your samples between batches. If you have not randomized samples but are certain there is a batch effect driving variation in your dataset, you can still remove loci this way but you may also be removing biologically informative loci. In SeqArray/SNPRelate, if you have batch effects that are identifiable by PCA, you would extract the PC loadings for each SNP at the PC which represents the batch effect, plot a histogram to see the distribution, choose a cutoff, then filter those SNPs out and create a new VCF for downstream analyses. For this example, let’s pretend that we see a batch effect on PC1. # assume you have a GDS object called gdsin and a PCA object called pca.out # extract PCA loadings SnpLoad <- snpgdsPCASNPLoading(pca.out, gdsin) ## SNP loading: ## Working space: 12 samples, 22645 SNPs ## using 1 (CPU) core ## using the top 12 eigenvectors ## Mon Nov 8 14:57:26 2021 (internal increment: 65536) ## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s ## Mon Nov 8 14:57:26 2021 Done. names(SnpLoad) ## [1] "sample.id" "snp.id" "eigenval" "snploading" "TraceXTX" "Bayesian" ## [7] "avgfreq" "scale" dim(SnpLoad$snploading)
## [1]    12 22645
#plot a histogram of absolute PC loading values for PC1, with larger loading meaning more effect on PC1
pc<-1
hist(sort(abs(SnpLoad$snploading[pc,]),decreasing=T,index.return=T)[[1]],breaks = 30,main="PC loadings: PC1",) Let’s remove the largest effect loci over abs(0.010) and see if it changes our PCA. #extract SNP.IDs for loadings on PC 1 > abs(0.010) batch.snp.ids <- SnpLoad$snp.id[which(abs(SnpLoad$snploading[1,]) >= 0.010)] #rerun PCA snp.id.allfilt <- setdiff(snpset.id,batch.snp.ids) pca.out.pcfilt = SNPRelate::snpgdsPCA(autosome.only = F, gdsin, num.thread=2, remove.monosnp = T, maf = 0.05, snp.id=snp.id.allfilt, sample.id = keep) # filtering for pruned SNPs ## Principal Component Analysis (PCA) on genotypes: ## Calculating allele counts/frequencies ... ## Excluding 5,386 SNVs (monomorphic: TRUE, MAF: 0.05, missing rate: NaN) ## Working space: 12 samples, 19,854 SNVs ## using 2 (CPU) cores ## CPU capabilities: Double-Precision SSE2 ## Mon Nov 8 14:57:27 2021 (internal increment: 29696) ## [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s ## Mon Nov 8 14:57:27 2021 Begin (eigenvalues and eigenvectors) ## Mon Nov 8 14:57:27 2021 Done. print( as.data.frame(pca.out.pcfilt$eigenvect) %>%
ggplot(., aes(x=V1, y=V2, color = STRATA)) +
geom_point(size=2) +
stat_ellipse(level = 0.95, size = 1) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
theme_bw() +
xlab(paste0("PC1 [",paste0(round(eig[1], 2)), "%]")) +
ylab(paste0("PC2 [",paste0(round(eig[2], 2)), "%]")) +
ggtitle("PCA Colored by Collection Location")
)

Our PCA is now clustering better by population, so we will now export a new VCF that includes all the filtering we’ve done so far: 1) removed individuals with missing data, 2) thinned our SNPs for linkage, and 3) removed SNPs driving a batch effect.

To save your filtered GDS object as a VCF in SeqArray:

# set a filter to exclude samples with missing data, and SNPs in linkage or with a batch effect
seqSetFilter(gdsin, sample.id=keep,variant.id=snp.id.allfilt)
## # of selected samples: 12
## # of selected variants: 25,240
# convert to vcf
seqGDS2VCF(gdsin, "RAD_data/OL-miss-pc1-linkage-filt.vcf.gz")
## Loading required namespace: Rsamtools
## Hint: install Rsamtools to enable the bgzf output.
## Mon Nov  8 14:57:28 2021
##     12 samples, 25,240 variants
##     INFO Field: NS, DP
##     FORMAT Field: DP, CATG
##     output to a general gzip file
##
[..................................................]  0%, ETC: ---
[==================================================] 100%, completed, 1s
## Mon Nov  8 14:57:29 2021    Done.
#close the gds file to save memory
closefn.gds(gdsin)

### Cryptic species/contamination/clones

One issue that is particularly common, especially for marine invertebrates, are cryptic, yet highly diverged lineages occurring in the same geographic location. If you are trying to do a landscape/seascape genetics study, cryptic species can dramatically throw off population genetic analyses. A PCA using all the samples can help identify cryptic species if samples are forming clusters that do not align with any other possible factor (eg, geography, sequencing batch).

You can also check heterozygosity. Large differences in heterozygosity between outliers can suggest different species or cryptic hybrids.

You can then estimate genetic divergence between these clusters using pairwise Fst and/or phylogenetic inference to determine how diverged the groups are. If they are very diverged, it may be better to assemble each cryptic species separately, especially for denovo assemblies with dDocent and Stacks, especially since often times restriction cut-sites are not always conserved; see Rubin, Ree, and Moreau (2012). These two pipelines used a set of given samples to create a “reference” or “catalog” to which all other samples are mapped. If highly diverged samples are included in creating this reference, it can lead to oversplitting.

Cryptic clones can also dramatically throw off population genetic analyses. In a PCA, clones will often appear as their own very tight cluster, similar to sample replicates. A calculation of sample pairwise distance (e.g., Manhattan distance) can very duplicates/clones.

## Test a range of key parameters

There are many parameter choices to make when assembling and genotyping, and those parameters vary slightly across pipelines. The parameters below are general across RADseq pipelines (although with different names), and are shown to have the largest impact on genoyping error rates. ### Clustering threshold {#cluster} #### (de novo assembly only) During de novo RADseq assembly, reads are assembled into contigs (contiguous sequences), so that each contig represents a single locus. This assembly can occur separately for each individual (ipyrad) or jointly for multiple individuals (dDocent, Stacks). To call SNPs, in the case of ipyrad, contigs are then clustered and aligned across individuals, while in dDocent and Stacks, reads for each individual are mapped to the “reference” contigs. All of these pipelines use sequence similarity (or clustering threshold) to determine if fragments are homologous (from the same place in the genome) and therefore should be clustered together.

If the clustering threshold is too low, undersplitting can occur so that multiple loci are combined into a single contig. Undersplitting is common with repetitive regions and paralogs. It can inflate mean observed heterozygosity. If the threshold is too high, it can lead to oversplitting, where alleles of the same locus are split into two or more contigs. This can lead to reduced mean observed heterozygosity.

If severely off, both of these issues can bias the models used to identify SNPs. Undersplitting a little is better than oversplitting, as it is easier to identify lumped paralogs downstream and remove them.

#### What influences the “optimal” clustering threshold of your data?

Clustering threshold is usually determined by how much genetic diversity you have in your dataset. If your dataset has high genetic diversity (eg., a single species with high heterozygosity, multiple highly diverged populations, multiple species), then you will need a lower clustering threshold. If your dataset has low genetic diversity (a single species with lower heterozygosity, a single family), then you will need a higher clustering threshold. Polyploid or highly repetitive (i.e., large) genomes may also have different requirements than diploid species with smaller genomes. Other factors, such as sequencing error rate, may play a role as well. We recommend always testing a few clustering thresholds with a new dataset and/or new species. However, as long as your clustering threshold is in the ballpark it shouldn’t affect downstream analyses (according to Isaac Overcast).

There are a few strategies to identify a good clustering threshold for your data, all of which require assembling your data with a range of clustering thresholds.
1) Metrics developed by (McCartney-Melstad, Gidiş, and Shaffer 2019), based on popgen theory and landscape genetic expectations
2) Utilize sample replicates (either exact replicates or individuals from the same locality) to quantify genotyping error and minimize error while maximizing the number of informative loci (Mastretta-Yanes et al. 2015) 3) Evaluating the number of haplotypes across loci, with many 3+ haplotypes indicating undersplitting (Harvey et al. 2015; Ilut, Nydam, and Hare 2014) 4) Make assemblies at a range of thresholds and parameters, plot the # of SNPs or contigs per assembly, and make a judgement call example in dDocent tutorial. If most parameters give you about the same # of SNPs or contigs, you can feel confident that this parameter doesn’t matter much for your data and just choose a reasonable value.

### Mapping parameters

#### (reference-based and some de novo assemblers)

When a sequencing read maps to the wrong assembled contig or place in a reference genome (ie, a non-homologous locus), this can result in erroneous SNPs. In many cases, these SNPs can be filtered out later (link to filtering). However, data should be checked to make sure there aren’t systematic, wide-scale mapping issues that need to be addressed.

For referenced-based assembly, both dDocent and ipyrad make use of bwa(link) to map reads, while Stacks allows the user to choose their aligner. dDocent also utilizes bwa for de novo assemblies.

Mapping defaults for popular RADseq pipelines: L: clipping penalty for 5’ and 3’ A: matching score O gap open penalty T: mapping quality filter

dDocent: bwa mem -L 20,5 -t 8 -a -M -T 10 -A1 -B 3 -O 5 -R

• can change -A and -O during the dDocent pipeline.
ipyrad: (defaults) -L 5,5 -B 4 -A1 -O 6 -T 30

• requires changing the ipyrad source code to change parameters

Our whole genome resequencing section has some good content on mapping(link). General considerations with RAD data:

• how many reads aligned for each sample? If many reads didn’t aligned, this could be an issue with the reference, read trimming/filtering, or the mapping parameters.
• What is the distribution of mapping scores?
• ratio of mean (across individuals) mapping quality scores for reference and alternate allele (O’Leary et al. 2018). 0.25-1.75 cut off implemented in dDocent.
• this info might not be coded in all VCFs

Mistakes that occur during sequencing can lead to a base-calling error in a read, which can then lead to an erroneous SNP. This type of error most commonly manifests as rare alleles (eg, only observed in one individual) or singletons (only one copy of the minor allele across all individuals). A conservative approach to deal with these errors is to filter out all loci with rare alleles (link filtering), but for some demographic analyses this filtering is inappropriate as they require the site frequency spectrum.
The best ways to minimize the impact of sequencing error during bioinformatic processing is to 1) properly trim and filter your data, and 2) require higher read depth thresholds for assembly and genotyping.

Always make sure to check with your sequencing center to see if they have done anything to the raw reads before you received them!

dDocent:

• does not allow you to use reads that have already been trimmed and filtered for de novo assembly, only referenced-based.
• Uses Trimmomatic to trim Illumina TruSeq adaters and low quality bases. Make sure to change the defaults if you used a different adapter/sequencing system.

Stacks:

• requires your reads to be all the same length, so if they were previously trimmed of low quality bases or adapter sequence this must be specified and they will be truncated to the same length.
##### 2) Require higher read depth thresholds for assembly and genotyping.

For RADseq, ideally you are generating an average of at least 10-20 sequence reads (10x-20x coverage) for each RAD locus in your library preparation (but this doesn’t always happen…). For diploid species, this would be sequencing each RAD haplotype at least 5x-10x. The SNP calling step of RAD pipelines all require setting a minimum number of reads to determine whether a site is homozygous or heterozygous. If this threshold is too low, sequencing errors are more likely to be included and the SNP caller will have a harder time accurately identifying SNPs (in which case, you’d be better off with genotype likelihood methods. If it’s too high, then real polymorphic loci will be excluded. All of the three major pipelines allow users to implement read depth thresholds

Generating multiple sequencing reads per locus is helpful when calling SNPs in order to distinguishing between reads with sequencing errors and.

## Filtering SNPs

### (and embrace the missingness!)

This section is still under development! Please contribute on Github or provide feedback in Discussions Even with a reduced representation sequencing approach, your assembly and genotyping pipeline will generate data for thousands of SNPs across your individuals. These SNPs will always need to be filtered before doing downstream analyses.
#### Main reasons to filter your data:

• Remove potentially erroneous SNPs that remain after optimizing your assembly pipeline
• Remove SNPs that are uninformative for downstream analyses
• Subsample to remove SNPs that are in linkage disequilibrium (required for analyses that assume independence like PCA, Structure)

• Don’t over filter for missingness

• papers showing how too much filtering can skew results

• Empirical: Diaz-Arce 2019, Tripp 2017, Silliman 2021, Eaton 2017, Martin-Hernanz (2019)

• Simulations: Huang & Knowles 2016, Leaché et al., 2015, Gautier et al., 2012, Rivera-Colón et al., 2021

• some ways to account for missing data in downstream analyses

• impute (PCA), subsample (PCA, structure, pretty much anything with SNPs), try a few different missing filters and see what impact it has on results

# References

Andrews, Kimberly R, Jeffrey M Good, Michael R Miller, Gordon Luikart, and Paul A Hohenlohe. 2016. “Harnessing the Power of RADseq for Ecological and Evolutionary Genomics.” Nat. Rev. Genet. 17 (2): 81–92. https://doi.org/10.1038/nrg.2015.28.
Elshire, Robert J, Jeffrey C Glaubitz, Qi Sun, Jesse A Poland, Ken Kawamoto, Edward S Buckler, and Sharon E Mitchell. 2011. “A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species.” PLoS One 6 (5): e19379. https://doi.org/10.1371/journal.pone.0019379.
Harvey, Michael G, Caroline Duffie Judy, Glenn F Seeholzer, James M Maley, Gary R Graves, and Robb T Brumfield. 2015. “Similarity Thresholds Used in DNA Sequence Assembly from Short Reads Can Reduce the Comparability of Population Histories Across Species.” PeerJ 3 (April): e895. https://doi.org/10.7717/peerj.895.
Ilut, Daniel C, Marie L Nydam, and Matthew P Hare. 2014. “Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering.” Biomed Res. Int. 2014 (June): 675158. https://doi.org/10.1155/2014/675158.
LaCava, Melanie E F, Ellen O Aikens, Libby C Megna, Gregg Randolph, Charley Hubbard, and C Alex Buerkle. 2020. “Accuracy of de Novo Assembly of DNA Sequences from Double-Digest Libraries Varies Substantially Among Software.” Mol. Ecol. Resour. 20 (2): 360–70. https://doi.org/10.1111/1755-0998.13108.
Mastretta-Yanes, A, N Arrigo, N Alvarez, T H Jorgensen, D Piñero, and B C Emerson. 2015. “Restriction Site-Associated DNA Sequencing, Genotyping Error Estimation and de Novo Assembly Optimization for Population Genetic Inference.” Mol. Ecol. Resour. 15 (1): 28–41. https://doi.org/10.1111/1755-0998.12291.
McCartney-Melstad, Evan, Müge Gidiş, and H Bradley Shaffer. 2019. “An Empirical Pipeline for Choosing the Optimal Clustering Threshold in RADseq Studies.” Mol. Ecol. Resour. 19 (5): 1195–1204. https://doi.org/10.1111/1755-0998.13029.
O’Leary, Shannon J, Jonathan B Puritz, Stuart C Willis, Christopher M Hollenbeck, and David S Portnoy. 2018. “These Aren’t the Loci You’e Looking for: Principles of Effective SNP Filtering for Molecular Ecologists.” Mol. Ecol., July. https://doi.org/10.1111/mec.14792.
Rubin, Benjamin E R, Richard H Ree, and Corrie S Moreau. 2012. “Inferring Phylogenies from RAD Sequence Data.” PLoS One 7 (4): e33394. https://doi.org/10.1371/journal.pone.0033394.
Wagner, Catherine E, Irene Keller, Samuel Wittwer, Oliver M Selz, Salome Mwaiko, Lucie Greuter, Arjun Sivasundar, and Ole Seehausen. 2013. “Genome-Wide RAD Sequence Data Provide Unprecedented Resolution of Species Boundaries and Relationships in the Lake Victoria Cichlid Adaptive Radiation.” Mol. Ecol. 22 (3): 787–98. https://doi.org/10.1111/mec.12023.
Zheng, Xiuwen, Stephanie M Gogarten, Michael Lawrence, Adrienne Stilp, Matthew P Conomos, Bruce S Weir, Cathy Laurie, and David Levine. 2017. SeqArray-a Storage-Efficient High-Performance Data Format for WGS Variant Calls.” Bioinformatics 33 (15): 2251–57. https://doi.org/10.1093/bioinformatics/btx145.