Initial publication year: 2022
How to cite

fastq2bam

Trimming adapters

Before aligning your sequences to a reference genome, you will need to trim off the adapters. We use FastP in our pipeline as below, but there are other programs that could be suitable (e.g., Trimmomatic, TrimGalore/Cutadapt, BBDuk).

fastp --in1 {r1} --in2 {r2} --out1 {r1} --out2 {r2} --detect_adapter_for_pe 2> {summary.out}

Where
--in and --out are the flags for the input and output reads
--detect_adapter_for_pe enables auto-detection of adapters for paired end data (because auto-detection is only enabled for single end data by default).

FastP (and other trimming software) can be used for quality control before aligning your sequences to the reference genome, but there are arguments on both sides for whether or not trimming your reads at this stage is useful.

Mapping

Our pipeline uses the Burrows-Wheeler transform (Li and Durbin 2010) to align reads to the reference genome, but there are many software that can be used for aligning short reads to a reference (e.g., Bowtie2, STAR). To make the alignment process more efficient, you need to index the reference genome.

bwa index {reference.fa}

Where reference.fa is your reference genome. Once your reference has been indexed, you can do a local alignment of your FASTQs, and pipe directly into samtools to sort the alignment by coordinate and converted to the BAM (Binary ALignment/Map) format.

bwa mem -M -R {string} {ref} {r1} {r2} | samtools sort -u - > {output.bam}

Where
-M marks shorter split hits as secondary (required for Picard compatibility downstream)
-R defines a read group header e.g., @RG\tID:{sample}\tSM:{sample}\tPL:ILLUMINA
{ref} is the indexed reference genome
{r1} and {r2} are the first and second reads you are aligning
-u is for uncompressed output

Once you have a sorted and indexed alignment, you can locate and tag duplicate records, which are reads that have originated from a single fragment of DNA. This can happen during sample preparation (e.g., during PCR) or from sequencing artifacts (e.g., optical duplicates).If you have multiple runs for the same sample, you would want to merge and index the BAMs before the deduplication tagging, which you can do with samtools merge (integrated in snpArcher under merge_bams rule).

sabamba markup {input} {output}

Quality control and filtering

It’s important to take a look at your alignments to check for quality, potentially filter out sites or samples that don’t meet adequate thresholds, etc. The pipeline outputs various statistics automatically along the way, but we have collected some of the QC commands here.

We use samtools to compute the depth at each position or region, with the tabulated output containing information for the chromosome, start position, end position, number of reads aligned in that region after filtering, number of covered bases with a depth greater than or equal to 1 in that region, the proportion of coverage bases, the mean depth of coverage, mean baseQ, and mean mapQ.

samtools coverage --output {output} {input.bam}

Where
--output is the file name you want to write the output to
input.bam is the input BAM file

And to produce a summary of the alignment metrics:

samtools flagstat -O tsv {input.bam} > {output}

Where
O sets the output format (in this case, as a TSV)
input.bam is the input BAM file
output is the file name you want to write the output to

bam2vcf

Variant Calling

Two of the major variant callers you can use (at the time of writing) are Genome Analysis Toolkit (GATK) and freebayes. You can call variants with either of these software in our pipeline and the downstream steps are the same, so here we present parameters for variant calling with both GATK and freebayes. Regardless of the software used, variant calling is computationally intensive but can be parallelized for more efficient resource use by splitting the reference genome into intervals. The interval creation is automated in our pipeline using Picard but parameter guidelines will differ depending on your reference genome.

To call germline single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) via local re-assembly of haplotypes using GATK:

gatk HaplotypeCaller -R {reference.fa} -I {input.bam} -O {output.gvcf} -L {interval.list} --emit-ref-confidence GVCF --min-pruning {params} --min-dangling-branch-length {params}

Where
-R is the reference genome
-I is the input BAM
-O is the output gVCF
-L is the interval file list
--emit-ref-confidence yields the reference confidence scores as gVCF
--min-pruning sets the minimum support to not prune paths in the graph (low coverage option: 1 / high coverage option: 2)
--min-dangling-branch-length sets the minimum length of a dangling branch to attempt recovery (low coverage option: 1 / high coverage option: 4)

There are some constraints when using clusters and job scheduling managers, including command lengths. When you have many samples and are trying to run GATK on all of them, the command may get too long, causing SLURM (or another job scheduler) to throw an error. To get around this, you can create DB map files (which is done automatically in our pipeline). Then we can import many gVCFs into a GenomicsDB for a given scaffold:

export TILEDB_DISABLE_FILE_LOCKING=1
gatk GenomicsDBImport --genomicsdb-shared-posixfs-optimizations true --batch-size 25 --genomicsdb-workspace-path {output.db} -L {input.l} --sample-name-map {input.db}

Where
the export command can remedy sluggish performance
--genomicsdb-shared-posixfs-optimizations allows for optimization when using shared Posix Filesystems (like Lustre)
--batch-size is used in conjunction with the above argument to control the memory consumption
--genomicsdb-workspace-path is the working directory to point to for GenomicsDB
-L is the interval file list
--sample-name-map is the path to a file that contains a sample to file map in a tab delimited format

Now you can use the genomic databases from gatk GenomicsDBImport to create VCF files (one per interval file)

gatk GenotypeGVCFs -R {reference.fa} --heterozygosity {params} --genomicsdb-shared-posixfs-optimizations true -V {input.DB} -O {output.vcf}

Where
-R is the reference genome
--heterozygosity value used to compute prior likelihoods for any locus - we use 0.005 as a default, but this can be changed in the config
-V is the DB map file
-O is the output VCF

Then, before combining all the VCFs for the intervals into one final VCF, it is computationally more efficient to filter each of the VCFs, then gather them together.

gatk VariantFiltration -R {reference.fa} -V {input.vcf} --output {output.vcf} \
--filter-name "RPRS_filter" \
--filter-expression "(vc.isSNP() && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -8.0)) || ((vc.isIndel() || vc.isMixed()) && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0)) || (vc.hasAttribute('QD') && QD < 2.0)" \
--filter-name "FS_SOR_filter" \
--filter-expression "(vc.isSNP() && ((vc.hasAttribute('FS') && FS > 60.0) || (vc.hasAttribute('SOR') &&  SOR > 3.0))) || ((vc.isIndel() || vc.isMixed()) && ((vc.hasAttribute('FS') && FS > 200.0) || (vc.hasAttribute('SOR') &&  SOR > 10.0)))" \
--filter-name "MQ_filter" \
--filter-expression "vc.isSNP() && ((vc.hasAttribute('MQ') && MQ < 40.0) || (vc.hasAttribute('MQRankSum') && MQRankSum < -12.5))" \
--filter-name "QUAL_filter" \
--filter-expression "QUAL < 30.0" \
--invalidate-previous-filters

gatk GatherVcfs {input.vcfs} -O {output.vcf}

Where
-R is the reference genome
-V is the input VCF (per interval) --output is the filtered VCF (per interval)
{input.vcfs} is a custom script invoked by the pipeline to gather all the filtered VCFs for each interval into a single input command using the intervals list
-O is the final output VCF

The filter expressions more or less follow the GATK hard filtering recommendations but are split to treat SNPs and indels with different thresholds within each filter expression, as well as keeping different quality tags separate to make downstream filtering easier.


Quality control and statistics

There are a number of statistics generated by snpArcher at various points of the workflow, as well as a QC workflow that is implemented after the final VCF is produced. This QC workflow produces an interactive dashboard for the user, but we have separated some of the QC out by the type/purpose, rather than separating each rule.

This bcftools query pulls the designated fields in a tab-separated format that can be easily read into R (or other software) for interrogation.

bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/AF\t%QUAL\t%INFO/ReadPosRankSum\t%INFO/FS\t%INFO/SOR\t%INFO/MQ\t%INFO/MQRankSum\n' {input.vcf} > {output}


This set of commands will output statistics based on the VCF

vcftools --gzvcf {input.vcf} --FILTER-summary --out {output.prefix}
vcftools --gzvcf {input.vcf} --depth --out {output.prefix}
vcftools --gzvcf {input.vcf} --missing-indv --out {output.prefix}

Where
--gzvcf is the input VCF, gzipped (use --vcf if working with an uncompressed VCF)
--FILTER-summary --depth --missing-indv generates a file reporting the missingness per indivdual
`--out assigns a prefix to each output e.g., the filter summary output would be titled “prefix.FILTER.summary”



Create a PCA from PLINK, using biallelic variants only

# First, remove filtered sites and keep only the biallelic SNPs
bcftools view -v snps -m2 -M2 -f .,PASS -e 'AF==1 | AF==0 | ALT="*" | TYPE~"indel" | ref="N"' {input.vcf} -O z -o {output.filtered}
bcftools index {output.filtered}

# Then, if there are less than 150,000 SNPs, just take all of them
ALLSITES=`bcftools query -f '%CHROM\t%POS\n' {output.filtered} | wc -l`
SITES=`echo $(( ${{ALLSITES}} / 100000 ))`
if [[ $SITES -gt 1 ]]
then
    bcftools +prune -w $SITES -n 1 -N rand -O z -o {output.pruned} {output.filtered}
else
    bcftools view -O z -o {output.pruned} {output.filtered}
fi

# Use PLINK 2 for the King relatedness matrix ... 
plink2 --vcf {input.vcf} --pca 2 --out {params.prefix} --allow-extra-chr --autosome-num 95 --make-bed --make-king square --const-fid --bad-freqs

# and PLINK 1.9 for the distance matrix 
plink --vcf {input.vcf} --out {params.prefix} --allow-extra-chr --autosome-num 95 --distance square --const-fid



snpArcher also runs ADMIXTURE at k=2 and k=3:

# First, make a BIM without any characters in the chromosome names
mv {input.bim} {input.bim}.orig
paste <(cut -f 1 {input.bim}.orig | sed 's/[^0-9]//g') <(cut -f 2,3,4,5,6 {input.bim}.orig) >  {input.bim}

# Then run ADMIXTURE
admixture {input.bed} 2
admixture {input.bed} 3