This tutorial accompanies the paper “The paradox of adaptive trait clines with non-clinal patterns in the underlying genes” published in PNAS.
Multivariate climate change presents an urgent need to understand how species adapt to complex environments. Population genetic theory predicts that loci under selection will form monotonic allele frequency clines with their selective environment, which has led to the wide use of genotype-environment associations (GEAs). However, the accuracy of GEA methods to identify adapted loci is limited, as shown in the main paper.
This tutorial shows how to apply a novel extension of multivariate ordination, which accurately predicted individual multivariate traits from genotype and environmental data on simulated data regardless of whether inference from GEAs was accurate.
If the following packages are not installed, be sure to install them first:
install.packages("vegan")
install.packages("lfmm")
install.packages("gplots")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("LEA")
libraries_needed <- c("vegan", "LEA", "lfmm", "gplots")
for (i in 1:length(libraries_needed)){
library(libraries_needed[i],character.only = TRUE) #laptop
}
knitr::opts_chunk$set(message = FALSE, warning = FALSE, cache = FALSE)
Don’t forget to set your working directory!
This data was simulated in SLiM and is associated with the complex multivariate simulation presented in Figure 6 in the paper. Briefly, a non-Wright-Fisher model was simulated on a landscape with 6 environmental variables that reflect different aspects of thermal stress and precipitation in British Columbia. The simulation included 6 environmental traits, each of which adapted to a different environmental variable.
The six environmental variables are based on real data from western Canada and are shown below, clockwise from upper right: Clockwise from upper left: precipitation of driest month, precipitation of warmest quarter, mean annual temperature, precipitation of wettest month, mean temperature of wettest quarter, mean temperature of driest quarter. Background colors correspond to the optimum trait value on each landscape, and each small square is a simulated individual, with its color representing its trait value in that environment.
The ind
table includes the xy location for each
individual, the 6 exact trait values (note that these won’t exactly
equal the trait value calculated from the genotype matrix because of MAF
filtering), and the 6 environmental values at their xy location.
The muts
table includes the linkage group
LG
, the position of the mutation on the genetic map
pos_pyslim
, a unique ID mutname
, the allele
frequency based on the 1000 sampled individuals
a_freq_subset
, and whether or not it had effects on one or
more phenotypes causal
.
Note that you will have to change the working directory to where the data is stored on your computer.
G <- read.table(unz("POP_05_RDAtraitPredictionTutorial_files/Genotypes.txt.zip", "Genotypes.txt"))
dim(G) # 1000 individuals and 26371 loci
## [1] 1000 26371
ind <- read.table("POP_05_RDAtraitPredictionTutorial_files/Individuals.txt", header=TRUE)
dim(ind) #corresponds to rows in G
## [1] 1000 15
head(ind)
muts <- read.table("POP_05_RDAtraitPredictionTutorial_files/SNPs.txt", header=TRUE)
dim(muts) #corresponds to columns in G
## [1] 26371 5
head(muts)
rownames(G) <- as.character(paste0("i_",ind$ind_index))
colnames(G) <- as.character(muts$mutname)
#G <- as.matrix(G)
head(G[,1:10])
This function predicts an environmental trait through the back-transformation of the RDA “site score” of an individual to a chosen environmental variable (Equation 1 in the manuscript). It makes the prediction for all the individuals that were used to run the RDA.
rda_trait_pred <- function(rdaobj, env_row, K){
#rdaobj is RDA object
#env_row is the row of the environment in the biplot output
#K is the number of RDA axes
scores <- scores(rdaobj, choices=1:K)
ind.sc <- scores$sites
pred <- matrix(NA, nrow=nrow(ind.sc), ncol=K)
for (k in 1:K){
pred[,k] <- ind.sc[,k]*eigenvals(rdaobj)[k]*summary(rdaobj)$biplot[env_row,k]
}
trait_pred <- scale(rowSums(pred))
return(trait_pred)
}
Scale the environmental variables to have a mean of 0 and standard deviation of 1.
ind$env1_mat <- scale(ind$env1_mat)
ind$env2_MTWetQ <- scale(ind$env2_MTWetQ)
ind$env3_MTDQ <- scale(ind$env3_MTDQ)
ind$env4_PDM <- scale(ind$env4_PDM)
ind$env5_PwarmQ <- scale(ind$env5_PwarmQ)
ind$env6_PWM <- scale(ind$env6_PWM)
# Run the RDA
rdaout <- rda(G ~ ind$env1_mat +
ind$env2_MTWetQ +
ind$env3_MTDQ +
ind$env4_PDM +
ind$env5_PwarmQ +
ind$env6_PWM
)
# Check the biplot output
rdaout$CCA$biplot
## RDA1 RDA2 RDA3 RDA4 RDA5
## ind$env1_mat -0.5004451 0.01863416 0.5476147 -0.54418964 -0.37298079
## ind$env2_MTWetQ 0.4523477 0.19158505 0.8296969 0.11392353 0.03132509
## ind$env3_MTDQ -0.7128197 -0.22757653 -0.1200761 -0.48309488 0.33816827
## ind$env4_PDM -0.3843437 0.04904761 -0.8297919 0.24744529 0.09707124
## ind$env5_PwarmQ 0.6524152 0.56221376 -0.4505148 -0.02427266 -0.04630498
## ind$env6_PWM 0.3209795 -0.05176393 -0.7346369 -0.19495732 -0.08780176
## RDA6
## ind$env1_mat -0.1186113
## ind$env2_MTWetQ -0.2373183
## ind$env3_MTDQ -0.2791780
## ind$env4_PDM -0.3011107
## ind$env5_PwarmQ -0.2292885
## ind$env6_PWM -0.5557731
# Decide how many RDA axes to use in calculation
a<- screeplot(rdaout)
str(a)
## List of 4
## $ x : num [1:6] 0.7 1.9 3.1 4.3 5.5 6.7
## $ y : num [1:6] 189 103.1 66.1 37.8 30.9 ...
## $ xlab: NULL
## $ ylab: NULL
a$y # save this it's the eigenvalues
## [1] 189.02481 103.10740 66.10749 37.79033 30.86247 23.24983
prop_var <- round(a$y[1:6]/sum(a$y),3)
cumsum(prop_var)
## [1] 0.420 0.649 0.796 0.880 0.949 1.001
plot(cumsum(prop_var), xlab="Number of RDA axes",
ylab="Cumulative percent of variation explained", ylim=c(0,1))
# Make the trait prediction for MAT (1st row in biplot output)
K = 3 # use 3 RDA axes to make the trait prediction
MATtraitPredict <- rda_trait_pred(rdaout, 1, K)
# Since this is a simulation, we can compare the prediction to the true value
# Similarly, an empirical study could compare an empirically measured trait value
# to the RDA-predicted trait value to test how well landscape genomic data
# can predict functional traits
plot(scale(ind$phenotype1_mat), MATtraitPredict, xlab="Evolved trait value in simulations",
ylab="RDA trait prediction")
abline(0,1)
#Correlation between the prediction and the true value:
cor(ind$phenotype1_mat, MATtraitPredict)
## [,1]
## [1,] 0.6461756
Note that the predict
function and it’s variations in
the R package vegan
do not make the same kind of
predictions as rda_trait_pred
. Here are the types of
outputs produced by the function predict
and its
variations:
# This option in the `predict` function outputs the scores for each locus in RDA space
loci_scores_predict <- predict(rdaout, type="sp", newdata=G, scaling=2)
str(loci_scores_predict)
## num [1:26371, 1:6] -0.00957 -0.01685 -0.00659 0.08155 0.03396 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:26371] "1-8" "1-27" "1-34" "1-81" ...
## ..$ : chr [1:6] "RDA1" "RDA2" "RDA3" "RDA4" ...
# This option in the `predict` function outputs the fitted values from the multiple regression
# performed on each locus within each individual
fitted_values_predict <- predict(rdaout, newdata=G, type="response")
str(fitted_values_predict)
## num [1:1000, 1:26371] 0.1031 0.0345 -0.0436 0.1821 0.1753 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "i_33" "i_34" "i_44" "i_45" ...
## ..$ : chr [1:26371] "1-8" "1-27" "1-34" "1-81" ...
# As a side note, it outputs the same thing as the `fitted` function
fitted_values_predict2 <- fitted(rdaout)
str(fitted_values_predict2)
## num [1:1000, 1:26371] 0.1031 0.0345 -0.0436 0.1821 0.1753 ...
## - attr(*, "METHOD")= chr "PCA"
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "i_33" "i_34" "i_44" "i_45" ...
## ..$ : chr [1:26371] "1-8" "1-27" "1-34" "1-81" ...
# This option in the `predict` function outputs the individual scores in RDA space
# based on a linear combination of the predictor variables
X <- data.frame(ind$env1_mat ,
ind$env2_MTWetQ ,
ind$env3_MTDQ ,
ind$env4_PDM ,
ind$env5_PwarmQ ,
ind$env6_PWM)
ind_scores_predict <- predict(rdaout, type="lc", new=X, scal=2)
str(ind_scores_predict)
## num [1:1000, 1:6] -2.034 -0.19 0.543 0.253 -0.508 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "1" "2" "3" "4" ...
## ..$ : chr [1:6] "RDA1" "RDA2" "RDA3" "RDA4" ...
The predict
function and its variations make predictions
in RDA space, and therefore do not output the same kind of predictions
as rda_trait_predict
and Equation 1 in the paper.
Prior to ordination in the RDA, each locus is used in a multiple regression model with the environmental variables to produce fitted values for that locus across individuals.
SNP Genotype ~ Env1 + Env2 + Env3 etc.
For example for the first SNP in the data:
# multiple regression of 1st locus
mod <- lm(G[,1] ~ ind$env1_mat + ind$env2_MTWetQ + ind$env3_MTDQ + ind$env4_PDM +
ind$env5_PwarmQ + ind$env6_PWM)
coef(summary(mod))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035000000 0.007404071 4.7271290 2.605782e-06
## ind$env1_mat -0.035729076 0.011338501 -3.1511287 1.675100e-03
## ind$env2_MTWetQ -0.062538545 0.012165353 -5.1407094 3.295576e-07
## ind$env3_MTDQ 0.007471565 0.013224098 0.5649962 5.722040e-01
## ind$env4_PDM -0.088670803 0.014835808 -5.9768099 3.169042e-09
## ind$env5_PwarmQ -0.069588746 0.014273932 -4.8752331 1.264746e-06
## ind$env6_PWM 0.047203246 0.013582893 3.4751982 5.325440e-04
Although multiple regression is a linear combination of multiple variables, it is able to model complex multivariate responses that appear to be non-monotonic in any one dimension. For example, let’s look at a the relationship between explanatory variable temperature and the response variable genotype, across decreasing and increasing values of the other explanatory variables:
otherenv <- c(seq(1,0,length.out=100), seq(0,1,length.out=101))
newdata=data.frame(ind.env1_mat = seq(-1,1, by=0.01),
ind.env2_MTWetQ = otherenv,
ind.env3_MTDQ = otherenv,
ind.env4_PDM = otherenv,
ind.env5_PwarmQ =otherenv,
ind.env6_PWM = otherenv)
pred <- t(newdata)*(coef(summary(mod))[2:7,1]) + coef(summary(mod))[1,1]
plot(seq(-1,1, by=0.01), colSums(pred), xlab="Temperature", ylab="Genotype prediction")
Thus, there is flexibility with the RDA to capture the way environmental variables may influence the patterns at one locus in a different way than at another locus, which may not correlate with the relationship between the environment and population structure.
It may be interesting for some studies to understand how each locus is shaped by the environment - in other words, what are the slopes associated with the environmental variables in the multiple regression model for each locus?
Unfortunately there is not a way to output these slopes in the R
package vegan
, but we can reproduce the first step of the
RDA to get the regression coefficients: (vegan source code at https://github.com/cran/vegan/blob/master/R/simpleRDA2.R)
X <- data.frame(ind$env1_mat ,
ind$env2_MTWetQ ,
ind$env3_MTDQ ,
ind$env4_PDM ,
ind$env5_PwarmQ ,
ind$env6_PWM)
# Perform qr decomposition to do the regression for all SNPs at the same time
Q <- qr(X, tol=1e-6)
# str(Q) run this line if you want to understand the structure of Q
# Get the matrix of regression coefficients
Qr.coeff <- qr.coef(Q, G)
# This matrix has each SNP in a column and the regression coefficients
# for that SNP corresponds to each environmental variable.
# This is the step that is not performed in the `vegan` package -
# the package skips directly to predicting the fitted values,
# on which the ordination is performed.
# Here is an example of regression coefficients for the first 10 SNPs:
head(Qr.coeff[,1:10])
## [,1] [,2] [,3] [,4]
## ind.env1_mat -0.035729076 0.007247705 -0.001047802 -6.104895e-02
## ind.env2_MTWetQ -0.062538545 0.039624398 0.101004048 9.680467e-05
## ind.env3_MTDQ 0.007471565 -0.005698865 0.039132724 7.995858e-02
## ind.env4_PDM -0.088670803 0.017639011 0.027586476 -1.237919e-01
## ind.env5_PwarmQ -0.069588746 -0.043489428 0.005967464 1.262363e-01
## ind.env6_PWM 0.047203246 -0.012944110 -0.061687810 -6.657131e-03
## [,5] [,6] [,7] [,8] [,9]
## ind.env1_mat -0.063099759 0.02064869 0.02539415 0.002292975 -0.07305243
## ind.env2_MTWetQ 0.023736921 -0.03019428 0.01360083 0.112457789 -0.02354021
## ind.env3_MTDQ -0.019240464 -0.02199593 -0.06875513 0.043179895 0.10418350
## ind.env4_PDM -0.003946225 -0.11767281 0.05887330 0.032681938 -0.09642371
## ind.env5_PwarmQ -0.032384913 0.05976386 -0.07535789 0.008011003 -0.03393128
## ind.env6_PWM 0.018107020 0.06710304 0.06788231 -0.068827598 0.02928042
## [,10]
## ind.env1_mat 0.03694418
## ind.env2_MTWetQ -0.04964375
## ind.env3_MTDQ -0.05660635
## ind.env4_PDM 0.01149236
## ind.env5_PwarmQ -0.01743073
## ind.env6_PWM -0.01118472
# Note that the regression coefficients for the first SNP from this
# approach is exactly the same as from our model above:
Qr.coeff[,1]
## ind.env1_mat ind.env2_MTWetQ ind.env3_MTDQ ind.env4_PDM ind.env5_PwarmQ
## -0.035729076 -0.062538545 0.007471565 -0.088670803 -0.069588746
## ind.env6_PWM
## 0.047203246
coef(summary(mod))[,1]
## (Intercept) ind$env1_mat ind$env2_MTWetQ ind$env3_MTDQ ind$env4_PDM
## 0.035000000 -0.035729076 -0.062538545 0.007471565 -0.088670803
## ind$env5_PwarmQ ind$env6_PWM
## -0.069588746 0.047203246
We can visualize the regression coefficients with a heatmap. In this case, we know the causal loci in the simulations, so we will just visualize those loci.
This visualization illustrates how there are unique ways in which environments are combined in the model to predict the pattern at each SNP.
dim(Qr.coeff)
## [1] 6 26371
colnames(Qr.coeff) <- muts$mutname
# look at the range of coefficients in the multiple regression model
summary(as.numeric(Qr.coeff[,which(muts$causal)]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.657086 -0.054464 0.001058 0.001041 0.056986 0.447236
brks <- seq(-0.7, 0.7, by=0.05) #set the color scale
heatmap.2(t(Qr.coeff[,which(muts$causal)]),
scale="none",
col = cm.colors(length(brks)-1),
breaks=brks,
dendrogram = "column",
Rowv=FALSE, #set this to "TRUE" if you would like to see which groups
trace="none",
key.title = "Coefficient in multiple\nregression model",
ylab="SNPs",
cexCol=1)
In the above heatmap, each row is a SNP. The SNPs are named according to their linkage group (1 through 10) and cumulative position in the genome (e.g. 9-448632 is on the 9th linkage group). Each linkage group was 50,000 bases long, so the cumulative position ranges from 1 to 500,000 over the 10 linkage groups. Each column in the heatmap is an environment, which is an explanatory variable in the model.
The color of the heatmap cells for a SNP shows the coefficients in the multiple regression model for each corresponding environment. In other words, the colors show how environments are combined in a multiple regression model to predict the patterns at that SNP on the landscape.
If we want to visualize clusters of SNPs that have similar coefficients in the multiple regression model, we can allow for clustering in the heatmap:
heatmap.2(t(Qr.coeff[,which(muts$causal)]),
scale="none",
col = cm.colors(length(brks)-1),
breaks=brks,
dendrogram = "both",
Rowv=TRUE,
trace="none",
key.title = "Coefficient in multiple\nregression model",
ylab="SNPs",
cexCol=1)
Here is the information about the session when the tutorial was built:
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Pacific/Honolulu
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gplots_3.1.3.1 lfmm_1.1 LEA_3.16.0
## [4] vegan_2.6-8 lattice_0.22-6 permute_0.9-7
## [7] reticulate_1.39.0 lubridate_1.9.3 forcats_1.0.0
## [10] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [16] ggplot2_3.5.1 tidyverse_2.0.0 kableExtra_1.4.0
## [19] knitcitations_1.0.12 knitr_1.48
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.47 bslib_0.8.0 caTools_1.18.3
## [5] tzdb_0.4.0 bitops_1.0-8 vctrs_0.6.5 tools_4.4.1
## [9] generics_0.1.3 parallel_4.4.1 fansi_1.0.6 highr_0.11
## [13] cluster_2.1.6 RefManageR_1.4.0 pkgconfig_2.0.3 KernSmooth_2.23-24
## [17] Matrix_1.7-0 lifecycle_1.0.4 compiler_4.4.1 munsell_0.5.1
## [21] codetools_0.2-20 htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
## [25] pillar_1.9.0 jquerylib_0.1.4 MASS_7.3-61 cachem_1.1.0
## [29] iterators_1.0.14 foreach_1.5.2 nlme_3.1-166 gtools_3.9.5
## [33] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4 splines_4.4.1
## [37] bibtex_0.5.1 fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
## [41] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4 withr_3.0.1
## [45] scales_1.3.0 backports_1.5.0 timechange_0.3.0 rmarkdown_2.28
## [49] httr_1.4.7 png_0.1-8 hms_1.1.3 evaluate_1.0.0
## [53] viridisLite_0.4.2 mgcv_1.9-1 rlang_1.1.4 Rcpp_1.0.13
## [57] glue_1.7.0 xml2_1.3.6 svglite_2.1.3 rstudioapi_0.16.0
## [61] jsonlite_1.8.8 R6_2.5.1 plyr_1.8.9 systemfonts_1.1.0