By Angus Ball
Alpha diversity just got COMPLICATED Ill be merging these two tutorials, and honestly, they are going to be alot better than I could ever be so go give them a run through! I’m still gonna explain here so we have a S.O.P but when the author of the code offers to explain how to use it, don’t turn your nose https://github.com/adw96/DivNet/blob/main/vignettes/getting-started.Rmd https://adw96.github.io/breakaway/articles/breakaway.html https://github.com/adw96/breakaway/blob/main/vignettes/diversity-hypothesis-testing.Rmd
Irregardless lets go! start by uploading your data!
filepulling <- "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\physeq_Key.rds" #add your file location here
physeq_Key <- readRDS(file = filepulling)
library(DivNet)
library(phyloseq)
library(breakaway)
library(speedyseq)
library(magrittr)
library(tibble)
library(tidyverse)
This alpha diversity program assumes samples are from the same population of species (as they often are!) to estimate the true abundance of species and have an altogether more accurate alpha diversity metric. First things first, this is a taxing program to run so we are going to look at the order level, you can change this just as easy.
physeq_order <- tax_glom(physeq_Key, taxrank="Order") #conglomerates everything to an order level
physeq_order %>% sample_variables()
[1] "Location" "Layer" "Treatment" "Site"
We just combined all the taxonomy so the lowest level of depth is at an order level Now lets run the stats!
#divnet_order <- physeq_order %>% divnet(tuning = "careful")
#divnet_order_Treatment <- physeq_order %>% divnet(tuning = "careful", formula = ~ Treatment)
#divnet_order_Location <- physeq_order %>% divnet(tuning = "careful", formula = ~ Location)
#divnet_order_Treatment_Location <- physeq_order %>% divnet(tuning = "careful", formula = ~ Treatment + Location)
set.seed(6458740)
divnet_order_Treatment_Location_Layer <- physeq_order %>% divnet(tuning = "careful", formula = ~ Treatment + Location + Layer) #only running this one because its the only one we think matters
Removing absent taxa!
|
| | 0%
Attaching package: ‘foreach’
The following objects are masked from ‘package:purrr’:
accumulate, when
|
|=========== | 10%
|
|===================== | 20%
|
|================================ | 30%
|
|========================================== | 40%
|
|===================================================== | 50%
|
|================================================================ | 60%
|
|========================================================================== | 70%
|
|===================================================================================== | 80%
|
|=============================================================================================== | 90%
|
|==========================================================================================================| 100%
|
| | 0%
|
|============ | 10%
|
|======================= | 20%
|
|================================== | 30%
|
|============================================== | 40%
|
|========================================================== | 50%
|
|===================================================================== | 60%
|
|================================================================================ | 70%
|
|============================================================================================ | 80%
|
|======================================================================================================== | 90%
|
|===================================================================================================================| 100%
|
| | 0%
|
|============ | 10%
|
|======================= | 20%
|
|================================== | 30%
|
|============================================== | 40%
|
|========================================================== | 50%
|
|===================================================================== | 60%
|
|================================================================================ | 70%
|
|============================================================================================ | 80%
|
|======================================================================================================== | 90%
|
|===================================================================================================================| 100%
|
| | 0%
|
|============ | 10%
|
|======================= | 20%
|
|================================== | 30%
|
|============================================== | 40%
|
|========================================================== | 50%
|
|===================================================================== | 60%
|
|================================================================================ | 70%
|
|============================================================================================ | 80%
|
|======================================================================================================== | 90%
|
|===================================================================================================================| 100%
|
| | 0%
|
|============ | 10%
|
|======================= | 20%
|
|================================== | 30%
|
|============================================== | 40%
|
|========================================================== | 50%
|
|===================================================================== | 60%
|
|================================================================================ | 70%
|
|============================================================================================ | 80%
|
|======================================================================================================== | 90%
|
|===================================================================================================================| 100%
|
| | 0%
|
|============ | 10%
|
|======================= | 20%
|
|================================== | 30%
|
|============================================== | 40%
|
|========================================================== | 50%
|
|===================================================================== | 60%
|
|================================================================================ | 70%
|
|============================================================================================ | 80%
|
|======================================================================================================== | 90%
|
|===================================================================================================================| 100%
#divnet_order_Location_Layer <- physeq_order %>% divnet(tuning = "careful", formula = ~ Location + Layer)
Okay I just did alot of things, I get that, First things first. we are taking our physeq object and applying (%>%) the divnet command to it. We are apply this code at a “careful” level. i.e. this means more computing power is going into the equation and you’ll get a more accurate representation of the stats being done (you can also do this “fast” but uhh dont?). next is this formula stuff. Within data sets we have covariates (go search up and read what a covariate is now), and when microbial communities covary with say a location each sample reflects a part of the population that can be “combined” statistically so that we can estimate the total population diversity. I’m pretty sure the covariates within this dataset are treatment AND location AND Layer; however, we’re gonna run all of them just to see what happens! fun right?
PS. I set a seed because I’ll likely be rerunning code to make sure it works, and I dont want to go around changing my conclusions!
These take FIVE-EVER to run so Imma make and save copies of the objects
#saveRDS(divnet_order, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order.rds")
#saveRDS(divnet_order_Location, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Location.rds")
#saveRDS(divnet_order_Treatment, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Treatment.rds")
#saveRDS(divnet_order_Treatment_Location, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Treatment_Location.rds")
#saveRDS(divnet_order_Treatment_Location_Layer, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Treatment_Location_Layer.rds")
#saveRDS(divnet_order_Location_Layer, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Location_Layer.rds")
divnet_order_Location_Layer <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Location_Layer.rds")
divnet_order_Treatment_Location <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Treatment_Location.rds")
divnet_order_Treatment <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Treatment.rds")
divnet_order_Location <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Location.rds")
divnet_order <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order.rds")
divnet_order_Treatment_Location_Layer <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\divnet_order_Treatment_Location_Layer.rds")
Now lets plot it out! This is a plot of the raw diversity estimates for each sample with no covariate information shared across samples.
divnet_order$shannon %>%
plot(physeq_order)
look at that graph that means absolutely nothing to me! oh yeah
Now lets add some covariate info, treatments
divnet_order_Treatment$shannon %>%
plot(physeq_order, color = "Treatment")
okay something more, how about location covariate info instead?
divnet_order_Location$shannon %>%
plot(physeq_order, color = "Location")
okay…okay, adding them together
divnet_order_Treatment_Location$shannon %>%
plot(physeq_order, color = "Location", shape = "Treatment")
and finally with all of the covariate information
divnet_order_Treatment_Location_Layer$shannon %>%
plot(physeq_order, color = "Location", shape = "Layer")
But perhaps this is too much covariate information does fallow versus replicate really reflect unqiue and different covariates? perhaps no. This inlies a very important part of this kind of analysis. Since the answers you get are partially based on the covariates you choose, you need to be damned sure you choose the correct ones.
divnet_order_Location_Layer$shannon %>%
plot(physeq_order, color = "Location", shape = "Layer")
Based on that our analysis we be on data that includes location and layer covariates
this creates an object thats pretty much the key from phyloseq creation, we need it to hold on to metadata
sample_data_key <- physeq_order %>%
sample_data %>%
tibble::as_tibble() %>%
dplyr::mutate("sample_names" = physeq_order %>% sample_names )
lets run the test!, we need to make three objects to use this function, theyre just more objects that betta expects and we just need to bend to its will
estimates <- divnet_order_Location_Layer$shannon %>% summary %$% estimate #this is where our shannon data is stored
ses <- sqrt(divnet_order_Location_Layer$`shannon-variance`) #take variance and turns it into standard error
X <- model.matrix(~Treatment, data = sample_data_key) #this creates a matrix that the model can use, its the sample sample_data_key but with numbers instead of words, also were we specify that we want to compare treatment
TLL_Treatment_betta <- betta(estimates, ses, X) # run the command
TLL_Treatment_betta$table #whats the p values
Estimates Standard Errors p-values
(Intercept) 1.83359422 0.04690832 0.000
TreatmentReplicate 0.01543417 0.05422932 0.776
Here we imputed the estimates of diversity, the error of these estimates (ses) and a table of covarites (X) and than ran a hypothesis test. intercept is the initial covatiate, in this case fallow, and all the tests are compared off of this covariate. It’s dumb but the estimates (the average diversity across all samples of a type) are given as values compared to the intercept estimate, not their true value. ei total diversity of replicate is (0.015 + 1.83). Based on this, the fallowed land does not have a significantly lower diversity than the replicate land. This is fine. but it only works with one comparison and we have three (although i think we can only do two using this program. Hey factorial anovas are hard yo, but also we saw that this factor is unimportant)
TLL_Location_betta_random <-
betta_random(estimates, ses, X, groups = sample_data_key$Treatment )
TLL_Location_betta_random$table
Estimates Standard Errors p-values
(Intercept) 1.83752775 0.04663094 0.000
TreatmentReplicate 0.01091204 0.05539918 0.844
Hey whats betta_random and whys it different from betta/which one should you use? betta is said to be “A simple plotting interface for comparing total diversity across samples or a covariate gradient.” betta_random is “This function extends betta() to permit random effects modelling.” so what is fixed and random effects? well read: https://web.pdx.edu/~newsomj/mlrclass/ho_randfixd.pdf but briefly, fixed effects are independent variables that are measured without error, for example whether a sample was measured in a forest or a plantation. Random effects are independent variables that are drawn from a population thus representing a population and therefore random values around the population mean i.e. this dataset should use a fixed effect model as it is valid for betta which is a stronger model statistical than betta_random
Either way lets look at how the the model responds to the layer information
estimates <- divnet_order_Location_Layer$shannon %>% summary %$% estimate #this is where our shannon data is stored
ses <- sqrt(divnet_order_Location_Layer$`shannon-variance`) #take variance and turns it into standard error
X <- model.matrix(~Layer, data = sample_data_key) #this creates a matrix that the model can use, its the sample sample_data_key but with numbers instead of words, also were we specify that we want to compare layers
TLL_Layer_betta <- betta(estimates, ses, X) # run the command
TLL_Layer_betta$table #whats the p values
Estimates Standard Errors p-values
(Intercept) 1.5275885 0.02168746 0
LayerOrganic 0.6134102 0.02903266 0
This means that the organic layer has a significantly higher diversity than the mineral layer across all samples its total amount is (0.6+1.5).
Okay what if we wanted to compare multiple things at once? we know replicate versus fallow doesn’t matter, so lets just compare location and layer
estimates <- divnet_order_Location_Layer$shannon %>% summary %$% estimate #this is where our shannon data is stored
ses <- sqrt(divnet_order_Location_Layer$`shannon-variance`) #take variance and turns it into standard error
X <- model.matrix(~Location*Layer, data = sample_data_key)#this is where we specify location and layer
TLL_Layer_Location_betta <- betta(estimates, ses, X, p.digits = 5) # run the command
TLL_Layer_Location_betta$table #whats the p values
Estimates Standard Errors p-values
(Intercept) 1.51525525 0.01263369 0.00000
LocationFallow -0.25929688 0.08586338 0.00253
LocationForest -0.00332354 0.01622067 0.83765
LocationPlantation 0.08221351 0.03220665 0.01069
LayerOrganic 0.38163180 0.01601571 0.00000
LocationFallow:LayerOrganic 0.53081318 0.09211521 0.00000
LocationForest:LayerOrganic 0.18812608 0.02348142 0.00000
LocationPlantation:LayerOrganic 0.46859478 0.03880079 0.00000
okay WHAT the fuck does this all mean? well fun fact this is the worst way to make a p value table known to man. Look at all the names in the intercept, which ones from your dataset are missing? Edge:mineral. That means edge:mineral is the intercept. so locationfallow is actually fallow:mineral, and layerorganic is actually edge:organic and so on. Estimates work the same way as before AND its still dumb. But regardless you can* use this table to make comparisons except there are two very important notes.
Firstly, why is edge:mineral first? its because its alphabetically first…. great. There is no easy way to change this besides renaming your location data. Yup its bs, if you wanted to compare forest to plantation rename forest to aforest so it’ll pop up first alphabetically and be the intercept. run the below commands and then rerun the betta exercise
physeq_order <- tax_glom(physeq_Key, taxrank="Order")
sample_data(physeq_order) <- data.frame(sample_data(physeq_order)) %>%
mutate(Location = case_when(Location == "Forest" ~ "aForest",
Location == "Plantation" ~ "Plantation",
Location == "Edge" ~ "Edge",
Location == "Fallow" ~ "Fallow"
#yes you need to specify each condition otherwise some samples might be removed
))
sample_data_key <- physeq_order %>%
sample_data %>%
tibble::as_tibble() %>%
dplyr::mutate("sample_names" = physeq_order %>% sample_names )
estimates <- divnet_order_Location_Layer$shannon %>% summary %$% estimate #this is where our shannon data is stored
ses <- sqrt(divnet_order_Location_Layer$`shannon-variance`) #take variance and turns it into standard error
X <- model.matrix(~Location*Layer, data = sample_data_key)#this is where we specify location and layer
TLL_Layer_Location_betta_forest <- betta(estimates, ses, X, p.digits = 5) # run the command
TLL_Layer_Location_betta_forest$table #whats the p values
Estimates Standard Errors p-values
(Intercept) 1.51193171 0.01263369 0.00000
LocationEdge 0.00332354 0.02706791 0.90228
LocationFallow -0.25597334 0.08586338 0.00287
LocationPlantation 0.08553705 0.03220665 0.00791
LayerOrganic 0.56975789 0.01601571 0.00000
LocationEdge:LayerOrganic -0.18812608 0.02770446 0.00000
LocationFallow:LayerOrganic 0.34268709 0.09211521 0.00020
LocationPlantation:LayerOrganic 0.28046870 0.03880079 0.00000
Secondly, betta does not account for the family wise error rate. ALSO great This is great but unfortunately, betta does account for family wise error rate https://github.com/adw96/breakaway/issues/84
so whats there to be done?
ptable <- TLL_Layer_Location_betta$table #start by separating the ptable from the weird betta object
ptable_adj <- ptable %>% #moving it to a new table bc reasons
as.data.frame() %>%
rename(p_values = "p-values") #removing the p-vlaues names because dashes mess stuff up later on
ptable_adj$p_values <- p.adjust(ptable_adj$p_values, method = "bonferroni") #apply whatever pvalue correction you'd like, bonferroni isn't a recommendation, but the first one I remembered off hand
ptable_adj
Sweet! now the p values have been adjusted and are valid as comparisons! But be very careful. Since the cut off earlier was at five digits some of these values were reported as zeros. These zeros will of course pass the bonferroni adjustment! so make sure you have enough zeros you’re comfortable with the adjusted p-table. PS. the intercept P value IS actually zero, so don’t worry about that (and technically you should remove it from this table before running the padjust, bc it isnt a real comparison)
dont forget to save your p-table!!!
Lets go ahead and fix the zeros and reomve the intercept…
ptable <- TLL_Layer_Location_betta$table #start by separating the ptable from the weird betta object
ptable_adj <- ptable %>% #moving it to a new table bc reasons
as.data.frame() %>%
rename(p_values = "p-values") #removing the p-vlaues names because dashes mess stuff up later on
ptable_adj <- ptable_adj %>% filter(!row_number() %in% c(1))
ptable_adj$p_values <- p.adjust(ptable_adj$p_values, method = "bonferroni") #apply whatever pvalue correction you'd like, bonferroni isn't a recommendation, but the first one I remembered off hand
ptable_adj
yknow what? if there wasnt a value 8 digits under I think it might just be signifigant…
PS. heres the original issues page where I finally figured out how to read the intercept tables This makes the table 10x more unreadable and took me a whole ass day to find out https://github.com/adw96/DivNet/issues/34