By Angus Ball
Hi, by now you’ve processed your data through DADA2 and now its time
for some data analysis!!! haha sike, we have to format it first, get
rekt
These are the packages you’ll need
library(phyloseq)
library(ggplot2)
library(janitor)
library(dplyr)
library(tidyverse)
library(ggpubr)
library(ape)
library(splitstackshape)
library(Biostrings)
Now its time to import our data from DATA2, unfortunately we need to
work with a .csv file and our data come off in text files Fun fact: .csv
files are just basic excel files
To turn your text files into a .csv file open excel and then try to
open your .txt file. Excel will go “hey…” and try to convert your txt
file into an excel file. This is perfect! Our file is delimited (the
autopicked option), via tab (the autopicked option after you click next
once), so all you have to do is meander through the pop up and excel
will do all the heavy lifting for you! Simply save your file as a .csv
file where ever you’d like!
Now go off into to world and create a csv file of you
16S_R[name]_dada2_nochim_tax.txt file! Okay sweet, on the left should be
a row of SV_0..Sv_10 ect, in the middle is you sample name with all the
count data per species and on the right is tax.vector.
Fun fact: backups! hey i’ve noticed that you didn’t immediately
create a backup and a working csv file, Shame on you! Its important to
always create backup files you NEVER modify in excel when you get a new
file you plan on modifying. It’d be a shame is you were to delete
something you shouldn’t have and then restart from the beginning.
You’re 16S_R[name]_dada2_nochim_tax.csv file is now ready to be
imported into R
Lets import this table
MetaG_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-dada2_nochim_tax.csv")
MetaG_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_dada2_nochim_tax.csv")
Seq_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-sv_seqs.csv")
Seq_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_sv_seqs.csv")
Congrats! We now have the data in R, you may have noticed I’ve
actually read 2 separate sets of data, 1 and 2. You might not have two
sets of data but I do!. We will then create phyloseq objects with both
these datasets
This is what the data looks like (the second copy is the same)
head(MetaG_1)
head(Seq_1)
We’ll start by concatenating all these tables! The MetaG tables are
OTU tables, with count and species data, the Seq tables are the exact
ASVs for each sequence. Since we want to combine these tables we must
make sure that the ASV’s exactly match each other
So,
MetaG_1_total <- cbind(MetaG_1, Seq_1$Seq)
MetaG_2_total <- cbind(MetaG_2, Seq_2$Seq)
Then we have to remove the SVs so there isnt multiple of the same
value in the final table
#remove the sV value
MetaG_1_total <- select(MetaG_1_total, -(1))
MetaG_2_total <- select(MetaG_2_total, -(1))
Then we can combine these two tables based on the ASV data is Seq,
BUT to make sure we dont loose anything lets find out the amount of
counts per row. Im just going to spot check two
sum(MetaG_1_total$`Municipal compost-10-Fertalized-Inoculated`)
[1] 20690
sum(MetaG_2_total$MasterMix)
[1] 22651
okay time to combine them
MetaG <- merge(MetaG_1_total, MetaG_2_total, by.x = "Seq_1$Seq", by.y = "Seq_2$Seq", all.x = TRUE, all.y = TRUE)
Lets check those sums again!
sum(MetaG$`Municipal compost-10-Fertalized-Inoculated`)
[1] NA
sum(MetaG$MasterMix)
[1] NA
Thats because in the merge step a bunch of NA’s were added on rows
not seen in both samples, we can just set those NA’s to 0
MetaG[is.na(MetaG)]<-0
sum(MetaG$`Municipal compost-10-Fertalized-Inoculated`)
[1] 20690
sum(MetaG$MasterMix)
[1] 22651
Perfect same as before so there was no data loss!
Lets just clean up the table a bit
#first seperate the ASV's into a new table
asvs <- select(MetaG, "Seq_1$Seq")
MetaG <- select(MetaG, -"Seq_1$Seq") #then remove them from the table
#lets seperate the taxa values
Tax_temp<-select(MetaG, c("tax.vector.x", "tax.vector.y"))
#then lets put them in the same column, so that tax.vector.x contains no zeros and holds all the data
Tax_fixed <- Tax_temp %>%
mutate(tax.vector.x = ifelse(tax.vector.x == 0, tax.vector.y,tax.vector.x))#spoken, if tax.x contains a zero replace with value from tax.y (which we know has to have a value if tax.x doesn't)
#rename
Tax_fixed <- select(Tax_fixed, -"tax.vector.y")
#Then we can fully clean up the metaG table
MetaG <- MetaG %>%
select(-"tax.vector.x") %>%
select(-"tax.vector.y") %>%
select(-("CHB1P1":"FNP9B3")) #removing the tax's bc they are in tax_fixed now, and removing samples not applicable to this dataset
First things first, we have to format the tables in a way that
phyloseq is expecting them and unfortunately, this is a bit of a pain.
Phyloseq expects a couple tables. First a count table (this is the MetaG
table), A taxonomy table (This is tax.vector), and a sample data table
(This ones based off of the group of data that exist in your sample)
Lets first start by creating the taxonomy table from the MetaG
tables!
Tax<-select(Tax_fixed, ("tax.vector.x"))
Oh bother! Phyloseq expects the taxonomy to be separated by taxons
i.e. one column for kingdom, one for family ect. But everything is under
one column. This means we have to deliminate again, and looking at the
how to column is created from DADA2, we need to deliminate by colon
cSplit deliminates within a column and then the rename functions make
all the columns names phyloseq expects!
Tax <- cSplit(Tax, "tax.vector.x", ":")
Warning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUE
Tax <- dplyr::rename(Tax,
Kingdom = tax.vector.x_1,
Phylum = tax.vector.x_2,
Class = tax.vector.x_3,
Order = tax.vector.x_4,
Family = tax.vector.x_5,
Genus = tax.vector.x_6,
Species = tax.vector.x_7,
)
#our datasets kinda wibbly wobbly because we've been adding things together, so sometimes if there isnt a designated taxon it writes NA and sometimes it write S__ or F__, lets fix that so it always just writes S__ ect
#first set to nas to zero
Tax[is.na(Tax)]<-0
#then find and replace with a similar command as above
Tax <- Tax %>%
mutate(Kingdom = ifelse(Kingdom == 0, "k__",Kingdom))%>%
mutate(Phylum = ifelse(Phylum == 0, "p__NA",Phylum))%>%
mutate(Class = ifelse(Class == 0, "c__",Class))%>%
mutate(Order = ifelse(Order == 0, "o__",Order))%>%
mutate(Family = ifelse(Family == 0, "f__",Family))%>%
mutate(Genus = ifelse(Genus == 0, "g__",Genus))%>%
mutate(Species = ifelse(Species == 0, "s__",Species))
We have to take our tables (that exist as a dataframe) and convert
them into a maxtrix, why? because the phyloseq class uses matrixes and
not dataframes (in the example above we couldn’t feed our class two int
values, it wants and int and a char!)
#now real stuff
#convert these bad boys to matrixes
Tax <- as.matrix(Tax)
MetaG <- as.matrix(MetaG)
You can check the class of the objects with a simple command (oh yes
the tables have been objects all along, we’re creating objects within
classes within classes! neat isnt it?)
#check
class(Tax)
[1] "matrix" "array"
class(MetaG)
[1] "matrix" "array"
Phyloseq expects the far left column to be called OTU, this is how to
do it, I’m naming the OTUs 1 and 2 because of how phyloseq merges row
names later on
#call row values OTU
rownames(MetaG) <- paste0("OTU", 1:nrow(MetaG))
rownames(Tax) <- paste0("OTU", 1:nrow(Tax))
Okay hopefully you know a bit about objects/classes, if not Funless
fact: please go read the first couple chapters of a coding in java
textbook Funner fact: data structure is neat, say you want to store a
number, “1”, you can do so in java by telling it a datatype (int for
integer), and then a value, for example int x=1; this tells the computer
that x is equal to one. Great but int only works for numbers (and not
even all of them ;), so what is you wanted to store a letter? well you
use char (for character), so char y=“b”; this is saying that y is equal
to the letter “b”. Obviously this quick example looses alot of the
subtly and complexity of the system (re: go read the first couple
chapters of a java textbook), but alas.
So now lets say you wanted a thing, lets call it a class, that
contained a number and a character, you could create a class that used x
and it used y to create an object. This is a poor explanation that has
likely made several computer scientists sad but whatever, phyloseq is a
class that creates phyloseq objects, each object having at a minimum an
abundance table (metaG_) and a taxonomy table (Tax_). Excellent you are
now equally confused as earlier but I feel like I’ve done my due
dillegence, onward.
Now phyloseq wants us to turn these matrices into a different class
created by phyloseq (not isnt a phyloseq object yet)
OTU = otu_table(MetaG, taxa_are_rows = TRUE)
TAX = tax_table(Tax)
Now we’ll create a phyloseq object with each OTU table and TAXonomy
table
#combine that data
physeq = phyloseq(OTU, TAX)
Side point time! Hey do you need to add sequence information to your
phyloseq object? follow these next couple commands First import your
sequence information
Then we will pull the sequences from the file using biostrings, then
the taxon names from the phyloseq object, then combine this information
and add it to the phyloseq object
sequences <- Biostrings::DNAStringSet(asvs$`Seq_1$Seq`)
names(sequences) <- taxa_names(physeq)
physeq <- merge_phyloseq(physeq, sequences)
You can see it worked by
physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 48054 taxa and 52 samples ]
tax_table() Taxonomy Table: [ 48054 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 48054 reference sequences ]
The Otu and tax table were added earlier this step added the refseq()
table
Excellent, almost done!
Now if you inspect the sample names they have a variety of
meanings
physeq@otu_table
Things like “municipal compost”, “fertalized” ect, this is metadata
that are in the sample names, but the phyloseq object doesn’t know about
yet. Fun fact: if you want to know what this actually is or what the
names mean ask to read my manuscript but we must import this
metadata.
Key <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\ultra to categories key.csv")
Rows: 52 Columns: 5── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Amd, Conc, Fertalized, Inoculum
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
if you explore this document, you’ll realize it has all the samples
in order they appear on column names within the MetaG table, but they
are in the rows instead. The meta data in other columns
for the same reasons we converted to OTU earlier (phyloseq being
fickle), we have to make the sample names be the left most row name
Key <- data.frame(Key[,-1], row.names=Key$Name)
sampledata = sample_data(Key)
Now we will add the key data
physeq_Key = merge_phyloseq(physeq, sampledata)
Congrats you are done (lie)!!! You now have a phyloseq object and can
move on to transformations and data analysis
lets save it
#saveRDS(physeq_Key, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\ULTRA\\physeq_Key_merged.rds")
physeq_Key <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\ULTRA\\physeq_Key_merged.rds")
