By Angus Ball
Hi, by now you’ve processed your data through DADA2 and now its time
for some data analysis!!! haha sike, we have to format it first, get
rekt
These are the packages you’ll need
library(phyloseq)
library(ggplot2)
library(janitor)
library(dplyr)
library(tidyverse)
library(ggpubr)
library(ape)
library(splitstackshape)
library(Biostrings)
Now its time to import our data from DATA2, unfortunately we need to
work with a .csv file and our data come off in text files Fun fact: .csv
files are just basic excel files
To turn your text files into a .csv file open excel and then try to
open your .txt file. Excel will go “hey…” and try to convert your txt
file into an excel file. This is perfect! Our file is delimited (the
autopicked option), via tab (the autopicked option after you click next
once), so all you have to do is meander through the pop up and excel
will do all the heavy lifting for you! Simply save your file as a .csv
file where ever you’d like!
Now go off into to world and create a csv file of you
16S_R[name]_dada2_nochim_tax.txt file! Okay sweet, on the left should be
a row of SV_0..Sv_10 ect, in the middle is you sample name with all the
count data per species and on the right is tax.vector.
Fun fact: backups! hey i’ve noticed that you didn’t immediately
create a backup and a working csv file, Shame on you! Its important to
always create backup files you NEVER modify in excel when you get a new
file you plan on modifying. It’d be a shame is you were to delete
something you shouldn’t have and then restart from the beginning.
You’re 16S_R[name]_dada2_nochim_tax.csv file is now ready to be
imported into R
Lets import this table
MetaG_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-dada2_nochim_tax.csv")
MetaG_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_dada2_nochim_tax.csv")
Seq_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-sv_seqs.csv")
Seq_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_sv_seqs.csv")
Congrats! We now have the data in R, you may have noticed I’ve
actually read 2 separate sets of data, 1 and 2. You might not have two
sets of data but I do!. We will then create phyloseq objects with both
these datasets
This is what the data looks like (the second copy is the same)
head(MetaG_1)
head(Seq_1)
We’ll start by concatenating all these tables! The MetaG tables are
OTU tables, with count and species data, the Seq tables are the exact
ASVs for each sequence. Since we want to combine these tables we must
make sure that the ASV’s exactly match each other
So,
MetaG_1_total <- cbind(MetaG_1, Seq_1$Seq)
MetaG_2_total <- cbind(MetaG_2, Seq_2$Seq)
Then we have to remove the SVs so there isnt multiple of the same
value in the final table
#remove the sV value
MetaG_1_total <- select(MetaG_1_total, -(1))
MetaG_2_total <- select(MetaG_2_total, -(1))
Then we can combine these two tables based on the ASV data is Seq,
BUT to make sure we dont loose anything lets find out the amount of
counts per row. Im just going to spot check two
sum(MetaG_1_total$`Municipal compost-10-Fertalized-Inoculated`)
[1] 20690
sum(MetaG_2_total$MasterMix)
[1] 22651
okay time to combine them
MetaG <- merge(MetaG_1_total, MetaG_2_total, by.x = "Seq_1$Seq", by.y = "Seq_2$Seq", all.x = TRUE, all.y = TRUE)
Lets check those sums again!
sum(MetaG$`Municipal compost-10-Fertalized-Inoculated`)
[1] NA
sum(MetaG$MasterMix)
[1] NA
Thats because in the merge step a bunch of NA’s were added on rows
not seen in both samples, we can just set those NA’s to 0
MetaG[is.na(MetaG)]<-0
sum(MetaG$`Municipal compost-10-Fertalized-Inoculated`)
[1] 20690
sum(MetaG$MasterMix)
[1] 22651
Perfect same as before so there was no data loss!
Lets just clean up the table a bit
#first seperate the ASV's into a new table
asvs <- select(MetaG, "Seq_1$Seq")
MetaG <- select(MetaG, -"Seq_1$Seq") #then remove them from the table
#lets seperate the taxa values
Tax_temp<-select(MetaG, c("tax.vector.x", "tax.vector.y"))
#then lets put them in the same column, so that tax.vector.x contains no zeros and holds all the data
Tax_fixed <- Tax_temp %>%
mutate(tax.vector.x = ifelse(tax.vector.x == 0, tax.vector.y,tax.vector.x))#spoken, if tax.x contains a zero replace with value from tax.y (which we know has to have a value if tax.x doesn't)
#rename
Tax_fixed <- select(Tax_fixed, -"tax.vector.y")
#Then we can fully clean up the metaG table
MetaG <- MetaG %>%
select(-"tax.vector.x") %>%
select(-"tax.vector.y") %>%
select(-("CHB1P1":"FNP9B3")) #removing the tax's bc they are in tax_fixed now, and removing samples not applicable to this dataset
First things first, we have to format the tables in a way that
phyloseq is expecting them and unfortunately, this is a bit of a pain.
Phyloseq expects a couple tables. First a count table (this is the MetaG
table), A taxonomy table (This is tax.vector), and a sample data table
(This ones based off of the group of data that exist in your sample)
Lets first start by creating the taxonomy table from the MetaG
tables!
Tax<-select(Tax_fixed, ("tax.vector.x"))
Oh bother! Phyloseq expects the taxonomy to be separated by taxons
i.e. one column for kingdom, one for family ect. But everything is under
one column. This means we have to deliminate again, and looking at the
how to column is created from DADA2, we need to deliminate by colon
cSplit deliminates within a column and then the rename functions make
all the columns names phyloseq expects!
Tax <- cSplit(Tax, "tax.vector.x", ":")
Warning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUE
Tax <- dplyr::rename(Tax,
Kingdom = tax.vector.x_1,
Phylum = tax.vector.x_2,
Class = tax.vector.x_3,
Order = tax.vector.x_4,
Family = tax.vector.x_5,
Genus = tax.vector.x_6,
Species = tax.vector.x_7,
)
#our datasets kinda wibbly wobbly because we've been adding things together, so sometimes if there isnt a designated taxon it writes NA and sometimes it write S__ or F__, lets fix that so it always just writes S__ ect
#first set to nas to zero
Tax[is.na(Tax)]<-0
#then find and replace with a similar command as above
Tax <- Tax %>%
mutate(Kingdom = ifelse(Kingdom == 0, "k__",Kingdom))%>%
mutate(Phylum = ifelse(Phylum == 0, "p__NA",Phylum))%>%
mutate(Class = ifelse(Class == 0, "c__",Class))%>%
mutate(Order = ifelse(Order == 0, "o__",Order))%>%
mutate(Family = ifelse(Family == 0, "f__",Family))%>%
mutate(Genus = ifelse(Genus == 0, "g__",Genus))%>%
mutate(Species = ifelse(Species == 0, "s__",Species))
We have to take our tables (that exist as a dataframe) and convert
them into a maxtrix, why? because the phyloseq class uses matrixes and
not dataframes (in the example above we couldn’t feed our class two int
values, it wants and int and a char!)
#now real stuff
#convert these bad boys to matrixes
Tax <- as.matrix(Tax)
MetaG <- as.matrix(MetaG)
You can check the class of the objects with a simple command (oh yes
the tables have been objects all along, we’re creating objects within
classes within classes! neat isnt it?)
#check
class(Tax)
[1] "matrix" "array"
class(MetaG)
[1] "matrix" "array"
Phyloseq expects the far left column to be called OTU, this is how to
do it, I’m naming the OTUs 1 and 2 because of how phyloseq merges row
names later on
#call row values OTU
rownames(MetaG) <- paste0("OTU", 1:nrow(MetaG))
rownames(Tax) <- paste0("OTU", 1:nrow(Tax))
Okay hopefully you know a bit about objects/classes, if not Funless
fact: please go read the first couple chapters of a coding in java
textbook Funner fact: data structure is neat, say you want to store a
number, “1”, you can do so in java by telling it a datatype (int for
integer), and then a value, for example int x=1; this tells the computer
that x is equal to one. Great but int only works for numbers (and not
even all of them ;), so what is you wanted to store a letter? well you
use char (for character), so char y=“b”; this is saying that y is equal
to the letter “b”. Obviously this quick example looses alot of the
subtly and complexity of the system (re: go read the first couple
chapters of a java textbook), but alas.
So now lets say you wanted a thing, lets call it a class, that
contained a number and a character, you could create a class that used x
and it used y to create an object. This is a poor explanation that has
likely made several computer scientists sad but whatever, phyloseq is a
class that creates phyloseq objects, each object having at a minimum an
abundance table (metaG_) and a taxonomy table (Tax_). Excellent you are
now equally confused as earlier but I feel like I’ve done my due
dillegence, onward.
Now phyloseq wants us to turn these matrices into a different class
created by phyloseq (not isnt a phyloseq object yet)
OTU = otu_table(MetaG, taxa_are_rows = TRUE)
TAX = tax_table(Tax)
Now we’ll create a phyloseq object with each OTU table and TAXonomy
table
#combine that data
physeq = phyloseq(OTU, TAX)
Side point time! Hey do you need to add sequence information to your
phyloseq object? follow these next couple commands First import your
sequence information
Then we will pull the sequences from the file using biostrings, then
the taxon names from the phyloseq object, then combine this information
and add it to the phyloseq object
sequences <- Biostrings::DNAStringSet(asvs$`Seq_1$Seq`)
names(sequences) <- taxa_names(physeq)
physeq <- merge_phyloseq(physeq, sequences)
You can see it worked by
physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 48054 taxa and 52 samples ]
tax_table() Taxonomy Table: [ 48054 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 48054 reference sequences ]
The Otu and tax table were added earlier this step added the refseq()
table
Excellent, almost done!
Now if you inspect the sample names they have a variety of
meanings
physeq@otu_table
Things like “municipal compost”, “fertalized” ect, this is metadata
that are in the sample names, but the phyloseq object doesn’t know about
yet. Fun fact: if you want to know what this actually is or what the
names mean ask to read my manuscript but we must import this
metadata.
Key <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\ultra to categories key.csv")
Rows: 52 Columns: 5── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Amd, Conc, Fertalized, Inoculum
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
if you explore this document, you’ll realize it has all the samples
in order they appear on column names within the MetaG table, but they
are in the rows instead. The meta data in other columns
for the same reasons we converted to OTU earlier (phyloseq being
fickle), we have to make the sample names be the left most row name
Key <- data.frame(Key[,-1], row.names=Key$Name)
sampledata = sample_data(Key)
Now we will add the key data
physeq_Key = merge_phyloseq(physeq, sampledata)
Congrats you are done (lie)!!! You now have a phyloseq object and can
move on to transformations and data analysis
lets save it
#saveRDS(physeq_Key, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\ULTRA\\physeq_Key_merged.rds")
physeq_Key <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\ULTRA\\physeq_Key_merged.rds")
---
title: "Importing data from DADA2 via Phyloseq"
output: html_notebook
---
By Angus Ball

Hi, by now you've processed your data through DADA2 and now its time for some data analysis!!! haha sike, we have to format it first, get rekt

These are the packages you'll need
```{r, echo = T, message=FALSE}
library(phyloseq)
library(ggplot2)
library(janitor)
library(dplyr)
library(tidyverse)
library(ggpubr)
library(ape)
library(splitstackshape)
library(Biostrings)
```




Now its time to import our data from DATA2, unfortunately we need to work with a .csv file and our data come off in text files
Fun fact: .csv files are just basic excel files

To turn your text files into a .csv file open excel and then try to open your .txt file. Excel will go "hey..." and try to convert your txt file into an excel file. This is perfect! Our file is delimited (the autopicked option), via tab (the autopicked option after you click next once), so all you have to do is meander through the pop up and excel will do all the heavy lifting for you! Simply save your file as a .csv file where ever you'd like!

Now go off into to world and create a csv file of you 16S_R[name]_dada2_nochim_tax.txt file!
Okay sweet, on the left should be a row of SV_0..Sv_10 ect, in the middle is you sample name with all the count data per species and on the right is tax.vector.



Fun fact: backups! hey i've noticed that you didn't immediately create a backup and a working csv file, Shame on you! Its important to always create backup files you NEVER modify in excel when you get a new file you plan on modifying. It'd be a shame is you were to delete something you shouldn't have and then restart from the beginning. 

You're 16S_R[name]_dada2_nochim_tax.csv file is now ready to be imported into R

Lets import this table

```{r, echo = T, message=FALSE}
MetaG_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-dada2_nochim_tax.csv")
MetaG_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_dada2_nochim_tax.csv")
Seq_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-sv_seqs.csv")
Seq_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_sv_seqs.csv")

```

Congrats! We now have the data in R, you may have noticed I've actually read 2 separate sets of data, 1 and 2. You might not have two sets of data but I do!. We will then create phyloseq objects with both these datasets

This is what the data looks like (the second copy is the same)
```{r}
head(MetaG_1)
```
```{r}
head(Seq_1)
```




We'll start by concatenating all these tables! The MetaG tables are OTU tables, with count and species data, the Seq tables are the exact ASVs for each sequence. Since we want to combine these tables we must make sure that the ASV's exactly match each other

So, 

```{r}
MetaG_1_total <- cbind(MetaG_1, Seq_1$Seq)
MetaG_2_total <- cbind(MetaG_2, Seq_2$Seq)

```

Then we have to remove the SVs so there isnt multiple of the same value in the final table

```{r}
#remove the sV value
MetaG_1_total <- select(MetaG_1_total, -(1))
MetaG_2_total <- select(MetaG_2_total, -(1))

```

Then we can combine these two tables based on the ASV data is Seq, BUT to make sure we dont loose anything lets find out the amount of counts per row. Im just going to spot check two

```{r}
sum(MetaG_1_total$`Municipal compost-10-Fertalized-Inoculated`)
sum(MetaG_2_total$MasterMix)
```
okay time to combine them
```{r}
MetaG <- merge(MetaG_1_total, MetaG_2_total, by.x = "Seq_1$Seq", by.y = "Seq_2$Seq", all.x = TRUE, all.y = TRUE)
```

Lets check those sums again!

```{r}
sum(MetaG$`Municipal compost-10-Fertalized-Inoculated`)
sum(MetaG$MasterMix)
```
Thats because in the merge step a bunch of NA's were added on rows not seen in both samples, we can just set those NA's to 0

```{r}
MetaG[is.na(MetaG)]<-0
```
```{r}
sum(MetaG$`Municipal compost-10-Fertalized-Inoculated`)
sum(MetaG$MasterMix)
```
Perfect same as before so there was no data loss!

Lets just clean up the table a bit
```{r}
#first seperate the ASV's into a new table
asvs <- select(MetaG, "Seq_1$Seq")
MetaG <- select(MetaG, -"Seq_1$Seq") #then remove them from the table
#lets seperate the taxa values
Tax_temp<-select(MetaG, c("tax.vector.x", "tax.vector.y"))
#then lets put them in the same column, so that tax.vector.x contains no zeros and holds all the data
Tax_fixed <- Tax_temp %>%
  mutate(tax.vector.x = ifelse(tax.vector.x == 0, tax.vector.y,tax.vector.x))#spoken, if tax.x contains a zero replace with value from tax.y (which we know has to have a value if tax.x doesn't)
#rename
Tax_fixed <- select(Tax_fixed, -"tax.vector.y")
```


```{r}
#Then we can fully clean up the metaG table
MetaG <- MetaG %>%
  select(-"tax.vector.x") %>%
           select(-"tax.vector.y") %>%
        select(-("CHB1P1":"FNP9B3")) #removing the tax's bc they are in tax_fixed now, and removing samples not applicable to this dataset

```






First things first, we have to format the tables in a way that phyloseq is expecting them and unfortunately, this is a bit of a pain. Phyloseq expects a couple tables. First a count table (this is the MetaG table), A taxonomy table (This is tax.vector), and a sample data table (This ones based off of the group of data that exist in your sample)

Lets first start by creating the taxonomy table from the MetaG tables!

```{r}
Tax<-select(Tax_fixed, ("tax.vector.x")) 
```
Oh bother! Phyloseq expects the taxonomy to be separated by taxons i.e. one column for kingdom, one for family ect. But everything is under one column. This means we have to deliminate again, and looking at the how to column is created from DADA2, we need to deliminate by colon

cSplit deliminates within a column and then the rename functions make all the columns names phyloseq expects!

```{r}
Tax <- cSplit(Tax, "tax.vector.x", ":")
Tax <- dplyr::rename(Tax,
 Kingdom =  tax.vector.x_1,
 Phylum = tax.vector.x_2,
 Class = tax.vector.x_3,
 Order = tax.vector.x_4,
 Family = tax.vector.x_5,
 Genus = tax.vector.x_6,
 Species = tax.vector.x_7,
  
  )

#our datasets kinda wibbly wobbly because we've been adding things together, so sometimes if there isnt a designated taxon it writes NA and sometimes it write S__ or F__, lets fix that so it always just writes S__ ect
#first set to nas to zero
Tax[is.na(Tax)]<-0
#then find and replace with a similar command as above
Tax <- Tax %>%
  mutate(Kingdom = ifelse(Kingdom == 0, "k__",Kingdom))%>%
  mutate(Phylum = ifelse(Phylum == 0, "p__NA",Phylum))%>%
  mutate(Class = ifelse(Class == 0, "c__",Class))%>%
  mutate(Order = ifelse(Order == 0, "o__",Order))%>%
  mutate(Family = ifelse(Family == 0, "f__",Family))%>%
  mutate(Genus = ifelse(Genus == 0, "g__",Genus))%>%
  mutate(Species = ifelse(Species == 0, "s__",Species))

```



We have to take our tables (that exist as a dataframe) and convert them into a maxtrix, why? because the phyloseq class uses matrixes and not dataframes (in the example above we couldn't feed our class two int values, it wants and int and a char!)

```{r}
#now real stuff
#convert these bad boys to matrixes
Tax <- as.matrix(Tax)
MetaG <- as.matrix(MetaG)


```

You can check the class of the objects with a simple command (oh yes the tables have been objects all along, we're creating objects within classes within classes! neat isnt it?)

```{r}
#check
class(Tax)
class(MetaG)

```


Phyloseq expects the far left column to be called OTU, this is how to do it, I'm naming the OTUs 1 and 2 because of how phyloseq merges row names later on
```{r}
#call row values OTU
rownames(MetaG) <- paste0("OTU", 1:nrow(MetaG))
rownames(Tax) <- paste0("OTU", 1:nrow(Tax))


```


Okay hopefully you know a bit about objects/classes, if not
Funless fact: please go read the first couple chapters of a coding in java textbook
Funner fact: data structure is neat, say you want to store a number, "1", you can do so in java by telling it a datatype (int for integer), and then a value, for example int x=1; this tells the computer that x is equal to one. Great but int only works for numbers (and not even all of them ;), so what is you wanted to store a letter? well you use char (for character), so char y="b"; this is saying that y is equal to the letter "b". Obviously this quick example looses alot of the subtly and complexity of the system (re: go read the first couple chapters of a java textbook), but alas.

So now lets say you wanted a thing, lets call it a class, that contained a number and a character, you could create a class that used x and it used y to create an object. This is a poor explanation that has likely made several computer scientists sad but whatever, phyloseq is a class that creates phyloseq objects, each object having at a minimum an abundance table (metaG_) and a taxonomy table (Tax_). Excellent you are now equally confused as earlier but I feel like I've done my due dillegence, onward.


Now phyloseq wants us to turn these matrices into a different class created by phyloseq (not isnt a phyloseq object yet)

```{r}

OTU = otu_table(MetaG, taxa_are_rows = TRUE)
TAX = tax_table(Tax)

```

Now we'll create a phyloseq object with each OTU table and TAXonomy table

```{r}
#combine that data
physeq = phyloseq(OTU, TAX)
```

Side point time! Hey do you need to add sequence information to your phyloseq object? follow these next couple commands
First import your sequence information


Then we will pull the sequences from the file using biostrings, then the taxon names from the phyloseq object, then combine this information and add it to the phyloseq object
```{r}
sequences <- Biostrings::DNAStringSet(asvs$`Seq_1$Seq`)
names(sequences) <- taxa_names(physeq)
physeq <- merge_phyloseq(physeq, sequences)
```

You can see it worked by
```{r}
physeq
```
The Otu and tax table were added earlier this step added the refseq() table

Excellent, almost done!

Now if you inspect the sample names they have a variety of meanings
```{r}
physeq@otu_table
```
Things like "municipal compost", "fertalized" ect, this is metadata that are in the sample names, but the phyloseq object doesn't know about yet. 
Fun fact: if you want to know what this actually is or what the names mean ask to read my manuscript
but we must import this metadata. 

```{r}
Key <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\ultra to categories key.csv")
```

if you explore this document, you'll realize it has all the samples in order they appear on column names within the MetaG table, but they are in the rows instead. The meta data in other columns


for the same reasons we converted to OTU earlier (phyloseq being fickle), we have to make the sample names be the left most row name
```{r}
Key <- data.frame(Key[,-1], row.names=Key$Name)
sampledata = sample_data(Key)
```



Now we will add the key data 
```{r}
physeq_Key = merge_phyloseq(physeq, sampledata)
```


Congrats you are done (lie)!!! You now have a phyloseq object and can move on to transformations and data analysis



lets save it

```{r}
#saveRDS(physeq_Key, file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\ULTRA\\physeq_Key_merged.rds")

physeq_Key <- readRDS(file = "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\ULTRA\\physeq_Key_merged.rds")
```
