By Angus Ball

Hi, by now you’ve processed your data through DADA2 and now its time for some data analysis!!! haha sike, we have to format it first, get rekt

Biomanager is a name thats gonna come up alot, but what is biomanager? well its actually called bioconductor (love the naming scheme), and its a database with all the big boys in bioinformatics software, so when we want to use specific packages we’ll have to install them directly from biomanager.

This is how to do that:



#phyloseq and biomanager
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
BiocManager::install(c("phyloseq"))

#microbiomeMaker
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("microbiomeMarker")
#dada2
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("dada2", version = "3.17")

These are the packages you’ll need

library(phyloseq)
library(ggplot2)
library(janitor)
library(dplyr)
library(tidyverse)
library(ggpubr)
library(ape)
library(dada2)

Now its time to import our data from DATA2, unfortunately we need to work with a .csv file and our data come off in text files Fun fact: .csv files are just basic excel files

To turn your text files into a .csv file open excel and then try to open your .txt file. Excel will go “hey…” and try to convert your txt file into an excel file. This is perfect! Our file is delimited (the autopicked option), via tab (the autopicked option after you click next once), so all you have to do is meander through the pop up and excel will do all the heavy lifting for you! Simply save your file as a .csv file where ever you’d like!

Now go off into to world and create a csv file of you 16S_R[name]_dada2_nochim_tax.txt file! Okay sweet, on the left should be a row of SV_0..Sv_10 ect, in the middle is you sample name with all the count data per species and on the right is tax.vector.

Fun fact: backups! hey i’ve noticed that you didn’t immediately create a backup and a working csv file, Shame on you! Its important to always create backup files you NEVER modify in excel when you get a new file you plan on modifying. It’d be a shame is you were to delete something you shouldn’t have and then restart from the beginning.

You’re 16S_R[name]_dada2_nochim_tax.csv file is now ready to be imported into R

Lets import this table

MetaG_1 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\16S-dada2_nochim_tax.csv")
MetaG_2 <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\Redone mastermix stuff\\16S_RefInoculum_dada2_nochim_tax.csv")

Congrats! We now have the data in R, you may have noticed I’ve actually read 2 separate sets of data, 1 and 2. You might not have two sets of data but I do!. We will then create phyloseq objects with both these datasets

First things first, we have to format the tables in a way that phyloseq is expecting them and unfortunately, this is a bit of a pain. Phyloseq expects a couple tables. First a count table (this is the MetaG table), A taxonomy table (This is tax.vector), and a sample data table (This ones based off of the group of data that exist in your sample)

Lets first start by creating the taxonomy table from the MetaG tables!

Tax_1<-select(MetaG_1, ("tax.vector")) 
Tax_2<-select(MetaG_2, ("tax.vector"))

Oh bother! Phyloseq expects the taxonomy to be separated by taxons i.e. one column for kingdom, one for family ect. But everything is under one column. This means we have to deliminate again, and looking at the how to column is created from DADA2, we need to deliminate by colon

cSplit deliminates within a column and then the rename functions make all the columns names phyloseq expects!

Tax_1 <- cSplit(Tax_1, "tax.vector", ":")
Warning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUE
Tax_1 <- rename(Tax_1,
 Kingdom = tax.vector_1,
 Phylum = tax.vector_2,
 Class = tax.vector_3,
 Order = tax.vector_4,
 Family = tax.vector_5,
 Genus = tax.vector_6,
 Species = tax.vector_7,
  
  )
Tax_2 <- cSplit(Tax_2, "tax.vector", ":")
Warning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUEWarning: 'as.is' should be specified by the caller; using TRUE
Tax_2 <- rename(Tax_2,
 Kingdom = tax.vector_1,
 Phylum = tax.vector_2,
 Class = tax.vector_3,
 Order = tax.vector_4,
 Family = tax.vector_5,
 Genus = tax.vector_6,
 Species = tax.vector_7,
  
  )
#remove the sV value
MetaG_1 <- select(MetaG_1, -(1))
MetaG_1 <- select(MetaG_1, -("tax.vector"))
MetaG_2 <- select(MetaG_2, -(1))
MetaG_2 <- select(MetaG_2, -("tax.vector"))

What we did was remove the first column (column “1”) from the data set, i’ve also included the code to remove the tax.vector column here

Now this is real data, and for paper reasons I want to remove some samples from MetaG_1. We can use a similar compand to do this

#ULTRA
MetaG_1 <- select(MetaG_1, -("CHB1P1":"FNP9B3"))

We have to take our tables (that exist as a dataframe) and convert them into a maxtrix, why? because the phyloseq class uses matrixes and not dataframes (in the example above we couldn’t feed our class two int values, it wants and int and a char!)

#now real stuff
#convert these bad boys to matrixes
Tax_1 <- as.matrix(Tax_1)
MetaG_1 <- as.matrix(MetaG_1)

Tax_2 <- as.matrix(Tax_2)
MetaG_2 <- as.matrix(MetaG_2)

You can check the class of the objects with a simple command (oh yes the tables have been objects all along, we’re creating objects within classes within classes! neat isnt it?)

#check
class(Tax_1)
[1] "matrix" "array" 
class(MetaG_1)
[1] "matrix" "array" 
class(Tax_2)
[1] "matrix" "array" 
class(MetaG_2)
[1] "matrix" "array" 

Phyloseq expects the far left column to be called OTU, this is how to do it, I’m naming the OTUs 1 and 2 because of how phyloseq merges row names later on

#call row values OTU
rownames(MetaG_1) <- paste0("OTU", 1:nrow(MetaG_1))
rownames(Tax_1) <- paste0("OTU", 1:nrow(Tax_1))

rownames(MetaG_2) <- paste0("2_OTU", 1:nrow(MetaG_2))
rownames(Tax_2) <- paste0("2_OTU", 1:nrow(Tax_2))

Very scary looking message, but it only reads that we deleted and replaced the previous row names

Okay hopefully you know a bit about objects/classes, if not Funless fact: please go read the first couple chapters of a coding in java textbook Funner fact: data structure is neat, say you want to store a number, “1”, you can do so in java by telling it a datatype (int for integer), and then a value, for example int x=1; this tells the computer that x is equal to one. Great but int only works for numbers (and not even all of them ;), so what is you wanted to store a letter? well you use char (for character), so char y=“b”; this is saying that y is equal to the letter “b”. Obviously this quick example looses alot of the subtly and complexity of the system (re: go read the first couple chapters of a java textbook), but alas.

So now lets say you wanted a thing, lets call it a class, that contained a number and a character, you could create a class that used x and it used y to create an object. This is a poor explanation that has likely made several computer scientists sad but whatever, phyloseq is a class that creates phyloseq objects, each object having at a minimum an abundance table (metaG_) and a taxonomy table (Tax_). Excellent you are now equally confused as earlier but I feel like I’ve done my due dillegence, onward.

Now phyloseq wants us to turn these matrices into a different class created by phyloseq (not isnt a phyloseq object yet)


OTU_1 = otu_table(MetaG_1, taxa_are_rows = TRUE)
TAX_1 = tax_table(Tax_1)

OTU_2 = otu_table(MetaG_2, taxa_are_rows = TRUE)
TAX_2 = tax_table(Tax_2)

Now we’ll create a phyloseq object with each OTU table and TAXonomy table

#combine that data
physeq_1 = phyloseq(OTU_1, TAX_1)
physeq_2 = phyloseq(OTU_2, TAX_2)

Now we’ll merge the two phyloseq objects, because we named the OTU row values differently there will not be overlap between the two samples

physeq<-merge_phyloseq(physeq_1,physeq_2)

Now we have to conglomerate our data so that OTUs specifying the same thing are combined. This code destroys the species level comparison and says if it exists within this family add it to total family value If you are working on a species level comparison do not do this as it will not let you work on species level comparisons; however, for this data set we are looking on an order and family level so it does not lose sensitivity. There are better ways to merge data but those are done within the dada2 pipeline before taxonomy assignment, I did not do dada2 analysis on these samples and therefore i have to do this (poorer) way of merging samples. Maybe I’ll find a better way to do it later, If you do tell me.

physeq <- tax_glom(physeq, taxrank = rank_names(physeq)[5], NArm = FALSE)

Fun fact: 5 corresponds to family just like 6 is genus and 1 is kingdom if you inspect the physeq object you can see that the tax table ends at family, but didnt delete values with NA (NArm = FALSE, read NA remove = false)

physeq@tax_table

Excellent, almost done!

Now if you inspect the sample names they have a variety of meanings

physeq@otu_table

Things like “municipal compost”, “fertalized” ect, this is metadata that are in the sample names, but the phyloseq object doesn’t know about yet. Fun fact: if you want to know what this actually is or what the names mean ask to read my manuscript but we must import this metadata.

Key <- read_csv("C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\ULTRA\\They call me... data\\Angus-16S\\ultra to categories key.csv")
Rows: 52 Columns: 5── Column specification ───────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Name, Amd, Fertalized, Inoculum
dbl (1): Conc
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

if you explore this document, you’ll realize it has all the samples in order they appear on column names within the MetaG table, but they are in the rows instead. The meta data in other columns

for the same reasons we converted to OTU earlier (phyloseq being fickle), we have to make the sample names be the left most row name

Key <- data.frame(Key[,-1], row.names=Key$Name)
sampledata = sample_data(Key)

Now we will add the key data

physeq_Key = merge_phyloseq(physeq, sampledata)

Congrats you are done (lie)!!! You now have a phyloseq object and can move on to transformations and data analysis

