By Angus Ball
Welcome to learn how to make relative abundance graphs! Some say my
codes actually pretty good so this should be a treat! This first graph
will be a generic bar graph with all samples.
I’m working with FUNGuild data in this example. (This explains why I
have more options to choose from in my taxVal)
First things first get some packages
library(phyloseq)
library(speedyseq)
library(ggplot2)
library(tidyverse)
Second choose what you want to plot!
taxVal <- "Phylum" #What taxa do you want
samdata1 <- "Location" #First sample metadata to facet_grid
samdata2 <- "Layer" #second sample metadata to facet_grid
#filterVal <- "p__Ascomycota"
#filterVal <- ""
filepathing <- "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\test.pdf" #this is where you want your plots to be stored on your computer, remember it has to change for each graph!, don't make fun of my folder structure...
This is where the glory of this code really starts, you NEVER need to
modify code below this and it should just work! (The obvious exception
being the ggplot function which you may want to modify for your own
pastel driven dreams!)
Eitherway, you will put whatever taxa or FUNGuild column name in
taxVal youd like What are your options? try:
physeq_Key %>% rank_names
[1] "Kingdom" "Phylum" "Class" "Order" "Family"
[6] "Genus" "Species" "taxon" "guid" "mbNumber"
[11] "taxonomicLevel" "trophicMode" "guild" "confidenceRanking" "growthForm"
[16] "trait" "notes" "citationSource"
Any of the above options are valid.
The same goes for sample data (samdata)
physeq_Key %>% sample_variables()
[1] "Location" "Layer" "Treatment" "Site" "FullName"
Any of these are valid choices!, They do matter which order you put
them in, ie samdata1 will facet_grid the data before samdata2 but just
run the script once or twice and it’ll make sense.
What about filterVal? Say you only wanted to plot the p__Ascomycota
species in your graph? to do a bulk comparison of just them. Well
uncomment filterVal and put p__Ascomycota in the quotes. This can be
done with any species taxonomy or any information/subsample from
funguild. PS. you’ll also have to uncomment a line of code before
plotting. I’ll include a graph of what this looks like at the very
end
Quick side point: wanna separate your data into separate graphs? Use
this command to choose what samples you want based on metadata for this
run and then change it later on
#physeq_Subset <- subset_samples(physeq_Key, Location != "Fallow")
This removes all the samples that are from Fallow land
What if you only wanted specific taxa associated with trophic
mode?
#physeq_Subset<-subset_taxa(physeq_Key, trophicMode == "Symbiotroph")
Note that these are sending to physeq_test and not _key or _Bar as
expected in a second, this is to make sure you don’t accidentally throw
these commands in your code and destroy something
And honestly, you can just run the code from here on out and it’ll
just work, but for the fool who wants to understand~…
This command physically changes the column names of your phyloseq
object, The names that are choosen are based off of what input you
provided above. Notice that the first command reads from physeq_Key, if
you wanted to subset any of your samples as above, replace this with
physeq_Subset
physeq_Bar<- rename_tax_table(physeq_Key, TaxChoice = taxVal)
physeq_Bar <- rename_sample_data(physeq_Bar, SamChoice1 = samdata1)
physeq_Bar <- rename_sample_data(physeq_Bar, SamChoice2 = samdata2)
This commander transforms our samples to relative abundance Note this
isn’t clr transformed, but I think thanks okay, we are accounting for
compositionality with relative abundance sorta, and we arent using this
data for statistical tests, just visualization.
physeq_rel = transform_sample_counts(physeq_Bar, function(x) x/sum(x)*100)
we first con(glom)erate the taxa to a specific taxrank dictated by
your choice above, NArm means NA remove, don’t remove your NAs!!!! this
looses data especially if you are using environmental data were not
every microbe in the world has been categorized. The other two commands
are modifying the tables so they can be read by ggplot easily
glom <- tax_glom(physeq_rel, taxrank = 'TaxChoice', NArm = FALSE)
physeq_melt <- psmelt(glom)
physeq_melt$TaxChoice <- as.character(physeq_melt$TaxChoice)
This code takes the groups we’ve chosen before and figures out the
median abundance of each OTU in each sample
physeq_melt <- physeq_melt %>%
group_by(SamChoice1, SamChoice2,TaxChoice) %>%
mutate(median=median(Abundance))
Which is immediately useful, because we remove OTUs that are less
than 1% in a sample. They are still represented in the graph, but they
don’t get their own category. You Don’t Have to do this, or you can make
the value higher, its just to make the graph a bit more readable
keep <- unique(physeq_melt$TaxChoice[physeq_melt$median > 1])
physeq_melt$TaxChoice[!(physeq_melt$TaxChoice %in% keep)] <- "< 1%"
Similar command as above, but summarizes all the information
physeq_melt_sum <- physeq_melt %>%
group_by(Sample,SamChoice1, SamChoice2,TaxChoice) %>%
summarise(Abundance=sum(Abundance))
`summarise()` has grouped output by 'Sample', 'SamChoice1', 'SamChoice2'. You can override using the `.groups` argument.
Hey I said filterVal would come up again! Uncomment this code if you
want to apply the filter!
#physeq_melt_sum <- filter(physeq_melt_sum, TaxChoice == filterVal)
Lets plot it!
ggplot(physeq_melt_sum, aes(x = Sample, y = Abundance, fill = TaxChoice)) +
geom_bar(stat = "identity", aes(fill=TaxChoice)) +
labs(x="", y="%") +
facet_wrap(SamChoice1~SamChoice2, scales= "free_x", nrow=1) +
scale_fill_discrete(name = taxVal)+
theme_classic() +
#scale_x_discrete(guide=guide_axis(n.dodge=3))+
theme(strip.background = element_blank(),
axis.text.x.bottom = element_text(angle = -90))
ggsave(filepathing)
Saving 7.29 x 4.5 in image

You’ll have to decide what your ggsave location is.
Looking at the graph, one of the important, non-normal things added
is scale_fill_discrete, we need this to rename the the legend because of
the object, taxVal, shenanigans earlier.
Looks great right? well, sorta I guess, see how the sample type text
is crumpled on the screen, well try expanding your plots view in R to
expand the spacing in the sample (you can also do this in ggsave by
setting sizes), or playing around with the n.dodge function I’ve
commented out in the gg plot formula (I dont like the look of it tho).
You can also plot some samples by themselves to manage images.
Hey hey HEY, whats NA, and why are there phylum’s under <1%? NA
stands for not available, and are reflective of sequences that exist but
we havent categorized them into species yet. but there being phylum’s
under <1%?, rerun the same code without the <1% modification.
ggplot(physeq_melt_sum, aes(x = Sample, y = Abundance, fill = TaxChoice)) +
geom_bar(stat = "identity", aes(fill=TaxChoice)) +
labs(x="", y="%") +
facet_wrap(SamChoice1~SamChoice2, scales= "free_x", nrow=1) +
scale_fill_discrete(name = taxVal)+
theme_classic() +
#scale_x_discrete(guide=guide_axis(n.dodge=3))+
theme(strip.background = element_blank(),
axis.text.x.bottom = element_text(angle = -90))

what the frick, why are there anthropod samples in my fungal
data!
This is a good example of why you must always review your data! and
relative abundance plots are a good way to do it! I’m going to go back
and remove these phylums so they dont affect downstream analysis
As promised heres a graph of only the Ascomycota samples
ggplot(physeq_melt_sum, aes(x = Sample, y = Abundance, fill = TaxChoice)) +
geom_bar(stat = "identity", aes(fill=TaxChoice)) +
labs(x="", y="%") +
facet_wrap(SamChoice1~SamChoice2, scales= "free_x", nrow=1) +
scale_fill_discrete(name = taxVal)+
theme_classic() +
#scale_x_discrete(guide=guide_axis(n.dodge=3))+
theme(strip.background = element_blank(),
axis.text.x.bottom = element_text(angle = -90))

---
title: "Relative Abundance Graphs"
output: html_notebook
---

By Angus Ball


Welcome to learn how to make relative abundance graphs!
Some say my codes actually pretty good so this should be a treat!
This first graph will be a generic bar graph with all samples.

I'm working with FUNGuild data in this example. (This explains why I have more options to choose from in my taxVal)


First things first get some packages
```{r, echo = T, message=FALSE}
library(phyloseq)
library(speedyseq)
library(ggplot2)
library(tidyverse)
```

Second choose what you want to plot!
```{r}
taxVal <- "Phylum" #What taxa level do you care about
samdata1 <- "Location" #First sample metadata to facet_grid
samdata2 <- "Layer" #second sample metadata to facet_grid
#filterVal <- "p__Ascomycota" #filter to only this group, if you want
filepathing <- "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\Bioinformatics\\Kenzies Data\\test.pdf" #this is where you want your plots to be stored on your computer, remember it has to change for each graph!, don't make fun of my folder structure...
```
This is where the glory of this code really starts, you NEVER need to modify code below this and it should just work! (The obvious exception being the ggplot function which you may want to modify for your own pastel driven dreams!)

Eitherway, you will put whatever taxa or FUNGuild column name in taxVal youd like
What are your options? try:
```{r}
physeq_Key %>% rank_names
```
Any of the above options are valid.

The same goes for sample data (samdata)
```{r}
physeq_Key %>% sample_variables()
```
Any of these are valid choices!, They do matter which order you put them in, ie samdata1 will facet_grid the data before samdata2 but just run the script once or twice and it'll make sense.

What about filterVal? Say you only wanted to plot the p__Ascomycota species in your graph? to do a bulk comparison of just them. Well uncomment filterVal and put p__Ascomycota in the quotes. This can be done with any species taxonomy or any information/subsample from funguild. PS. you'll also have to uncomment a line of code before plotting. I'll include a graph of what this looks like at the very end




Quick side point: wanna separate your data into separate graphs? Use this command to choose what samples you want based on metadata for this run and then change it later on

```{r}
#physeq_Subset <- subset_samples(physeq_Key, Location != "Fallow")
```
This removes all the samples that are from Fallow land

What if you only wanted specific taxa associated with trophic mode?

```{r}
#physeq_Subset<-subset_taxa(physeq_Key, trophicMode == "Symbiotroph")
```

Note that these are sending to physeq_test and not _key or _Bar as expected in a second, this is to make sure you don't accidentally throw these commands in your code and destroy something





And honestly, you can just run the code from here on out and it'll just work, but for the fool who wants to understand~...

This command physically changes the column names of your phyloseq object, The names that are choosen are based off of what input you provided above. Notice that the first command reads from physeq_Key, if you wanted to subset any of your samples as above, replace this with physeq_Subset
```{r}
physeq_Bar<- rename_tax_table(physeq_Key, TaxChoice = taxVal)
physeq_Bar <- rename_sample_data(physeq_Bar, SamChoice1 = samdata1)
physeq_Bar <- rename_sample_data(physeq_Bar, SamChoice2 = samdata2)
```




This commander transforms our samples to relative abundance 
Note this isn't clr transformed, but I think thanks okay, we are accounting for compositionality with relative abundance sorta, and we arent using this data for statistical tests, just visualization.
```{r}
physeq_rel = transform_sample_counts(physeq_Bar, function(x) x/sum(x)*100)
```

we first con(glom)erate the taxa to a specific taxrank dictated by your choice above, NArm means NA remove, don't remove your NAs!!!! this looses data especially if you are using environmental data were not every microbe in the world has been categorized.
The other two commands are modifying the tables so they can be read by ggplot easily
```{r}
glom <- tax_glom(physeq_rel, taxrank = 'TaxChoice', NArm = FALSE)
physeq_melt <- psmelt(glom)
physeq_melt$TaxChoice <- as.character(physeq_melt$TaxChoice)
```

This code takes the groups we've chosen before and figures out the median abundance of each OTU in each sample 
```{r}
physeq_melt <- physeq_melt %>%
  group_by(SamChoice1, SamChoice2,TaxChoice) %>%
  mutate(median=median(Abundance))
```

Which is immediately useful, because we remove OTUs that are less than 1% in a sample. They are still represented in the graph, but they don't get their own category. You Don't Have to do this, or you can make the value higher, its just to make the graph a bit more readable

```{r}
keep <- unique(physeq_melt$TaxChoice[physeq_melt$median > 1])
physeq_melt$TaxChoice[!(physeq_melt$TaxChoice %in% keep)] <- "< 1%"
```

Similar command as above, but summarizes all the information
```{r}
physeq_melt_sum <- physeq_melt %>%
  group_by(Sample,SamChoice1, SamChoice2,TaxChoice) %>%
  summarise(Abundance=sum(Abundance))
```

Hey I said filterVal would come up again!
Uncomment this code if you want to apply the filter!
```{r}
#physeq_melt_sum <- filter(physeq_melt_sum, TaxChoice == filterVal)
```



Lets plot it!

```{r}
ggplot(physeq_melt_sum, aes(x = Sample, y = Abundance, fill = TaxChoice)) + 
  geom_bar(stat = "identity", aes(fill=TaxChoice)) + 
  labs(x="", y="%") +
  facet_wrap(SamChoice1~SamChoice2, scales= "free_x", nrow=1) +
  scale_fill_discrete(name = taxVal)+
  theme_classic() + 
  #scale_x_discrete(guide=guide_axis(n.dodge=3))+
  theme(strip.background = element_blank(), 
        axis.text.x.bottom = element_text(angle = -90))        
ggsave(filepathing)

```


You'll have to decide what your ggsave location is.

Looking at the graph, one of the important, non-normal things added is scale_fill_discrete, we need this to rename the the legend because of the object, taxVal, shenanigans earlier.  



Looks great right? well, sorta I guess, see how the sample type text is crumpled on the screen, well try expanding your plots view in R to expand the spacing in the sample (you can also do this in ggsave by setting sizes), or playing around with the n.dodge function I've commented out in the gg plot formula (I dont like the look of it tho). You can also plot some samples by themselves to manage images. 


Hey hey HEY, whats NA, and why are there phylum's under <1%?
NA stands for not available, and are reflective of sequences that exist but we havent categorized them into species yet. 
but there being phylum's under <1%?, rerun the same code without the <1% modification. 


```{r}
ggplot(physeq_melt_sum, aes(x = Sample, y = Abundance, fill = TaxChoice)) + 
  geom_bar(stat = "identity", aes(fill=TaxChoice)) + 
  labs(x="", y="%") +
  facet_wrap(SamChoice1~SamChoice2, scales= "free_x", nrow=1) +
  scale_fill_discrete(name = taxVal)+
  theme_classic() + 
  #scale_x_discrete(guide=guide_axis(n.dodge=3))+
  theme(strip.background = element_blank(), 
        axis.text.x.bottom = element_text(angle = -90)) 
```


what the frick, why are there anthropod samples in my fungal data!

This is a good example of why you must always review your data! and relative abundance plots are a good way to do it! I'm going to go back and remove these phylums so they dont affect downstream analysis




As promised heres a graph of only the Ascomycota samples

```{r}
ggplot(physeq_melt_sum, aes(x = Sample, y = Abundance, fill = TaxChoice)) + 
  geom_bar(stat = "identity", aes(fill=TaxChoice)) + 
  labs(x="", y="%") +
  facet_wrap(SamChoice1~SamChoice2, scales= "free_x", nrow=1) +
  scale_fill_discrete(name = taxVal)+
  theme_classic() + 
  #scale_x_discrete(guide=guide_axis(n.dodge=3))+
  theme(strip.background = element_blank(), 
        axis.text.x.bottom = element_text(angle = -90)) 
```
