By Angus Ball
This tutorial is based primarily on: https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html
And note this tutorial only works for 16S data (Theres extra fun considerations for fungal data)
First load up the packages
Then pull the location of where your files are. This will allow us to work with all of them and not have to fully import them into R
path <- "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\The Nautilus\\Data\\On metabarcoding\\Zymo pretrials" #location of data
list.files(path)
## [1] "graphs"
## [2] "in00298.241004.zymo"
## [3] "in00298.241004.zymo.zip"
## [4] "key.csv"
## [5] "key.xlsx"
## [6] "noprimers"
## [7] "physeq_50.rds"
## [8] "physeq_80.rds"
## [9] "QLDc1TL4.fastq.gz"
## [10] "QLDc2NS1.fastq.gz"
## [11] "QLNc1LN4.fastq.gz"
## [12] "QLNc1SH2.fastq.gz"
## [13] "silva_nr99_v138.1_wSpecies_train_set.fa.gz"
## [14] "st.rds"
## [15] "tax50.rds"
## [16] "tax80.rds"
## [17] "zymo_err.rds"
fnraw <- sort(list.files(path, pattern = ".fastq", full.names = TRUE))
For my samples They are from Quesnel lake (QL) in the disaster (D) or natural (N) area from cores (c) 1 or 2. They are also from various depths down the column (TL4, NS1, LN4, SH2)
Just to start lets looka t what the reads look like!
plotQualityProfile(fnraw)
SO how do you read these? the green line is the median quality score,
the greyscale is a heatmap of the frequency of each quality score, the
quartiles of the average quality are shown with the orange lines, and
the read line at the bottom (with a seperate scale) shows the proportion
of reads a that extend to at least that length. What is quality score?
check out https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html
and https://drive5.com/usearch/manual/exp_errs.html but
briefly, Quality score = 10 means a wrong base is reported 10% of the
time. Q20 is 1% and Q30 is 0.1% of the time. So anything under q20
should be removed. This is in theory correct; however it is better to
delete samples based on the number of expected errors rather than error
chance. (this still uses quality score information) https://academic.oup.com/bioinformatics/article/31/21/3476/194979?login=false
PS. idk why it says cycle, it means bps (length of sequence)
Eitherway that data is RAW it even goes all the way up to 3000 bp, but the 16S reigon is like 1500 actually valuable so we’ll be trimming to that
But first we should remove the primers
nopsraw <- file.path(path, "noprimers", basename(fnraw)) #this makes a file location we can point to and a folder to follow it
primraw <- removePrimers(fnraw, #data
nopsraw, #output location
primer.fwd="AGRGTTYGATYMTGGCTCAG", #the forward primer used was 27F
primer.rev=dada2:::rc("RGYTACCTTGTTACGACTT"), #here we add the reverse primer and reverse it's direction (since all the reads should be in the forward direction)
orient=TRUE)
Just for fun we’re also gonna look at a histogram of average length, most should be around 1500 bp
lens.fn <- lapply(nopsraw, function(fn) nchar(getSequences(fn)))
lens <- do.call(c, lens.fn)
hist(lens, 100)
Excelent!
Now we can filter out data!
filts <- file.path(path, "noprimers", "filtered", basename(fnraw))#creating our output folder and location
track <- filterAndTrim(nopsraw, #input data
filts, #output data
minQ=3, #minimum quality score
minLen=1000, #minimum length (don't do this with fungal data (asterix))
maxLen=1600, #max length (Dont do this with fungal data (asterix))
maxN=0, #no ambigous bases here please!
rm.phix=FALSE,
maxEE=2) #you can have 2 errors in a sequence as a treat but thats IT
track
## reads.in reads.out
## QLDc1TL4.fastq.gz 9236 7666
## QLDc2NS1.fastq.gz 5436 4461
## QLNc1LN4.fastq.gz 3527 2913
## QLNc1SH2.fastq.gz 7868 6469
I’m loosing reads but not massive amounts so its okayyy
Lets take another look at the qualities of those sequences!
plotQualityProfile(filts)
drp <- derepFastq(filts, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLDc1TL4.fastq.gz
## Encountered 5703 unique sequences from 7666 total sequences read.
## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLDc2NS1.fastq.gz
## Encountered 3536 unique sequences from 4461 total sequences read.
## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLNc1LN4.fastq.gz
## Encountered 2253 unique sequences from 2913 total sequences read.
## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLNc1SH2.fastq.gz
## Encountered 5231 unique sequences from 6469 total sequences read.
dada2 needs replication in sequences to function properly. the rough rubric is that duplication needs to be less than 10% “(so #(unique sequences) > 0.9 * #(reads))” otherwise dada2 isnt appropriate : https://github.com/benjjneb/dada2/issues/1877
My samples are getting close but i’m under this level thank goodness!
Then we will learn the errors of these samples
err <- learnErrors(drp, #input files
errorEstimationFunction=PacBioErrfun, #probably should use the special pacbio error function for my pacbio data but idk ;)
BAND_SIZE=32, #this affects the amount of insertions deletions allowed for the command, 16 is normal, 32 recommended for pacbio sequencing
multithread=FALSE, verbose = TRUE) #im on windows...
## 31398313 total bases in 21509 reads from 4 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 ....
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## Convergence after 8 rounds.
That took a hot second so lets save it
saveRDS(err, file.path(path, "zymo_err.rds"))
and inspect it!
plotErrors(err)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
This is mostly okay. Things get bit weird on the far right for samples
like G2T ect. This is likely just because there aren’t very many
nucleotides at that high a quality score to make the measurement off of.
To fix this you can in force monotonicity (The function can only
decrease in value, not go up and down). See the dada2 nextseq tutorial
at this set if you want an example of how to do it (Im not going to
bother on this dataset)
We can then denoise the samples, seperating the reads into unique sequences
dd <- dada(drp, err=err, BAND_SIZE=32, multithread=FALSE)
## Sample 1 - 7666 reads in 5703 unique sequences.
## Sample 2 - 4461 reads in 3536 unique sequences.
## Sample 3 - 2913 reads in 2253 unique sequences.
## Sample 4 - 6469 reads in 5231 unique sequences.
For fun we can track where the reads are lost throughout our methods
cbind(ccs=primraw[,1], primers=primraw[,2], filtered=track[,2], denoised=sapply(dd, function(x) sum(x$denoised)))
## ccs primers filtered denoised
## QLDc1TL4.fastq.gz 11435 9236 7666 4610
## QLDc2NS1.fastq.gz 6668 5436 4461 2365
## QLNc1LN4.fastq.gz 4315 3527 2913 1689
## QLNc1SH2.fastq.gz 9774 7868 6469 3406
Then make a sequence table, esentially an otu table
st <- makeSequenceTable(dd); dim(st)
## [1] 4 1212
Which we’ll use to assign taxonomy. I’m going to assign it at a 80% minimum bootstrap value and 50% and see what the difference is. 50% is normal for short read, but its been suggested (see below) that 80% should be the minimum for long read sequencing since its uhhh long. Understanding k-mers, bootstrapping and assigning taxonomy i actually really important, maybe go read the paper?
tax80 <- assignTaxonomy(st,
"C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\The Nautilus\\Data\\On metabarcoding\\Zymo pretrials\\silva_nr99_v138.1_wSpecies_train_set.fa.gz",
multithread=FALSE,
minBoot = 80, #how much confidence is required for dada2 to assign the species, 50 is baseline, but its bee suggested to go up to 80 for pacbio data
verbose = TRUE) # talkie
## Finished processing reference fasta.
head(unname(tax80))
## [,1] [,2] [,3] [,4]
## [1,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"
## [2,] "Bacteria" "Actinobacteriota" "Thermoleophilia" "Gaiellales"
## [3,] "Bacteria" "Actinobacteriota" "Thermoleophilia" "Gaiellales"
## [4,] "Bacteria" "Nitrospirota" "Nitrospiria" "Nitrospirales"
## [5,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"
## [6,] "Bacteria" "Methylomirabilota" "Methylomirabilia" "Methylomirabilales"
## [,5] [,6] [,7]
## [1,] "Xanthobacteraceae" "Bradyrhizobium" NA
## [2,] "Gaiellaceae" "Gaiella" NA
## [3,] "Gaiellaceae" "Gaiella" NA
## [4,] "Nitrospiraceae" "Nitrospira" NA
## [5,] "Xanthobacteraceae" NA NA
## [6,] "Methylomirabilaceae" "Sh765B-TzT-35" NA
heres the reference for minboot 80: https://github.com/benjjneb/dada2/issues/990
tax50 <- assignTaxonomy(st,
"C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\The Nautilus\\Data\\On metabarcoding\\Zymo pretrials\\silva_nr99_v138.1_wSpecies_train_set.fa.gz",
multithread=FALSE,
minBoot = 50, #how much confidence is required for dada2 to assign the species, 50 is baseline, but its bee suggested to go up to 80 for pacbio data
verbose = TRUE) # talkie
## Finished processing reference fasta.
head(unname(tax50))
## [,1] [,2] [,3] [,4]
## [1,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"
## [2,] "Bacteria" "Actinobacteriota" "Thermoleophilia" "Gaiellales"
## [3,] "Bacteria" "Actinobacteriota" "Thermoleophilia" "Gaiellales"
## [4,] "Bacteria" "Nitrospirota" "Nitrospiria" "Nitrospirales"
## [5,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"
## [6,] "Bacteria" "Methylomirabilota" "Methylomirabilia" "Methylomirabilales"
## [,5] [,6] [,7]
## [1,] "Xanthobacteraceae" "Bradyrhizobium" NA
## [2,] "Gaiellaceae" "Gaiella" NA
## [3,] "Gaiellaceae" "Gaiella" NA
## [4,] "Nitrospiraceae" "Nitrospira" NA
## [5,] "Xanthobacteraceae" NA NA
## [6,] "Methylomirabilaceae" "Sh765B-TzT-35" NA
lets remove the bimeras/chimeras. These can be PCR artifacts and not just merging artifacts as they typically are in short read sequences. Only one removed
bim <- isBimeraDenovo(st, minFoldParentOverAbundance=3.5, multithread=FALSE)
table(bim)
## bim
## FALSE TRUE
## 1211 1
and when we look at its percentage in the total reads, its very small so we can feel confident its okay to remove this sequence
sum(st[,bim])/sum(st)
## [1] 0.001325601
Save objects
saveRDS(st, file.path(path, "st.rds"))
saveRDS(tax80, file.path(path, "tax80.rds"))
saveRDS(tax50, file.path(path, "tax50.rds"))
and then hand off the objects to phyloseq!
physeq_50 <- phyloseq(otu_table(st, taxa_are_rows = FALSE), tax_table(tax50))
physeq_80 <- phyloseq(otu_table(st, taxa_are_rows = FALSE), tax_table(tax80))
Save objects
saveRDS(physeq_50, file.path(path, "physeq_50.rds"))
saveRDS(physeq_80, file.path(path, "physeq_80.rds"))