DADA2 notebook PacBio long read

By Angus Ball

This tutorial is based primarily on: https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html

And note this tutorial only works for 16S data (Theres extra fun considerations for fungal data)

First load up the packages

Then pull the location of where your files are. This will allow us to work with all of them and not have to fully import them into R

path <- "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\The Nautilus\\Data\\On metabarcoding\\Zymo pretrials" #location of data
list.files(path)

##  [1] "graphs"                                    
##  [2] "in00298.241004.zymo"                       
##  [3] "in00298.241004.zymo.zip"                   
##  [4] "key.csv"                                   
##  [5] "key.xlsx"                                  
##  [6] "noprimers"                                 
##  [7] "physeq_50.rds"                             
##  [8] "physeq_80.rds"                             
##  [9] "QLDc1TL4.fastq.gz"                         
## [10] "QLDc2NS1.fastq.gz"                         
## [11] "QLNc1LN4.fastq.gz"                         
## [12] "QLNc1SH2.fastq.gz"                         
## [13] "silva_nr99_v138.1_wSpecies_train_set.fa.gz"
## [14] "st.rds"                                    
## [15] "tax50.rds"                                 
## [16] "tax80.rds"                                 
## [17] "zymo_err.rds"

fnraw <- sort(list.files(path, pattern = ".fastq", full.names = TRUE))

For my samples They are from Quesnel lake (QL) in the disaster (D) or natural (N) area from cores (c) 1 or 2. They are also from various depths down the column (TL4, NS1, LN4, SH2)

Just to start lets looka t what the reads look like!

plotQualityProfile(fnraw)

SO how do you read these? the green line is the median quality score, the greyscale is a heatmap of the frequency of each quality score, the quartiles of the average quality are shown with the orange lines, and the read line at the bottom (with a seperate scale) shows the proportion of reads a that extend to at least that length. What is quality score? check out https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html and https://drive5.com/usearch/manual/exp_errs.html but briefly, Quality score = 10 means a wrong base is reported 10% of the time. Q20 is 1% and Q30 is 0.1% of the time. So anything under q20 should be removed. This is in theory correct; however it is better to delete samples based on the number of expected errors rather than error chance. (this still uses quality score information) https://academic.oup.com/bioinformatics/article/31/21/3476/194979?login=false PS. idk why it says cycle, it means bps (length of sequence)

Eitherway that data is RAW it even goes all the way up to 3000 bp, but the 16S reigon is like 1500 actually valuable so we’ll be trimming to that

But first we should remove the primers

nopsraw <- file.path(path, "noprimers", basename(fnraw)) #this makes a file location we can point to and a folder to follow it
primraw <- removePrimers(fnraw, #data
                         nopsraw, #output location
                         primer.fwd="AGRGTTYGATYMTGGCTCAG", #the forward primer used was 27F
                         primer.rev=dada2:::rc("RGYTACCTTGTTACGACTT"), #here we add the reverse primer and reverse it's direction (since all the reads should be in the forward direction)
                         orient=TRUE)

Just for fun we’re also gonna look at a histogram of average length, most should be around 1500 bp

lens.fn <- lapply(nopsraw, function(fn) nchar(getSequences(fn)))
lens <- do.call(c, lens.fn)
hist(lens, 100)

Excelent!

Now we can filter out data!

filts <- file.path(path, "noprimers", "filtered", basename(fnraw))#creating our output folder and location


track <- filterAndTrim(nopsraw, #input data
                       filts, #output data
                       minQ=3, #minimum quality score
                       minLen=1000, #minimum length (don't do this with fungal data (asterix))
                       maxLen=1600, #max length (Dont do this with fungal data (asterix))
                       maxN=0, #no ambigous bases here please!
                       rm.phix=FALSE, 
                       maxEE=2) #you can have 2 errors in a sequence as a treat but thats IT
track

##                   reads.in reads.out
## QLDc1TL4.fastq.gz     9236      7666
## QLDc2NS1.fastq.gz     5436      4461
## QLNc1LN4.fastq.gz     3527      2913
## QLNc1SH2.fastq.gz     7868      6469

I’m loosing reads but not massive amounts so its okayyy

Lets take another look at the qualities of those sequences!

plotQualityProfile(filts)

drp <- derepFastq(filts, verbose=TRUE)

## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLDc1TL4.fastq.gz

## Encountered 5703 unique sequences from 7666 total sequences read.

## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLDc2NS1.fastq.gz

## Encountered 3536 unique sequences from 4461 total sequences read.

## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLNc1LN4.fastq.gz

## Encountered 2253 unique sequences from 2913 total sequences read.

## Dereplicating sequence entries in Fastq file: C:\Users\angus\OneDrive - UNBC\Angus Ball\Lab work\The Nautilus\Data\On metabarcoding\Zymo pretrials/noprimers/filtered/QLNc1SH2.fastq.gz

## Encountered 5231 unique sequences from 6469 total sequences read.

dada2 needs replication in sequences to function properly. the rough rubric is that duplication needs to be less than 10% “(so #(unique sequences) > 0.9 * #(reads))” otherwise dada2 isnt appropriate : https://github.com/benjjneb/dada2/issues/1877

My samples are getting close but i’m under this level thank goodness!

Then we will learn the errors of these samples

err <- learnErrors(drp, #input files
                   errorEstimationFunction=PacBioErrfun, #probably should use the special pacbio error function for my pacbio data but idk ;)
                   BAND_SIZE=32, #this affects the amount of insertions deletions allowed for the command, 16 is normal, 32 recommended for pacbio sequencing
                   multithread=FALSE, verbose = TRUE) #im on windows...

## 31398313 total bases in 21509 reads from 4 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 ....
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
## Convergence after  8  rounds.

That took a hot second so lets save it

saveRDS(err, file.path(path, "zymo_err.rds"))

and inspect it!

plotErrors(err)

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

This is mostly okay. Things get bit weird on the far right for samples like G2T ect. This is likely just because there aren’t very many nucleotides at that high a quality score to make the measurement off of. To fix this you can in force monotonicity (The function can only decrease in value, not go up and down). See the dada2 nextseq tutorial at this set if you want an example of how to do it (Im not going to bother on this dataset)

We can then denoise the samples, seperating the reads into unique sequences

dd <- dada(drp, err=err, BAND_SIZE=32, multithread=FALSE)

## Sample 1 - 7666 reads in 5703 unique sequences.
## Sample 2 - 4461 reads in 3536 unique sequences.
## Sample 3 - 2913 reads in 2253 unique sequences.
## Sample 4 - 6469 reads in 5231 unique sequences.

For fun we can track where the reads are lost throughout our methods

cbind(ccs=primraw[,1], primers=primraw[,2], filtered=track[,2], denoised=sapply(dd, function(x) sum(x$denoised)))

##                     ccs primers filtered denoised
## QLDc1TL4.fastq.gz 11435    9236     7666     4610
## QLDc2NS1.fastq.gz  6668    5436     4461     2365
## QLNc1LN4.fastq.gz  4315    3527     2913     1689
## QLNc1SH2.fastq.gz  9774    7868     6469     3406

Then make a sequence table, esentially an otu table

st <- makeSequenceTable(dd); dim(st)

## [1]    4 1212

Which we’ll use to assign taxonomy. I’m going to assign it at a 80% minimum bootstrap value and 50% and see what the difference is. 50% is normal for short read, but its been suggested (see below) that 80% should be the minimum for long read sequencing since its uhhh long. Understanding k-mers, bootstrapping and assigning taxonomy i actually really important, maybe go read the paper?

tax80 <- assignTaxonomy(st, 
                      "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\The Nautilus\\Data\\On metabarcoding\\Zymo pretrials\\silva_nr99_v138.1_wSpecies_train_set.fa.gz", 
                      multithread=FALSE, 
                      minBoot = 80, #how much confidence is required for dada2 to assign the species, 50 is baseline, but its bee suggested to go up to 80 for pacbio data
                      verbose = TRUE) # talkie

## Finished processing reference fasta.

head(unname(tax80))

##      [,1]       [,2]                [,3]                  [,4]                
## [1,] "Bacteria" "Proteobacteria"    "Alphaproteobacteria" "Rhizobiales"       
## [2,] "Bacteria" "Actinobacteriota"  "Thermoleophilia"     "Gaiellales"        
## [3,] "Bacteria" "Actinobacteriota"  "Thermoleophilia"     "Gaiellales"        
## [4,] "Bacteria" "Nitrospirota"      "Nitrospiria"         "Nitrospirales"     
## [5,] "Bacteria" "Proteobacteria"    "Alphaproteobacteria" "Rhizobiales"       
## [6,] "Bacteria" "Methylomirabilota" "Methylomirabilia"    "Methylomirabilales"
##      [,5]                  [,6]             [,7]
## [1,] "Xanthobacteraceae"   "Bradyrhizobium" NA  
## [2,] "Gaiellaceae"         "Gaiella"        NA  
## [3,] "Gaiellaceae"         "Gaiella"        NA  
## [4,] "Nitrospiraceae"      "Nitrospira"     NA  
## [5,] "Xanthobacteraceae"   NA               NA  
## [6,] "Methylomirabilaceae" "Sh765B-TzT-35"  NA

heres the reference for minboot 80: https://github.com/benjjneb/dada2/issues/990

tax50 <- assignTaxonomy(st, 
                      "C:\\Users\\angus\\OneDrive - UNBC\\Angus Ball\\Lab work\\The Nautilus\\Data\\On metabarcoding\\Zymo pretrials\\silva_nr99_v138.1_wSpecies_train_set.fa.gz", 
                      multithread=FALSE, 
                      minBoot = 50, #how much confidence is required for dada2 to assign the species, 50 is baseline, but its bee suggested to go up to 80 for pacbio data
                      verbose = TRUE) # talkie

## Finished processing reference fasta.

head(unname(tax50))

##      [,1]       [,2]                [,3]                  [,4]                
## [1,] "Bacteria" "Proteobacteria"    "Alphaproteobacteria" "Rhizobiales"       
## [2,] "Bacteria" "Actinobacteriota"  "Thermoleophilia"     "Gaiellales"        
## [3,] "Bacteria" "Actinobacteriota"  "Thermoleophilia"     "Gaiellales"        
## [4,] "Bacteria" "Nitrospirota"      "Nitrospiria"         "Nitrospirales"     
## [5,] "Bacteria" "Proteobacteria"    "Alphaproteobacteria" "Rhizobiales"       
## [6,] "Bacteria" "Methylomirabilota" "Methylomirabilia"    "Methylomirabilales"
##      [,5]                  [,6]             [,7]
## [1,] "Xanthobacteraceae"   "Bradyrhizobium" NA  
## [2,] "Gaiellaceae"         "Gaiella"        NA  
## [3,] "Gaiellaceae"         "Gaiella"        NA  
## [4,] "Nitrospiraceae"      "Nitrospira"     NA  
## [5,] "Xanthobacteraceae"   NA               NA  
## [6,] "Methylomirabilaceae" "Sh765B-TzT-35"  NA

lets remove the bimeras/chimeras. These can be PCR artifacts and not just merging artifacts as they typically are in short read sequences. Only one removed

bim <- isBimeraDenovo(st, minFoldParentOverAbundance=3.5, multithread=FALSE)
table(bim)

## bim
## FALSE  TRUE 
##  1211     1

and when we look at its percentage in the total reads, its very small so we can feel confident its okay to remove this sequence

sum(st[,bim])/sum(st)

## [1] 0.001325601

Save objects

saveRDS(st, file.path(path, "st.rds"))
saveRDS(tax80, file.path(path, "tax80.rds"))
saveRDS(tax50, file.path(path, "tax50.rds"))

and then hand off the objects to phyloseq!

physeq_50 <- phyloseq(otu_table(st, taxa_are_rows = FALSE), tax_table(tax50))
physeq_80 <- phyloseq(otu_table(st, taxa_are_rows = FALSE), tax_table(tax80))

Save objects

saveRDS(physeq_50, file.path(path, "physeq_50.rds"))
saveRDS(physeq_80, file.path(path, "physeq_80.rds"))

DADA2 notebook PacBio long read

Angus Ball

2024-10-10