by Angus Ball
Hi This is the introductory tutorial on DADA2 using a single sample so you can do it on your computer before going and using the HPC This was stolen from: https://benjjneb.github.io/dada2/ITS_workflow.html
As always packages
library(dada2)
library(ShortRead)
library(Biostrings)
Start by importing your one sample there’ll be 4 files, 2 index files and 2 read files, one for the forward (I1, R1), and one for the reverse reads (I2, R2). That being said the indexes are only needed to demultiplex your files and since you’ve already done that we will just be filtering them out from our forward reads caragory (fnFs) and reverse reads category (fnRs)
path <- "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset" #location of data
list.files(path)
[1] "cutadapt" "filtN" "LW01_I1.fastq.gz" "LW01_I2.fastq.gz" "LW01_R1.fastq.gz" "LW01_R2.fastq.gz"
#organize files
fnFs <- sort(list.files(path, pattern = "R1.fastq.gz", full.names = TRUE)) #F orward
fnRs <- sort(list.files(path, pattern = "R2.fastq.gz", full.names = TRUE)) #R everse
Then we have to identify the primers used. In this study the ITS86F and ITS4 reverse primers were used so…
FWD <- "ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA"
REV <- "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC"
Lets confirm we have the correct orientation of the primers and verify their presence This code makes reverse compliments of our dna
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
RevComp = Biostrings::reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
Forward
"ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA"
Complement
"TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGACACTTAGTAGCTTAGAAACTT"
Reverse
"AAGTTTCTAAGCTACTAAGTGTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA"
RevComp
"TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
Now we will prefilter the sequences to remove ambiguous bases (Ns)
fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filtered files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = FALSE)
Now this chunk of code goes through the sequences and counts the number of times the primers appear in the forward and reverse reads at all possible orientations.
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FWD.orients,
primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
Forward Complement Reverse RevComp
FWD.ForwardReads 0 0 0 0
FWD.ReverseReads 0 0 0 19
REV.ForwardReads 0 0 0 21
REV.ReverseReads 0 0 0 0
Sweet so the primers have already been removed from the sample at the start of each sequence, but since there is some bleed through of short its sequences measured in this study there are some hits of the reverse compliment in the respective strands.To cut the ends off the sequences from this bleed through we will be using a program called cut adapt. Cutadapt isnt an r package so its kinda a pain but alas. go here https://cutadapt.readthedocs.io/en/stable/installation.html#installation-on-windows to download it
cutadapt <- "C:\\Users\\angus\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\cutadapt" #location of cutadapt on you computer
system2(cutadapt, args = "--version") # Run shell commands from R
4.6
this cuts the sequences with the primers in them
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)#this creates a cutadapt file
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
#Hey this whole thing breaks if you have alot of spaces in your path names so uhhh don't do that
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i])) # input files
}
This is cutadapt 4.6 with Python 3.12.1
Command line parameters: -g ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA -a GCATATCAATAAGCGGAGGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -G GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC -A TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -n 2 -o C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/cutadapt/LW01_R1.fastq.gz -p C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/cutadapt/LW01_R2.fastq.gz C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/filtN/LW01_R1.fastq.gz C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/filtN/LW01_R2.fastq.gz
Processing paired-end reads on 1 core ...
Finished in 6.079 s (98.672 µs/read; 0.61 M reads/minute).
=== Summary ===
Total read pairs processed: 61,606
Read 1 with adapter: 61,279 (99.5%)
Read 2 with adapter: 61,466 (99.8%)
Pairs written (passing filters): 61,606 (100.0%)
Total basepairs processed: 30,802,152 bp
Read 1: 15,401,500 bp
Read 2: 15,400,652 bp
Total written (filtered): 28,233,750 bp (91.7%)
Read 1: 14,086,562 bp
Read 2: 14,147,188 bp
=== First read: Adapter 1 ===
Sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA; Type: regular 5'; Length: 54; Trimmed: 61257 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5
Overview of removed sequences
length count expect max.err error counts
15 2 0.0 1 2
19 23 0.0 1 5 7 11
20 606 0.0 2 109 481 16
21 60562 0.0 2 59743 781 38
22 64 0.0 2 5 55 4
=== First read: Adapter 2 ===
Sequence: GCATATCAATAAGCGGAGGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 54; Trimmed: 853 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5
Bases preceding removed adapters:
A: 86.5%
C: 1.3%
G: 2.6%
T: 9.6%
none/other: 0.0%
WARNING:
The adapter is preceded by 'A' extremely often.
The provided adapter sequence could be incomplete at its 5' end.
Ignore this warning when trimming primers.
Overview of removed sequences
length count expect max.err error counts
3 110 962.6 0 110
4 6 240.6 0 6
5 2 60.2 0 2
7 4 3.8 0 4
9 1 0.2 0 1
10 1 0.1 1 0 1
13 8 0.0 1 6 2
14 9 0.0 1 8 1
15 8 0.0 1 8
16 1 0.0 1 1
17 2 0.0 1 2
18 4 0.0 1 4
19 2 0.0 1 2
20 1 0.0 2 0 1
21 2 0.0 2 1 1
23 2 0.0 2 2
24 2 0.0 2 2
25 1 0.0 2 1
27 1 0.0 2 1
28 2 0.0 2 2
30 1 0.0 3 0 0 0 1
31 1 0.0 3 1
32 1 0.0 3 1
33 595 0.0 3 547 35 9 4
34 12 0.0 3 11 0 0 1
43 2 0.0 4 2
49 1 0.0 4 1
50 1 0.0 5 1
51 27 0.0 5 21 5 1
52 1 0.0 5 1
53 3 0.0 5 3
57 1 0.0 5 1
59 1 0.0 5 1
91 1 0.0 5 0 0 0 0 1
95 10 0.0 5 0 0 0 2 5 3
96 1 0.0 5 0 0 0 0 1
128 2 0.0 5 0 0 0 0 0 2
133 1 0.0 5 1
134 1 0.0 5 1
176 1 0.0 5 0 0 1
187 1 0.0 5 0 0 0 1
205 6 0.0 5 5 0 0 1
206 13 0.0 5 12 1
=== Second read: Adapter 3 ===
Sequence: GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 54; Trimmed: 61449 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5
Overview of removed sequences
length count expect max.err error counts
14 1 0.0 1 0 1
16 1 0.0 1 0 1
17 6 0.0 1 2 4
18 38 0.0 1 3 18 17
19 358 0.0 1 47 293 18
20 60985 0.0 2 60034 872 79
21 57 0.0 2 5 47 5
26 1 0.0 2 0 0 1
61 1 0.0 5 0 0 0 0 0 1
64 1 0.0 5 0 0 0 0 0 1
=== Second read: Adapter 4 ===
Sequence: TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 54; Trimmed: 719 times
Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5
Bases preceding removed adapters:
A: 1.4%
C: 0.3%
G: 94.9%
T: 3.5%
none/other: 0.0%
WARNING:
The adapter is preceded by 'G' extremely often.
The provided adapter sequence could be incomplete at its 5' end.
Ignore this warning when trimming primers.
Overview of removed sequences
length count expect max.err error counts
3 25 962.6 0 25
4 12 240.6 0 12
6 1 15.0 0 1
8 2 0.9 0 2
9 1 0.2 0 1
10 3 0.1 1 1 2
11 1 0.0 1 1
14 7 0.0 1 6 1
15 9 0.0 1 8 1
16 8 0.0 1 8
17 1 0.0 1 1
18 2 0.0 1 2
19 4 0.0 1 4
20 2 0.0 2 2
22 1 0.0 2 1
23 1 0.0 2 1
24 2 0.0 2 2
25 2 0.0 2 1 0 1
26 1 0.0 2 1
28 1 0.0 2 1
29 2 0.0 2 0 1 1
32 1 0.0 3 1
33 1 0.0 3 1
34 567 0.0 3 398 126 30 13
35 2 0.0 3 2
36 1 0.0 3 0 1
37 1 0.0 3 1
44 2 0.0 4 2
50 1 0.0 5 0 0 1
51 1 0.0 5 1
52 27 0.0 5 8 17 0 2
53 1 0.0 5 1
54 3 0.0 5 1 1 1
58 1 0.0 5 1
60 1 0.0 5 1
92 1 0.0 5 0 0 1
96 12 0.0 5 11 1
106 2 0.0 5 1 1
128 1 0.0 5 1
134 1 0.0 5 1
135 1 0.0 5 0 0 1
177 1 0.0 5 1
188 2 0.0 5 1 0 0 1
WARNING:
One or more of your adapter sequences may be incomplete.
Please see the detailed output above.
so two primers had warning messages saying that there was a weird amount of one specific base after the primer. Option 1: you’re a fool! and didnt include the full length of your primer, so go add the bases you missed and reassess. Option 2 (this is us): Only the compliments of the primers threw warning messages which means that this probably isnt an artifact of your primers and more an artifact of natural diversity putting an A or G next.(if you’d missed a nucleotide on your primer, you’d see error messages on both the reverse and normal compliments i believe)
Lets double check if it worked!
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), FWD.ReverseReads = sapply(FWD.orients,
primerHits, fn = fnRs.cut[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
fn = fnFs.cut[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
Forward Complement Reverse RevComp
FWD.ForwardReads 0 0 0 0
FWD.ReverseReads 0 0 0 0
REV.ForwardReads 0 0 0 0
REV.ReverseReads 0 0 0 0
Hey it worked!! there are no more primer sequences of the reverse compliments within our sequences.
# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "R2.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.names <- unname(sapply(cutFs, get.sample.name))
head(sample.names)
[1] "LW01"
Now we want to inspect the read quality profiles first the forward read
plotQualityProfile(cutFs)
Warning: 'package:stats' may not be available when loading
Then the reverse
plotQualityProfile(cutRs)
Warning: 'package:stats' may not be available when loading
SO how do you read these? the green line is the median quality score, the greyscale is a heatmap of the frequency of each quality score, the quartiles of the average quality are shown with the orange lines, and the read line at the bottom (with a seperate scale) shows the proportion of reads a that extend to at least that length. What is quality score? check out https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html and https://drive5.com/usearch/manual/exp_errs.html but briefly, Quality score = 10 means a wrong base is reported 10% of the time. Q20 is 1% and Q30 is 0.1% of the time. So anything under q20 should be removed. This is in theory correct; however it is better to delete samples based on the number of expected errors rather than error chance. (this still uses quality score information) https://academic.oup.com/bioinformatics/article/31/21/3476/194979?login=false
This sets some path directories that will be created within the filterandtrim command below
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))
And now we can enter dada2 fully!!
out <- filterAndTrim(
cutFs, #location of cut forward reads
filtFs, #location of where filtered forward reads will go
cutRs, #location of cut reverse reads
filtRs, #location of where filtered reverse reads will go
maxN = 0, #max number of ambiguous bases allowed in a read (dada2 REQUIRES 0 ambiguous bases)
maxEE = c(2, 2), #maxEE is the command that uses expected error rates to determine if a sample should be deleted (rather than using Q scores absolutely), 2 errors is the recommended value. Why is it a list? one for the forward reads one for the reverse
truncQ = 2, #truncates reads when any base pair quality score is 2 or below, why 2? illumina sequencing reports a 2 on really bad reads
minLen = 50, #minimum length of read, note for ITS region there is no max length
rm.phix = TRUE, #removes sequences of the bacteriophage PhiX which commonly contaminates NGS data
compress = TRUE, #compresses your files when its done
multithread = FALSE) #multithreading cant be done on a windows machine
head(out)
reads.in reads.out
LW01_R1.fastq.gz 61606 38621
if you want to learn more about truncQ read this: https://drive5.com/usearch/manual/exp_errs.html
OUCH! That removed two thirds of all the reads!!!!
lets take a look at the reads again
plotQualityProfile(filtFs)
Warning: 'package:stats' may not be available when loading
plotQualityProfile(filtRs)
Warning: 'package:stats' may not be available when loading
Less samples but they are certainly of a much higher quality!! in the dada2 tutorial only about 80% of the reads survived this step, Honestly this sounds like it sucks but looking at other peoples samples it’s not too bad in the grand scheme of things, we still have 38 thousands reads in this samples!!. Feel free to ask me (angus) about your specific loss rates when you come to this step!
oKAY! We are out of ITS specific commands now, everything from here on out (excluding what database you compare to) is the same for 16S and ITS sequencing. This also means we might get a bunch of warnings messages saying our reads are different lengths, thats okay we know it! DADA2 tries to learn the error rates of different nucleotide transitions (dont ask me how) so…
errF <- learnErrors(filtFs, multithread=FALSE) #remember windows machines cant multithread, maybe you can?
8825172 total bases in 38621 reads from 1 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread=FALSE)
8859689 total bases in 38621 reads from 1 samples will be used for learning the error rates.
Then lets take a look…
plotErrors(errF, nominalQ=TRUE)
plotErrors(errR, nominalQ=TRUE)
Looking at these two graphs, they look pretty good, the actual values (grey dots) match up with the expected (red lines) pretty well for the different nucelotide transitions (eg A2C is an adenine -> Cytosine transition)
Now its time to run dada2 to find our unqiue sequences
dadaFs <- dada(filtFs, err=errF, multithread=FALSE)
Sample 1 - 38621 reads in 12833 unique sequences.
dadaRs <- dada(filtRs, err=errR, multithread=FALSE)
Sample 1 - 38621 reads in 11222 unique sequences.
Sweet loads of unique sequences lets merge the forward and reverse reads
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
27156 paired-reads (in 168 unique pairings) successfully merged out of 37747 (in 524 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
[1] "CGCACCTTGCGCTCTTTGGTATTCCGAAGAGCATGCCTGTTTGAGTGTCATGAAAATATCAACCTTGACTTGGGTTTAGTGCTCTTGTCTTGGCTTGGATTTGGCTGTTTGCCGCTCGAAAGAGTCGGCTCAGCTTAAAAGTATTAGCTGGATCTGTCTTTGAGACTTGGTTTGACTTGGCGTAATAAGTTATTTCGCTGAGGACAATCTTCGGATTGGCCGAGTTTCTGGGACGTTTGTCCGCTTTCTAATACAAGTTCTAGCTTGCTAGACATGACTTTTTTATTATCTGGCCTCAAATCAGGTAGGACTACCCGCTGAACTTAA"
[2] "CGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTTTGCTTGGTGTTGGGTGTTTGTCTCGCCTTGCGTGTAGACTCGCCTTAAAATAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCACTCTGTCCTCACAACGACGACGTCCAAAAGTACTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA"
[3] "CGCACCTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGTGTCATGAAACCTCACCCCACTTGGGTTTTTGCCTGAGCGGTGGTGTATTGGGTGTTGCCTTGCTAAAGGCTCGCCTTAAAGACATAAGCACCTTGGATGTAATACGTTTCATCCTTCTGGGTGGCTGATAACCCCACATATTCATGATCTGGCCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA"
[4] "CGCACATTGCACTCTCTGGTATTCCGGAGAGTATGCTTGTTTGAGTATCTGTAAACATCTCAAATCTCTCGACTTTCGAGTTTGAGAGAATTGGACTTGAGCAATCCCAACTCCTTCGTCAGAAAGAGGAGGGTTGCTTGAAATGCAGGTGCAGCTGGGCATTCTTCTGAGCTAAAAGCATATTTATTTAGTCCCGTCAAACGGATTATTATTCTTGCTTCAGCTTACATAAAGGTTGAATGACTACCGAGGCTGATTGCAGGTAAACTCGTTCCGTTAAAAAGAACTTGTGAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAA"
[5] "CGCACATTGCGCTCTCTAGTATTCTGGAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAATCCCCTAAATCTTGAAAAAGAAAGGGAATTGGACTTGAGCTATCCCAACAAACGTAAAACTTTGGCGGGTCGCTTGAAATGCAGGTGCAGCTGGAACTACTCTTCTGAGCTATAAGCATATCTATTTAGTCCCGTCAAACGGATTATTATTTTTTGCTGCAGCTAACATAAAGGAAAAGTAGATCCGCTATGCTGACTGATGCAGGAATTAGCAGTGAACGTCTCTTTGAGGCGCTTACTGTGAAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAA"
[6] "CGCACCTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGTGTCATGAAACCTCACCCCACTTGGGTTTTTACCTGAGCGGTGGTGTATTGGGTGTTGCCTTGCTAAAGGCTCGCCTTAAAGATATAAGCACCTTGGATGTAATACGTTTCATCCTTCTGGGTGGCTAATAACCCCACATATTCATGATCTGGCCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA"
sweet only lost 10’000 reads from that! Now with full sequences we can construct sequence tables and determine our amplicon sequence variants (ASV’s)
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
[1] 1 168
That means 168 unqiue ASV’s!
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))
196 211 214 215 216 222 229 232 233 234 236 237 238 239 240 241 242 243 244 245 246 247 248 249 252 254 256 258 259 262 265 266 270 272 273
1 1 1 2 1 1 1 8 1 3 5 5 12 4 2 3 1 1 1 3 3 4 3 1 1 1 1 1 1 1 1 1 1 2 1
275 280 283 284 286 287 292 298 303 304 305 306 311 312 313 316 324 325 326 327 328 329 330 331 332 333 334 335 336 338 339 341 343 345 347
1 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 2 2 11 9 5 1 3 1 6 3 3 3 2 1 1 1 2 2
348 349 356 370 373
2 2 2 1 2
We have so much variation in the amount of sequence lengths because its the ITS sequence we are using, but always check that the sequence length makes sense as weird sequence lengths can result from non-specific merging. The ITS2 region is between 160-390 bps so this all checks out https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7162488/#:~:text=In%20spite%20of%20its%20short,across%20all%20land%20plants%20(e.
But life is never perfectly easy, we must remove chimeras (i.e. sequences that have been merged but dont reflect biologically species). Chimeras are checked for by looking at ASVs with low abundance that can be made of combinations of highly abundant ASV’s. Why and how this happens is well explained here https://forum.qiime2.org/t/dada2-merging-plates-and-chimera-removal/21030/4
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
Identified 41 bimeras out of 168 input sequences.
dim(seqtab.nochim)
[1] 1 127
41 chimeric sequences! This is 24.45% of the total amount of ASV’s!!
sum(seqtab.nochim)/sum(seqtab)
[1] 0.9531595
However, removing these samples only resulted in a 4% loss of reads in total! These are normal numbers, and anything under 25% is no need for a second thought. Chimeras can be more or less of a sample based on the specific experimental procedures
Now lets do a whole look through all the reads throughout the pipeline
getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
input filtered denoisedF denoisedR merged nonchim
LW01 61606 38621 38058 38045 27156 25884
PS. I modified the above code to work on a single sample. you’ll have to change it for multiple samples, see the its big data tutorial
Then lets assign taxonomy! first we need to download the ITS database from UNITE https://unite.ut.ee/repository.php
you’ll want to download one of the general fasta releases. The differences between these two files isn’t well explained. I have the beans on the HPC to do what I want, so I’ll be using the slightly larger database, and perhaps thats how you should make your decision aswell, but if you have strong opinions tell me.
unite.ref <- "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\UNITE_database.fastq.gz" # CHANGE ME to location on your machine
taxa <- assignTaxonomy(seqtab.nochim, #Sequnces
unite.ref, #databse
multithread = FALSE,
tryRC = FALSE) #try reverse compliment of sequences
Error: Memory allocation failed.
My baby computer doesnt have the stuffs to process even one sample! But alas, we can see what it theoretically would look like with this
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
Then you can take your data and save it! Your seqtab.nochim, and taxa are the important files. seqtab.nochim is your otu_table, and taxa is your tax_table to save…
saveRDS(taxa, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\taxa.rds" )#add your file location here
saveRDS(seqtab.nochim, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\seqtabnochim.rds" )#add your file location here
Then to make a phyloseq object later on…
library(phyloseq)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), tax_table(taxa))
saveRDS(ps, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\ps.rds" )#add your file location here