by Angus Ball

Hi This is the introductory tutorial on DADA2 using a single sample so you can do it on your computer before going and using the HPC This was stolen from: https://benjjneb.github.io/dada2/ITS_workflow.html

As a heads up this tutorial is exactly the same as the Miseq one EXCEPT when you start looking at error modeling

As always packages

library(dada2)
library(ShortRead)
library(Biostrings)

Start by importing your one sample there’ll be 4 files, 2 index files and 2 read files, one for the forward (I1, R1), and one for the reverse reads (I2, R2). That being said the indexes are only needed to demultiplex your files and since you’ve already done that we will just be filtering them out from our forward reads caragory (fnFs) and reverse reads category (fnRs)

path <- "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset" #location of data
list.files(path)
 [1] "cutadapt"                                                      "filtN"                                                        
 [3] "MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R1.fastq.gz" "MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R2.fastq.gz"
 [5] "MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R1.fastq.gz" "MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R2.fastq.gz"
 [7] "plotqualitybeforefilter.rds"                                   "seqtab.nochim.rds"                                            
 [9] "UNITE.tgz"                                                     "UNITE_database.fastq"                                         
[11] "UNITE_database.fastq.gz"                                      
#organize files
fnFs <- sort(list.files(path, pattern = "R1.fastq.gz", full.names = TRUE)) #F orward
fnRs <- sort(list.files(path, pattern = "R2.fastq.gz", full.names = TRUE)) #R everse

Then we have to identify the primers used. In this study the ITS86F and ITS4 reverse primers were used so…

FWD <- "ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA"
REV <- "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC"

Lets confirm we have the correct orientation of the primers and verify their presence This code makes reverse compliments of our dna

allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
    orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
        RevComp = Biostrings::reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
                                                 Forward                                               Complement 
"ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA" "TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGACACTTAGTAGCTTAGAAACTT" 
                                                 Reverse                                                  RevComp 
"AAGTTTCTAAGCTACTAAGTGTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA" "TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" 

Now we will prefilter the sequences to remove ambiguous bases (Ns)

fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filtered files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = FALSE)

Now this chunk of code goes through the sequences and counts the number of times the primers appear in the forward and reverse reads at all possible orientations.

primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       3
REV.ForwardReads       0          0       0       4
REV.ReverseReads       0          0       0       0

Sweet so the primers have already been removed from the sample at the start of each sequence, but since there is some bleed through of short its sequences measured in this study there are some hits of the reverse compliment in the respective strands.To cut the ends off the sequences from this bleed through we will be using a program called cut adapt. Cutadapt isnt an r package so its kinda a pain but alas. go here https://cutadapt.readthedocs.io/en/stable/installation.html#installation-on-windows to download it

cutadapt <- "C:\\Users\\angus\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\cutadapt" #location of cutadapt on you computer
system2(cutadapt, args = "--version") # Run shell commands from R
4.6

this cuts the sequences with the primers in them

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)#this creates a cutadapt file
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

#Hey this whole thing breaks if you have alot of spaces in your path names so uhhh don't do that

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 
# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs.filtN[i], fnRs.filtN[i])) # input files
}
This is cutadapt 4.6 with Python 3.12.1
Command line parameters: -g ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA -a GCATATCAATAAGCGGAGGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -G GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC -A TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -n 2 -o C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/cutadapt/MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R1.fastq.gz -p C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/cutadapt/MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R2.fastq.gz C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/filtN/MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R1.fastq.gz C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/filtN/MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R2.fastq.gz
Processing paired-end reads on 1 core ...
Finished in 7.008 s (89.923 µs/read; 0.67 M reads/minute).

=== Summary ===

Total read pairs processed:             77,936
  Read 1 with adapter:                  77,544 (99.5%)
  Read 2 with adapter:                  77,750 (99.8%)
Pairs written (passing filters):        77,936 (100.0%)

Total basepairs processed:    38,967,374 bp
  Read 1:    19,484,000 bp
  Read 2:    19,483,374 bp
Total written (filtered):     35,782,993 bp (91.8%)
  Read 1:    17,854,923 bp
  Read 2:    17,928,070 bp

=== First read: Adapter 1 ===

Sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA; Type: regular 5'; Length: 54; Trimmed: 77540 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Overview of removed sequences
length  count   expect  max.err error counts
5   1   76.1    0   1
17  1   0.0 1   0 1
18  1   0.0 1   0 0 1
19  38  0.0 1   6 5 27
20  733 0.0 2   102 607 24
21  76685   0.0 2   75705 939 41
22  79  0.0 2   2 75 2
23  1   0.0 2   0 1
24  1   0.0 2   0 0 1


=== First read: Adapter 2 ===

Sequence: GCATATCAATAAGCGGAGGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 54; Trimmed: 78 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Bases preceding removed adapters:
  A: 44.9%
  C: 17.9%
  G: 9.0%
  T: 28.2%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   38  1217.8  0   38
4   6   304.4   0   6
5   6   76.1    0   6
6   1   19.0    0   1
7   1   4.8 0   1
10  1   0.1 1   0 1
11  4   0.0 1   4
13  2   0.0 1   2
15  2   0.0 1   2
17  1   0.0 1   0 1
19  5   0.0 1   4 1
20  1   0.0 2   0 0 1
22  1   0.0 2   1
30  1   0.0 3   0 0 0 1
52  1   0.0 5   1
55  1   0.0 5   1
95  3   0.0 5   0 0 0 2 0 1
206 3   0.0 5   3


=== Second read: Adapter 3 ===

Sequence: GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 54; Trimmed: 77746 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Overview of removed sequences
length  count   expect  max.err error counts
3   1   1217.8  0   1
14  2   0.0 1   1 1
16  7   0.0 1   5 2
17  4   0.0 1   2 2
18  27  0.0 1   4 9 14
19  482 0.0 1   45 402 35
20  77161   0.0 2   75962 1104 95
21  62  0.0 2   5 53 4


=== Second read: Adapter 4 ===

Sequence: TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 54; Trimmed: 47 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Bases preceding removed adapters:
  A: 6.4%
  C: 10.6%
  G: 61.7%
  T: 21.3%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   21  1217.8  0   21
4   1   304.4   0   1
6   5   19.0    0   5
8   1   1.2 0   1
12  3   0.0 1   2 1
14  2   0.0 1   2
16  2   0.0 1   1 1
20  4   0.0 2   3 1
23  1   0.0 2   1
53  1   0.0 5   1
56  1   0.0 5   0 1
96  4   0.0 5   3 0 0 0 1
130 1   0.0 5   0 1
This is cutadapt 4.6 with Python 3.12.1
Command line parameters: -g ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA -a GCATATCAATAAGCGGAGGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -G GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC -A TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -n 2 -o C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/cutadapt/MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R1.fastq.gz -p C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/cutadapt/MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R2.fastq.gz C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/filtN/MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R1.fastq.gz C:\Users\angus\OneDrive\Documents\Story\lisatestdataset/filtN/MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R2.fastq.gz
Processing paired-end reads on 1 core ...
Finished in 6.826 s (85.348 µs/read; 0.70 M reads/minute).

=== Summary ===

Total read pairs processed:             79,983
  Read 1 with adapter:                  79,594 (99.5%)
  Read 2 with adapter:                  79,789 (99.8%)
Pairs written (passing filters):        79,983 (100.0%)

Total basepairs processed:    39,990,603 bp
  Read 1:    19,995,750 bp
  Read 2:    19,994,853 bp
Total written (filtered):     36,719,697 bp (91.8%)
  Read 1:    18,321,996 bp
  Read 2:    18,397,701 bp

=== First read: Adapter 1 ===

Sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA; Type: regular 5'; Length: 54; Trimmed: 79586 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Overview of removed sequences
length  count   expect  max.err error counts
5   2   78.1    0   2
17  1   0.0 1   0 1
18  3   0.0 1   1 2
19  25  0.0 1   8 5 12
20  770 0.0 2   134 614 22
21  78693   0.0 2   77637 1002 54
22  92  0.0 2   1 86 5


=== First read: Adapter 2 ===

Sequence: GCATATCAATAAGCGGAGGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 54; Trimmed: 136 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Bases preceding removed adapters:
  A: 88.2%
  C: 3.7%
  G: 5.9%
  T: 2.2%
  none/other: 0.0%
WARNING:
    The adapter is preceded by 'A' extremely often.
    The provided adapter sequence could be incomplete at its 5' end.
    Ignore this warning when trimming primers.

Overview of removed sequences
length  count   expect  max.err error counts
3   12  1249.7  0   12
4   9   312.4   0   9
5   3   78.1    0   3
7   12  4.9 0   12
8   15  1.2 0   15
9   1   0.3 0   1
10  2   0.1 1   2
11  1   0.0 1   1
13  16  0.0 1   12 4
14  26  0.0 1   26
15  5   0.0 1   5
16  11  0.0 1   11
20  3   0.0 2   3
29  2   0.0 2   2
34  4   0.0 3   4
38  1   0.0 3   1
54  1   0.0 5   1
55  5   0.0 5   4 1
197 1   0.0 5   1
206 5   0.0 5   3 1 1
219 1   0.0 5   1


=== Second read: Adapter 3 ===

Sequence: GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC; Type: regular 5'; Length: 54; Trimmed: 79788 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Overview of removed sequences
length  count   expect  max.err error counts
3   4   1249.7  0   4
15  1   0.0 1   0 1
16  1   0.0 1   1
17  5   0.0 1   3 2
18  26  0.0 1   2 6 18
19  433 0.0 1   58 352 23
20  79257   0.0 2   77967 1131 159
21  58  0.0 2   6 49 3
23  1   0.0 2   0 0 1
24  1   0.0 2   0 1
64  1   0.0 5   0 0 0 0 1


=== Second read: Adapter 4 ===

Sequence: TTCAAAGATTCGATGATTCACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 54; Trimmed: 112 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-54 bp: 5

Bases preceding removed adapters:
  A: 4.5%
  C: 2.7%
  G: 88.4%
  T: 4.5%
  none/other: 0.0%
WARNING:
    The adapter is preceded by 'G' extremely often.
    The provided adapter sequence could be incomplete at its 5' end.
    Ignore this warning when trimming primers.

Overview of removed sequences
length  count   expect  max.err error counts
3   21  1249.7  0   21
8   8   1.2 0   8
9   14  0.3 0   14
10  2   0.1 1   2
11  1   0.0 1   1
12  1   0.0 1   1
14  12  0.0 1   6 6
15  23  0.0 1   21 2
16  4   0.0 1   4
17  9   0.0 1   9
21  3   0.0 2   3
30  2   0.0 3   2
35  4   0.0 3   4
39  1   0.0 3   1
55  1   0.0 5   1
56  5   0.0 5   3 0 1 1
197 1   0.0 5   0 0 0 0 1


WARNING:
    One or more of your adapter sequences may be incomplete.
    Please see the detailed output above.

so two primers had warning messages saying that there was a weird amount of one specific base after the primer. Option 1: you’re a fool! and didnt include the full length of your primer, so go add the bases you missed and reassess. Option 2 (this is us): Only the compliments of the primers threw warning messages which means that this probably isnt an artifact of your primers and more an artifact of natural diversity putting an A or G next.(if you’d missed a nucleotide on your primer, you’d see error messages on both the reverse and normal compliments i believe)

Lets double check if it worked!

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.cut[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.cut[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0

Hey it worked!! there are no more primer sequences of the reverse compliments within our sequences.

This is some spicy ways to extract the names so we can use them (and name them to something not stupid)

# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "R2.fastq.gz", full.names = TRUE))

You’re samples will have different naming schemes, therefore you’ll have to change this code to fit your data. To rename these files we use regex expressions which are hard and complicated :(.

here is an unadultered sample name: MI.M03992_0831.001.FLD_ill_028_i7—IDT_i5_8.LW01_R1.fastq.gz

We want our name to be Lw43 only so…

the first function…

get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][7] #"_"<-split value, [7]<-save this one
sample.names <- unname(sapply(cutFs, get.sample.name))

can be read as create an object with the function split the string (the name) at every _ and keep the 7th one i.e. you run the command and

MI.M03992_0831.001.FLD_ill_028_i7—IDT_i5_8.LW03_R1.fastq.gz

becomes

“MI.M03992” “0831.001.FLD” “ill” “028” “i7—IDT” “i5” “8.LW03” “R1.fastq.gz”

And since only the 7th is kept it becomes

“8.LW03”

Then we repeat just to save the LW43 part of the name, because thats what we care about

get.sample.name.2 <- function(fname) strsplit(basename(fname), "[.]")[[1]][2]
sample.names <- unname(sapply(sample.names, get.sample.name.2))

This takes

“8.LW03”

splits it by the period

Note: the period is special in regex expressions so I have to enclose it in square brackets (the square brackets are a special form of an escape character, this sentence won’t make sense unless you code in other languages tho :/ )

“8” “LW03”

Then we save the 2nd instance and the sample name becomes

“LW03”

Huzzah! actual names that are good

head(sample.names)
[1] NA     "LW03" "LW67"

(You can ignore the NA, its a byproduct of my creating these tutorials haphazardly)

Now we want to inspect the read quality profiles first the forward read

Then the reverse

SO how do you read these? the green line is the median quality score, the greyscale is a heatmap of the frequency of each quality score, the quartiles of the average quality are shown with the orange lines, and the read line at the bottom (with a seperate scale) shows the proportion of reads a that extend to at least that length. What is quality score? check out https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html and https://drive5.com/usearch/manual/exp_errs.html but briefly, Quality score = 10 means a wrong base is reported 10% of the time. Q20 is 1% and Q30 is 0.1% of the time. So anything under q20 should be removed. This is in theory correct; however it is better to delete samples based on the number of expected errors rather than error chance. (this still uses quality score information) https://academic.oup.com/bioinformatics/article/31/21/3476/194979?login=false

This sets some path directories that will be created within the filterandtrim command below

filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))

And now we can enter dada2 fully!!

out <- filterAndTrim(
  cutFs, #location of cut forward reads
  filtFs, #location of where filtered forward reads will go
  cutRs, #location of cut reverse reads
  filtRs, #location of where filtered reverse reads will go
  maxN = 0, #max number of ambiguous bases allowed in a read (dada2 REQUIRES 0 ambiguous bases)
  maxEE = c(2, 2), #maxEE is the command that uses expected error rates to determine if a sample should be deleted (rather than using Q scores absolutely), 2 errors is the recommended value. Why is it a list? one for the forward reads one for the reverse
  truncQ = 2, #truncates reads when any base pair quality score is 2 or below, why 2? illumina sequencing reports a 2 on really bad reads
  minLen = 50, #minimum length of read, note for ITS region there is no max length
  rm.phix = TRUE, #removes sequences of the bacteriophage PhiX which commonly contaminates NGS data
  compress = TRUE, #compresses your files when its done
  multithread = FALSE)  #multithreading cant be done on a windows machine
head(out)
                                                              reads.in reads.out
LW01_R1.fastq.gz                                                 61606     38621
MI.M03992_0831.001.FLD_ill_017_i7---IDT_i5_7.LW03_R1.fastq.gz    77936     42442
MI.M03992_0831.001.FLD_ill_022_i7---IDT_i5_7.LW67_R1.fastq.gz    79983     56495

if you want to learn more about truncQ read this: https://drive5.com/usearch/manual/exp_errs.html

OUCH! That removed two thirds of all the reads!!!!

lets take a look at the reads again

Less samples but they are certainly of a much higher quality!! in the dada2 tutorial only about 80% of the reads survived this step, Honestly this sounds like it sucks but looking at other peoples samples it’s not too bad in the grand scheme of things, we still have 38 thousands reads in this samples!!. Feel free to ask me (angus) about your specific loss rates when you finish the pipeline if thinks look bleak, there are some small things that we can do to up the number (or maybe some hidden error is eating your reads and who knows)!

oKAY! We are out of ITS specific commands now, everything from here on out (excluding what database you compare to) is the same for 16S and ITS sequencing. This also means we might get a bunch of warnings messages saying our reads are different lengths, thats okay we know it! DADA2 tries to learn the error rates of different nucleotide transitions (dont ask me how) so…

HEY, because you read my ENTIRE preamble on the dada2 git page you know Miseq and nextseq data needs to learn errors a bit differently. This is that step! Lets say you didn’t know this and ran your data like normal this’d what your quality plots would look like

#lets say you didn't know this and ran your data like normal this'd what your quality plots would look like
plot(ploterrF_PreLOESS)

plot(ploterrR_PreLOESS)

Brother ewww. Look at C2G or T2A (eg A2C is an adenine -> Cytosine transition)! what the hell is that squiggly ass line!?! These trend lines don’t match up with the datapoints or the red lines at all!plus them falling into levels is what I mean when i say plots fall into stepwise patterns (see documentation on the git).

lets run error modeling the correct way

This code is from https://github.com/benjjneb/dada2/issues/1307, it forces the trend lines to have monotonicity (i.e. they can only tren din one direction and not just up and down), and alters the loess function to make it run closer to the data points. Whats loess?? its kinda like linear regression trying to find a trend line between your data points but more complicated.

Just so we’re on the same page this isn’t changing the data, just the trend line through the data, so as long as you look and it better fits the data then its okay to use. There are some other options (i.e. only enforcing monotonicity or only fixing loess) that we can try but based on my research this seems like the best general solution

loessErrfun_mod4 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
        
        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)
        
        # jonalim's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)
        
        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))
  
  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
  
  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)
  
  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

This above code just makes an error model. Now we can actually run our error modeling approach

#note to run the error matrix with this new model, the three samples i've given the program aren't enough, you can only really run this with a full data set
errF <- learnErrors(
  filtFs,
  multithread = FALSE, #remember multitread = TRUE on big computers
  nbases = 1e9,
  errorEstimationFunction = loessErrfun_mod4,
  verbose = TRUE
)

errR <- learnErrors(
  filtRs,
  multithread = FALSE,
  nbases = 1e9,
  errorEstimationFunction = loessErrfun_mod4,
  verbose = TRUE
)

Then lets take a look…

plot(ploterrR_postloess)

Looking at these two graphs, they look pretty good, the actual values (grey dots) match up with the expected (red lines) pretty well for the different nucleotide transitions especially compared to the earlier ones. we see that A2G and T2C transitions are still kinda straight (especially compared to the red line), this is due to the nextseq itself and (the global) we don’t really know why yet. dada2 is still developing solutions to this issue (and these solutions have been in development since 2019…)

Now its time to run dada2 to find our unqiue sequences

dadaFs <- dada(filtFs, err=errF, multithread=FALSE)
#Sample 1 - 38621 reads in 12833 unique sequences.

(I’m lazy and wont rerun the whole thing just so the outputs are here on the bottom in the comments)

dadaRs <- dada(filtRs, err=errR, multithread=FALSE)
#Sample 1 - 38621 reads in 11222 unique sequences.

Sweet loads of unique sequences lets merge the forward and reverse reads

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
#27156 paired-reads (in 168 unique pairings) successfully merged out of 37747 (in 524 pairings) input.


# Inspect the merger data.frame from the first sample
head(mergers[[1]])

#[1] "CGCACCTTGCGCTCTTTGGTATTCCGAAGAGCATGCCTGTTTGAGTGTCATGAAAATATCAACCTTGACTTGGGTTTAGTGCTCTTGTCTTGGCTTGGATTTGGCTGTTTGCCGCTCGAAAGAGTCGGCTCAGCTTAAAAGTATTAGCTGGATCTGTCTTTGAGACTTGGTTTGACTTGGCGTAATAAGTTATTTCGCTGAGGACAATCTTCGGATTGGCCGAGTTTCTGGGACGTTTGTCCGCTTTCTAATACAAGTTCTAGCTTGCTAGACATGACTTTTTTATTATCTGGCCTCAAATCAGGTAGGACTACCCGCTGAACTTAA"                     
#[2] "CGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTTTGCTTGGTGTTGGGTGTTTGTCTCGCCTTGCGTGTAGACTCGCCTTAAAATAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCACTCTGTCCTCACAACGACGACGTCCAAAAGTACTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA"                                                                                                       
#[3] "CGCACCTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGTGTCATGAAACCTCACCCCACTTGGGTTTTTGCCTGAGCGGTGGTGTATTGGGTGTTGCCTTGCTAAAGGCTCGCCTTAAAGACATAAGCACCTTGGATGTAATACGTTTCATCCTTCTGGGTGGCTGATAACCCCACATATTCATGATCTGGCCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA"                                                                                                                    
#[4] "CGCACATTGCACTCTCTGGTATTCCGGAGAGTATGCTTGTTTGAGTATCTGTAAACATCTCAAATCTCTCGACTTTCGAGTTTGAGAGAATTGGACTTGAGCAATCCCAACTCCTTCGTCAGAAAGAGGAGGGTTGCTTGAAATGCAGGTGCAGCTGGGCATTCTTCTGAGCTAAAAGCATATTTATTTAGTCCCGTCAAACGGATTATTATTCTTGCTTCAGCTTACATAAAGGTTGAATGACTACCGAGGCTGATTGCAGGTAAACTCGTTCCGTTAAAAAGAACTTGTGAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAA"               
#[5] "CGCACATTGCGCTCTCTAGTATTCTGGAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAATCCCCTAAATCTTGAAAAAGAAAGGGAATTGGACTTGAGCTATCCCAACAAACGTAAAACTTTGGCGGGTCGCTTGAAATGCAGGTGCAGCTGGAACTACTCTTCTGAGCTATAAGCATATCTATTTAGTCCCGTCAAACGGATTATTATTTTTTGCTGCAGCTAACATAAAGGAAAAGTAGATCCGCTATGCTGACTGATGCAGGAATTAGCAGTGAACGTCTCTTTGAGGCGCTTACTGTGAAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAA"
#[6] "CGCACCTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGTGTCATGAAACCTCACCCCACTTGGGTTTTTACCTGAGCGGTGGTGTATTGGGTGTTGCCTTGCTAAAGGCTCGCCTTAAAGATATAAGCACCTTGGATGTAATACGTTTCATCCTTCTGGGTGGCTAATAACCCCACATATTCATGATCTGGCCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA"   

sweet only lost 10’000 reads from that! Now with full sequences we can construct sequence tables and determine our amplicon sequence variants (ASV’s)

seqtab <- makeSequenceTable(mergers)
dim(seqtab)
#[1]   1 168

That means 168 unqiue ASV’s!

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

#196 211 214 215 216 222 229 232 233 234 236 237 238 239 240 241 242 243 244 245 246 247 248 249 252 254 256 258 259 262 265 266 270 272 273 
#  1   1   1   2   1   1   1   8   1   3   5   5  12   4   2   3   1   1   1   3   3   4   3   1   1   1   1   1   1   1   1   1   1   2   1 
#275 280 283 284 286 287 292 298 303 304 305 306 311 312 313 316 324 325 326 327 328 329 330 331 332 333 334 335 336 338 339 341 343 345 347 
#  1   2   1   1   1   1   1   1   2   1   1   2   2   1   1   1   1   2   2  11   9   5   1   3   1   6   3   3   3   2   1   1   1   2   2 
#348 349 356 370 373 
#  2   2   2   1   2 

We have so much variation in the amount of sequence lengths because its the ITS sequence we are using, but always check that the sequence length makes sense as weird sequence lengths can result from non-specific merging. The ITS2 region is between 160-390 bps so this all checks out https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7162488/#:~:text=In%20spite%20of%20its%20short,across%20all%20land%20plants%20(e.

But life is never perfectly easy, we must remove chimeras (i.e. sequences that have been merged but dont reflect biologically species). Chimeras are checked for by looking at ASVs with low abundance that can be made of combinations of highly abundant ASV’s. Why and how this happens is well explained here https://forum.qiime2.org/t/dada2-merging-plates-and-chimera-removal/21030/4

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
#Identified 41 bimeras out of 168 input sequences.

dim(seqtab.nochim)
#[1]   1 127

41 chimeric sequences! This is 24.45% of the total amount of ASV’s!!

sum(seqtab.nochim)/sum(seqtab)
#[1] 0.9531595

However, removing these samples only resulted in a 4% loss of reads in total! These are normal numbers, and anything under 25% is no need for a second thought. Chimeras can be more or less of a sample based on the specific experimental procedures

Now lets do a whole look through all the reads throughout the pipeline

getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)



#     input filtered denoisedF denoisedR merged nonchim
#LW01 61606    38621     38058     38045  27156   25884

PS. I modified the above code to work on a single sample. you’ll have to change it for multiple samples, see the its big data tutorial

Then lets assign taxonomy! first we need to download the ITS database from UNITE https://unite.ut.ee/repository.php

you’ll want to download one of the general fasta releases. The differences between these files isn’t well explained. There seems to be two fungi and two eukaryote databases to download

I have the beans on the HPC to do what I want, so I’ll be using the slightly larger database, and perhaps thats how you should make your decision aswell, but if you have strong opinions tell me.

unite.ref <- "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\UNITE_database.fastq.gz"  # CHANGE ME to location on your machine
taxa <- assignTaxonomy(seqtab.nochim, #Sequnces
                       unite.ref, #databse
                       multithread = FALSE, 
                       tryRC = TRUE) #try reverse compliment of sequences

My baby computer doesnt have the stuffs to process even one sample! But alas, we can see what it theoretically would look like with this

taxa.print <- taxa  # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)

Then you can take your data and save it! Your seqtab.nochim, and taxa are the important files. seqtab.nochim is your otu_table, and taxa is your tax_table to save…

saveRDS(taxa, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\taxa.rds" )#add your file location here
saveRDS(seqtab.nochim, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\seqtabnochim.rds" )#add your file location here

Then to make a phyloseq object later on…

library(phyloseq)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), tax_table(taxa))
saveRDS(ps, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\ps.rds" )#add your file location here
---
title: "DADA2 with Nextseq data"
output: html_notebook
---
by Angus Ball

Hi This is the introductory tutorial on DADA2 using a single sample so you can do it on your computer before going and using the HPC
This was stolen from: https://benjjneb.github.io/dada2/ITS_workflow.html

As a heads up this tutorial is exactly the same as the Miseq one EXCEPT when you start looking at error modeling


As always packages
```{r, echo = T, message=FALSE}
library(dada2)
library(ShortRead)
library(Biostrings)
```

Start by importing your one sample
there'll be 4 files, 2 index files and 2 read files, one for the forward (I1, R1), and one for the reverse reads (I2, R2). That being said the indexes are only needed to demultiplex your files and since you've already done that we will just be filtering them out from our forward reads caragory (fnFs) and reverse reads category (fnRs)
```{r}
path <- "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset" #location of data
list.files(path)

#organize files
fnFs <- sort(list.files(path, pattern = "R1.fastq.gz", full.names = TRUE)) #F orward
fnRs <- sort(list.files(path, pattern = "R2.fastq.gz", full.names = TRUE)) #R everse
```

Then we have to identify the primers used. In this study the ITS86F and ITS4 reverse primers were used so...
```{r}
FWD <- "ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTGAATCATCGAATCTTTGAA"
REV <- "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCCTCCGCTTATTGATATGC"
```

Lets confirm we have the correct orientation of the primers and verify their presence
This code makes reverse compliments of our dna
```{r}
allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
    orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
        RevComp = Biostrings::reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
```
Now we will prefilter the sequences to remove ambiguous bases (Ns)
```{r}
fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filtered files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = FALSE)
```

Now this chunk of code goes through the sequences and counts the number of times the primers appear in the forward and reverse reads at all possible orientations.
```{r}
primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
```

Sweet so the primers have already been removed from the sample at the start of each sequence, but since there is some bleed through of short its sequences measured in this study there are some hits of the reverse compliment in the respective strands.To cut the ends off the sequences from this bleed through we will be using a program called cut adapt. Cutadapt isnt an r package so its kinda a pain but alas. go here https://cutadapt.readthedocs.io/en/stable/installation.html#installation-on-windows to download it

```{r}
cutadapt <- "C:\\Users\\angus\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\cutadapt" #location of cutadapt on you computer
system2(cutadapt, args = "--version") # Run shell commands from R
```
this cuts the sequences with the primers in them
```{r, echo = T, message=FALSE}
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)#this creates a cutadapt file
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

#Hey this whole thing breaks if you have alot of spaces in your path names so uhhh don't do that

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC) 
# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs.filtN[i], fnRs.filtN[i])) # input files
}
```
so two primers had warning messages saying that there was a weird amount of one specific base after the primer. Option 1: you're a fool! and didnt include the full length of your primer, so go add the bases you missed and reassess. Option 2 (this is us): Only the compliments of the primers threw warning messages which means that this probably isnt an artifact of your primers and more an artifact of natural diversity putting an A or G next.(if you'd missed a nucleotide on your primer, you'd see error messages on both the reverse and normal compliments i believe)



Lets double check if it worked!
```{r}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.cut[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.cut[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
```
Hey it worked!! there are no more primer sequences of the reverse compliments within our sequences.


This is some spicy ways to extract the names so we can use them (and name them to something not stupid)
```{r}
# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "R2.fastq.gz", full.names = TRUE))

```

You're samples will have different naming schemes, therefore you'll have to change this code to fit your data. To rename these files we use regex expressions which are hard and complicated :(.

here is an unadultered sample name: MI.M03992_0831.001.FLD_ill_028_i7---IDT_i5_8.LW01_R1.fastq.gz

We want our name to be Lw43 only so...

the first function...

```{r}
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][7] #"_"<-split value, [7]<-save this one
sample.names <- unname(sapply(cutFs, get.sample.name))
```

can be read as create an object with the function split the string (the name) at every _ and keep the 7th one i.e. you run the command and

MI.M03992_0831.001.FLD_ill_028_i7---IDT_i5_8.LW03_R1.fastq.gz

becomes

"MI.M03992" "0831.001.FLD" "ill" "028" "i7---IDT" "i5" "8.LW03" "R1.fastq.gz"

And since only the 7th is kept it becomes

"8.LW03"

Then we repeat just to save the LW43 part of the name, because thats what we care about

```{r}
get.sample.name.2 <- function(fname) strsplit(basename(fname), "[.]")[[1]][2]
sample.names <- unname(sapply(sample.names, get.sample.name.2))
```

This takes

"8.LW03"

splits it by the period

Note: the period is special in regex expressions so I have to enclose it in square brackets (the square brackets are a special form of an escape character, this sentence won't make sense unless you code in other languages tho :/ )

"8" "LW03"

Then we save the 2nd instance and the sample name becomes

"LW03"

Huzzah! actual names that are good

```{r}
head(sample.names)
```
(You can ignore the NA, its a byproduct of my creating these tutorials haphazardly)


Now we want to inspect the read quality profiles first the forward read

```{r}
plotQualityProfile(cutFs)
```
Then the reverse
```{r}
plotQualityProfile(cutRs)
```
SO how do you read these? the green line is the median quality score, the greyscale is a heatmap of the frequency of each quality score, the quartiles of the average quality are shown with the orange lines, and the read line at the bottom (with a seperate scale) shows the proportion of reads a that extend to at least that length. What is quality score? check  out
https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html
and 
https://drive5.com/usearch/manual/exp_errs.html
but briefly, Quality score = 10 means a wrong base is reported 10% of the time. Q20 is 1% and Q30 is 0.1% of the time. So anything under q20 should be removed*. *This is in theory correct; however it is better to delete samples based on the number of expected errors rather than error chance. (this still uses quality score information)
https://academic.oup.com/bioinformatics/article/31/21/3476/194979?login=false

This sets some path directories that will be created within the filterandtrim command below
```{r}
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))
```

And now we can enter dada2 fully!!
```{r}
out <- filterAndTrim(
  cutFs, #location of cut forward reads
  filtFs, #location of where filtered forward reads will go
  cutRs, #location of cut reverse reads
  filtRs, #location of where filtered reverse reads will go
  maxN = 0, #max number of ambiguous bases allowed in a read (dada2 REQUIRES 0 ambiguous bases)
  maxEE = c(2, 2), #maxEE is the command that uses expected error rates to determine if a sample should be deleted (rather than using Q scores absolutely), 2 errors is the recommended value. Why is it a list? one for the forward reads one for the reverse
  truncQ = 2, #truncates reads when any base pair quality score is 2 or below, why 2? illumina sequencing reports a 2 on really bad reads
  minLen = 50, #minimum length of read, note for ITS region there is no max length
  rm.phix = TRUE, #removes sequences of the bacteriophage PhiX which commonly contaminates NGS data
  compress = TRUE, #compresses your files when its done
  multithread = FALSE)  #multithreading cant be done on a windows machine
head(out)
```



if you want to learn more about truncQ read this: https://drive5.com/usearch/manual/exp_errs.html

OUCH! That removed two thirds of all the reads!!!!

lets take a look at the reads again
```{r}
plotQualityProfile(filtFs)
```



```{r}
plotQualityProfile(filtRs)
```
Less samples but they are certainly of a much higher quality!! in the dada2 tutorial only about 80% of the reads survived this step, Honestly this sounds like it sucks but looking at other peoples samples it's not too bad in the grand scheme of things, we still have 38 thousands reads in this samples!!. Feel free to ask me (angus) about your specific loss rates when you finish the pipeline if thinks look bleak, there are some small things that we can do to up the number (or maybe some hidden error is eating your reads and who knows)!


oKAY! We are out of ITS specific commands now, everything from here on out (excluding what database you compare to) is the same for 16S and ITS sequencing. This also means we might get a bunch of warnings messages saying our reads are different lengths, thats okay we know it!
DADA2 tries to learn the error rates of different nucleotide transitions (dont ask me how) so...


HEY, because you read my ENTIRE preamble on the dada2 git page you know Miseq and nextseq data needs to learn errors a bit differently. This is that step! Lets say you didn't know this and ran your data like normal this'd what your quality plots would look like
```{r}
plot(ploterrF_PreLOESS)
```
```{r}
plot(ploterrR_PreLOESS)
```

Brother ewww. Look at C2G or T2A (eg A2C is an adenine -> Cytosine transition)! what the hell is that squiggly ass line!?! These trend lines don't match up with the datapoints or the red lines at all!plus them falling into levels is what I mean when i say plots fall into stepwise patterns (see documentation on the git). 

lets run error modeling the correct way

This code is from https://github.com/benjjneb/dada2/issues/1307, it forces the trend lines to have monotonicity (i.e. they can only tren din one direction and not just up and down), and alters the loess function to make it run closer to the data points. Whats loess?? its kinda like linear regression trying to find a trend line between your data points but more complicated. 

Just so we're on the same page this isn't changing the data, just the trend line through the data, so as long as you look and it better fits the data then its okay to use. There are some other options (i.e. only enforcing monotonicity or only fixing loess) that we can try but based on my research this seems like the best general solution
```{r}
loessErrfun_mod4 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
        
        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)
        
        # jonalim's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)
        
        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))
  
  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
  
  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)
  
  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

```
This above code just makes an error model. Now we can actually run our error modeling approach

```{r}
#note to run the error matrix with this new model, the three samples i've given the program aren't enough, you can only really run this with a full data set
errF <- learnErrors(
  filtFs,
  multithread = FALSE, #remember multitread = TRUE on big computers
  nbases = 1e9,
  errorEstimationFunction = loessErrfun_mod4,
  verbose = TRUE
)

errR <- learnErrors(
  filtRs,
  multithread = FALSE,
  nbases = 1e9,
  errorEstimationFunction = loessErrfun_mod4,
  verbose = TRUE
)

```



Then lets take a look...
```{r}
plotErrors(errF, nominalQ=TRUE)
```
```{r}
plotErrors(errR, nominalQ=TRUE)
```


Looking at these two graphs, they look pretty good, the actual values (grey dots) match up with the expected (red lines) pretty well for the different nucleotide transitions especially compared to the earlier ones. we see that A2G and T2C transitions are still kinda straight (especially compared to the red line), this is due to the nextseq itself and (the global) we don't really know why yet. dada2 is still developing solutions to this issue (and these solutions have been in development since 2019...)

Now its time to run dada2 to find our unqiue sequences
```{r}
dadaFs <- dada(filtFs, err=errF, multithread=FALSE)
#Sample 1 - 38621 reads in 12833 unique sequences.
```


(I'm lazy and wont rerun the whole thing just so the outputs are here on the bottom in the comments)

```{r}
dadaRs <- dada(filtRs, err=errR, multithread=FALSE)
#Sample 1 - 38621 reads in 11222 unique sequences.
```




Sweet loads of unique sequences
lets merge the forward and reverse reads
```{r}
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
#27156 paired-reads (in 168 unique pairings) successfully merged out of 37747 (in 524 pairings) input.


# Inspect the merger data.frame from the first sample
head(mergers[[1]])

#[1] "CGCACCTTGCGCTCTTTGGTATTCCGAAGAGCATGCCTGTTTGAGTGTCATGAAAATATCAACCTTGACTTGGGTTTAGTGCTCTTGTCTTGGCTTGGATTTGGCTGTTTGCCGCTCGAAAGAGTCGGCTCAGCTTAAAAGTATTAGCTGGATCTGTCTTTGAGACTTGGTTTGACTTGGCGTAATAAGTTATTTCGCTGAGGACAATCTTCGGATTGGCCGAGTTTCTGGGACGTTTGTCCGCTTTCTAATACAAGTTCTAGCTTGCTAGACATGACTTTTTTATTATCTGGCCTCAAATCAGGTAGGACTACCCGCTGAACTTAA"                     
#[2] "CGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTTTGCTTGGTGTTGGGTGTTTGTCTCGCCTTGCGTGTAGACTCGCCTTAAAATAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCACTCTGTCCTCACAACGACGACGTCCAAAAGTACTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA"                                                                                                       
#[3] "CGCACCTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGTGTCATGAAACCTCACCCCACTTGGGTTTTTGCCTGAGCGGTGGTGTATTGGGTGTTGCCTTGCTAAAGGCTCGCCTTAAAGACATAAGCACCTTGGATGTAATACGTTTCATCCTTCTGGGTGGCTGATAACCCCACATATTCATGATCTGGCCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA"                                                                                                                    
#[4] "CGCACATTGCACTCTCTGGTATTCCGGAGAGTATGCTTGTTTGAGTATCTGTAAACATCTCAAATCTCTCGACTTTCGAGTTTGAGAGAATTGGACTTGAGCAATCCCAACTCCTTCGTCAGAAAGAGGAGGGTTGCTTGAAATGCAGGTGCAGCTGGGCATTCTTCTGAGCTAAAAGCATATTTATTTAGTCCCGTCAAACGGATTATTATTCTTGCTTCAGCTTACATAAAGGTTGAATGACTACCGAGGCTGATTGCAGGTAAACTCGTTCCGTTAAAAAGAACTTGTGAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAA"               
#[5] "CGCACATTGCGCTCTCTAGTATTCTGGAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAATCCCCTAAATCTTGAAAAAGAAAGGGAATTGGACTTGAGCTATCCCAACAAACGTAAAACTTTGGCGGGTCGCTTGAAATGCAGGTGCAGCTGGAACTACTCTTCTGAGCTATAAGCATATCTATTTAGTCCCGTCAAACGGATTATTATTTTTTGCTGCAGCTAACATAAAGGAAAAGTAGATCCGCTATGCTGACTGATGCAGGAATTAGCAGTGAACGTCTCTTTGAGGCGCTTACTGTGAAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAA"
#[6] "CGCACCTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGTGTCATGAAACCTCACCCCACTTGGGTTTTTACCTGAGCGGTGGTGTATTGGGTGTTGCCTTGCTAAAGGCTCGCCTTAAAGATATAAGCACCTTGGATGTAATACGTTTCATCCTTCTGGGTGGCTAATAACCCCACATATTCATGATCTGGCCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA"   


```




sweet only lost 10'000 reads from that!
Now with full sequences we can construct sequence tables and determine our amplicon sequence variants (ASV's) 
```{r}
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
#[1]   1 168

```
That means 168 unqiue ASV's!
```{r}
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

#196 211 214 215 216 222 229 232 233 234 236 237 238 239 240 241 242 243 244 245 246 247 248 249 252 254 256 258 259 262 265 266 270 272 273 
#  1   1   1   2   1   1   1   8   1   3   5   5  12   4   2   3   1   1   1   3   3   4   3   1   1   1   1   1   1   1   1   1   1   2   1 
#275 280 283 284 286 287 292 298 303 304 305 306 311 312 313 316 324 325 326 327 328 329 330 331 332 333 334 335 336 338 339 341 343 345 347 
#  1   2   1   1   1   1   1   1   2   1   1   2   2   1   1   1   1   2   2  11   9   5   1   3   1   6   3   3   3   2   1   1   1   2   2 
#348 349 356 370 373 
#  2   2   2   1   2 
```
We have so much variation in the amount of sequence lengths because its the ITS sequence we are using, but always check that the sequence length makes sense as weird sequence lengths can result from non-specific merging. The ITS2 region is between 160-390 bps so this all checks out 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7162488/#:~:text=In%20spite%20of%20its%20short,across%20all%20land%20plants%20(e.

But life is never perfectly easy, we must remove chimeras (i.e. sequences that have been merged but dont reflect biologically species). Chimeras are checked for by looking at ASVs with low abundance that can be made of combinations of highly abundant ASV's. Why and how this happens is well explained here
https://forum.qiime2.org/t/dada2-merging-plates-and-chimera-removal/21030/4
```{r}
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
#Identified 41 bimeras out of 168 input sequences.

dim(seqtab.nochim)
#[1]   1 127

```
41 chimeric sequences! This is 24.45% of the total amount of ASV's!! 
```{r}
sum(seqtab.nochim)/sum(seqtab)
#[1] 0.9531595
```
However, removing these samples only resulted in a 4% loss of reads in total! These are normal numbers, and anything under 25% is no need for a second thought. Chimeras can be more or less of a sample based on the specific experimental procedures

Now lets do a whole look through all the reads throughout the pipeline
```{r}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)



#     input filtered denoisedF denoisedR merged nonchim
#LW01 61606    38621     38058     38045  27156   25884

```
PS. I modified the above code to work on a single sample. you'll have to change it for multiple samples, see the its big data tutorial





Then lets assign taxonomy!
first we need to download the ITS database from UNITE 
https://unite.ut.ee/repository.php

you'll want to download one of the general fasta releases. The differences between these files isn't well explained. There seems to be two fungi and two eukaryote databases to download

I have the beans on the HPC to do what I want, so I'll be using the slightly larger database, and perhaps thats how you should make your decision aswell, but if you have strong opinions tell me.
```{r}
unite.ref <- "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\UNITE_database.fastq.gz"  # CHANGE ME to location on your machine
taxa <- assignTaxonomy(seqtab.nochim, #Sequnces
                       unite.ref, #databse
                       multithread = FALSE, 
                       tryRC = TRUE) #try reverse compliment of sequences
```
My baby computer doesnt have the stuffs to process even one sample! But alas, we can see what it theoretically would look like with this 
```{r}
taxa.print <- taxa  # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
```

Then you can take your data and save it! Your seqtab.nochim, and taxa are the important files. seqtab.nochim is your otu_table, and taxa is your tax_table
to save...
```{r}
saveRDS(taxa, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\taxa.rds" )#add your file location here
saveRDS(seqtab.nochim, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\seqtabnochim.rds" )#add your file location here
```

Then to make a phyloseq object later on...
```{r}
library(phyloseq)
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), tax_table(taxa))
saveRDS(ps, file = "C:\\Users\\angus\\OneDrive\\Documents\\Story\\lisatestdataset\\ps.rds" )#add your file location here

```




