Dependencies
guppy_basecaller
filtlong
minimap2
samtools
pgap
blastn
Basecalling
In this experiment, we used MinION Mk1B sequencing device with FLO-MIN106 flow cell and SQK-RAD004 kit. Be careful about the configuration file because it is different for different flow cell and kits. Sequencing output created 1.63 GB data with 815.22 k reads and 368.15 Mb bases. The estimated N50 was 623. Guppy was used as basecaller with the command below:
guppy_basecaller --input_path fast5 --save_path basecalled_output --config dna_r9.4.1_450bps_hac.cfg --cpu_threads_per_caller 24 --num_callers 12
Then we need to combine all fastq outputs to the single file:
cd basecalled_output
cat *.fastq > combined.fastq
Filtering
Filtering is a crucial step before starting the analysis because there will be low quality reads and very short reads. There are different needs for different projects, so determining the filtering factors is not easy and sharp. In our analysis, we are looking for longer reads than 1-5 kb to capture plasmids or tranposon activity, and the quality is not main focus. Therefore, we just filtered reads longer than 5 kb.
filtlong --min_length 5000 combined.fastq | gzip > filtered_5kb.fastq
Then we can start the mapping to see what is going on there :)
Mapping
Our aim is to find reads that matches with the reference query sequences which are (oxyBAC gene), full genome of a strain (VD2) and a plasmid (pMAT). So, we mapped our filtered long reads to these three reference sequence with minimap2.
minimap2 -ax map-ont query.fasta filtered_5kb.fastq > aln.sam
The output of minimap2 is sam file, so we used samtools for sorting and eliminating unmapped reads with command below. Bam file format has same information with sam file, it is just compressed version of sam.
samtools view -S -b aln.sam > bamfile.bam
samtools sort bamfile.bam -o sorted.bam
samtools view -b -F 4 sorted.bam > mapped.bam
In the output file, there are just mapped alignments. Then, we created another fastq file which has just mapped reads with the command below:
samtools bam2fq mapped.bam > mapped.fastq
For the sake of ease (it doesn’t create reliable outputs, but gives and intuition to see what is in there), we converted fastq to fasta with an one line command (Be careful!, Nanopore reads mostly have indels and errors, so without polishing using this data directly as fasta is not good):
sed -n '1~4s/^@/>/p;2~4p' mapped.fastq > mapped.fasta
Indexing
samtools index mapped.bam
Seeing the alignment
COLUMNS=1000 samtools tview mapped.bam reference.fasta -d H > a.html
Functional Analysis
Conversion to the fasta is making easier the functional analysis although it has some errors. We first annotated the sequences with PGAP pipeline:
./pgap.py -r -o out input.yaml
Then, we made blast to see coordinates:
makeblastdb -in reference.fasta -dbtype nucl -out reference_db
blastn -db reference_db -query mapped.fasta -out blast_out.txt -outfmt 6
It can be seen that this step is very similar to the mapping step, actually, yes. It will be good to see similar things. Therefore, blasting is not necessary because we made alignment already.