BIOMIG

Dependencies

guppy_basecaller
filtlong
minimap2
samtools
pgap
blastn

Basecalling

In this experiment, we used MinION Mk1B sequencing device with FLO-MIN106 flow cell and SQK-RAD004 kit. Be careful about the configuration file because it is different for different flow cell and kits. Sequencing output created 1.63 GB data with 815.22 k reads and 368.15 Mb bases. The estimated N50 was 623. Guppy was used as basecaller with the command below:

guppy_basecaller --input_path fast5 --save_path basecalled_output --config dna_r9.4.1_450bps_hac.cfg --cpu_threads_per_caller 24 --num_callers 12

Then we need to combine all fastq outputs to the single file:

cd basecalled_output
cat *.fastq > combined.fastq

Filtering

Filtering is a crucial step before starting the analysis because there will be low quality reads and very short reads. There are different needs for different projects, so determining the filtering factors is not easy and sharp. In our analysis, we are looking for longer reads than 1-5 kb to capture plasmids or tranposon activity, and the quality is not main focus. Therefore, we just filtered reads longer than 5 kb.

filtlong --min_length 5000 combined.fastq | gzip > filtered_5kb.fastq

Then we can start the mapping to see what is going on there :)

Mapping

Our aim is to find reads that matches with the reference query sequences which are (oxyBAC gene), full genome of a strain (VD2) and a plasmid (pMAT). So, we mapped our filtered long reads to these three reference sequence with minimap2.

minimap2 -ax map-ont query.fasta filtered_5kb.fastq > aln.sam

The output of minimap2 is sam file, so we used samtools for sorting and eliminating unmapped reads with command below. Bam file format has same information with sam file, it is just compressed version of sam.

samtools view -S -b aln.sam > bamfile.bam

samtools sort bamfile.bam -o sorted.bam

samtools view -b -F 4 sorted.bam > mapped.bam

In the output file, there are just mapped alignments. Then, we created another fastq file which has just mapped reads with the command below:

samtools bam2fq mapped.bam > mapped.fastq

For the sake of ease (it doesn’t create reliable outputs, but gives and intuition to see what is in there), we converted fastq to fasta with an one line command (Be careful!, Nanopore reads mostly have indels and errors, so without polishing using this data directly as fasta is not good):

sed -n '1~4s/^@/>/p;2~4p' mapped.fastq > mapped.fasta

Indexing

samtools index mapped.bam

Seeing the alignment

COLUMNS=1000 samtools tview mapped.bam reference.fasta -d H > a.html

Functional Analysis

Conversion to the fasta is making easier the functional analysis although it has some errors. We first annotated the sequences with PGAP pipeline:

./pgap.py -r -o out input.yaml

Then, we made blast to see coordinates:

makeblastdb -in reference.fasta -dbtype nucl -out reference_db

blastn -db reference_db -query mapped.fasta -out blast_out.txt -outfmt 6

It can be seen that this step is very similar to the mapping step, actually, yes. It will be good to see similar things. Therefore, blasting is not necessary because we made alignment already.

Examining Nanopore Data

Basecalling, Filtering, Mapping, Aligning and Analyzing the Nanopore Long Read Data

Dependencies

Basecalling

Filtering

Mapping

Functional Analysis

Polishing the Reads

Reviewing the Outputs