Dependencies

blast
python
pandas
logging

Installing Blast

You need to install Blast in your environment. Also, you can use executables of Blast directly. For this tutorial, you need to have blastn and makeblastdb. Open the terminal and test your blast installation:

./blastn -h
./makeblastdb -h

Importing Libraries

import subprocess
import os
import pandas as pd
import shutil
import logging

Writing the Blast functions

Logging is not functional for now, but it can be useful for later if needed. The name of our class is OxyBlast and we wrote the init function like below:

class OxyBlast:

    def __init__(self, log):
        self.log = log

Before making blast queries, it is needed to create database. So, we need to run makeblastdb executable. We use Python subprocess module for this purpose. Also, we use gunzip for unzipping the sequences (If you want to work with uncompressed files, you can remove gunzip command)

def make_blast_db(self, db_input, db_output):
        args = f"gunzip -c {db_input} | ./makeblastdb -in - -title {db_input} -dbtype nucl -parse_seqids -out {db_output} "
        my_process = subprocess.run(args, shell=True, executable='/bin/bash', text=True, check=True,  stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

Second function is for query, we use -outfmt 6 for output format. It creates tabular output with column values like below:

#qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
def run_blast(self, query, database, output):
        args = f"./blastn -db {database} -query {query} -out {output} -outfmt 6"
        my_process = subprocess.run(args, shell=True, executable='/bin/bash', text=True, check=True,  stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

Necessary output files can be generated with this functions, now it is time to write function to parse and filter outputs. In this code, it eliminate if percent idendity is lower than 80.0 and e_value is less than 0.1. You can change this or you can give them as arguments to the function. Also, this function returns a Python dictionary with start, end, contig_name and search keys. If search is False, then there is no significant similarity. If it is True, then start and end will be the start and end values of similarty in the contig which name is given in contig_name.

def is_gene_exist(self, blast_result):
        return_dict = {
                    "start" : 0,
                    "end" : 0,
                    "contig_name": None,
                    "search": False
        }
        
        try:
            blast_result = pd.read_table(blast_result, header=None)
        except pd.errors.EmptyDataError:
            self.log.debug(" is empty and has been skipped. %s" % blast_result)
            return return_dict

        else:
            default_outfmt6_cols = 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'.strip().split(' ')
            blast_result.columns = default_outfmt6_cols
            df_filtered = blast_result[(blast_result['pident'] >= 80.0) & (blast_result['evalue'] <= 0.1)]
            self.log.info(df_filtered)
            if len(df_filtered) > 0:
                return_dict = {
                    "start" : int(df_filtered.iloc[0]['sstart']),
                    "end" : int(df_filtered.iloc[0]['send']),
                    "contig_name": df_filtered.iloc[0]['sseqid'],
                    "search": search_value
                }
            return return_dict

Now, we need a driver function to use these functions:

def blast_driver(self, assembly_id, assembly_path, blast_query, md5_path):
        db_output = os.path.join("blast_db", assembly_id, assembly_id)
        db_dir = os.path.join("blast_db", assembly_id)
        db_input = assembly_path
        result_path = os.path.join("blast_result", assembly_id + "_" + os.path.basename(blast_query) + ".out")

        if not os.path.exists("blast_result"):
            os.mkdir("blast_result")

        if not os.path.exists(blast_query):
            self.log.warning('No available Blast Query file')
            return False

        try:
            self.make_blast_db(db_input, db_output)
            self.run_blast(blast_query, db_output, result_path)
        except Exception as e:
            self.log.warning('Blast Error')
            raise UserWarning('Blast Error')
        

        search_d_value = self.is_gene_exist(result_path)
        if search_d_value["search"]:
            self.log.info('Gene found in %s' % assembly_id)
            self.delete_blast_db(db_dir)
            return search_d_value
        else:
            if os.path.exists(result_path):
                os.remove(result_path)
            self.log.debug('Deleting the assembly')
            # delete assembly file and md5 file
            os.remove(md5_path)
            os.remove(assembly_path)
            # update searched txt file

            self.delete_blast_db(db_dir)
            return search_d_value

Now, we can use this class for blast:

assembly_id="Your sequence id"
sq_path="Your sequence path"
blast_query="Your query path"
md5_path="Your sequence md5 path"

logging.basicConfig(filename='test.log', level=logging.INFO, format='%(asctime)s:%(levelname)s:%(message)s')
log = logging.getLogger(__name__)
oxy_blast = OxyBlast(log)
blast_res = oxy_blast.blast_driver(assembly_id, sq_path, blast_query, md5_path)
if blast_res["search"]:
    print("There is similarty in ", blast_res["contig_name"], " between ", blast_res["start"], " - ", blast_res["end"])