Dependencies
Installing Blast
You need to install Blast in your environment. Also, you can use executables of Blast directly. For this tutorial, you need to have blastn and makeblastdb. Open the terminal and test your blast installation:
./blastn -h
./makeblastdb -h
Importing Libraries
import subprocess
import os
import pandas as pd
import shutil
import logging
Writing the Blast functions
Logging is not functional for now, but it can be useful for later if needed. The name of our class is OxyBlast and we wrote the init function like below:
class OxyBlast:
def __init__(self, log):
self.log = log
Before making blast queries, it is needed to create database. So, we need to run makeblastdb executable. We use Python subprocess module for this purpose. Also, we use gunzip for unzipping the sequences (If you want to work with uncompressed files, you can remove gunzip command)
def make_blast_db(self, db_input, db_output):
args = f"gunzip -c {db_input} | ./makeblastdb -in - -title {db_input} -dbtype nucl -parse_seqids -out {db_output} "
my_process = subprocess.run(args, shell=True, executable='/bin/bash', text=True, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
Second function is for query, we use -outfmt 6 for output format. It creates tabular output with column values like below:
#qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
def run_blast(self, query, database, output):
args = f"./blastn -db {database} -query {query} -out {output} -outfmt 6"
my_process = subprocess.run(args, shell=True, executable='/bin/bash', text=True, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
Necessary output files can be generated with this functions, now it is time to write function to parse and filter outputs. In this code, it eliminate if percent idendity is lower than 80.0 and e_value is less than 0.1. You can change this or you can give them as arguments to the function. Also, this function returns a Python dictionary with start, end, contig_name and search keys. If search is False, then there is no significant similarity. If it is True, then start and end will be the start and end values of similarty in the contig which name is given in contig_name.
def is_gene_exist(self, blast_result):
return_dict = {
"start" : 0,
"end" : 0,
"contig_name": None,
"search": False
}
try:
blast_result = pd.read_table(blast_result, header=None)
except pd.errors.EmptyDataError:
self.log.debug(" is empty and has been skipped. %s" % blast_result)
return return_dict
else:
default_outfmt6_cols = 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore'.strip().split(' ')
blast_result.columns = default_outfmt6_cols
df_filtered = blast_result[(blast_result['pident'] >= 80.0) & (blast_result['evalue'] <= 0.1)]
self.log.info(df_filtered)
if len(df_filtered) > 0:
return_dict = {
"start" : int(df_filtered.iloc[0]['sstart']),
"end" : int(df_filtered.iloc[0]['send']),
"contig_name": df_filtered.iloc[0]['sseqid'],
"search": search_value
}
return return_dict
Now, we need a driver function to use these functions:
def blast_driver(self, assembly_id, assembly_path, blast_query, md5_path):
db_output = os.path.join("blast_db", assembly_id, assembly_id)
db_dir = os.path.join("blast_db", assembly_id)
db_input = assembly_path
result_path = os.path.join("blast_result", assembly_id + "_" + os.path.basename(blast_query) + ".out")
if not os.path.exists("blast_result"):
os.mkdir("blast_result")
if not os.path.exists(blast_query):
self.log.warning('No available Blast Query file')
return False
try:
self.make_blast_db(db_input, db_output)
self.run_blast(blast_query, db_output, result_path)
except Exception as e:
self.log.warning('Blast Error')
raise UserWarning('Blast Error')
search_d_value = self.is_gene_exist(result_path)
if search_d_value["search"]:
self.log.info('Gene found in %s' % assembly_id)
self.delete_blast_db(db_dir)
return search_d_value
else:
if os.path.exists(result_path):
os.remove(result_path)
self.log.debug('Deleting the assembly')
# delete assembly file and md5 file
os.remove(md5_path)
os.remove(assembly_path)
# update searched txt file
self.delete_blast_db(db_dir)
return search_d_value
Now, we can use this class for blast:
assembly_id="Your sequence id"
sq_path="Your sequence path"
blast_query="Your query path"
md5_path="Your sequence md5 path"
logging.basicConfig(filename='test.log', level=logging.INFO, format='%(asctime)s:%(levelname)s:%(message)s')
log = logging.getLogger(__name__)
oxy_blast = OxyBlast(log)
blast_res = oxy_blast.blast_driver(assembly_id, sq_path, blast_query, md5_path)
if blast_res["search"]:
print("There is similarty in ", blast_res["contig_name"], " between ", blast_res["start"], " - ", blast_res["end"])