MPIBLAST |
---|
Description: Parallel implementation of NCBI BLAST |
SHARCNET Package information: see MPIBLAST software page in web portal |
Full list of SHARCNET supported software |
Contents
GETTING STARTED
The mpiblast module must be manually loaded before submitting any mpiblast jobs. The two examples below demonstrate how to setup and submit jobs to the mpi queue on those sharcnet clusters where mpiblast is installed (as shown in the Availability Table on the mpiblast software page https://www.sharcnet.ca/my/software/show/55).
module unload openmpi intel; module load intel/11.0.083 openmpi/intel/1.4.2 mpiblast/1.6.0
EXAMPLE1 - DROSOPH
Copy sample problem files (fasta database and input) from the /opt/sharcnet examples directory a directory under work as shown here. The fasta database used in this example can be obtained as a guest from NCBI here http://www.ncbi.nlm.nih.gov/guide/all/#downloads_ then clicking "FTP: FASTA BLAST Databases".
mkdir -p /work/$USER/samples/mpiblast/test1; rm /work/$USER/samples/mpiblast/test1/* cd /work/$USER/samples/mpiblast/test1 cp /opt/sharcnet/mpiblast/1.6.0/examples/drosoph.in drosoph.in gunzip -c /opt/sharcnet/mpiblast/1.6.0/examples/drosoph.nt.gz > drosoph.nt
Create hidden configuration file to define a Shared storage location between nodes and a Local storage directory available on each compute node where $USER should be replaced with your username as shown here:
[username@orc-login1:/work/roberpj/samples/mpiblast/test1] vi .ncbirc [BLAST] BLASTDB=/scratch/YourUserName/mpiblasttest1 BLASTMAT=/work/YourUserName/samples/mpiblast/test1 [mpiBLAST] Shared=/scratch/YourUserName/mpiblasttest1 Local=/tmp
Format the database into 16 fragments under the following local scratch directory location:
mkdir /scratch/$USER/mpiblasttest1; rm -f /scratch/$USER/mpiblasttest1/* cd /work/$USER/samples/mpiblast/test1 mpiformatdb -N 8 -i drosoph.nt -o T -p F -n /scratch/$USER/mpiblasttest1
For example ...
[roberpj@hnd20:/work/roberpj/samples/mpiblast/test1] mpiformatdb -N 8 -i drosoph.nt -o T -p F -n /scratch/roberpj/mpiblasttest1 Reading input file Done, read 1534943 lines Breaking drosoph.nt into 8 fragments Executing: formatdb -p F -i drosoph.nt -N 8 -n /scratch/roberpj/mpiblasttest1/drosoph.nt -o T Created 8 fragments. <<< Please make sure the formatted database fragments are placed in /scratch/roberpj/mpiblasttest1/ before executing mpiblast. >>>
Submit a short job with a 15m time limit on 8 plus 2 cores. If all goes well output results will be written to drosoph.out and the execution time will appear in ofile%J where %J is the job number:
cd /work/$USER/samples/mpiblast/test1 sqsub -r 15m -n 10 -q mpi --mpp=1G -o ofile%J mpiblast -d drosoph.nt -i drosoph.in -p blastn -o drosoph.out --use-parallel-write --use-virtual-frags
For example ...
[roberpj@hnd20:/work/roberpj/samples/mpiblast/test1] sqsub -r 15m -n 10 -q mpi --mpp=1G -o ofile%J mpiblast -d drosoph.nt -i drosoph.in -p blastn -o drosoph.out --use-parallel-write --use-virtual-frags submitted as jobid 6966896 [roberpj@hnd20:/work/roberpj/samples/mpiblast/test1] cat ofile6966896.hnd50 Total Execution Time: 1.80031
When submitting a mpiblast job on a cluster such as goblin that doesnt have an inifiniband interconnect better performance (at least double speedup) will be achieved running the mpi job on one compute node. For regular users of non-contributed hardware typically specify "-n 8" to reflect the max number of cores on a single node:
sqsub -r 15m -n 8 -N 1 -q mpi --mpp=4G -o ofile%J mpiblast -d drosoph.nt -i drosoph.in -p blastn -o drosoph.out --use-parallel-write --use-virtual-frags
Sample output results computed previously with BLASTN 2.2.15 [Oct-15-2006] are included in /opt/sharcnet/mpiblast/1.6.0/examples/ROSOPH.out to compare your newly generated drosoph.out file with.
EXAMPLE2 - UNIGENE
The main purpose of this example is to illustrate some additional options and switchs that maybe useful for debugging and for dealing with larger databases as described in official detail at http://www.mpiblast.org/Docs/Guide. The fasta database used in this example can also be downloaded from http://www.ncbi.nlm.nih.gov/guide/all/#downloads_ as a guest by clicking "FTP: UniGene" then entering the "Homo_sapiens" sub-directory. More information about UniGene alignments can be found at https://cgwb.nci.nih.gov/cgi-bin/hgTrackUi?hgsid=95443&c=chr1&g=uniGene_3 . As with Example1 above, for convenience all required files can simply be copied from the /opt/sharcnet examples subdirectory to work as shown here:
mkdir /work/$USER/samples/mpiblast/test2; rm /work/$USER/samples/mpiblast/test2/* cd /work/$USER/samples/mpiblast/test2 cp /opt/sharcnet/mpiblast/1.6.0/examples/il2ra.in il2ra.in gunzip -c /opt/sharcnet/mpiblast/1.6.0/examples/Hs.seq.uniq.gz > Hs.seq.uniq
Create hidden configuration file using the vi editor to define a Shared storage location between nodes and a Local storage directory available on each compute node as follows, where the Data directory is not yet populated or used in this example and hence can be omitted, where $USER should be replaced with your username as shown here:. If its desired the Local and Shared directories are the same then replace --copy-via=mpi with --copy-via=none as will be demonstrated in the below sqsub commands.
[username@orc-login1:/work/roberpj/samples/mpiblast/test2] vi .ncbirc [NCBI] Data=/opt/sharcnet/mpiblast/1.6.0/data [BLAST] BLASTDB=/work/$USER/mpiblasttest2 BLASTMAT=/work/$USER/samples/mpiblast/test2 [mpiBLAST] Shared=/work/$USER/mpiblasttest2 Local=/tmp
Partition the database into 16 fragments under the following work directory location:
mkdir -p /scratch/$USER/mpiblasttest2; rm -f /scratch/$USER/mpiblasttest2/* cd /work/$USER/samples/mpiblast/test1 mpiformatdb -N 16 -i Hs.seq.uniq -o T -p F
Submit a couple of short jobs 15m time limit. If all goes well output results will be written to biobrewA.out and biobrewB.out and the execution time appear in corresponding ofile%J's where %J is the job number as per usual:
A) In this job submission fragment files are first copied from work to local /tmp before being used (appropriate if work is slow). Usage of the profile option is also shown in this example:
cd /work/$USER/samples/mpiblast/test2; rm -f oTime* sqsub -r 15m -n 18 -q mpi -o ofile%J mpiblast --use-parallel-write --copy-via=mpi -d Hs.seq.uniq -i il2ra.in -p blastn -o biobrew.out --time-profile=oTime
B) In this job submission fragment files are used inplace on work. Usage of the debug option is also shown in this example.
cd /work/$USER/samples/mpiblast/test2; rm -f oLog* sqsub -r 15m -n 18 -q mpi -o ofile%J mpiblast --use-parallel-write --copy-via=none -d Hs.seq.uniq -i il2ra.in -p blastn -o biobrew.out --debug=oLog
Finally compare /opt/sharcnet/mpiblast/1.6.0/examples/BIOBREW.out computed previously with BLASTN 2.2.15 [Oct-15-2006] with your newly generated biobrew.out output file to verify the results and submit a ticket if there are any problems!
SUPPORTED PROGRAMS IN MPIBLAST
As described in http://www.mpiblast.org/Docs/FAQ mpiblast supports the standard blast programs http://www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml which are reproduced here for reference:
blastp: Compares an amino acid query sequence against a protein sequence database. blastn: Compares a nucleotide query sequence against a nucleotide sequence database. blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. tblastn: Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
MPIBLAST BINARIES OPTIONS
[roberpj@orc-login1:/opt/sharcnet/mpiblast/1.6.0/bin] ./mpiblast -help mpiBLAST requires the following options: -d [database] -i [query file] -p [blast program name]
[roberpj@orc-login1:/opt/sharcnet/mpiblast/1.6.0/bin] ./mpiformatdb --help Executing: formatdb - formatdb 2.2.20 arguments: -t Title for database file [String] Optional -i Input file(s) for formatting [File In] Optional -l Logfile name: [File Out] Optional default = formatdb.log -p Type of file T - protein F - nucleotide [T/F] Optional default = T -o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes. [T/F] Optional default = F -a Input file is database in ASN.1 format (otherwise FASTA is expected) T - True, F - False. [T/F] Optional default = F -b ASN.1 database in binary mode T - binary, F - text mode. [T/F] Optional default = F -e Input is a Seq-entry [T/F] Optional default = F -n Base name for BLAST files [String] Optional -v Database volume size in millions of letters [Integer] Optional default = 4000 -s Create indexes limited only to accessions - sparse [T/F] Optional default = F -V Verbose: check for non-unique string ids in the database [T/F] Optional default = F -L Create an alias file with this name use the gifile arg (below) if set to calculate db size use the BLAST db specified with -i (above) [File Out] Optional -F Gifile (file containing list of gi's) [File In] Optional -B Binary Gifile produced from the Gifile specified above [File Out] Optional -T Taxid file to set the taxonomy ids in ASN.1 deflines [File In] Optional -N Number of database volumes [Integer] Optional default = 0
References
o MPIBLAST Homepage
http://www.mpiblast.org/