[GE users] Sample Grid Applications

Chris Dagdigian dag at sonsorol.org
Tue Apr 12 15:31:27 BST 2005

Yep. If you want an example application from the lifesciences then the 
BLAST suite of algorithms is the sterotypical app for people who do 
bioinformatics or are interested in sequence comparison searches.

The NCBI version of the binary is called "blastall" and it is used to 
answer the following biology question:

  "I have some protein or DNA sequences that I want to compare to a 
database of known protein or DNA sequences. I wish to know if there are 
any statistically *and biologically* significant homologies (matches) 
between my query sequence(s) and the target database(s)."

This is not simple pattern matching.  The algorithms have to account for 
evolutionary divergence and sequencing/lab errors. Practically speaking 
this means the program has to find matches between sequence patterns 
that may have gaps, insertions or deletions as well as bad data. Tough 
problem but it boils down to methods long known by CS and statistics folk.

BLAST is a hardcore consumer of physical memory and after that it is 
performance bound by the speed of your storage system (how fast can your 
system stream the human genome from disk through your system bus). 
Biologists store sequence databases as massive textfiles with associated 
binary lookup indices (often confusing non-biologists who hear 
'database' and think RDBMS).

This is a good test to run if you want to run a memory and I/O bound 
application on your cluster. Large sequence databases also tend to break 
the 2GB filesize limit so back in the day it was a good test of Linux's 
ability to handle largefile support. Not as big of a deal now though.

The blastall binary contains several algorithms for "dna vs dna", 
"translated dna vs protein" etc. etc. Look for info on 
"blastn,blastx,blastp,tblastx, etc.

The simple usage for searching a dna sequence against a dna (nucleotide) 
database is this:

  $ blastall -p blastn -i ./my-input-sequences.fasta -d /data/nr

("nr" is a popular freely available, very large database of GenBank 
known DNA sequences in which duplicate/redundant sequences have been 
removed. It is updated nightly as data from individual researches and 
large sequencing labs are fed into the curation system. )

You can get blast from here:


You can get free pre-built binary genome/dna/protein databases here:


The raw text files used to build the blast databases are available in a 
format called FASTA here:


{ Yep. US government sites. The NCBI is one of the places where I'm glad 
my tax dollars go ... }


Ron Chen wrote:

> There are many applications that can be used under
> SGE. 
> How about bioinformatics applications such as BLAST?
> People use it for DNA sequencing everyday.
> However, I don't know too much about blast. Chris may
> be able to provide more detail. Nevertheless, google
> is always your friend :)
>  -Ron
> --- Arati Kadav <aratik at cse.iitk.ac.in> wrote:
>>  I needed some sample applications (basically batch
>>jobs) that can be 
>>used over grids. This I require to demonstrate
>>usability of Grids. I 
>>want these application binaries and want them to be
>>as simple as 
>>possible (minimum dependencies) but essentially
>>solve some problems in a 
>>particular domain.  I want standalone binaries of
>>the nature that work 
>>on input file and produces output. Also if there are
>>more than one 
>>binaries in the same domain so that I can show that
>>output of one is fed 
>>as input to other, it will be really helpful.
>>Any guidance as to from where to take them will be
>>helpful. If anyone of 
>>you have such applications that can be used for this
>>purpose then if you 
>>share their binaries, it will be helpful for me.
>>With Best Regards,
> ---------------------------------------------------------------------
>>To unsubscribe, e-mail:
>>users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail:
>>users-help at gridengine.sunsource.net
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Small Business - Try our new resources site!
> http://smallbusiness.yahoo.com/resources/
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

Chris Dagdigian, <dag at sonsorol.org>
BioTeam  - Independent life science IT & informatics consulting
Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E iChat/AIM: bioteamdag  Web: http://bioteam.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list