[GE users] Sample Grid Applications

Chris Dagdigian dag at sonsorol.org
Tue Apr 12 15:38:10 BST 2005

Ack brain dead.

Had a typo in my examples:

  "nr" is the non-redundant *protein* database
  "nt" is the non-redundant *nucleotide* database

So DNA vs DNA would be:

  $ blastall -p blastn -i ./my-input-sequences.fasta -d /data/nt


Chris Dagdigian wrote:

> Yep. If you want an example application from the lifesciences then the 
> BLAST suite of algorithms is the sterotypical app for people who do 
> bioinformatics or are interested in sequence comparison searches.
> The NCBI version of the binary is called "blastall" and it is used to 
> answer the following biology question:
>  "I have some protein or DNA sequences that I want to compare to a 
> database of known protein or DNA sequences. I wish to know if there are 
> any statistically *and biologically* significant homologies (matches) 
> between my query sequence(s) and the target database(s)."
> This is not simple pattern matching.  The algorithms have to account for 
> evolutionary divergence and sequencing/lab errors. Practically speaking 
> this means the program has to find matches between sequence patterns 
> that may have gaps, insertions or deletions as well as bad data. Tough 
> problem but it boils down to methods long known by CS and statistics folk.
> BLAST is a hardcore consumer of physical memory and after that it is 
> performance bound by the speed of your storage system (how fast can your 
> system stream the human genome from disk through your system bus). 
> Biologists store sequence databases as massive textfiles with associated 
> binary lookup indices (often confusing non-biologists who hear 
> 'database' and think RDBMS).
> This is a good test to run if you want to run a memory and I/O bound 
> application on your cluster. Large sequence databases also tend to break 
> the 2GB filesize limit so back in the day it was a good test of Linux's 
> ability to handle largefile support. Not as big of a deal now though.
> The blastall binary contains several algorithms for "dna vs dna", 
> "translated dna vs protein" etc. etc. Look for info on 
> "blastn,blastx,blastp,tblastx, etc.
> The simple usage for searching a dna sequence against a dna (nucleotide) 
> database is this:
>  $ blastall -p blastn -i ./my-input-sequences.fasta -d /data/nr
> ("nr" is a popular freely available, very large database of GenBank 
> known DNA sequences in which duplicate/redundant sequences have been 
> removed. It is updated nightly as data from individual researches and 
> large sequencing labs are fed into the curation system. )
> You can get blast from here:
> http://ncbi.nlm.nih.gov/BLAST/
> You can get free pre-built binary genome/dna/protein databases here:
> ftp://ftp.ncbi.nlm.nih.gov/blast/db/
> The raw text files used to build the blast databases are available in a 
> format called FASTA here:
> ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
> { Yep. US government sites. The NCBI is one of the places where I'm glad 
> my tax dollars go ... }
> -Chris
> Ron Chen wrote:
>> There are many applications that can be used under
>> SGE.
>> How about bioinformatics applications such as BLAST?
>> People use it for DNA sequencing everyday.
>> However, I don't know too much about blast. Chris may
>> be able to provide more detail. Nevertheless, google
>> is always your friend :)
>>  -Ron
>> --- Arati Kadav <aratik at cse.iitk.ac.in> wrote:
>>> Hello,
>>>  I needed some sample applications (basically batch
>>> jobs) that can be used over grids. This I require to demonstrate
>>> usability of Grids. I want these application binaries and want them 
>>> to be
>>> as simple as possible (minimum dependencies) but essentially
>>> solve some problems in a particular domain.  I want standalone 
>>> binaries of
>>> the nature that work on input file and produces output. Also if there 
>>> are
>>> more than one binaries in the same domain so that I can show that
>>> output of one is fed as input to other, it will be really helpful.
>>> Any guidance as to from where to take them will be
>>> helpful. If anyone of you have such applications that can be used for 
>>> this
>>> purpose then if you share their binaries, it will be helpful for me.
>>> With Best Regards,

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list