[GE users] ncbi blast on SGE

foisysdiploide sylvain.foisy at diploide.net
Wed May 26 14:12:52 BST 2010


On 25/05/10 22:41, "[NAME]" <[ADDRESS]> wrote:

> hmmm alright, masters, i give up the wrong expectation on SGE/blast. very
> appreciated you sharing your knowledge and exp.

Actually, SGE and blast work very well together. You just need to know how
to make them shake hands ;-)
> the reason we try SGE & cluster is to submit millions of sequence for
> blasting. it will take a few month if we use normal single cpu workstation.

It depends on which blast program you are using and what DB you want to
blast against. A million blastp runs over UniProt would probably not take
more than a few weeks on a 12 core cluster. Typical blastp needs about 15-20
seconds to run but complex tblastn/tblastx might need 12-15 minutes...
> i think i should take the way to blastall with "-a $NSLOTS"(this para should
> be wonderful). so SGE only take my job to run on a --single-- exec node (with
> NSLOTS cpus/cores), i should do a small script to cut my million seq to a few
> thousands seq pieces and submit quite a few times to different nodes, am i
> going to right way now?

See reuti's mail about that.
> when we are talking about thousands of sequence blasting, i test against small
> swissprot db (only about 200MB), my memory is 1GB/3.5GB, it take up only like
> 300MB RAM. even with the 7GB NR db(i format it by 1G volume), only 600MB RAM
> is used. according blast manual, it should read the whole db into memory if
> there is enough RAM. something wrong? so in my scenario, the CPUs seems to be
> the only bottleneck?

Like I wrote earlier, blast performances are bound by RAM and I/O. Thw worst
thing that can happen is to start running off swap; that will litterally
kill your throughput. The key here is to format your DB the proper way. Look
at the material about using formatdb to chunck a large database in pieces
that can fit in memory with formatdb's -v flag. One other point is that you
might want to have all databases to live locally on each node instead of
distributing them with NFS to minimize I/O bottlenecks. Unless you have 10Gb
Ethernet or Infiniband that is ;-)
> so what i need to do, find as many CPU as possible, add them into my cluster,
> feed them the same number of jobs. correct?

Look over to maximize blast performance on a single node first with the
threads, DB pieces and DB formatting. After that, integrate with SGE.

Best regards



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list