[GE users] SGE 6.2: qsub -sync y option for large number of jobs

reuti reuti at staff.uni-marburg.de
Thu Dec 17 12:51:26 GMT 2009

Am 14.12.2009 um 03:25 schrieb elauzier:

> Reuti, thanks for the feedback...
> I'll try to be clearer with my inquiries...
> Here are a couple work flows that I am working with...
> (1)
> Simple work flow in the foreground:
> ========================================
> ./setup.sh
> ./do_something.sh
> qsub ... -sync y -t 1-1000 ./fan_out.sh
> ./do_something_else.sh
> ./cleanup.sh
> ========================================
> (2)
> Alternative simple work flow using pure batch:
> ==================================
> qsub ... -N Setup_unique_name ./setup.sh
> qsub ... -t 1-1000 -hold_jid "Setup_unique_name" -N  
> fan_out_unique_name ./fanout.sh
> qsub ... -N do_something_else_unique_name -hold_jid  
> "fan_out_unique_name" ./do_something_else.sh
> qsub ... -N cleanup_unique_name -hold_jid  
> "do_something_else_unique_name" ./cleanup.sh
> =================================
> I guess you can say that the first flow is more of an interactive  
> flow and the second one is a pure batch flow.
> Considering scalability of say 500 people running similar flows at  
> the same time, I would tend to go with (2) especially if the flows  
> are large and long, where (1) can be used for smaller and shorter  
> flows.

Agreed. (1) will get the information of the finished jobs from the  
qmaster, hence many will connect to him which leads to some load.

> The main reason I would choose (2) over (1) is for stability of the  
> system, but I'm still looking into the pros and cons of such flows  
> and how they are implemented.
> For example, what happens if the SGE system becomes unresponsive  
> with users using flows as in (1)?  How will the system behave?

Just try it ;-) It will make efforts to keep the workflow alive as  
you expect it work.

>   Will these flows break?  Likewise for (2), if the SGE system  
> becomes unresponsive, will the flows in (2) better handle a  
> relatively short SGE interruption?

A short interruption will be handled well by both. But with (2) you  
can get all information about your workflow from a `qstat` and you  
don't have to check various windows on your system or reconnect to  
them with `screen` from home.

The only disadvantage of (2) is, that you have to make several qdel's  
if you want to cancel the workflow. There is an RFE to shorten it up:  

There are also projects to make decisions about the jobs to be  
executed during time of execution like http://wildfire.bii.a- 
star.edu.sg/ but their daemon needs to run during the complete  
execution of the workflow - and jobs will be submitted when it's  
clear that they should be executed, i.e. they have to wait in the  
queue from that time on. IMO it's much better to have this either  
directly as part of SGE, or to code the workflow into the submitted  
jobs by holds, which are conditionally removed. But my group wasn't  
interested in it and so I didn't follow it up (meta language):

qsub job1.sh
+ qsub job2.sh
qsub job3.sh
if qdecide ~/mydecision.sh
     qsub job4a.sh
     qsub job4b.sh

All 5+1 jobs are submitted at once. job1.sh and job2.sh can run at  
the same time, but job3.sh only when both have finished. The qdecide  
(which will qsub ~/mydecision.sh) will either qdel job4b.sh and  
release the hold of job4a.sh or vice versa.

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list