[GE users] SGE 6.2: qsub -sync y option for large number of jobs
reuti at staff.uni-marburg.de
Thu Dec 17 12:51:26 GMT 2009
Am 14.12.2009 um 03:25 schrieb elauzier:
> Reuti, thanks for the feedback...
> I'll try to be clearer with my inquiries...
> Here are a couple work flows that I am working with...
> Simple work flow in the foreground:
> qsub ... -sync y -t 1-1000 ./fan_out.sh
> Alternative simple work flow using pure batch:
> qsub ... -N Setup_unique_name ./setup.sh
> qsub ... -t 1-1000 -hold_jid "Setup_unique_name" -N
> fan_out_unique_name ./fanout.sh
> qsub ... -N do_something_else_unique_name -hold_jid
> "fan_out_unique_name" ./do_something_else.sh
> qsub ... -N cleanup_unique_name -hold_jid
> "do_something_else_unique_name" ./cleanup.sh
> I guess you can say that the first flow is more of an interactive
> flow and the second one is a pure batch flow.
> Considering scalability of say 500 people running similar flows at
> the same time, I would tend to go with (2) especially if the flows
> are large and long, where (1) can be used for smaller and shorter
Agreed. (1) will get the information of the finished jobs from the
qmaster, hence many will connect to him which leads to some load.
> The main reason I would choose (2) over (1) is for stability of the
> system, but I'm still looking into the pros and cons of such flows
> and how they are implemented.
> For example, what happens if the SGE system becomes unresponsive
> with users using flows as in (1)? How will the system behave?
Just try it ;-) It will make efforts to keep the workflow alive as
you expect it work.
> Will these flows break? Likewise for (2), if the SGE system
> becomes unresponsive, will the flows in (2) better handle a
> relatively short SGE interruption?
A short interruption will be handled well by both. But with (2) you
can get all information about your workflow from a `qstat` and you
don't have to check various windows on your system or reconnect to
them with `screen` from home.
The only disadvantage of (2) is, that you have to make several qdel's
if you want to cancel the workflow. There is an RFE to shorten it up:
There are also projects to make decisions about the jobs to be
executed during time of execution like http://wildfire.bii.a-
star.edu.sg/ but their daemon needs to run during the complete
execution of the workflow - and jobs will be submitted when it's
clear that they should be executed, i.e. they have to wait in the
queue from that time on. IMO it's much better to have this either
directly as part of SGE, or to code the workflow into the submitted
jobs by holds, which are conditionally removed. But my group wasn't
interested in it and so I didn't follow it up (meta language):
+ qsub job2.sh
if qdecide ~/mydecision.sh
All 5+1 jobs are submitted at once. job1.sh and job2.sh can run at
the same time, but job3.sh only when both have finished. The qdecide
(which will qsub ~/mydecision.sh) will either qdel job4b.sh and
release the hold of job4a.sh or vice versa.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users