[GE users] SGE 5.3 p1

Reuti reuti at staff.uni-marburg.de
Wed Sep 1 16:58:47 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

Paul Lyons wrote:

> Have a few problems/questions, the Qmaster is an Ultra 10 running 
> Solaris 8 and the exec hosts are a mixture of Solaris and Red Hat 
> Linux, we are running sge 5.3p1...
>
5.3p1 is not the latest one from the 5.3 version. Do you see any chance 
to upgrade to the p6 version or SGE 6.0?

> We have changed slightly the way we work and what's happening is that 
> there are periodically hundreds of jobs being submitted at the same 
> time by one person (quite legitimately).
>
> The problems I'm seeing are:
>
> 1) Qmaster falls over frequently > 2xper week
> - we have how/where should I look to debug what's happening here? 
> (have looked in the log files but can't see anything obvious, some 
> hints would be good)
>
> 2) When there are a flood of jobs then what I'd like to see is that 
> users with one job on the queue take precedence over the person with 
> 20 jobs running and 200 waiting. Is there any mechanism for doing this 
> sort of "fair share". Other than by fiddling with the priority of the 
> submitted jobs?
>
In 5.3 you can select "user_sort true" in the setup of the scheduler. 
Have a look at `man sched_conf`.

> 3) During this peak season some of the submitted jobs end up in status 
> Ewq (error) What has happened is that they have been refused on all 
> hosts and fallen off the stack these jobs never get re-submitted. This 
> is very bad.
>
> Why does this happen (the job not being re-submitted?) the jobs in 
> question are a script that has been run successfully many dozens of times.
>
> Snipit of one such job > qstat -f -j <> ....
>      .
>      .
>      .
>      scheduling info:            queue "dashpot.q" dropped because it is
>      temporarily not available
>                                  queue "gudgeon.q" dropped because it is
>      temporarily not available
>                                  queue "joint.q" dropped because it is
>      temporarily not available
>                                  queue "pedal.q" dropped because it is
>      temporarily not available
>                                  queue "plate.q" dropped because it is
>      temporarily not available
>                                  queue "press.q" dropped because it is
>      temporarily not available
>                                  queue "shim.q" dropped because it is
>      temporarily not available
>                                  queue "welt.q" dropped because it is
>      temporarily not available
>                                  queue "whiffle.q" dropped because it is
>      temporarily not available
>                                  queue "bobbin.q" dropped because it 
> is full
>                                  queue "gear.q" dropped because it is 
> full
>                                  queue "sump.q" dropped because it is 
> full
>                                  queue "washer.q" dropped because it 
> is full
>                                  queue "idler.q" dropped because it is 
> full
>                                  queue "tenon.q" dropped because it is 
> full
>                                  queue "dynamo.q" dropped because it 
> is full
>                                  queue "dibble.q" dropped because it 
> is full
>                                  queue "fulcrum.q" dropped because it 
> is full
>                                  queue "eye.q" dropped because it is full
>                                  queue "governer.q" dropped because it 
> is full
>                                  queue "ferrule.q" dropped because it 
> is full
>                                  queue "fascia.q" dropped because it 
> is full
>                                  queue "blade.q" dropped because it is 
> full
>                                  queue "gasket.q" dropped because it 
> is full
>                                  job is in error state
>      
> 4) Last but least a small question if I rotate the logs using 
> "logchecker.sh" will it reset the job counter as well? Otherwise how 
> do I reset the counter. A job number of 1,2xx,xxx is currently what we 
> are running at!
>
No. Did you tried simply to shutdown the qmaster and edit the file 
$SGE_ROOT/default/spool/qmaster/jobseqnum and restart SGE?

Cheers - Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list