[GE users] SGE 5.3 p1

Paul Lyons paul at dsiweb.co.uk
Wed Sep 1 16:24:56 BST 2004


Hi,


Have a few problems/questions, the Qmaster is an Ultra 10 running Solaris 8
and the exec hosts are a mixture of Solaris and Red Hat Linux, we are
running sge 5.3p1...


We have changed slightly the way we work and what's happening is that there
are periodically hundreds of jobs being submitted at the same time by one
person (quite legitimately).

The problems I'm seeing are:

1) Qmaster falls over frequently > 2xper week
- we have how/where should I look to debug what's happening here? (have
looked in the log files but can't see anything obvious, some hints would be
good)

2) When there are a flood of jobs then what I'd like to see is that users
with one job on the queue take precedence over the person with 20 jobs
running and 200 waiting. Is there any mechanism for doing this sort of "fair
share". Other than by fiddling with the priority of the submitted jobs?

3) During this peak season some of the submitted jobs end up in status Ewq
(error) What has happened is that they have been refused on all hosts and
fallen off the stack these jobs never get re-submitted. This is very bad.

Why does this happen (the job not being re-submitted?) the jobs in question
are a script that has been run successfully many dozens of times.

Snipit of one such job > qstat -f -j <> ....
     .
     .
     .
     scheduling info:            queue "dashpot.q" dropped because it is
     temporarily not available
                                 queue "gudgeon.q" dropped because it is
     temporarily not available
                                 queue "joint.q" dropped because it is
     temporarily not available
                                 queue "pedal.q" dropped because it is
     temporarily not available
                                 queue "plate.q" dropped because it is
     temporarily not available
                                 queue "press.q" dropped because it is
     temporarily not available
                                 queue "shim.q" dropped because it is
     temporarily not available
                                 queue "welt.q" dropped because it is
     temporarily not available
                                 queue "whiffle.q" dropped because it is
     temporarily not available
                                 queue "bobbin.q" dropped because it is full
                                 queue "gear.q" dropped because it is full
                                 queue "sump.q" dropped because it is full
                                 queue "washer.q" dropped because it is full
                                 queue "idler.q" dropped because it is full
                                 queue "tenon.q" dropped because it is full
                                 queue "dynamo.q" dropped because it is full
                                 queue "dibble.q" dropped because it is full
                                 queue "fulcrum.q" dropped because it is
full
                                 queue "eye.q" dropped because it is full
                                 queue "governer.q" dropped because it is
full
                                 queue "ferrule.q" dropped because it is
full
                                 queue "fascia.q" dropped because it is full
                                 queue "blade.q" dropped because it is full
                                 queue "gasket.q" dropped because it is full
                                 job is in error state
      
4) Last but least a small question if I rotate the logs using
"logchecker.sh" will it reset the job counter as well? Otherwise how do I
reset the counter. A job number of 1,2xx,xxx is currently what we are
running at!


Thanks,
Paul




More information about the gridengine-users mailing list