[GE users] qstat help with lost computers

Bevan C. Bennett bevan at fulcrummicro.com
Wed Feb 21 00:12:12 GMT 2007

    [ The following text is in the "windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Brett_W_Grant at raytheon.com wrote:
> I have sge 6.9 running on a number of different computer networks.  One
> of these networks had the physical computers moved to a different room,
> which apparently has inadequate A/C, so a random number of machines go
> down at random times.  I am interested in only know what is actually
> running at the moment, but if box9 goes down, the qstat command simply
> shows a job is running on box9.  It seems like the qstat -qs a will tell
> me what is in an alarm state, but is there a way to see just what is
> running?   In this example, box9 would not even show up in the list of
> running jobs.

I think you'd want "qstat -s r" to list just all running jobs.
To show all queue instances that are unreachable, try this;

"qstat -f -qs au"

You need to add the "-f" to trigger qstat to "full" mode, where it lists all
queue instances instead of just those with jobs in them.

> Also, the web page shows that for the qs flag, one has several  options
> {a|c|d|o|s|u|A|C|D|E|S}, but it doesn't explain what they are or where
> to find out what they are.  I think that they are:
> a - alarm
> c - calendar
> d - disabled
> o - ?
> s - suspended
> u - unknow?
> and not sure of what the Caps mean.

The answers are in the 'qstat' man page under "Full Format":

       ·  the  state  of  the  queue  -  one of u(nknown) if the corresponding
          sge_execd(8) cannot be contacted, a(larm), A(larm),  C(alendar  sus-
          pended), s(uspended), S(ubordinate), d(isabled), D(isabled), E(rror)
          or combinations thereof.

       If the state is a(larm) at least on of the load thresholds  defined  in
       the load_thresholds list of the queue configuration (see queue_conf(5))
       is currently exceeded, which prevents from scheduling further  jobs  to
       that queue.

       As  opposed  to  this, the state A(larm) indicates that at least one of
       the suspend thresholds of the queue (see  queue_conf(5))  is  currently
       exceeded.  This will result in jobs running in that queue being succes-
       sively suspended until no threshold is violated.

       The states s(uspended) and d(isabled) can be  assigned  to  queues  and
       released  via  the  qmod(1)  command. Suspending a queue will cause all
       jobs executing in that queue to be suspended.

       The states D(isabled) and C(alendar suspended) indicate that the  queue
       has  been disabled or suspended automatically via the calendar facility
       of N1 Grid Engine (see calendar_conf(5)), while the S(ubordinate) state
       indicates, that the queue has been suspend via subordination to another
       queue (see queue_conf(5) for details). When suspending a queue (regard-
       less  of the cause) all jobs executing in that queue are suspended too.

       If an E(rror) state is displayed for a queue, sge_execd(8) on that host
       was  unable  to  locate  the sge_shepherd(8) executable on that host in
       order  to  start  a  job.  Please  check  the  error  logfile  of  that
       sge_execd(8) for leads on how to resolve the problem. Please enable the
       queue afterwards via the -c option of the qmod(1) command manually.

       If the  c(onfiguration  ambiguous)  state  is  displayed  for  a  queue
       instance this indicates that the configuration specified for this queue
       instance in sge_conf(5) is ambiguous. The state vanishes when the  con-
       figuration   becomes  un-ambiguous  again.  This  state  prevents  from
       scheduling further jobs to that queue instance. Detailed reasons why  a
       queue instance entered the c(onfiguration ambiguous) state can be found
       in the sge_qmaster(8) messages file and are shown by the qstat -explain
       switch.  For  queue instances in this state the cluster queue?s default
       settings are used for the ambiguous attribute.

       If an o(rphaned) state is displayed for a queue instance this indicates
       that  the current cluster queue?s configuration and host group configu-
       ration does not any  longer  demand  this  queue  instance.  The  queue
       instance is kept because not yet finished jobs are still associated and
       it will vanish from qstat output  when  these  jobs  are  finished.  To
       quicken  vanishing  of an orphaned queue instance associated job(s) can
       be deleted using qdel(1).  A queue instance in (o)rphaned state can  be
       revived  by  changing  the  cluster  queue configuration accordingly to
       cover that queue instance. This state prevents from scheduling  further
       jobs to that queue instance.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list