[GE users] qstat help with lost computers

Reuti reuti at staff.uni-marburg.de
Wed Feb 21 00:24:29 GMT 2007


Am 21.02.2007 um 00:16 schrieb Brett_W_Grant at raytheon.com:

> I have sge 6.9 running on a number of different computer networks.   
> One of these networks had the physical computers moved to a  
> different room, which apparently has inadequate A/C, so a random  
> number of machines go down at random times.  I am interested in  
> only know what is actually running at the moment, but if box9 goes  
> down, the qstat command simply shows a job is running on box9.  It  
> seems like the qstat -qs a will tell me what is in an alarm state,  
> but is there a way to see just what is running?   In this example,  
> box9 would not even show up in the list of running jobs.

this is a feature, as the node might regain network connection at a  
later point in time. SGE can't know, that this node is already in the  
land of lost bytes. But you could activate the reschedule feature for  
the queue, so that that the job will again go into pending state  and  
wait for a free slot `man sge_conf`section "reschedule_unknown".

> Also, the web page shows that for the qs flag, one has several   
> options {a|c|d|o|s|u|A|C|D|E|S}, but it doesn't explain what they  
> are or where to find out what they are.  I think that they are:
> a - alarm
> c - calendar
> d - disabled
> o - ?
> s - suspended
> u - unknow?

Please have a look at `man qstat`section "Full Format (with -f and -F)".

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list