[GE users] SGE jobs in "qw" state

Chris Dagdigian dag at sonsorol.org
Mon May 22 20:04:27 BST 2006


Hi Mark,

Send us the output of "qstat -f" and also "qstat -j <jobID>" using a  
jobID of a job that is pending in state 'qw'

The usual causes are:

- sge is down cluster wide, resulting in no free execution hosts (if  
your qstat -f shows 'au' in the state column then this is the cause)

- sge queues have all been knocked into a persistent error (E) state  
(will show up in "qstat -f")

- most other causes will be revealed in the scheduler_info line of  
"qstat -j <jobID>" output

Judging by the output below I would not be surprised to see your  
"qstat -f' output full of "au" states which means alarm/unreachable.  
You need to restart SGE on any node showing 'au' in the state column.

Regards,
Chris


On May 22, 2006, at 2:50 PM, Mark_Johnson at URSCorp.com wrote:

> I have built a Rocks 4.1 Cluster, and am trying to resolve a  
> problem with
> the SGE.
>
> I can submit jobs to the queue, but once sibmitted they just sit  
> thre in
> the "qw" state.  I have received good help from the Rocks  
> community, but am
> still unable to get the jobs to start.  Below are a few lines from the
> /opt/gridengine/default/spool/qmaster/message.  It looks like the  
> qmaster
> cannot contact the "execd" on the nodes and timesout ?
>
> Any thoughts or ideas are appreciated..
>
> ps...dumb it down for me as I have a Windows Handicap...
>
> Mark,
>
> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-179.local
> registered
> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-178.local
> registered
> 05/22/2006 10:39:07|qmaster|medusa|I|execd on compute-0-180.local
> registered
> 05/22/2006 10:40:11|qmaster|medusa|E|got max. unheard timeout for  
> target
> "execd" on host "compute-0-157.local", can't delivering job "42"
> 05/22/2006 10:40:11|qmaster|medusa|W|rescheduling job 42.1
> 05/22/2006 10:40:11|qmaster|medusa|E|failed delivering job 42.1
> 05/22/2006 10:40:26|qmaster|medusa|E|got max. unheard timeout for  
> target
> "execd" on host "compute-0-156.local", can't delivering job "42"
> 05/22/2006 10:40:26|qmaster|medusa|W|rescheduling job 42.1
> 05/22/2006 10:40:26|qmaster|medusa|E|failed delivering job 42.1
> 05/22/2006 10:40:32|qmaster|medusa|I|urs1 has deleted job 42
> [urs1 at medusa qmaster]$
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list