[GE users] SGE jobs in "qw" state

Mark_Johnson at URSCorp.com Mark_Johnson at URSCorp.com
Mon May 22 20:13:27 BST 2006


The job I have submitted is the simple.sh script.  Below is qstat -j 43,
and qstat -f (part of it....as I have 250 nodes)

[urs1 at medusa ~]$ qstat -j 43
job_number:                 43
exec_file:                  job_scripts/43
submission_time:            Mon May 22 15:09:53 2006
owner:                      urs1
uid:                        500
group:                      urs1
gid:                        500
sge_o_home:                 /home/urs1
sge_o_log_name:             urs1
sge_o_path:
/opt/gridengine/bin/lx26-x86:/usr/kerberos/bin:/usr/java/jdk1.5.0
_05/bin:/opt/intel/itc60/bin:/opt/intel/ita60/bin:/opt/intel/fc/9.0/bin:/opt/intel/idb/9.0/bin:/opt/intel/cc/9.0/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/mpich/intel/bin:/opt/chromium/bin/Linux:/opt/ganglia/bin:/opt/lam/gnu/bin:/usr/share/pvm3/lib:/usr/share/pvm3/lib/LINUX:/usr/share/pvm3/bin/LINUX:/opt/rocks/bin:/opt/rocks/sbin:/home/urs1/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/urs1
sge_o_host:                 medusa
account:                    sge
mail_list:                  urs1 at medusa.ursdcmetro.com
notify:                     FALSE
job_name:                   simple.sh
jobshare:                   0
shell_list:                 /bin/sh
env_list:
script_file:                simple.sh
scheduling info:            queue instance "all.q at compute-0-193.local"
dropped because it is temporarily not available
                            queue instance "all.q at compute-0-194.local"
dropped because it is temporarily not available
                            queue instance "all.q at compute-0-192.local"
dropped because it is full

[urs1 at medusa ~]$

all.q at compute-0-92.local       BIP   0/1       0.53     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-93.local       BIP   0/1       0.68     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-94.local       BIP   0/1       0.59     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-95.local       BIP   0/1       0.70     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-96.local       BIP   0/1       0.64     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-97.local       BIP   0/1       0.66     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-98.local       BIP   0/1       0.77     lx26-x86
----------------------------------------------------------------------------
all.q at compute-0-99.local       BIP   0/1       0.64     lx26-x86

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     43 0.55500 simple.sh  urs1         qw    05/22/2006 15:09:53     1
[urs1 at medusa ~]$




                                                                           
             Chris Dagdigian                                               
             <dag at sonsorol.org                                             
             >                                                          To 
                                       users at gridengine.sunsource.net      
             05/22/2006 03:04                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: [GE users] SGE jobs in "qw"     
             Please respond to         state                               
             users at gridengine.                                             
               sunsource.net                                               
                                                                           
                                                                           
                                                                           
                                                                           





Hi Mark,

Send us the output of "qstat -f" and also "qstat -j <jobID>" using a
jobID of a job that is pending in state 'qw'

The usual causes are:

- sge is down cluster wide, resulting in no free execution hosts (if
your qstat -f shows 'au' in the state column then this is the cause)

- sge queues have all been knocked into a persistent error (E) state
(will show up in "qstat -f")

- most other causes will be revealed in the scheduler_info line of
"qstat -j <jobID>" output

Judging by the output below I would not be surprised to see your
"qstat -f' output full of "au" states which means alarm/unreachable.
You need to restart SGE on any node showing 'au' in the state column.

Regards,
Chris


On May 22, 2006, at 2:50 PM, Mark_Johnson at URSCorp.com wrote:

> I have built a Rocks 4.1 Cluster, and am trying to resolve a
> problem with
> the SGE.
>
> I can submit jobs to the queue, but once sibmitted they just sit
> thre in
> the "qw" state.  I have received good help from the Rocks
> community, but am
> still unable to get the jobs to start.  Below are a few lines from the
> /opt/gridengine/default/spool/qmaster/message.  It looks like the
> qmaster
> cannot contact the "execd" on the nodes and timesout ?
>
> Any thoughts or ideas are appreciated..
>
> ps...dumb it down for me as I have a Windows Handicap...
>
> Mark,
>
> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-179.local
> registered
> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-178.local
> registered
> 05/22/2006 10:39:07|qmaster|medusa|I|execd on compute-0-180.local
> registered
> 05/22/2006 10:40:11|qmaster|medusa|E|got max. unheard timeout for
> target
> "execd" on host "compute-0-157.local", can't delivering job "42"
> 05/22/2006 10:40:11|qmaster|medusa|W|rescheduling job 42.1
> 05/22/2006 10:40:11|qmaster|medusa|E|failed delivering job 42.1
> 05/22/2006 10:40:26|qmaster|medusa|E|got max. unheard timeout for
> target
> "execd" on host "compute-0-156.local", can't delivering job "42"
> 05/22/2006 10:40:26|qmaster|medusa|W|rescheduling job 42.1
> 05/22/2006 10:40:26|qmaster|medusa|E|failed delivering job 42.1
> 05/22/2006 10:40:32|qmaster|medusa|I|urs1 has deleted job 42
> [urs1 at medusa qmaster]$
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list