[GE users] SGE jobs in "qw" state

Chris Dagdigian dag at sonsorol.org
Mon May 22 20:37:25 BST 2006


The output looks pretty normal to me, you have idle nodes and a job  
status that is not showing anything particularly alarming.

At this point I'd do the following:

(1) As root, run the sorta-under-documented command "qconf -tsm" on  
your head node. That will cause a one-time scheduler profiling run to  
be dumped to a text file called "schedd_runlog" in your $SGE_ROOT/ 
$SGE_CELL/common/ directory. It may explain why your job is stuck in  
"qw" despite having available queue instances.

(2) Double check the  qmaster/messages and qmaster/shedd/messages for  
any odd error states

(3) Does the command "qrsh hostname" even work?

(4) Check /tmp on the compute nodes, as the execd daemon will log  
there in panic situations when it can't get to its spooldir

Does this cluster share a $SGE_ROOT over NFS?

This does not seem like a network timeout or firewall/routing issue  
as you'd clearly see SGE alarm states in your qstat output showing  
that nodes are unreachable. Each node regularly reports its load data  
and SGE will notice when these are not coming in.

Regards,
Chris




On May 22, 2006, at 3:13 PM, Mark_Johnson at URSCorp.com wrote:

> The job I have submitted is the simple.sh script.  Below is qstat - 
> j 43,
> and qstat -f (part of it....as I have 250 nodes)
>
> [urs1 at medusa ~]$ qstat -j 43
> job_number:                 43
> exec_file:                  job_scripts/43
> submission_time:            Mon May 22 15:09:53 2006
> owner:                      urs1
> uid:                        500
> group:                      urs1
> gid:                        500
> sge_o_home:                 /home/urs1
> sge_o_log_name:             urs1
> sge_o_path:
> /opt/gridengine/bin/lx26-x86:/usr/kerberos/bin:/usr/java/jdk1.5.0
> _05/bin:/opt/intel/itc60/bin:/opt/intel/ita60/bin:/opt/intel/fc/9.0/ 
> bin:/opt/intel/idb/9.0/bin:/opt/intel/cc/9.0/bin:/usr/local/bin:/ 
> bin:/usr/bin:/usr/X11R6/bin:/opt/mpich/intel/bin:/opt/chromium/bin/ 
> Linux:/opt/ganglia/bin:/opt/lam/gnu/bin:/usr/share/pvm3/lib:/usr/ 
> share/pvm3/lib/LINUX:/usr/share/pvm3/bin/LINUX:/opt/rocks/bin:/opt/ 
> rocks/sbin:/home/urs1/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /home/urs1
> sge_o_host:                 medusa
> account:                    sge
> mail_list:                  urs1 at medusa.ursdcmetro.com
> notify:                     FALSE
> job_name:                   simple.sh
> jobshare:                   0
> shell_list:                 /bin/sh
> env_list:
> script_file:                simple.sh
> scheduling info:            queue instance "all.q at compute-0-193.local"
> dropped because it is temporarily not available
>                             queue instance "all.q at compute-0-194.local"
> dropped because it is temporarily not available
>                             queue instance "all.q at compute-0-192.local"
> dropped because it is full
>
> [urs1 at medusa ~]$
>
> all.q at compute-0-92.local       BIP   0/1       0.53     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-93.local       BIP   0/1       0.68     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-94.local       BIP   0/1       0.59     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-95.local       BIP   0/1       0.70     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-96.local       BIP   0/1       0.64     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-97.local       BIP   0/1       0.66     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-98.local       BIP   0/1       0.77     lx26-x86
> ---------------------------------------------------------------------- 
> ------
> all.q at compute-0-99.local       BIP   0/1       0.64     lx26-x86
>
> ###################################################################### 
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
> ###################################################################### 
> ######
>      43 0.55500 simple.sh  urs1         qw    05/22/2006  
> 15:09:53     1
> [urs1 at medusa ~]$
>
>
>
>
>
>              Chris Dagdigian
>              <dag at sonsorol.org
>>                                                          To
>                                        users at gridengine.sunsource.net
>              05/22/2006  
> 03:04                                           cc
>              PM
>                                                                     
> Subject
>                                        Re: [GE users] SGE jobs in "qw"
>              Please respond to         state
>              users at gridengine.
>                sunsource.net
>
>
>
>
>
>
>
>
>
> Hi Mark,
>
> Send us the output of "qstat -f" and also "qstat -j <jobID>" using a
> jobID of a job that is pending in state 'qw'
>
> The usual causes are:
>
> - sge is down cluster wide, resulting in no free execution hosts (if
> your qstat -f shows 'au' in the state column then this is the cause)
>
> - sge queues have all been knocked into a persistent error (E) state
> (will show up in "qstat -f")
>
> - most other causes will be revealed in the scheduler_info line of
> "qstat -j <jobID>" output
>
> Judging by the output below I would not be surprised to see your
> "qstat -f' output full of "au" states which means alarm/unreachable.
> You need to restart SGE on any node showing 'au' in the state column.
>
> Regards,
> Chris
>
>
> On May 22, 2006, at 2:50 PM, Mark_Johnson at URSCorp.com wrote:
>
>> I have built a Rocks 4.1 Cluster, and am trying to resolve a
>> problem with
>> the SGE.
>>
>> I can submit jobs to the queue, but once sibmitted they just sit
>> thre in
>> the "qw" state.  I have received good help from the Rocks
>> community, but am
>> still unable to get the jobs to start.  Below are a few lines from  
>> the
>> /opt/gridengine/default/spool/qmaster/message.  It looks like the
>> qmaster
>> cannot contact the "execd" on the nodes and timesout ?
>>
>> Any thoughts or ideas are appreciated..
>>
>> ps...dumb it down for me as I have a Windows Handicap...
>>
>> Mark,
>>
>> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-179.local
>> registered
>> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-178.local
>> registered
>> 05/22/2006 10:39:07|qmaster|medusa|I|execd on compute-0-180.local
>> registered
>> 05/22/2006 10:40:11|qmaster|medusa|E|got max. unheard timeout for
>> target
>> "execd" on host "compute-0-157.local", can't delivering job "42"
>> 05/22/2006 10:40:11|qmaster|medusa|W|rescheduling job 42.1
>> 05/22/2006 10:40:11|qmaster|medusa|E|failed delivering job 42.1
>> 05/22/2006 10:40:26|qmaster|medusa|E|got max. unheard timeout for
>> target
>> "execd" on host "compute-0-156.local", can't delivering job "42"
>> 05/22/2006 10:40:26|qmaster|medusa|W|rescheduling job 42.1
>> 05/22/2006 10:40:26|qmaster|medusa|E|failed delivering job 42.1
>> 05/22/2006 10:40:32|qmaster|medusa|I|urs1 has deleted job 42
>> [urs1 at medusa qmaster]$
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list