[GE users] SGE jobs in "qw" state

Mark_Johnson at URSCorp.com Mark_Johnson at URSCorp.com
Mon May 22 21:09:05 BST 2006


Chris,

I've got some results on 1 and 2....not sure what to look for on 3, and can
use puTTY to get to one of the compute nodes, can see the .debug
file.....but don't know how to view it.. :-(

1. So i've run the "qconf -tsm", the result is below: it says full ?....but
there's no other jobs in the queue, and it even says full after
re-installing the nodes and sending a  job.
Mon May 22 15:49:27 2006|-------------START-SCHEDULER-RUN-------------
Mon May 22 15:49:27 2006|queue instance "all.q at compute-0-19.local" dropped
because it is temporarily not available
Mon May 22 15:49:27 2006|queues dropped because they are temporarily not
available: all.q at compute-0-19.local
Mon May 22 15:49:27 2006|queue instance "all.q at compute-0-27.local" dropped
because it is full
Mon May 22 15:49:27 2006|queues dropped because they are full:
all.q at compute-0-27.local

2.
05/22/2006 10:04:54|schedd|medusa|I|starting up 6.0u6
05/22/2006 10:04:56|schedd|medusa|I|using "default" as algorithm
05/22/2006 10:04:56|schedd|medusa|I|using "0:0:15" for schedule_interval
05/22/2006 10:04:56|schedd|medusa|I|using 0 for maxujobs
05/22/2006 10:04:56|schedd|medusa|I|using 0 for queue_sort_method
05/22/2006 10:04:56|schedd|medusa|I|using 0 for flush_submit_sec
05/22/2006 10:04:56|schedd|medusa|I|using 0 for flush_finish_sec
05/22/2006 10:04:56|schedd|medusa|I|using "np_load_avg=0.50" for
job_load_adjustments
05/22/2006 10:04:56|schedd|medusa|I|using "0:7:30" for
load_adjustment_decay_time
05/22/2006 10:04:56|schedd|medusa|I|using "np_load_avg" for load_formula
05/22/2006 10:04:56|schedd|medusa|I|using "true" for schedd_job_info
05/22/2006 10:04:56|schedd|medusa|I|using param: "none"
05/22/2006 10:04:56|schedd|medusa|I|using "0:0:0" for reprioritize_interval
05/22/2006 10:04:56|schedd|medusa|I|using 168 for halftime
05/22/2006 10:04:56|schedd|medusa|I|using "cpu=1,mem=0,io=0" for
usage_weight_list
05/22/2006 10:04:56|schedd|medusa|I|using 5 for compensation_factor
05/22/2006 10:04:56|schedd|medusa|I|using 0.25 for weight_user
05/22/2006 10:04:56|schedd|medusa|I|using 0.25 for weight_project
05/22/2006 10:04:56|schedd|medusa|I|using 0.25 for weight_department
05/22/2006 10:04:56|schedd|medusa|I|using 0.25 for weight_job
05/22/2006 10:04:56|schedd|medusa|I|using 0 for weight_tickets_functional
05/22/2006 10:04:56|schedd|medusa|I|using 0 for weight_tickets_share
05/22/2006 10:04:56|schedd|medusa|I|using 1 for share_override_tickets
05/22/2006 10:04:56|schedd|medusa|I|using 1 for share_functional_shares
05/22/2006 10:04:56|schedd|medusa|I|using 200 for
max_functional_jobs_to_schedule
05/22/2006 10:04:56|schedd|medusa|I|using 1 for report_pjob_tickets
05/22/2006 10:04:56|schedd|medusa|I|using 50 for max_pending_tasks_per_job
05/22/2006 10:04:56|schedd|medusa|I|using "none" for halflife_decay_list
05/22/2006 10:04:56|schedd|medusa|I|using "OFS" for policy_hierarchy
05/22/2006 10:04:56|schedd|medusa|I|using 0.01 for weight_ticket
05/22/2006 10:04:56|schedd|medusa|I|using 0 for weight_waiting_time
05/22/2006 10:04:56|schedd|medusa|I|using 3.6e+06 for weight_deadline
05/22/2006 10:04:56|schedd|medusa|I|using 0.1 for weight_urgency
05/22/2006 10:04:56|schedd|medusa|I|using 1 for weight_priority
05/22/2006 10:04:56|schedd|medusa|I|using 0 for max_reservation

3. ran "qrsh medusa.ursdcmetro.com", not sure what should happen here ?

4.


                                                                           
             Chris Dagdigian                                               
             <dag at sonsorol.org                                             
             >                                                          To 
                                       users at gridengine.sunsource.net      
             05/22/2006 03:37                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: [GE users] SGE jobs in "qw"     
             Please respond to         state                               
             users at gridengine.                                             
               sunsource.net                                               
                                                                           
                                                                           
                                                                           
                                                                           





The output looks pretty normal to me, you have idle nodes and a job
status that is not showing anything particularly alarming.

At this point I'd do the following:

(1) As root, run the sorta-under-documented command "qconf -tsm" on
your head node. That will cause a one-time scheduler profiling run to
be dumped to a text file called "schedd_runlog" in your $SGE_ROOT/
$SGE_CELL/common/ directory. It may explain why your job is stuck in
"qw" despite having available queue instances.

(2) Double check the  qmaster/messages and qmaster/shedd/messages for
any odd error states

(3) Does the command "qrsh hostname" even work?

(4) Check /tmp on the compute nodes, as the execd daemon will log
there in panic situations when it can't get to its spooldir

Does this cluster share a $SGE_ROOT over NFS?

This does not seem like a network timeout or firewall/routing issue
as you'd clearly see SGE alarm states in your qstat output showing
that nodes are unreachable. Each node regularly reports its load data
and SGE will notice when these are not coming in.

Regards,
Chris




On May 22, 2006, at 3:13 PM, Mark_Johnson at URSCorp.com wrote:

> The job I have submitted is the simple.sh script.  Below is qstat -
> j 43,
> and qstat -f (part of it....as I have 250 nodes)
>
> [urs1 at medusa ~]$ qstat -j 43
> job_number:                 43
> exec_file:                  job_scripts/43
> submission_time:            Mon May 22 15:09:53 2006
> owner:                      urs1
> uid:                        500
> group:                      urs1
> gid:                        500
> sge_o_home:                 /home/urs1
> sge_o_log_name:             urs1
> sge_o_path:
> /opt/gridengine/bin/lx26-x86:/usr/kerberos/bin:/usr/java/jdk1.5.0
> _05/bin:/opt/intel/itc60/bin:/opt/intel/ita60/bin:/opt/intel/fc/9.0/
> bin:/opt/intel/idb/9.0/bin:/opt/intel/cc/9.0/bin:/usr/local/bin:/
> bin:/usr/bin:/usr/X11R6/bin:/opt/mpich/intel/bin:/opt/chromium/bin/
> Linux:/opt/ganglia/bin:/opt/lam/gnu/bin:/usr/share/pvm3/lib:/usr/
> share/pvm3/lib/LINUX:/usr/share/pvm3/bin/LINUX:/opt/rocks/bin:/opt/
> rocks/sbin:/home/urs1/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /home/urs1
> sge_o_host:                 medusa
> account:                    sge
> mail_list:                  urs1 at medusa.ursdcmetro.com
> notify:                     FALSE
> job_name:                   simple.sh
> jobshare:                   0
> shell_list:                 /bin/sh
> env_list:
> script_file:                simple.sh
> scheduling info:            queue instance "all.q at compute-0-193.local"
> dropped because it is temporarily not available
>                             queue instance "all.q at compute-0-194.local"
> dropped because it is temporarily not available
>                             queue instance "all.q at compute-0-192.local"
> dropped because it is full
>
> [urs1 at medusa ~]$
>
> all.q at compute-0-92.local       BIP   0/1       0.53     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-93.local       BIP   0/1       0.68     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-94.local       BIP   0/1       0.59     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-95.local       BIP   0/1       0.70     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-96.local       BIP   0/1       0.64     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-97.local       BIP   0/1       0.66     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-98.local       BIP   0/1       0.77     lx26-x86
> ----------------------------------------------------------------------
> ------
> all.q at compute-0-99.local       BIP   0/1       0.64     lx26-x86
>
> ######################################################################
> ######
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
> PENDING JOBS
> ######################################################################
> ######
>      43 0.55500 simple.sh  urs1         qw    05/22/2006
> 15:09:53     1
> [urs1 at medusa ~]$
>
>
>
>
>
>              Chris Dagdigian
>              <dag at sonsorol.org
>>                                                          To
>                                        users at gridengine.sunsource.net
>              05/22/2006
> 03:04                                           cc
>              PM
>
> Subject
>                                        Re: [GE users] SGE jobs in "qw"
>              Please respond to         state
>              users at gridengine.
>                sunsource.net
>
>
>
>
>
>
>
>
>
> Hi Mark,
>
> Send us the output of "qstat -f" and also "qstat -j <jobID>" using a
> jobID of a job that is pending in state 'qw'
>
> The usual causes are:
>
> - sge is down cluster wide, resulting in no free execution hosts (if
> your qstat -f shows 'au' in the state column then this is the cause)
>
> - sge queues have all been knocked into a persistent error (E) state
> (will show up in "qstat -f")
>
> - most other causes will be revealed in the scheduler_info line of
> "qstat -j <jobID>" output
>
> Judging by the output below I would not be surprised to see your
> "qstat -f' output full of "au" states which means alarm/unreachable.
> You need to restart SGE on any node showing 'au' in the state column.
>
> Regards,
> Chris
>
>
> On May 22, 2006, at 2:50 PM, Mark_Johnson at URSCorp.com wrote:
>
>> I have built a Rocks 4.1 Cluster, and am trying to resolve a
>> problem with
>> the SGE.
>>
>> I can submit jobs to the queue, but once sibmitted they just sit
>> thre in
>> the "qw" state.  I have received good help from the Rocks
>> community, but am
>> still unable to get the jobs to start.  Below are a few lines from
>> the
>> /opt/gridengine/default/spool/qmaster/message.  It looks like the
>> qmaster
>> cannot contact the "execd" on the nodes and timesout ?
>>
>> Any thoughts or ideas are appreciated..
>>
>> ps...dumb it down for me as I have a Windows Handicap...
>>
>> Mark,
>>
>> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-179.local
>> registered
>> 05/22/2006 10:39:06|qmaster|medusa|I|execd on compute-0-178.local
>> registered
>> 05/22/2006 10:39:07|qmaster|medusa|I|execd on compute-0-180.local
>> registered
>> 05/22/2006 10:40:11|qmaster|medusa|E|got max. unheard timeout for
>> target
>> "execd" on host "compute-0-157.local", can't delivering job "42"
>> 05/22/2006 10:40:11|qmaster|medusa|W|rescheduling job 42.1
>> 05/22/2006 10:40:11|qmaster|medusa|E|failed delivering job 42.1
>> 05/22/2006 10:40:26|qmaster|medusa|E|got max. unheard timeout for
>> target
>> "execd" on host "compute-0-156.local", can't delivering job "42"
>> 05/22/2006 10:40:26|qmaster|medusa|W|rescheduling job 42.1
>> 05/22/2006 10:40:26|qmaster|medusa|E|failed delivering job 42.1
>> 05/22/2006 10:40:32|qmaster|medusa|I|urs1 has deleted job 42
>> [urs1 at medusa qmaster]$
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list