[GE users] Cannot start SGE on master

Richard Bohn rxbeee at rit.edu
Thu Mar 15 21:34:48 GMT 2007


Hi Daniel,

Here is the output from qstat

qstat -j 9
job_number:                 9
exec_file:                  job_scripts/9
submission_time:            Thu Mar 15 17:10:30 2007
owner:                      rxb
uid:                        123
group:                      rxb
gid:                        123
sge_o_home:                 /home/rxb
sge_o_log_name:             rxb
sge_o_path:
/opt/gridengine/bin/lx26-x86:/opt/mpich/gnu/bin:/usr/java/jdk1.5.0_05/bi
n:/opt/intel/idb/9.0/bin:/opt/intel/cc/9.0/bin:/opt/intel/fc/9.0/bin:/op
t/gridengine/bin/lx26-x86:/opt/mpich/gnu/bin:/usr/kerberos/bin:/usr/java
/jdk1.5.0_05/bin:/opt/intel/idb/9.0/bin:/opt/intel/cc/9.0/bin:/opt/intel
/fc/9.0/bin:/usr/local/bin:/bin:/usr/bin:/opt/blast-2.2.10/bin:/opt/mpib
last-1.4.0/bin:/opt/condor-6.8.2/bin:/opt/condor-6.8.2/sbin:/opt/ganglia
/bin:/opt/Mathematica/bin:/opt/Mathematica/bin/bin/Linux:/opt/matlab/bin
:/opt/mpich/gnu/lib:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/usr/
share/pvm3/lib:/usr/share/pvm3/lib/LINUXI386:/usr/share/pvm3/bin/LINUXI3
86:/opt/rocks/bin:/opt/rocks/sbin:/usr/X11R6/bin:/opt/blast-2.2.10/bin:/
opt/mpiblast-1.4.0/bin:/opt/condor-6.8.2/bin:/opt/condor-6.8.2/sbin:/opt
/ganglia/bin:/opt/Mathematica/bin:/opt/Mathematica/bin/bin/Linux:/opt/ma
tlab/bin:/opt/mpich/gnu/lib:/opt/maui/bin:/opt/torque/bin:/opt/torque/sb
in:/usr/share/pvm3/lib:/usr/share/pvm3/lib/LINUXI386:/usr/share/pvm3/bin
/LINUXI386:/opt/rocks/bin:/opt/rocks/sbin:/home/rxb/bin:/home/rxb/script
s
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/rxb
sge_o_host:                 cluster
account:                    sge
mail_list:                  rxb at cluster.rit.edu
notify:                     FALSE
job_name:                   hostname
jobshare:                   0
env_list:                   
script_file:                /bin/hostname
scheduling info:            queue instance "all.q at compute-1-7.local"
dropped because it is temporarily not available
                            queue instance "all.q at compute-0-2.local"
dropped because it is temporarily not available
                            queue instance
"mpiqueue.q at compute-1-7.local" dropped because it is temporarily not
available

I don't see anything helpful in the output.
I have 47 nodes in the cluster so I have a few available.

Rick

-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Thursday, March 15, 2007 5:27 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Cannot start SGE on master

What does qstat -j <jobid> tell you about why the job isn't being
scheduled?

Daniel

Richard Bohn wrote:
> Hi Reuti,
>
> I solved the original problem and have the master and scheduler
running.
> But now when I submit a job it never gets executed because the
scheduler
> cannot connect to the compute nodes. I see messages like the following
> in the log:
>
> qmaster|cluster|E|got max. unheard timeout for target "execd" on host
> "compute-17.local", can't delivering job "3"
>
> I can do a qstat on the remote node and see the job waiting in the
> queue.
>
> We did upgrade the local LAN switch during the move and set the local
> machines and head node to use jumbo frames which the switch supports.
I
> don't know if SGE would be sensitive to this. I haven't seen any other
> problems with the local LAN and talking to the remote nodes.
>
> I rebooted the head node and looking at the qmaster log there are a
> number of entries saying the execd on the various compute nodes were
> registered. Doing a qstat -f shows nothing in the status column. When
I
> do the 
> qsub -b y /bin/hostname the job stays pending and a few nodes have au
> show up in the status field and I see the above error in the qmaster
> log.
>
> I also have restarted sge execd on all the compute nodes.
>
> Thanks for the help.
>
> Rick
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: Thursday, March 15, 2007 1:44 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Cannot start SGE on master
>
> Hi,
>
> Am 15.03.2007 um 17:44 schrieb Richard Bohn:
>
>   
>> Hello All,
>>
>>
>>
>> I'm running SGE (version 6 u6) under the ROCKS clustering software.  
>> It had been working fine until we moved the cluster which meant  
>> changing IP address of head node public interface. Now when I try  
>> to start SGE I get the following error:
>>
>>
>>
>> ./sgemaster start
>>
>>    starting sge_qmaster
>>
>>    starting sge_schedd
>>
>> error: unable to read qmaster name: qmaster hostname in "/opt/ 
>> gridengine/default/common/act_qmaster" has zero length
>>
>> critical error: unable to read qmaster name: /opt/gridengine/ 
>> default/common/act_qmaster
>>
>>
>>
>> Indeed the act_qmaster is zero length but if I try setting it the  
>> fqdn hostname of the machine and then try restarting SGE, the file  
>> gets reset back to zero length. The configuration file in the same  
>> directory is also zero length.
>>     
> you adjusted also the /etc/hosts file and/or DNS entry to reflect the

> new (or old name) under the changed TCP/IP address? You can check  
> this with the tools in $SGE_ROOT/utilbin/<your_arch>/
>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list