[GE users] SGE jobs in "qw" state

Chris Dagdigian dag at sonsorol.org
Tue Jun 6 21:21:41 BST 2006


A well behaved Grid Engine should never lose its configuration this  
way -- in classic spooling there will be a human readable textfile in  
$SGE_ROOT/<cell>/common/schedd_configuration that contains the data.  
In binary Berkeley spooling the config would be in the spooldb  
files.   I've never seen a partial broken configuration unless it was  
my own human error -- the times I've seen the SGE configuration  
totally hosed either all the files were messed up or the Berkeley DB  
database was totally unrecoverable. In the cases of total  
configuration loss we've usually tracked the problems down to bad  
behavior by  SAN client software or some other OS or file-server  
related issue that was external to grid engine.

In case this is a new problem, it would be helpful if you could reply  
with the following:

o what version of SGE did this happen on
o what OS/architecture
o what spooling method was in use
o any other files missing or corrupt ?

.. this way we can see if others report the same thing.  Spontaneous  
loss of scheduler configuration would be a big deal if reproducible  
in SGE 6.x .

Regards,
Chris



On Jun 6, 2006, at 4:03 PM, Adam Brust wrote:

> Hi.
>
> I just spent an entire day troubleshooting what seems to be a very  
> similar problem. I finally found a resolution and perhaps it can  
> help someone else.  Basically, somehow my scheduler config file got  
> blown away. (maybe after I rebooted the system?)  The output of  
> "qconf -sconf" displayed nothing.  I re-created this config file  
> and I am able to run jobs again.
>
> The symptoms in this thread were nearly identical to mine, most  
> notibly the 'got max. unheard timeout for target "execd" on  
> host...' in the qmaster message log, which lead me to believe there  
> was some communicaitons problems from the qmaster to the sgeexecd  
> on the nodes.  Unfortunatley the error logs in this instance  
> weren't very helpful.
>
> Hope this helps.
>
> -adam
>
> Joe Landman wrote:
>
>> Chris Dagdigian wrote:
>>
>>>
>>> Sensible error messages at least.
>>>
>>> (1) Are sge_qmaster and sge_schedd daemons running OK on the master?
>>>
>>> (2) Are there any firewalls blocking TCP port 536? Grid Engine  
>>> requires 2 TCP ports, one used by sge_qmaster and the other used  
>>> for sge_execd communication.
>>>
>>> (3) I've seen qrsh errors similar to this when the $SGE_ROOT was  
>>> being shared cluster wide via NFS yet with extremely locked down  
>>> export permissions that forbid suid operations or remapped the  
>>> root user UID to a different, non-privileged user account.  Grid  
>>> Engine has some setuid binaries that should not be blocked or  
>>> remapped and odd permissions will certainly break qrsh commands  
>>> and sometimes other things as well. You may want to look at file  
>>> permissions and how they appear from the head (qmaster ) node  
>>> versus how they look when you login to a compute node.
>>>
>>> I'm not familiar with recent ROCKS so I can't say for sure how  
>>> the SGE rocks-roll is deployed or even if it uses a shared NFS  
>>> $SGE_ROOT by default. Sorry about that.
>>>
>>> { Just noticed Joe replying, he knows ROCKS far far better than  
>>> I !! }
>>
>>
>> Hi Chris :)
>>
>>   Usually I see name service issues, but more often than not, I  
>> see iptables get in the way.
>>
>>   If you look on the head node with lsof (lsof is one of your many  
>> friends)
>>
>> [root at minicc ~]# lsof -i | grep -i sge
>> sge_qmast  3072     sge    3u  IPv4   6914       TCP *:536 (LISTEN)
>> sge_qmast  3072     sge    4u  IPv4   6934       TCP  
>> minicc.scalableinformatics.com:536->minicc.scalableinformatics.com: 
>> 32781 (ESTABLISHED)
>> sge_qmast  3072     sge    5u  IPv4 497728       TCP  
>> minicc.scalableinformatics.com:536->compute-0-0.local:33254  
>> (ESTABLISHED)
>> sge_sched  3091     sge    3u  IPv4   6933       TCP  
>> minicc.scalableinformatics.com:32781- 
>> >minicc.scalableinformatics.com:536 (ESTABLISHED)
>>
>>
>> You will see that it happily talks on port 536.  This is good, we  
>> will play with this in a second.
>>
>> On the compute node, you will see something like this
>>
>> [root at compute-0-0 ~]# lsof  -i | grep -i sge
>> sge_execd  3034     sge    3u  IPv4   6255       TCP *:537 (LISTEN)
>> sge_execd  3034     sge    4u  IPv4  96002       TCP  
>> compute-0-0.local:33254->minicc.scalableinformatics.com:536  
>> (ESTABLISHED)
>>
>> where the execd is in listen mode on port 537.  Now to check  
>> connectivity.
>>
>> [root at compute-0-0 ~]# telnet minicc.local 536
>> Trying 10.1.0.1...
>> Connected to minicc.local (10.1.0.1).
>> Escape character is '^]'.
>>
>> Yup, we can get through from the compute node to the head node.   
>> This means that the compute node is not being blocked either  
>> iptables on either node.  Lets try the other way
>>
>> [root at minicc ~]# telnet c0-0 537
>> Trying 10.1.255.254...
>> Connected to compute-0-0.local (10.1.255.254).
>> Escape character is '^]'.
>>
>> That also worked.  They should both work.  If they don't, this is  
>> a a problem.
>>
>> As for qrsh working, the default install of Rocks 4.1 does not  
>> have a working qrsh.  I usually install my own SGE if I want a  
>> working qrsh (which I usually do).
>>
>> [landman at minicc ~]$ qrsh uname -a
>> poll: protocol failure in circuit setup
>>
>> You should be able to run the following job like this:
>>
>> [landman at minicc ~]$ cat > e
>> #!/bin/tcsh
>> #-S /bin/tcsh
>> uname -a
>> date
>> cat /proc/cpuinfo
>> [landman at minicc ~]$ chmod +x e
>> [landman at minicc ~]$ qsub e
>> Your job 4 ("e") has been submitted.
>> [landman at minicc ~]$ qstat
>> job-ID  prior   name       user         state submit/start at      
>> queue                          slots ja-task-ID
>> --------------------------------------------------------------------- 
>> --------------------------------------------
>>       4 0.00000 e          landman      qw    05/22/2006  
>> 12:16:29                               1
>> [landman at minicc ~]$ qstat
>> job-ID  prior   name       user         state submit/start at      
>> queue                          slots ja-task-ID
>> --------------------------------------------------------------------- 
>> --------------------------------------------
>>       4 0.00000 e          landman      qw    05/22/2006  
>> 12:16:29                               1
>> [landman at minicc ~]$ qstat
>> [landman at minicc ~]$
>> [landman at minicc ~]$ cat e.o4
>> Warning: no access to tty (Bad file descriptor).
>> Thus no job control in this shell.
>> Linux compute-0-0.local 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36  
>> BST 2005 x86_64 x86_64 x86_64 GNU/Linux
>> Mon May 22 12:16:37 EDT 2006
>> processor       : 0
>> vendor_id       : AuthenticAMD
>> cpu family      : 15
>> model           : 37
>> model name      : AMD Opteron(tm) Processor 252
>> stepping        : 1
>> cpu MHz         : 2592.694
>> cache size      : 1024 KB
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 1
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext  
>> lm 3dnowext 3dnow pni ts
>> bogomips        : 5095.42
>> TLB size        : 1088 4K pages
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 40 bits physical, 48 bits virtual
>> power management: ts fid vid ttp
>>
>> processor       : 1
>> vendor_id       : AuthenticAMD
>> cpu family      : 15
>> model           : 37
>> model name      : AMD Opteron(tm) Processor 252
>> stepping        : 1
>> cpu MHz         : 2592.694
>> cache size      : 1024 KB
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 1
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext  
>> lm 3dnowext 3dnow pni ts
>> bogomips        : 5177.34
>> TLB size        : 1088 4K pages
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 40 bits physical, 48 bits virtual
>> power management: ts fid vid ttp
>>
>> Joe
>>
>>
>>>
>>>
>>> -Chris
>>>
>>>
>>>
>>>
>>> On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:
>>>
>>>> Kickstarted 16:21 27-Mar-2006
>>>> [urs1 at medusa ~]$ qrsh hostname
>>>> error: error waiting on socket for client to connect:  
>>>> Interrupted system
>>>> call
>>>> error: unable to contact qmaster using port 536 on host
>>>> "medusa.ursdcmetro.com"
>>>> [urs1 at medusa ~]$
>>>>
>>>> Mark A. Johnson
>>>> URS Network Administrator
>>>> Gaithersburg, MD
>>>> Ph:  301-721-2231
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list