[GE users] SGE jobs in "qw" state

Adam Brust abrust at csag.ucsd.edu
Tue Jun 6 22:03:58 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Chris,

Thanks for your reply.  The events that lead up to my problem were 1) 
ganglia had filled up my / partition with httpd error logs, this of 
course lead to several problems.  Most notibly 2) autofs wasn't behaving 
well.  Users were getting intermittent mount failures all over.   For a 
couple weeks we lived with this autofs problem and SGE ran along just 
fine.  I finally decided that a reboot was in order to get autofs out 
this "funky" state.  After this reboot was the start of the SGE jobs 
getting stuck in the "qw" state.  So, perhaps the full root partition 
was the cause of this file corruption and the reboot just brought it to 
the surface. 

To answer your other questions:

> o what version of SGE did this happen on

SGE 6.0

> o what OS/architecture

Rocks 4.1/i386  ...SGE was installed using the "SGE roll" provided by 
the Rocks developers

> o what spooling method was in use

classic

> o any other files missing or corrupt ?

I certainly hope not :)  The system has been used pretty heavily today 
and so for no complaints.

> .. this way we can see if others report the same thing.  Spontaneous  
> loss of scheduler configuration would be a big deal if reproducible  
> in SGE 6.x .

Yeah, I'm curious if the person who intiated this post had the same problem

Thanks,

-adam

> On Jun 6, 2006, at 4:03 PM, Adam Brust wrote:
>
>> Hi.
>>
>> I just spent an entire day troubleshooting what seems to be a very  
>> similar problem. I finally found a resolution and perhaps it can  
>> help someone else.  Basically, somehow my scheduler config file got  
>> blown away. (maybe after I rebooted the system?)  The output of  
>> "qconf -sconf" displayed nothing.  I re-created this config file  and 
>> I am able to run jobs again.
>>
>> The symptoms in this thread were nearly identical to mine, most  
>> notibly the 'got max. unheard timeout for target "execd" on  host...' 
>> in the qmaster message log, which lead me to believe there  was some 
>> communicaitons problems from the qmaster to the sgeexecd  on the 
>> nodes.  Unfortunatley the error logs in this instance  weren't very 
>> helpful.
>>
>> Hope this helps.
>>
>> -adam
>>
>> Joe Landman wrote:
>>
>>> Chris Dagdigian wrote:
>>>
>>>>
>>>> Sensible error messages at least.
>>>>
>>>> (1) Are sge_qmaster and sge_schedd daemons running OK on the master?
>>>>
>>>> (2) Are there any firewalls blocking TCP port 536? Grid Engine  
>>>> requires 2 TCP ports, one used by sge_qmaster and the other used  
>>>> for sge_execd communication.
>>>>
>>>> (3) I've seen qrsh errors similar to this when the $SGE_ROOT was  
>>>> being shared cluster wide via NFS yet with extremely locked down  
>>>> export permissions that forbid suid operations or remapped the  
>>>> root user UID to a different, non-privileged user account.  Grid  
>>>> Engine has some setuid binaries that should not be blocked or  
>>>> remapped and odd permissions will certainly break qrsh commands  
>>>> and sometimes other things as well. You may want to look at file  
>>>> permissions and how they appear from the head (qmaster ) node  
>>>> versus how they look when you login to a compute node.
>>>>
>>>> I'm not familiar with recent ROCKS so I can't say for sure how  the 
>>>> SGE rocks-roll is deployed or even if it uses a shared NFS  
>>>> $SGE_ROOT by default. Sorry about that.
>>>>
>>>> { Just noticed Joe replying, he knows ROCKS far far better than  I 
>>>> !! }
>>>
>>>
>>>
>>> Hi Chris :)
>>>
>>>   Usually I see name service issues, but more often than not, I  see 
>>> iptables get in the way.
>>>
>>>   If you look on the head node with lsof (lsof is one of your many  
>>> friends)
>>>
>>> [root at minicc ~]# lsof -i | grep -i sge
>>> sge_qmast  3072     sge    3u  IPv4   6914       TCP *:536 (LISTEN)
>>> sge_qmast  3072     sge    4u  IPv4   6934       TCP  
>>> minicc.scalableinformatics.com:536->minicc.scalableinformatics.com: 
>>> 32781 (ESTABLISHED)
>>> sge_qmast  3072     sge    5u  IPv4 497728       TCP  
>>> minicc.scalableinformatics.com:536->compute-0-0.local:33254  
>>> (ESTABLISHED)
>>> sge_sched  3091     sge    3u  IPv4   6933       TCP  
>>> minicc.scalableinformatics.com:32781- 
>>> >minicc.scalableinformatics.com:536 (ESTABLISHED)
>>>
>>>
>>> You will see that it happily talks on port 536.  This is good, we  
>>> will play with this in a second.
>>>
>>> On the compute node, you will see something like this
>>>
>>> [root at compute-0-0 ~]# lsof  -i | grep -i sge
>>> sge_execd  3034     sge    3u  IPv4   6255       TCP *:537 (LISTEN)
>>> sge_execd  3034     sge    4u  IPv4  96002       TCP  
>>> compute-0-0.local:33254->minicc.scalableinformatics.com:536  
>>> (ESTABLISHED)
>>>
>>> where the execd is in listen mode on port 537.  Now to check  
>>> connectivity.
>>>
>>> [root at compute-0-0 ~]# telnet minicc.local 536
>>> Trying 10.1.0.1...
>>> Connected to minicc.local (10.1.0.1).
>>> Escape character is '^]'.
>>>
>>> Yup, we can get through from the compute node to the head node.   
>>> This means that the compute node is not being blocked either  
>>> iptables on either node.  Lets try the other way
>>>
>>> [root at minicc ~]# telnet c0-0 537
>>> Trying 10.1.255.254...
>>> Connected to compute-0-0.local (10.1.255.254).
>>> Escape character is '^]'.
>>>
>>> That also worked.  They should both work.  If they don't, this is  a 
>>> a problem.
>>>
>>> As for qrsh working, the default install of Rocks 4.1 does not  have 
>>> a working qrsh.  I usually install my own SGE if I want a  working 
>>> qrsh (which I usually do).
>>>
>>> [landman at minicc ~]$ qrsh uname -a
>>> poll: protocol failure in circuit setup
>>>
>>> You should be able to run the following job like this:
>>>
>>> [landman at minicc ~]$ cat > e
>>> #!/bin/tcsh
>>> #-S /bin/tcsh
>>> uname -a
>>> date
>>> cat /proc/cpuinfo
>>> [landman at minicc ~]$ chmod +x e
>>> [landman at minicc ~]$ qsub e
>>> Your job 4 ("e") has been submitted.
>>> [landman at minicc ~]$ qstat
>>> job-ID  prior   name       user         state submit/start at      
>>> queue                          slots ja-task-ID
>>> --------------------------------------------------------------------- 
>>> --------------------------------------------
>>>       4 0.00000 e          landman      qw    05/22/2006  
>>> 12:16:29                               1
>>> [landman at minicc ~]$ qstat
>>> job-ID  prior   name       user         state submit/start at      
>>> queue                          slots ja-task-ID
>>> --------------------------------------------------------------------- 
>>> --------------------------------------------
>>>       4 0.00000 e          landman      qw    05/22/2006  
>>> 12:16:29                               1
>>> [landman at minicc ~]$ qstat
>>> [landman at minicc ~]$
>>> [landman at minicc ~]$ cat e.o4
>>> Warning: no access to tty (Bad file descriptor).
>>> Thus no job control in this shell.
>>> Linux compute-0-0.local 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36  
>>> BST 2005 x86_64 x86_64 x86_64 GNU/Linux
>>> Mon May 22 12:16:37 EDT 2006
>>> processor       : 0
>>> vendor_id       : AuthenticAMD
>>> cpu family      : 15
>>> model           : 37
>>> model name      : AMD Opteron(tm) Processor 252
>>> stepping        : 1
>>> cpu MHz         : 2592.694
>>> cache size      : 1024 KB
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 1
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
>>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext  
>>> lm 3dnowext 3dnow pni ts
>>> bogomips        : 5095.42
>>> TLB size        : 1088 4K pages
>>> clflush size    : 64
>>> cache_alignment : 64
>>> address sizes   : 40 bits physical, 48 bits virtual
>>> power management: ts fid vid ttp
>>>
>>> processor       : 1
>>> vendor_id       : AuthenticAMD
>>> cpu family      : 15
>>> model           : 37
>>> model name      : AMD Opteron(tm) Processor 252
>>> stepping        : 1
>>> cpu MHz         : 2592.694
>>> cache size      : 1024 KB
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 1
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  
>>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext  
>>> lm 3dnowext 3dnow pni ts
>>> bogomips        : 5177.34
>>> TLB size        : 1088 4K pages
>>> clflush size    : 64
>>> cache_alignment : 64
>>> address sizes   : 40 bits physical, 48 bits virtual
>>> power management: ts fid vid ttp
>>>
>>> Joe
>>>
>>>
>>>>
>>>>
>>>> -Chris
>>>>
>>>>
>>>>
>>>>
>>>> On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:
>>>>
>>>>> Kickstarted 16:21 27-Mar-2006
>>>>> [urs1 at medusa ~]$ qrsh hostname
>>>>> error: error waiting on socket for client to connect:  Interrupted 
>>>>> system
>>>>> call
>>>>> error: unable to contact qmaster using port 536 on host
>>>>> "medusa.ursdcmetro.com"
>>>>> [urs1 at medusa ~]$
>>>>>
>>>>> Mark A. Johnson
>>>>> URS Network Administrator
>>>>> Gaithersburg, MD
>>>>> Ph:  301-721-2231
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list