[GE users] SGE jobs in "qw" state

Adam Brust abrust at csag.ucsd.edu
Tue Jun 6 21:03:43 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi.

I just spent an entire day troubleshooting what seems to be a very 
similar problem. I finally found a resolution and perhaps it can help 
someone else.  Basically, somehow my scheduler config file got blown 
away. (maybe after I rebooted the system?)  The output of "qconf -sconf" 
displayed nothing.  I re-created this config file and I am able to run 
jobs again.

The symptoms in this thread were nearly identical to mine, most notibly 
the 'got max. unheard timeout for target "execd" on host...' in the 
qmaster message log, which lead me to believe there was some 
communicaitons problems from the qmaster to the sgeexecd on the nodes.  
Unfortunatley the error logs in this instance weren't very helpful.

Hope this helps.

-adam

Joe Landman wrote:

> Chris Dagdigian wrote:
>
>>
>> Sensible error messages at least.
>>
>> (1) Are sge_qmaster and sge_schedd daemons running OK on the master?
>>
>> (2) Are there any firewalls blocking TCP port 536? Grid Engine 
>> requires 2 TCP ports, one used by sge_qmaster and the other used for 
>> sge_execd communication.
>>
>> (3) I've seen qrsh errors similar to this when the $SGE_ROOT was 
>> being shared cluster wide via NFS yet with extremely locked down 
>> export permissions that forbid suid operations or remapped the root 
>> user UID to a different, non-privileged user account.  Grid Engine 
>> has some setuid binaries that should not be blocked or remapped and 
>> odd permissions will certainly break qrsh commands and sometimes 
>> other things as well. You may want to look at file permissions and 
>> how they appear from the head (qmaster ) node versus how they look 
>> when you login to a compute node.
>>
>> I'm not familiar with recent ROCKS so I can't say for sure how the 
>> SGE rocks-roll is deployed or even if it uses a shared NFS $SGE_ROOT 
>> by default. Sorry about that.
>>
>> { Just noticed Joe replying, he knows ROCKS far far better than I !! }
>
>
> Hi Chris :)
>
>   Usually I see name service issues, but more often than not, I see 
> iptables get in the way.
>
>   If you look on the head node with lsof (lsof is one of your many 
> friends)
>
> [root at minicc ~]# lsof -i | grep -i sge
> sge_qmast  3072     sge    3u  IPv4   6914       TCP *:536 (LISTEN)
> sge_qmast  3072     sge    4u  IPv4   6934       TCP 
> minicc.scalableinformatics.com:536->minicc.scalableinformatics.com:32781 
> (ESTABLISHED)
> sge_qmast  3072     sge    5u  IPv4 497728       TCP 
> minicc.scalableinformatics.com:536->compute-0-0.local:33254 (ESTABLISHED)
> sge_sched  3091     sge    3u  IPv4   6933       TCP 
> minicc.scalableinformatics.com:32781->minicc.scalableinformatics.com:536 
> (ESTABLISHED)
>
>
> You will see that it happily talks on port 536.  This is good, we will 
> play with this in a second.
>
> On the compute node, you will see something like this
>
> [root at compute-0-0 ~]# lsof  -i | grep -i sge
> sge_execd  3034     sge    3u  IPv4   6255       TCP *:537 (LISTEN)
> sge_execd  3034     sge    4u  IPv4  96002       TCP 
> compute-0-0.local:33254->minicc.scalableinformatics.com:536 (ESTABLISHED)
>
> where the execd is in listen mode on port 537.  Now to check 
> connectivity.
>
> [root at compute-0-0 ~]# telnet minicc.local 536
> Trying 10.1.0.1...
> Connected to minicc.local (10.1.0.1).
> Escape character is '^]'.
>
> Yup, we can get through from the compute node to the head node.  This 
> means that the compute node is not being blocked either iptables on 
> either node.  Lets try the other way
>
> [root at minicc ~]# telnet c0-0 537
> Trying 10.1.255.254...
> Connected to compute-0-0.local (10.1.255.254).
> Escape character is '^]'.
>
> That also worked.  They should both work.  If they don't, this is a a 
> problem.
>
> As for qrsh working, the default install of Rocks 4.1 does not have a 
> working qrsh.  I usually install my own SGE if I want a working qrsh 
> (which I usually do).
>
> [landman at minicc ~]$ qrsh uname -a
> poll: protocol failure in circuit setup
>
> You should be able to run the following job like this:
>
> [landman at minicc ~]$ cat > e
> #!/bin/tcsh
> #-S /bin/tcsh
> uname -a
> date
> cat /proc/cpuinfo
> [landman at minicc ~]$ chmod +x e
> [landman at minicc ~]$ qsub e
> Your job 4 ("e") has been submitted.
> [landman at minicc ~]$ qstat
> job-ID  prior   name       user         state submit/start at     
> queue                          slots ja-task-ID
> ----------------------------------------------------------------------------------------------------------------- 
>
>       4 0.00000 e          landman      qw    05/22/2006 12:16:29 
>                               1
> [landman at minicc ~]$ qstat
> job-ID  prior   name       user         state submit/start at     
> queue                          slots ja-task-ID
> ----------------------------------------------------------------------------------------------------------------- 
>
>       4 0.00000 e          landman      qw    05/22/2006 12:16:29 
>                               1
> [landman at minicc ~]$ qstat
> [landman at minicc ~]$
> [landman at minicc ~]$ cat e.o4
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Linux compute-0-0.local 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 
> 2005 x86_64 x86_64 x86_64 GNU/Linux
> Mon May 22 12:16:37 EDT 2006
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 15
> model           : 37
> model name      : AMD Opteron(tm) Processor 252
> stepping        : 1
> cpu MHz         : 2592.694
> cache size      : 1024 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 1
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 
> 3dnowext 3dnow pni ts
> bogomips        : 5095.42
> TLB size        : 1088 4K pages
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management: ts fid vid ttp
>
> processor       : 1
> vendor_id       : AuthenticAMD
> cpu family      : 15
> model           : 37
> model name      : AMD Opteron(tm) Processor 252
> stepping        : 1
> cpu MHz         : 2592.694
> cache size      : 1024 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 1
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 
> 3dnowext 3dnow pni ts
> bogomips        : 5177.34
> TLB size        : 1088 4K pages
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management: ts fid vid ttp
>
> Joe
>
>
>>
>>
>> -Chris
>>
>>
>>
>>
>> On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:
>>
>>> Kickstarted 16:21 27-Mar-2006
>>> [urs1 at medusa ~]$ qrsh hostname
>>> error: error waiting on socket for client to connect: Interrupted 
>>> system
>>> call
>>> error: unable to contact qmaster using port 536 on host
>>> "medusa.ursdcmetro.com"
>>> [urs1 at medusa ~]$
>>>
>>> Mark A. Johnson
>>> URS Network Administrator
>>> Gaithersburg, MD
>>> Ph:  301-721-2231
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list