[GE users] SGE jobs in "qw" state

Joe Landman landman at scalableinformatics.com
Mon May 22 22:21:56 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Chris Dagdigian wrote:
> 
> Sensible error messages at least.
> 
> (1) Are sge_qmaster and sge_schedd daemons running OK on the master?
> 
> (2) Are there any firewalls blocking TCP port 536? Grid Engine requires 
> 2 TCP ports, one used by sge_qmaster and the other used for sge_execd 
> communication.
> 
> (3) I've seen qrsh errors similar to this when the $SGE_ROOT was being 
> shared cluster wide via NFS yet with extremely locked down export 
> permissions that forbid suid operations or remapped the root user UID to 
> a different, non-privileged user account.  Grid Engine has some setuid 
> binaries that should not be blocked or remapped and odd permissions will 
> certainly break qrsh commands and sometimes other things as well. You 
> may want to look at file permissions and how they appear from the head 
> (qmaster ) node versus how they look when you login to a compute node.
> 
> I'm not familiar with recent ROCKS so I can't say for sure how the SGE 
> rocks-roll is deployed or even if it uses a shared NFS $SGE_ROOT by 
> default. Sorry about that.
> 
> { Just noticed Joe replying, he knows ROCKS far far better than I !! }

Hi Chris :)

   Usually I see name service issues, but more often than not, I see 
iptables get in the way.

   If you look on the head node with lsof (lsof is one of your many friends)

[root at minicc ~]# lsof -i | grep -i sge
sge_qmast  3072     sge    3u  IPv4   6914       TCP *:536 (LISTEN)
sge_qmast  3072     sge    4u  IPv4   6934       TCP 
minicc.scalableinformatics.com:536->minicc.scalableinformatics.com:32781 
(ESTABLISHED)
sge_qmast  3072     sge    5u  IPv4 497728       TCP 
minicc.scalableinformatics.com:536->compute-0-0.local:33254 (ESTABLISHED)
sge_sched  3091     sge    3u  IPv4   6933       TCP 
minicc.scalableinformatics.com:32781->minicc.scalableinformatics.com:536 
(ESTABLISHED)


You will see that it happily talks on port 536.  This is good, we will 
play with this in a second.

On the compute node, you will see something like this

[root at compute-0-0 ~]# lsof  -i | grep -i sge
sge_execd  3034     sge    3u  IPv4   6255       TCP *:537 (LISTEN)
sge_execd  3034     sge    4u  IPv4  96002       TCP 
compute-0-0.local:33254->minicc.scalableinformatics.com:536 (ESTABLISHED)

where the execd is in listen mode on port 537.  Now to check connectivity.

[root at compute-0-0 ~]# telnet minicc.local 536
Trying 10.1.0.1...
Connected to minicc.local (10.1.0.1).
Escape character is '^]'.

Yup, we can get through from the compute node to the head node.  This 
means that the compute node is not being blocked either iptables on 
either node.  Lets try the other way

[root at minicc ~]# telnet c0-0 537
Trying 10.1.255.254...
Connected to compute-0-0.local (10.1.255.254).
Escape character is '^]'.

That also worked.  They should both work.  If they don't, this is a a 
problem.

As for qrsh working, the default install of Rocks 4.1 does not have a 
working qrsh.  I usually install my own SGE if I want a working qrsh 
(which I usually do).

[landman at minicc ~]$ qrsh uname -a
poll: protocol failure in circuit setup

You should be able to run the following job like this:

[landman at minicc ~]$ cat > e
#!/bin/tcsh
#-S /bin/tcsh
uname -a
date
cat /proc/cpuinfo
[landman at minicc ~]$ chmod +x e
[landman at minicc ~]$ qsub e
Your job 4 ("e") has been submitted.
[landman at minicc ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue 
                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       4 0.00000 e          landman      qw    05/22/2006 12:16:29 
                               1
[landman at minicc ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue 
                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
       4 0.00000 e          landman      qw    05/22/2006 12:16:29 
                               1
[landman at minicc ~]$ qstat
[landman at minicc ~]$
[landman at minicc ~]$ cat e.o4
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Linux compute-0-0.local 2.6.9-22.ELsmp #1 SMP Sat Oct 8 21:32:36 BST 
2005 x86_64 x86_64 x86_64 GNU/Linux
Mon May 22 12:16:37 EDT 2006
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 37
model name      : AMD Opteron(tm) Processor 252
stepping        : 1
cpu MHz         : 2592.694
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 
3dnowext 3dnow pni ts
bogomips        : 5095.42
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 37
model name      : AMD Opteron(tm) Processor 252
stepping        : 1
cpu MHz         : 2592.694
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 
3dnowext 3dnow pni ts
bogomips        : 5177.34
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Joe


> 
> 
> -Chris
> 
> 
> 
> 
> On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:
> 
>> Kickstarted 16:21 27-Mar-2006
>> [urs1 at medusa ~]$ qrsh hostname
>> error: error waiting on socket for client to connect: Interrupted system
>> call
>> error: unable to contact qmaster using port 536 on host
>> "medusa.ursdcmetro.com"
>> [urs1 at medusa ~]$
>>
>> Mark A. Johnson
>> URS Network Administrator
>> Gaithersburg, MD
>> Ph:  301-721-2231
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list