[GE users] SGE jobs in "qw" state

Chris Dagdigian dag at sonsorol.org
Mon May 22 22:01:52 BST 2006


Sensible error messages at least.

(1) Are sge_qmaster and sge_schedd daemons running OK on the master?

(2) Are there any firewalls blocking TCP port 536? Grid Engine  
requires 2 TCP ports, one used by sge_qmaster and the other used for  
sge_execd communication.

(3) I've seen qrsh errors similar to this when the $SGE_ROOT was  
being shared cluster wide via NFS yet with extremely locked down  
export permissions that forbid suid operations or remapped the root  
user UID to a different, non-privileged user account.  Grid Engine  
has some setuid binaries that should not be blocked or remapped and  
odd permissions will certainly break qrsh commands and sometimes  
other things as well. You may want to look at file permissions and  
how they appear from the head (qmaster ) node versus how they look  
when you login to a compute node.

I'm not familiar with recent ROCKS so I can't say for sure how the  
SGE rocks-roll is deployed or even if it uses a shared NFS $SGE_ROOT  
by default. Sorry about that.

{ Just noticed Joe replying, he knows ROCKS far far better than I !! }


-Chris




On May 22, 2006, at 4:52 PM, Mark_Johnson at URSCorp.com wrote:

> Kickstarted 16:21 27-Mar-2006
> [urs1 at medusa ~]$ qrsh hostname
> error: error waiting on socket for client to connect: Interrupted  
> system
> call
> error: unable to contact qmaster using port 536 on host
> "medusa.ursdcmetro.com"
> [urs1 at medusa ~]$
>
> Mark A. Johnson
> URS Network Administrator
> Gaithersburg, MD
> Ph:  301-721-2231

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list