[GE users] SGE noobie

craffi dag at sonsorol.org
Wed May 27 01:33:57 BST 2009


Is there a technical or other reason why you have "/SGE6" and "/SGE6b" ?

Ideally you need a consistent and shared path so that it is "/SGE6"  
everywhere.

- NFS export "/SGE6" on your master host
- Mount machine1:/SGE6 to /SGE6 on all your other hosts

Then your qmaster will write its hostname to /SGE6/default/common/ 
act_qmaster. When you run ./install_execd  on machine2 from  /SGE6 it  
will detect the presence of the cell directory ("default/") and will  
automatically read in default/common/act_qmaster and register with the  
qmaster.

The connection refused error usually indicates a firewall problem or  
other problem with multiple NICs on the master host but as you  
mentioned this error does not matter since it's trying to contact the  
wrong host anyway. There is chance that you have old or incorrect data  
in /SGE6b and somehow the installer is picking up the wrong master  
from there.

Again, unless there is a technical, security or political reason for  
doing things differently you really should have paths that are same  
regardless of host. This means shared home directories for users and a  
shared $SGE_ROOT at a consistent path.

-Chris




On May 26, 2009, at 5:17 PM, biostat wrote:

> I have installed the qmaster on my master host computer, machine1,  
> under the directory /SGE6. I then NSFed that folder to machine 2.  
> Under machine 2, I ran ./install_execd (I did not run the  
> install_execd script from the NSFed directory, I ran it from a local  
> directory /SGE6b). During the execd install, it asked for my root  
> path, and I gave it the NSFed folder. However a few steps later, I  
> ran into this problem:
>
> Checking hostname resolving
> ---------------------------
>
> Cannot contact qmaster. The command failed:
>
>   ./bin/darwin-x86/qconf -sh
>
> The error message was:
>
>   error: commlib error: can't connect to service (Connection refused)
> ERROR: unable to contact qmaster using port 288 on host  
> "machine2.xxx.yyy.edu"
>
> You can fix the problem now or abort the installation  procedure.
> The problem can be:
>
>   - the qmaster is not running
>   - the qmaster host is down
>   - an active firewall blocks your request
>
> Contact qmaster again (y/n) ('n' will abort) [y] >>
>
>
> The qmaster *is* running on machine1, but it doesnt seem like it is  
> looking at machine1. Its looking in machine2.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=199058
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=199091

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list