[GE users] qlogin fails from ROCKS compute node to external SGE exec host

reuti reuti at staff.uni-marburg.de
Wed Nov 25 23:44:35 GMT 2009


Am 25.11.2009 um 23:50 schrieb bergman:

> I'm having trouble getting qlogin to work from compute nodes in a  
> ROCKS
> (Platform ROCKS 4.1.1.1) cluster to SGE (6.2u3) execution hosts  
> outside the
> cluster.
>
> SGE batch jobs and ssh sessions can be successfully launched on the  
> compute node
> and executed on the external server, but qlogin and qsh jobs fail.  
> The following
> messages appear when launching an interactive job:
>
> 	local configuration compute-1-10.local not defined - using global  
> configuration
> 	Your job 791494 ("QLOGIN") has been submitted
> 	waiting for interactive job to be scheduled ...timeout (4 s)  
> expired while waiting on socket fd 4
> 	Could not start interactive job.
>
> I'm using SSH integration with a qlogin-wrapper script, as  
> described in:
> 	http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>
> A crude ASCII diagram of the network configuration is:
>
>              [server1]
>                |
>                |
>              [headnode]
>                  ||
>                  ||
>                  NAT -- all connections from compute nodes are  
> NAT'ed to
>                  ||     appear to come from the headnode.
>                  ||
>              [compute node]
>
>
> The execution host "server1" provides the resource "GPU". This is  
> defined in a
> resource complex, with a boolean value (not consumable). Jobs that  
> require the
> resource are launched with:
> 	qsub -l GPU
> or
> 	qlogin -l GPU
>
> When jobs are launched, here's a list of the status
> of different connections:
>
> 	"qlogin -l GPU" headnode => server1:    OK
>         ssh headnode => server1:                OK
>         "qsub -l GPU" headnode => server1:      OK
>         ssh compute => server1:                 OK
>         "qsub -l GPU" compute => server1:       OK
>         "qlogin" compute => other nodes:        OK
>         "qlogin -l GPU" compute => server1:     FAILURE
>
> The status is the same with or without iptables firewall enabled on
> server1 and headnode (there's no firewall on the compute nodes).
>
> The SGE shepard connection from the compute node to server1 takes  
> place
> (as shown by network packet traces, "lastcomm" command, etc.), but  
> I don't
> see any indication that SGE is launching an sshd process on server1 to
> listen for the connection being launched from the qlogin_wrapper  
> script.
>
> Changing the "qlogin_wrapper" to call ssh with verbose debugging  
> and other
> output produces nothing...implying that the "qlogin_wrapper" is  
> never being
> called since the first phase of qlogin (starting sshd on "server1")  
> fails.
>
> The log file ($SGE_ROOT/default/spool/$HOSTNAME/messages) shows:
> 	
> 	server1: no entries in the "messages" file about either the initial
> 		sshd connection or the ssh connection that's supposed to be
> 		run from qlogin_wrapper
>
> 	
> 	compute: no entries in the "messages" file about either the initial
> 		sshd connection or the ssh connection that's supposed to be
> 		run from qlogin_wrapper
>
> 	qmaster: entries in the "qmaster" log show:
> 		worker|qmaster|W|job 791495.1 failed on host server1 assumedly  
> after job because: job 791495.1 died through signal KILL (9)
> 		
>
> Does this diagnosis--that sshd is never being launched on the node  
> where the
> ssh connection should happen--make sense?
>
> Any suggestions for how to debug this further?

So, you login to a compute node and then issue qlogin from there to  
an outside server - unusual setup. Aynway:

Does server1 know the hosts inside the cluster, i.e. compute-0-0  
resolves to something? AFAIK SGE will check the address from the  
incoming rsh or builtin method being originated from the issuing  
machine (which would fail due to NAT). But as you are using SSH it  
should work. As the shepherd startup will try to resolve compute-0-0  
although it's not needed, it might hang at that point.

-- Reuti

PS: qsub is different, as it doesn't need a direct connection between  
the issuing and executing machine at any point.


> Thanks,
>
> Mark
>
>
> ----
> Mark Bergman                              voice: 215-662-7310
> mark.bergman at uphs.upenn.edu                 fax: 215-614-0266
> System Administrator     Section of Biomedical Image Analysis
> Department of Radiology            University of Pennsylvania
>       PGP Key: https://www.rad.upenn.edu/sbia/bergman
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=229401
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=229412

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list