[GE users] qlogin fails from ROCKS compute node to external SGE exec host

bergman mark.bergman at uphs.upenn.edu
Wed Nov 25 22:50:44 GMT 2009


I'm having trouble getting qlogin to work from compute nodes in a ROCKS
(Platform ROCKS 4.1.1.1) cluster to SGE (6.2u3) execution hosts outside the
cluster.

SGE batch jobs and ssh sessions can be successfully launched on the compute node
and executed on the external server, but qlogin and qsh jobs fail. The following
messages appear when launching an interactive job:

	local configuration compute-1-10.local not defined - using global configuration
	Your job 791494 ("QLOGIN") has been submitted
	waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 4
	Could not start interactive job.

I'm using SSH integration with a qlogin-wrapper script, as described in:
	http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html

A crude ASCII diagram of the network configuration is:
 
             [server1] 
               |
               | 
             [headnode]
                 ||
                 ||
                 NAT -- all connections from compute nodes are NAT'ed to
                 ||     appear to come from the headnode.
                 ||
             [compute node]


The execution host "server1" provides the resource "GPU". This is defined in a
resource complex, with a boolean value (not consumable). Jobs that require the
resource are launched with:
	qsub -l GPU
or
	qlogin -l GPU

When jobs are launched, here's a list of the status
of different connections:

	"qlogin -l GPU" headnode => server1:    OK
        ssh headnode => server1:                OK
        "qsub -l GPU" headnode => server1:      OK
        ssh compute => server1:                 OK
        "qsub -l GPU" compute => server1:       OK
        "qlogin" compute => other nodes:        OK
        "qlogin -l GPU" compute => server1:     FAILURE

The status is the same with or without iptables firewall enabled on
server1 and headnode (there's no firewall on the compute nodes).

The SGE shepard connection from the compute node to server1 takes place
(as shown by network packet traces, "lastcomm" command, etc.), but I don't
see any indication that SGE is launching an sshd process on server1 to
listen for the connection being launched from the qlogin_wrapper script.

Changing the "qlogin_wrapper" to call ssh with verbose debugging and other
output produces nothing...implying that the "qlogin_wrapper" is never being
called since the first phase of qlogin (starting sshd on "server1") fails.

The log file ($SGE_ROOT/default/spool/$HOSTNAME/messages) shows:
	
	server1: no entries in the "messages" file about either the initial
		sshd connection or the ssh connection that's supposed to be
		run from qlogin_wrapper

	
	compute: no entries in the "messages" file about either the initial
		sshd connection or the ssh connection that's supposed to be
		run from qlogin_wrapper

	qmaster: entries in the "qmaster" log show:
		worker|qmaster|W|job 791495.1 failed on host server1 assumedly after job because: job 791495.1 died through signal KILL (9)
		

Does this diagnosis--that sshd is never being launched on the node where the
ssh connection should happen--make sense?

Any suggestions for how to debug this further?

Thanks,

Mark


----
Mark Bergman                              voice: 215-662-7310
mark.bergman at uphs.upenn.edu                 fax: 215-614-0266
System Administrator     Section of Biomedical Image Analysis
Department of Radiology            University of Pennsylvania
      PGP Key: https://www.rad.upenn.edu/sbia/bergman

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=229401

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list