No subject


Wed Jan 12 20:38:46 GMT 2011


I can do the same from any worker node to any worker node.
But, I cannot run jobs like "qrsh -l n=norbert hostname" from
a worker node (it does work on norbert-> norbert, however)

Firewall is off, I checked permissions, all nodes
are exec and submit hosts, norbert is admin host.
ssh is used for remote login.
I spent time going through debug traces and logs
and cannot find anything obvious.

Does anyone have any ideas on what to check next?
I am wondering if the host node has some security
feature preventing qrsh from working, but I cannot
think of anything.

Here are some data:

CASE 1: From worker node to head node (UNSUCCESSFUL)
====================================================

"qrsh  -l h=norbert  hostname"

Example ps output on head node (norbert)
----------------------------------------
14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd 14257
?        S      0:00  \_ sge_shepherd-490 -bg
14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg

Trace from qrsh on node (dl=3)
==============================
    18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! !
! 19   6647         main
============================================ 20   6647         main
 random polling set to 3
    21   6647         main     ---- got NO valid socket! ----
    22   6647         main     sge_set_auth_info: username(uid) =
deadline(500), groupname = deadline(500)


Relevant parts of execd trace file for job
------------------------------------------
08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
08/26/2010 08:31:32 [0:14441]: start qlogin
08/26/2010 08:31:32 [0:14441]: calling
qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
/usr/sbin/sshd -i);
08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
08/26/2010 08:31:32 [0:14441]: using sfd 0
08/26/2010 08:31:32 [0:14441]: bound to port 39897
08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address = n2.cluster:56747
08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster, port =
56747
08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
08/26/2010 08:32:17 [0:14440]: wait3 returned -1

CASE 2: From worker node to worker node (SUCCESSFUL)
====================================================

"qrsh  -l h=n0  hostname"

Example ps output on worker node (n0)
-----------------------------
3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
 4086 ?        S      0:00          \_ sshd: deadline at notty
 4087 ?        Ss     0:00              \_
/opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
 4100 ?        S      0:00                  \_ sleep 500

Trace from qrsh on node (dl=3)
==============================
    18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! !
! 19   6640         main
============================================ 20   6640         main
 random polling set to 3
    21   6640         main     accepted client connection, fd = 3
    22   6640         main     qlogin_starter sent:
0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
    23   6640         main     accepted client connection, fd = 3
    24   6640         main     exit_status = 0
l
Relevant parts of execd trace file for job on worker node
---------------------------------------------------------
08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
08/26/2010 08:39:57 [0:4355]: start qlogin
08/26/2010 08:39:57 [0:4355]: calling
qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
/usr/sbin/sshd -i);
08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
08/26/2010 08:39:57 [0:4355]: using sfd 0
08/26/2010 08:39:57 [0:4355]: bound to port 48105
08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address = n2.cluster:54407
08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster, port =
54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0; WIFSIGNALED:
0,  WIFEXITED: 1, WEXITSTATUS: 0)
08/26/2010 08:39:57 [0:4354]: job exited with exit status 0

--
Doug

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list