[GE users] qrsh problem

reuti reuti at staff.uni-marburg.de
Fri Aug 27 12:13:15 BST 2010


Hi,

Am 26.08.2010 um 20:37 schrieb deadline:

> OS: Scientific Linux 5.4
> Hardware: Intel x86_64, GigE
> GE version 6.2u5
> 
> Problem: I have a smallish cluster and I want to use the head node to run
> jobs. When I run parallel jobs, the nodes
> will try to use the head node, but they will time out.
> Sequential jobs run fine on all nodes (because
> the are launched from the head node)
> I narrowed it down using qrsh on the worker
> nodes.

does the headnode have two network interfaces?

-- Reuti


> "norbert" is head node running sge_qmaster and sge_execd
> worker nodes are "n0" an "n2"
> 
> From norbert I can "qrsh -l n=X hostname" where X is any node.
> I can do the same from any worker node to any worker node.
> But, I cannot run jobs like "qrsh -l n=norbert hostname" from
> a worker node (it does work on norbert-> norbert, however)
> 
> Firewall is off, I checked permissions, all nodes
> are exec and submit hosts, norbert is admin host.
> ssh is used for remote login.
> I spent time going through debug traces and logs
> and cannot find anything obvious.
> 
> Does anyone have any ideas on what to check next?
> I am wondering if the host node has some security
> feature preventing qrsh from working, but I cannot
> think of anything.
> 
> Here are some data:
> 
> CASE 1: From worker node to head node (UNSUCCESSFUL)
> ====================================================
> 
> "qrsh  -l h=norbert  hostname"
> 
> Example ps output on head node (norbert)
> ----------------------------------------
> 14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd 14257
> ?        S      0:00  \_ sge_shepherd-490 -bg
> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
> 
> Trace from qrsh on node (dl=3)
> ==============================
>    18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! !
> ! 19   6647         main
> ============================================ 20   6647         main
> random polling set to 3
>    21   6647         main     ---- got NO valid socket! ----
>    22   6647         main     sge_set_auth_info: username(uid) =
> deadline(500), groupname = deadline(500)
> 
> 
> Relevant parts of execd trace file for job
> ------------------------------------------
> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
> 08/26/2010 08:31:32 [0:14441]: start qlogin
> 08/26/2010 08:31:32 [0:14441]: calling
> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
> /usr/sbin/sshd -i);
> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/26/2010 08:31:32 [0:14441]: using sfd 0
> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address = n2.cluster:56747
> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster, port =
> 56747
> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
> 08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
> 
> CASE 2: From worker node to worker node (SUCCESSFUL)
> ====================================================
> 
> "qrsh  -l h=n0  hostname"
> 
> Example ps output on worker node (n0)
> -----------------------------
> 3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
> 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
> 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
> 4086 ?        S      0:00          \_ sshd: deadline at notty
> 4087 ?        Ss     0:00              \_
> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
> 4100 ?        S      0:00                  \_ sleep 500
> 
> Trace from qrsh on node (dl=3)
> ==============================
>    18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! !
> ! 19   6640         main
> ============================================ 20   6640         main
> random polling set to 3
>    21   6640         main     accepted client connection, fd = 3
>    22   6640         main     qlogin_starter sent:
> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>    23   6640         main     accepted client connection, fd = 3
>    24   6640         main     exit_status = 0
> l
> Relevant parts of execd trace file for job on worker node
> ---------------------------------------------------------
> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
> 08/26/2010 08:39:57 [0:4355]: start qlogin
> 08/26/2010 08:39:57 [0:4355]: calling
> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
> /usr/sbin/sshd -i);
> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/26/2010 08:39:57 [0:4355]: using sfd 0
> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address = n2.cluster:54407
> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster, port =
> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
> 08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0; WIFSIGNALED:
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
> 
> --
> Doug
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list