[GE users] qrsh problem

laotsao laotsao at gmail.com
Fri Aug 27 13:41:14 BST 2010


ASSUME that head node has two NICs
please check the howto with  multihome setup
http://gridengine.sunsource.net/howto/multi_intrfcs.html
regards


On 8/27/2010 7:13 AM, reuti wrote:
> Hi,
>
> Am 26.08.2010 um 20:37 schrieb deadline:
>
>> OS: Scientific Linux 5.4
>> Hardware: Intel x86_64, GigE
>> GE version 6.2u5
>>
>> Problem: I have a smallish cluster and I want to use the head node to run
>> jobs. When I run parallel jobs, the nodes
>> will try to use the head node, but they will time out.
>> Sequential jobs run fine on all nodes (because
>> the are launched from the head node)
>> I narrowed it down using qrsh on the worker
>> nodes.
> does the headnode have two network interfaces?
>
> -- Reuti
>
>
>> "norbert" is head node running sge_qmaster and sge_execd
>> worker nodes are "n0" an "n2"
>>
>>  From norbert I can "qrsh -l n=X hostname" where X is any node.
>> I can do the same from any worker node to any worker node.
>> But, I cannot run jobs like "qrsh -l n=norbert hostname" from
>> a worker node (it does work on norbert->  norbert, however)
>>
>> Firewall is off, I checked permissions, all nodes
>> are exec and submit hosts, norbert is admin host.
>> ssh is used for remote login.
>> I spent time going through debug traces and logs
>> and cannot find anything obvious.
>>
>> Does anyone have any ideas on what to check next?
>> I am wondering if the host node has some security
>> feature preventing qrsh from working, but I cannot
>> think of anything.
>>
>> Here are some data:
>>
>> CASE 1: From worker node to head node (UNSUCCESSFUL)
>> ====================================================
>>
>> "qrsh  -l h=norbert  hostname"
>>
>> Example ps output on head node (norbert)
>> ----------------------------------------
>> 14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd 14257
>> ?        S      0:00  \_ sge_shepherd-490 -bg
>> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
>>
>> Trace from qrsh on node (dl=3)
>> ==============================
>>     18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! !
>> ! 19   6647         main
>> ============================================ 20   6647         main
>> random polling set to 3
>>     21   6647         main     ---- got NO valid socket! ----
>>     22   6647         main     sge_set_auth_info: username(uid) =
>> deadline(500), groupname = deadline(500)
>>
>>
>> Relevant parts of execd trace file for job
>> ------------------------------------------
>> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
>> 08/26/2010 08:31:32 [0:14441]: start qlogin
>> 08/26/2010 08:31:32 [0:14441]: calling
>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
>> /usr/sbin/sshd -i);
>> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
>> 08/26/2010 08:31:32 [0:14441]: using sfd 0
>> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
>> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address = n2.cluster:56747
>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster, port =
>> 56747
>> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
>> 08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
>> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
>>
>> CASE 2: From worker node to worker node (SUCCESSFUL)
>> ====================================================
>>
>> "qrsh  -l h=n0  hostname"
>>
>> Example ps output on worker node (n0)
>> -----------------------------
>> 3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
>> 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
>> 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
>> 4086 ?        S      0:00          \_ sshd: deadline at notty
>> 4087 ?        Ss     0:00              \_
>> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
>> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
>> 4100 ?        S      0:00                  \_ sleep 500
>>
>> Trace from qrsh on node (dl=3)
>> ==============================
>>     18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! ! !
>> ! 19   6640         main
>> ============================================ 20   6640         main
>> random polling set to 3
>>     21   6640         main     accepted client connection, fd = 3
>>     22   6640         main     qlogin_starter sent:
>> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>>     23   6640         main     accepted client connection, fd = 3
>>     24   6640         main     exit_status = 0
>> l
>> Relevant parts of execd trace file for job on worker node
>> ---------------------------------------------------------
>> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
>> 08/26/2010 08:39:57 [0:4355]: start qlogin
>> 08/26/2010 08:39:57 [0:4355]: calling
>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
>> /usr/sbin/sshd -i);
>> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
>> 08/26/2010 08:39:57 [0:4355]: using sfd 0
>> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
>> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address = n2.cluster:54407
>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster, port =
>> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
>> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
>> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
>> 08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0; WIFSIGNALED:
>> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
>>
>> --
>> Doug
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277410

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "laotsao.vcf"  Text/X-VCARD (Name: "laotsao.vcf") ~228 ]
    [ bytes. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list