[GE users] qrsh problem

deadline deadline at basement-supercomputing.com
Sat Aug 28 19:07:16 BST 2010


> Hi,
>
> Am 27.08.2010 um 22:51 schrieb deadline:
>
>>> Am 27.08.2010 um 14:54 schrieb deadline:
>>>
>>>>> Hi,
>>>>>
>>>>> Am 26.08.2010 um 20:37 schrieb deadline:
>>>>>
>>>>>> OS: Scientific Linux 5.4
>>>>>> Hardware: Intel x86_64, GigE
>>>>>> GE version 6.2u5
>>>>>>
>>>>>> Problem: I have a smallish cluster and I want to use the head node
>>>>>> to
>> run
>>>>>> jobs. When I run parallel jobs, the nodes
>>>>>> will try to use the head node, but they will time out.
>>>>>> Sequential jobs run fine on all nodes (because
>>>>>> the are launched from the head node)
>>>>>> I narrowed it down using qrsh on the worker
>>>>>> nodes.
>>>>>
>>>>> does the headnode have two network interfaces?
>>>>
>>>> Yes
>>>
>>> Then you will just need one entry in the host_aliases file for the
>> headnode, to which explanation LaoTsao pointed. The SGE daemons should
>> think to work solely on the internal interface, whereto also the nodes
>> are connected.
>>>
>>> There are tools in $SGE_ROOT/utilbin/$ARC/: `gethostbyname`,
>>> `gethostbyaddr`, ... to check the result with the internal name then.
>> The headnode should be accessible with its internal name from the nodes,
>> and also be the entry in $SGE_ROOT/default/common/act_qmaster
>>
>> I tried the host_aliases file and it did not seem change anything. The
>> job
>> still hangs as described. I'll keep looking for
>> some hints. Any other suggestions?
>>
>> In looking at the code, it seems that everything
>> is fine until qlogin_starter.c
>>
>>  529    if (write_to_qrsh(buffer) != 0) {
>>
>> I get the correct trace information from write_to_qrsh
>> (no errors) but then it never makes it to
>>
>>  537    shepherd_trace("waiting for connection.");
>>
>> but tells me
>>
>>  parent: forked "job" with pid 14441
>>
>> which is coming from start_child() in  shepherd.c
>>
>>  1126    shepherd_trace("parent: forked \"%s\" with pid %d", childname,
>> pid);
>>
>> See below. The only thing I can think of it is talking
>> to the wrong daemon or starting the wrong daemon?
>
> then the next step could be to fall back to the -builtin- communication
> method (as I see "sshd" below). Is this also not working?

I assume you mean rsh? That was my thought as well.

--
Doug


>
> -- Reuti
>
>
>> --
>> Doug
>>
>>
>>
>>>
>>> -- Reuti
>>>
>>>
>>>> --
>>>> Doug
>>>>
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> "norbert" is head node running sge_qmaster and sge_execd
>>>>>> worker nodes are "n0" an "n2"
>>>>>>
>>>>>> From norbert I can "qrsh -l n=X hostname" where X is any node. I can
>> do the same from any worker node to any worker node.
>>>>>> But, I cannot run jobs like "qrsh -l n=norbert hostname" from a
>> worker node (it does work on norbert-> norbert, however)
>>>>>>
>>>>>> Firewall is off, I checked permissions, all nodes
>>>>>> are exec and submit hosts, norbert is admin host.
>>>>>> ssh is used for remote login.
>>>>>> I spent time going through debug traces and logs
>>>>>> and cannot find anything obvious.
>>>>>>
>>>>>> Does anyone have any ideas on what to check next?
>>>>>> I am wondering if the host node has some security
>>>>>> feature preventing qrsh from working, but I cannot
>>>>>> think of anything.
>>>>>>
>>>>>> Here are some data:
>>>>>>
>>>>>> CASE 1: From worker node to head node (UNSUCCESSFUL)
>>>>>> ====================================================
>>>>>>
>>>>>> "qrsh  -l h=norbert  hostname"
>>>>>>
>>>>>> Example ps output on head node (norbert)
>>>>>> ----------------------------------------
>>>>>> 14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
>> 14257
>>>>>> ?        S      0:00  \_ sge_shepherd-490 -bg
>>>>>> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
>>>>>>
>>>>>> Trace from qrsh on node (dl=3)
>>>>>> ==============================
>>>>>>  18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>>>> !
>>>>>> ! 19   6647         main
>>>>>> ============================================ 20   6647         main
>> random polling set to 3
>>>>>>  21   6647         main     ---- got NO valid socket! ----
>>>>>>  22   6647         main     sge_set_auth_info: username(uid) =
>>>>>> deadline(500), groupname = deadline(500)
>>>>>>
>>>>>>
>>>>>> Relevant parts of execd trace file for job
>>>>>> ------------------------------------------
>>>>>> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
>> 08/26/2010 08:31:32 [0:14441]: start qlogin
>>>>>> 08/26/2010 08:31:32 [0:14441]: calling
>>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
>> /usr/sbin/sshd -i);
>>>>>> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
>> 08/26/2010 08:31:32 [0:14441]: using sfd 0
>>>>>> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
>>>>>> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address =
>>>>>> n2.cluster:56747
>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster,
>>>>>> port =
>>>>>> 56747
>>>>>> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
>> 08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
>>>>>> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
>>>>>>
>>>>>> CASE 2: From worker node to worker node (SUCCESSFUL)
>>>>>> ====================================================
>>>>>>
>>>>>> "qrsh  -l h=n0  hostname"
>>>>>>
>>>>>> Example ps output on worker node (n0)
>>>>>> -----------------------------
>>>>>> 3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
>> 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
>>>>>> 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
>>>>>> 4086 ?        S      0:00          \_ sshd: deadline at notty
>>>>>> 4087 ?        Ss     0:00              \_
>>>>>> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
>>>>>> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
>>>>>> 4100 ?        S      0:00                  \_ sleep 500
>>>>>>
>>>>>> Trace from qrsh on node (dl=3)
>>>>>> ==============================
>>>>>>  18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>>>> !
>>>>>> ! 19   6640         main
>>>>>> ============================================ 20   6640         main
>> random polling set to 3
>>>>>>  21   6640         main     accepted client connection, fd = 3 22
>> 6640         main     qlogin_starter sent:
>>>>>> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>>>>>>  23   6640         main     accepted client connection, fd = 3 24
>> 6640         main     exit_status = 0
>>>>>> l
>>>>>> Relevant parts of execd trace file for job on worker node
>>>>>> ---------------------------------------------------------
>>>>>> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
>>>>>> 08/26/2010 08:39:57 [0:4355]: start qlogin
>>>>>> 08/26/2010 08:39:57 [0:4355]: calling
>>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
>> /usr/sbin/sshd -i);
>>>>>> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
>> 08/26/2010 08:39:57 [0:4355]: using sfd 0
>>>>>> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
>>>>>> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address =
>>>>>> n2.cluster:54407
>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster,
>>>>>> port =
>>>>>> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
>>>>>> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
>>>>>> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
>> 08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0;
>> WIFSIGNALED:
>>>>>> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>>>>>> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
>>>>>>
>>>>>> --
>>>>>> Doug
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396
>>>>>
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>> --
>>>>> This message has been scanned for viruses and
>>>>> dangerous content by MailScanner, and is
>>>>> believed to be clean.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Doug
>>>>
>>>> --
>>>> This message has been scanned for viruses and
>>>> dangerous content by MailScanner, and is
>>>> believed to be clean.
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277416
>>>>
>>>> To unsubscribe from this discussion, e-mail:
>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277419
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>>
>>>
>>
>>
>> --
>> Doug
>>
>>
>>
>> --
>> Doug
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277537
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277539
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


--
Doug

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277740

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list