[GE users] qrsh problem

deadline deadline at basement-supercomputing.com
Fri Aug 27 21:51:55 BST 2010


> Am 27.08.2010 um 14:54 schrieb deadline:
>
>>> Hi,
>>>
>>> Am 26.08.2010 um 20:37 schrieb deadline:
>>>
>>>> OS: Scientific Linux 5.4
>>>> Hardware: Intel x86_64, GigE
>>>> GE version 6.2u5
>>>>
>>>> Problem: I have a smallish cluster and I want to use the head node to
run
>>>> jobs. When I run parallel jobs, the nodes
>>>> will try to use the head node, but they will time out.
>>>> Sequential jobs run fine on all nodes (because
>>>> the are launched from the head node)
>>>> I narrowed it down using qrsh on the worker
>>>> nodes.
>>>
>>> does the headnode have two network interfaces?
>>
>> Yes
>
> Then you will just need one entry in the host_aliases file for the
headnode, to which explanation LaoTsao pointed. The SGE daemons should
think to work solely on the internal interface, whereto also the nodes
are connected.
>
> There are tools in $SGE_ROOT/utilbin/$ARC/: `gethostbyname`,
> `gethostbyaddr`, ... to check the result with the internal name then.
The headnode should be accessible with its internal name from the nodes,
and also be the entry in $SGE_ROOT/default/common/act_qmaster

I tried the host_aliases file and it did not seem change anything. The job
still hangs as described. I'll keep looking for
some hints. Any other suggestions?

In looking at the code, it seems that everything
is fine until qlogin_starter.c

  529    if (write_to_qrsh(buffer) != 0) {

I get the correct trace information from write_to_qrsh
(no errors) but then it never makes it to

  537    shepherd_trace("waiting for connection.");

but tells me

  parent: forked "job" with pid 14441

which is coming from start_child() in  shepherd.c

  1126    shepherd_trace("parent: forked \"%s\" with pid %d", childname,
pid);

See below. The only thing I can think of it is talking
to the wrong daemon or starting the wrong daemon?

--
Doug



>
> -- Reuti
>
>
>> --
>> Doug
>>
>>>
>>> -- Reuti
>>>
>>>
>>>> "norbert" is head node running sge_qmaster and sge_execd
>>>> worker nodes are "n0" an "n2"
>>>>
>>>> From norbert I can "qrsh -l n=X hostname" where X is any node. I can
do the same from any worker node to any worker node.
>>>> But, I cannot run jobs like "qrsh -l n=norbert hostname" from a
worker node (it does work on norbert-> norbert, however)
>>>>
>>>> Firewall is off, I checked permissions, all nodes
>>>> are exec and submit hosts, norbert is admin host.
>>>> ssh is used for remote login.
>>>> I spent time going through debug traces and logs
>>>> and cannot find anything obvious.
>>>>
>>>> Does anyone have any ideas on what to check next?
>>>> I am wondering if the host node has some security
>>>> feature preventing qrsh from working, but I cannot
>>>> think of anything.
>>>>
>>>> Here are some data:
>>>>
>>>> CASE 1: From worker node to head node (UNSUCCESSFUL)
>>>> ====================================================
>>>>
>>>> "qrsh  -l h=norbert  hostname"
>>>>
>>>> Example ps output on head node (norbert)
>>>> ----------------------------------------
>>>> 14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
14257
>>>> ?        S      0:00  \_ sge_shepherd-490 -bg
>>>> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
>>>>
>>>> Trace from qrsh on node (dl=3)
>>>> ==============================
>>>>   18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>> !
>>>> ! 19   6647         main
>>>> ============================================ 20   6647         main
random polling set to 3
>>>>   21   6647         main     ---- got NO valid socket! ----
>>>>   22   6647         main     sge_set_auth_info: username(uid) =
>>>> deadline(500), groupname = deadline(500)
>>>>
>>>>
>>>> Relevant parts of execd trace file for job
>>>> ------------------------------------------
>>>> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
08/26/2010 08:31:32 [0:14441]: start qlogin
>>>> 08/26/2010 08:31:32 [0:14441]: calling
>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
/usr/sbin/sshd -i);
>>>> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
08/26/2010 08:31:32 [0:14441]: using sfd 0
>>>> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
>>>> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address =
>>>> n2.cluster:56747
>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster, port =
>>>> 56747
>>>> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
>>>> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
>>>>
>>>> CASE 2: From worker node to worker node (SUCCESSFUL)
>>>> ====================================================
>>>>
>>>> "qrsh  -l h=n0  hostname"
>>>>
>>>> Example ps output on worker node (n0)
>>>> -----------------------------
>>>> 3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
4083 ?        S      0:00  \_ sge_shepherd-483 -bg
>>>> 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
>>>> 4086 ?        S      0:00          \_ sshd: deadline at notty
>>>> 4087 ?        Ss     0:00              \_
>>>> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
>>>> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
>>>> 4100 ?        S      0:00                  \_ sleep 500
>>>>
>>>> Trace from qrsh on node (dl=3)
>>>> ==============================
>>>>   18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>> !
>>>> ! 19   6640         main
>>>> ============================================ 20   6640         main
random polling set to 3
>>>>   21   6640         main     accepted client connection, fd = 3 22
6640         main     qlogin_starter sent:
>>>> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>>>>   23   6640         main     accepted client connection, fd = 3 24
6640         main     exit_status = 0
>>>> l
>>>> Relevant parts of execd trace file for job on worker node
>>>> ---------------------------------------------------------
>>>> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
>>>> 08/26/2010 08:39:57 [0:4355]: start qlogin
>>>> 08/26/2010 08:39:57 [0:4355]: calling
>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
/usr/sbin/sshd -i);
>>>> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
08/26/2010 08:39:57 [0:4355]: using sfd 0
>>>> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
>>>> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address =
>>>> n2.cluster:54407
>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster, port =
>>>> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
>>>> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
>>>> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0;
WIFSIGNALED:
>>>> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>>>> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
>>>>
>>>> --
>>>> Doug
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
>>>>
>>>> To unsubscribe from this discussion, e-mail:
>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>>
>>>
>>
>>
>> --
>> Doug
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277416
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277419
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


--
Doug



--
Doug

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277537

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list