[GE users] qrsh problem

reuti reuti at staff.uni-marburg.de
Fri Aug 27 21:58:56 BST 2010


Hi,

Am 27.08.2010 um 22:51 schrieb deadline:

>> Am 27.08.2010 um 14:54 schrieb deadline:
>> 
>>>> Hi,
>>>> 
>>>> Am 26.08.2010 um 20:37 schrieb deadline:
>>>> 
>>>>> OS: Scientific Linux 5.4
>>>>> Hardware: Intel x86_64, GigE
>>>>> GE version 6.2u5
>>>>> 
>>>>> Problem: I have a smallish cluster and I want to use the head node to
> run
>>>>> jobs. When I run parallel jobs, the nodes
>>>>> will try to use the head node, but they will time out.
>>>>> Sequential jobs run fine on all nodes (because
>>>>> the are launched from the head node)
>>>>> I narrowed it down using qrsh on the worker
>>>>> nodes.
>>>> 
>>>> does the headnode have two network interfaces?
>>> 
>>> Yes
>> 
>> Then you will just need one entry in the host_aliases file for the
> headnode, to which explanation LaoTsao pointed. The SGE daemons should
> think to work solely on the internal interface, whereto also the nodes
> are connected.
>> 
>> There are tools in $SGE_ROOT/utilbin/$ARC/: `gethostbyname`,
>> `gethostbyaddr`, ... to check the result with the internal name then.
> The headnode should be accessible with its internal name from the nodes,
> and also be the entry in $SGE_ROOT/default/common/act_qmaster
> 
> I tried the host_aliases file and it did not seem change anything. The job
> still hangs as described. I'll keep looking for
> some hints. Any other suggestions?
> 
> In looking at the code, it seems that everything
> is fine until qlogin_starter.c
> 
>  529    if (write_to_qrsh(buffer) != 0) {
> 
> I get the correct trace information from write_to_qrsh
> (no errors) but then it never makes it to
> 
>  537    shepherd_trace("waiting for connection.");
> 
> but tells me
> 
>  parent: forked "job" with pid 14441
> 
> which is coming from start_child() in  shepherd.c
> 
>  1126    shepherd_trace("parent: forked \"%s\" with pid %d", childname,
> pid);
> 
> See below. The only thing I can think of it is talking
> to the wrong daemon or starting the wrong daemon?

then the next step could be to fall back to the -builtin- communication method (as I see "sshd" below). Is this also not working?

-- Reuti


> --
> Doug
> 
> 
> 
>> 
>> -- Reuti
>> 
>> 
>>> --
>>> Doug
>>> 
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> "norbert" is head node running sge_qmaster and sge_execd
>>>>> worker nodes are "n0" an "n2"
>>>>> 
>>>>> From norbert I can "qrsh -l n=X hostname" where X is any node. I can
> do the same from any worker node to any worker node.
>>>>> But, I cannot run jobs like "qrsh -l n=norbert hostname" from a
> worker node (it does work on norbert-> norbert, however)
>>>>> 
>>>>> Firewall is off, I checked permissions, all nodes
>>>>> are exec and submit hosts, norbert is admin host.
>>>>> ssh is used for remote login.
>>>>> I spent time going through debug traces and logs
>>>>> and cannot find anything obvious.
>>>>> 
>>>>> Does anyone have any ideas on what to check next?
>>>>> I am wondering if the host node has some security
>>>>> feature preventing qrsh from working, but I cannot
>>>>> think of anything.
>>>>> 
>>>>> Here are some data:
>>>>> 
>>>>> CASE 1: From worker node to head node (UNSUCCESSFUL)
>>>>> ====================================================
>>>>> 
>>>>> "qrsh  -l h=norbert  hostname"
>>>>> 
>>>>> Example ps output on head node (norbert)
>>>>> ----------------------------------------
>>>>> 14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
> 14257
>>>>> ?        S      0:00  \_ sge_shepherd-490 -bg
>>>>> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
>>>>> 
>>>>> Trace from qrsh on node (dl=3)
>>>>> ==============================
>>>>>  18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>>> !
>>>>> ! 19   6647         main
>>>>> ============================================ 20   6647         main
> random polling set to 3
>>>>>  21   6647         main     ---- got NO valid socket! ----
>>>>>  22   6647         main     sge_set_auth_info: username(uid) =
>>>>> deadline(500), groupname = deadline(500)
>>>>> 
>>>>> 
>>>>> Relevant parts of execd trace file for job
>>>>> ------------------------------------------
>>>>> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
> 08/26/2010 08:31:32 [0:14441]: start qlogin
>>>>> 08/26/2010 08:31:32 [0:14441]: calling
>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
> /usr/sbin/sshd -i);
>>>>> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/26/2010 08:31:32 [0:14441]: using sfd 0
>>>>> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
>>>>> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address =
>>>>> n2.cluster:56747
>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster, port =
>>>>> 56747
>>>>> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
> 08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
>>>>> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
>>>>> 
>>>>> CASE 2: From worker node to worker node (SUCCESSFUL)
>>>>> ====================================================
>>>>> 
>>>>> "qrsh  -l h=n0  hostname"
>>>>> 
>>>>> Example ps output on worker node (n0)
>>>>> -----------------------------
>>>>> 3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
> 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
>>>>> 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
>>>>> 4086 ?        S      0:00          \_ sshd: deadline at notty
>>>>> 4087 ?        Ss     0:00              \_
>>>>> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
>>>>> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
>>>>> 4100 ?        S      0:00                  \_ sleep 500
>>>>> 
>>>>> Trace from qrsh on node (dl=3)
>>>>> ==============================
>>>>>  18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>>> !
>>>>> ! 19   6640         main
>>>>> ============================================ 20   6640         main
> random polling set to 3
>>>>>  21   6640         main     accepted client connection, fd = 3 22  
> 6640         main     qlogin_starter sent:
>>>>> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>>>>>  23   6640         main     accepted client connection, fd = 3 24  
> 6640         main     exit_status = 0
>>>>> l
>>>>> Relevant parts of execd trace file for job on worker node
>>>>> ---------------------------------------------------------
>>>>> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
>>>>> 08/26/2010 08:39:57 [0:4355]: start qlogin
>>>>> 08/26/2010 08:39:57 [0:4355]: calling
>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
> /usr/sbin/sshd -i);
>>>>> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/26/2010 08:39:57 [0:4355]: using sfd 0
>>>>> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
>>>>> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address =
>>>>> n2.cluster:54407
>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster, port =
>>>>> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
>>>>> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
>>>>> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
> 08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0;
> WIFSIGNALED:
>>>>> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>>>>> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
>>>>> 
>>>>> --
>>>>> Doug
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396
>>>> 
>>>> To unsubscribe from this discussion, e-mail:
>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>> 
>>>> --
>>>> This message has been scanned for viruses and
>>>> dangerous content by MailScanner, and is
>>>> believed to be clean.
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Doug
>>> 
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277416
>>> 
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>> 
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277419
>> 
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>> 
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>> 
>> 
> 
> 
> -- 
> Doug
> 
> 
> 
> -- 
> Doug
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277537
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277539

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list