[GE users] qrsh problem

reuti reuti at staff.uni-marburg.de
Sat Aug 28 19:53:27 BST 2010


Am 28.08.2010 um 20:07 schrieb deadline:

>>> <snip>
>>> See below. The only thing I can think of it is talking
>>> to the wrong daemon or starting the wrong daemon?
>> 
>> then the next step could be to fall back to the -builtin- communication
>> method (as I see "sshd" below). Is this also not working?
> 
> I assume you mean rsh? That was my thought as well.

No. SGE has also a -builtin- startup method. Just put a plain "builtin" w/o the quotes in all 6 entries for *_command and *_daemon (`man sge_conf`). Unless you need X11 forwarding, there is no need for using SSH. It's also possible to use for the rlogin_* and qlogin_* SSH, while a startup between nodes uses the "builtin" method.

If you want to fall back to true `rsh`, it's not in the man page but here:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2757

-- Reuti


> --
> Doug
> 
> 
>> 
>> -- Reuti
>> 
>> 
>>> --
>>> Doug
>>> 
>>> 
>>> 
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> --
>>>>> Doug
>>>>> 
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> "norbert" is head node running sge_qmaster and sge_execd
>>>>>>> worker nodes are "n0" an "n2"
>>>>>>> 
>>>>>>> From norbert I can "qrsh -l n=X hostname" where X is any node. I can
>>> do the same from any worker node to any worker node.
>>>>>>> But, I cannot run jobs like "qrsh -l n=norbert hostname" from a
>>> worker node (it does work on norbert-> norbert, however)
>>>>>>> 
>>>>>>> Firewall is off, I checked permissions, all nodes
>>>>>>> are exec and submit hosts, norbert is admin host.
>>>>>>> ssh is used for remote login.
>>>>>>> I spent time going through debug traces and logs
>>>>>>> and cannot find anything obvious.
>>>>>>> 
>>>>>>> Does anyone have any ideas on what to check next?
>>>>>>> I am wondering if the host node has some security
>>>>>>> feature preventing qrsh from working, but I cannot
>>>>>>> think of anything.
>>>>>>> 
>>>>>>> Here are some data:
>>>>>>> 
>>>>>>> CASE 1: From worker node to head node (UNSUCCESSFUL)
>>>>>>> ====================================================
>>>>>>> 
>>>>>>> "qrsh  -l h=norbert  hostname"
>>>>>>> 
>>>>>>> Example ps output on head node (norbert)
>>>>>>> ----------------------------------------
>>>>>>> 14231 ?        Sl     0:00 /opt/gridengine/bin/lx26-amd64/sge_execd
>>> 14257
>>>>>>> ?        S      0:00  \_ sge_shepherd-490 -bg
>>>>>>> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
>>>>>>> 
>>>>>>> Trace from qrsh on node (dl=3)
>>>>>>> ==============================
>>>>>>> 18   6647         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>>>>> !
>>>>>>> ! 19   6647         main
>>>>>>> ============================================ 20   6647         main
>>> random polling set to 3
>>>>>>> 21   6647         main     ---- got NO valid socket! ----
>>>>>>> 22   6647         main     sge_set_auth_info: username(uid) =
>>>>>>> deadline(500), groupname = deadline(500)
>>>>>>> 
>>>>>>> 
>>>>>>> Relevant parts of execd trace file for job
>>>>>>> ------------------------------------------
>>>>>>> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
>>> 08/26/2010 08:31:32 [0:14441]: start qlogin
>>>>>>> 08/26/2010 08:31:32 [0:14441]: calling
>>>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
>>> /usr/sbin/sshd -i);
>>>>>>> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0, egid = 0
>>> 08/26/2010 08:31:32 [0:14441]: using sfd 0
>>>>>>> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
>>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
>>>>>>> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address =
>>>>>>> n2.cluster:56747
>>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host = n2.cluster,
>>>>>>> port =
>>>>>>> 56747
>>>>>>> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid 14441
>>> 08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
>>>>>>> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
>>>>>>> 
>>>>>>> CASE 2: From worker node to worker node (SUCCESSFUL)
>>>>>>> ====================================================
>>>>>>> 
>>>>>>> "qrsh  -l h=n0  hostname"
>>>>>>> 
>>>>>>> Example ps output on worker node (n0)
>>>>>>> -----------------------------
>>>>>>> 3875 ?        Sl     0:11 /opt/gridengine/bin/lx26-amd64/sge_execd
>>> 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
>>>>>>> 4084 ?        Ss     0:00      \_ sshd: deadline [priv]
>>>>>>> 4086 ?        S      0:00          \_ sshd: deadline at notty
>>>>>>> 4087 ?        Ss     0:00              \_
>>>>>>> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
>>>>>>> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1
>>>>>>> 4100 ?        S      0:00                  \_ sleep 500
>>>>>>> 
>>>>>>> Trace from qrsh on node (dl=3)
>>>>>>> ==============================
>>>>>>> 18   6640         main     R E A D I N G    J O B ! ! ! ! ! ! ! ! !
>>>>>>> !
>>>>>>> ! 19   6640         main
>>>>>>> ============================================ 20   6640         main
>>> random polling set to 3
>>>>>>> 21   6640         main     accepted client connection, fd = 3 22
>>> 6640         main     qlogin_starter sent:
>>>>>>> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>>>>>>> 23   6640         main     accepted client connection, fd = 3 24
>>> 6640         main     exit_status = 0
>>>>>>> l
>>>>>>> Relevant parts of execd trace file for job on worker node
>>>>>>> ---------------------------------------------------------
>>>>>>> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
>>>>>>> 08/26/2010 08:39:57 [0:4355]: start qlogin
>>>>>>> 08/26/2010 08:39:57 [0:4355]: calling
>>>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
>>> /usr/sbin/sshd -i);
>>>>>>> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid = 0
>>> 08/26/2010 08:39:57 [0:4355]: using sfd 0
>>>>>>> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
>>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
>>>>>>> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address =
>>>>>>> n2.cluster:54407
>>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host = n2.cluster,
>>>>>>> port =
>>>>>>> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
>>>>>>> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
>>>>>>> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd -i|
>>> 08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0;
>>> WIFSIGNALED:
>>>>>>> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>>>>>>> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
>>>>>>> 
>>>>>>> --
>>>>>>> Doug
>>>>>>> 
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
>>>>>>> 
>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>> 
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396
>>>>>> 
>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>> 
>>>>>> --
>>>>>> This message has been scanned for viruses and
>>>>>> dangerous content by MailScanner, and is
>>>>>> believed to be clean.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Doug
>>>>> 
>>>>> --
>>>>> This message has been scanned for viruses and
>>>>> dangerous content by MailScanner, and is
>>>>> believed to be clean.
>>>>> 
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277416
>>>>> 
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>> 
>>>> 
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277419
>>>> 
>>>> To unsubscribe from this discussion, e-mail:
>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>> 
>>>> --
>>>> This message has been scanned for viruses and
>>>> dangerous content by MailScanner, and is
>>>> believed to be clean.
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Doug
>>> 
>>> 
>>> 
>>> --
>>> Doug
>>> 
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>> 
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277537
>>> 
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>> 
>> 
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277539
>> 
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>> 
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>> 
>> 
> 
> 
> -- 
> Doug
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277740
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277752

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list