[GE users] qrsh problem

deadline deadline at basement-supercomputing.com
Mon Aug 30 19:11:00 BST 2010


> Am 30.08.2010 um 18:59 schrieb deadline:
>
>>> Am 30.08.2010 um 15:09 schrieb deadline:
>>>
>>>>> Am 28.08.2010 um 20:07 schrieb deadline:
>>>>>
>>>>>>>> <snip>
>>>>>>>> See below. The only thing I can think of it is talking
>>>>>>>> to the wrong daemon or starting the wrong daemon?
>>>>>>>
>>>>>>> then the next step could be to fall back to the -builtin-
>>>>>>> communication
>>>>>>> method (as I see "sshd" below). Is this also not working?
>>>>>>
>>>>>> I assume you mean rsh? That was my thought as well.
>>>>>
>>>>> No. SGE has also a -builtin- startup method. Just put a plain
>> "builtin" w/o the quotes in all 6 entries for *_command and *_daemon
>> (`man sge_conf`). Unless you need X11 forwarding, there is no need for
>> using SSH. It's also possible to use for the rlogin_* and qlogin_*
>> SSH, while a
>>>>> startup between nodes uses the "builtin" method.
>>>>>
>>>>> If you want to fall back to true `rsh`, it's not in the man page but
>> here:
>>>>>
>>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=2757
>>>>
>>>> I switched to built-in and the behavior is the same.
>>>> Several things I noticed. First, the file "addgrpid"
>>>> is not written to the active_jobs sub folder.
>>>
>>> Is there anything in the messages file of the qmster or node i.e.
>> $SGE_ROOT/default/spool/qmaster/messages
>>
>> I submit the job:
>>
>> qrsh  -l h=norbert  hostname
>>
>> qstat shows:
>>
>> 584 0.50000 hostname   deadline     r     08/30/2010 12:45:53
>> cluster at norbert
>>
>> Then I qdel it:
>>
>> qmaster messages:
>>
>> 08/30/2010 12:46:55|worker|norbert|W|job 584.1 failed on host norbert
>> assumedly after job because: job 584.1 died through signal HUP (1)
>>
>> execd messages on norbert:
>>
>>> 08/30/2010 12:46:55|  main|norbert|W|reaping job "584" ptf complains:
>> Job does not exist
>> 08/30/2010 12:46:55|  main|norbert|E|can't open file
>> active_jobs/584.1/error: No such file or directory
>
> As other jobs are working, that spool directory doesn't seem to be full or
> write protected.

No, plenty of space, the job writes other data like the trace file
local jobs work fine and write to the directory.
>
> Can you qrsh "local" - from the headnode to itself?

yes, works fine.


>
> -- Reuti
>
>
>>
>>> To avoid a mismatch of programs: did you check `qconf -sconfl` that
>> there is no custom definition for the headnode?
>>>
>>
>> yes, "no config defined"
>>
>>> A `qsub`ed job works fine on the headnode?
>>
>> yes
>>
>> --
>> Doug
>>
>>
>>>
>>> -- Reuti
>>>
>>>
>>>> Second, examining the shepherd trace, I see that both
>>>> the threads are created without an error, but fail at
>>>>
>>>> 888    *exit_status = wait_my_child(job_pid, ...
>>>>
>>>> in sge_shepherd_ijs.c.  Here is the trace
>>>>
>>>> 08/29/2010 19:46:47 [0:19972]: parent: creating pty_to_commlib thread
>> 08/29/2010 19:46:47 [0:19972]: parent: creating commlib_to_pty thread
>> 08/29/2010 19:46:47 [0:19972]: parent: created both worker threads, now
>> waiting for jobs end
>>>> 08/29/2010 19:46:59 [0:19972]: wait3 returned -1
>>>>
>>>> It is odd because the threads seem to start fine (no errors), but then
>> go away. As I said it works fine if I run it on the host
>>>> it only happened when I qrsh to the host from a worker node.
>>>> When I think about it, it must be some configuration
>>>> issue on the host (Scientific Linux 5.4), but I am not sure
>>>> where to look. Firewall is off.
>>>>
>>>> --
>>>> Doug
>>>>
>>>>
>>>>>>>>>>>> "norbert" is head node running sge_qmaster and sge_execd
>>>>>>>>>>>> worker
>> nodes are "n0" an "n2"
>>>>>>>>>>>>
>>>>>>>>>>>> From norbert I can "qrsh -l n=X hostname" where X is any node.
>> I can
>>>>>>>> do the same from any worker node to any worker node.
>>>>>>>>>>>> But, I cannot run jobs like "qrsh -l n=norbert hostname" from
>>>>>>>>>>>> a
>>>>>>>> worker node (it does work on norbert-> norbert, however)
>>>>>>>>>>>>
>>>>>>>>>>>> Firewall is off, I checked permissions, all nodes
>>>>>>>>>>>> are exec and submit hosts, norbert is admin host.
>>>>>>>>>>>> ssh is used for remote login.
>>>>>>>>>>>> I spent time going through debug traces and logs
>>>>>>>>>>>> and cannot find anything obvious.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have any ideas on what to check next?
>>>>>>>>>>>> I am wondering if the host node has some security
>>>>>>>>>>>> feature preventing qrsh from working, but I cannot
>>>>>>>>>>>> think of anything.
>>>>>>>>>>>>
>>>>>>>>>>>> Here are some data:
>>>>>>>>>>>>
>>>>>>>>>>>> CASE 1: From worker node to head node (UNSUCCESSFUL)
>>>>>>>>>>>> ====================================================
>>>>>>>>>>>>
>>>>>>>>>>>> "qrsh  -l h=norbert  hostname"
>>>>>>>>>>>>
>>>>>>>>>>>> Example ps output on head node (norbert)
>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>> 14231 ?        Sl     0:00
>>>>>>>>>>>> /opt/gridengine/bin/lx26-amd64/sge_execd
>>>>>>>> 14257
>>>>>>>>>>>> ?        S      0:00  \_ sge_shepherd-490 -bg
>>>>>>>>>>>> 14258 ?        Ss     0:00      \_ sge_shepherd-490 -bg
>>>>>>>>>>>>
>>>>>>>>>>>> Trace from qrsh on node (dl=3)
>>>>>>>>>>>> ==============================
>>>>>>>>>>>> 18   6647         main     R E A D I N G    J O B ! ! ! ! ! !
>>>>>>>>>>>> ! !
>>>>>>>>>>>> !
>>>>>>>>>>>> !
>>>>>>>>>>>> ! 19   6647         main
>>>>>>>>>>>> ============================================ 20   6647
>>>>>>>>>>>> main
>>>>>>>> random polling set to 3
>>>>>>>>>>>> 21   6647         main     ---- got NO valid socket! ---- 22
>> 6647         main     sge_set_auth_info: username(uid) =
>> deadline(500), groupname = deadline(500)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Relevant parts of execd trace file for job
>>>>>>>>>>>> ------------------------------------------
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14441]: now running with uid=0, euid=0
>>>>>>>> 08/26/2010 08:31:32 [0:14441]: start qlogin
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14441]: calling
>>>>>>>>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1,
>>>>>>>> /usr/sbin/sshd -i);
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14441]: uid = 0, euid = 0, gid = 0,
>>>>>>>>>>>> egid =
>>>>>>>>>>>> 0
>>>>>>>> 08/26/2010 08:31:32 [0:14441]: using sfd 0
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14441]: bound to port 39897
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - data =
>>>>>>>>>>>> 0:39897:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/norbert/active_jobs/493.1:norbert
>>>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - address =
>>>>>>>>>>>> n2.cluster:56747
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14441]: write_to_qrsh - host =
>>>>>>>>>>>> n2.cluster,
>>>>>>>>>>>> port =
>>>>>>>>>>>> 56747
>>>>>>>>>>>> 08/26/2010 08:31:32 [0:14440]: parent: forked "job" with pid
>>>>>>>>>>>> 14441
>>>>>>>> 08/26/2010 08:31:32 [0:14440]: parent: job-pid: 14441
>>>>>>>>>>>> 08/26/2010 08:32:17 [0:14440]: wait3 returned -1
>>>>>>>>>>>>
>>>>>>>>>>>> CASE 2: From worker node to worker node (SUCCESSFUL)
>>>>>>>>>>>> ====================================================
>>>>>>>>>>>>
>>>>>>>>>>>> "qrsh  -l h=n0  hostname"
>>>>>>>>>>>>
>>>>>>>>>>>> Example ps output on worker node (n0)
>>>>>>>>>>>> -----------------------------
>>>>>>>>>>>> 3875 ?        Sl     0:11
>>>>>>>>>>>> /opt/gridengine/bin/lx26-amd64/sge_execd
>>>>>>>> 4083 ?        S      0:00  \_ sge_shepherd-483 -bg
>>>>>>>>>>>> 4084 ?        Ss     0:00      \_ sshd: deadline [priv] 4086 ?
>>      S      0:00          \_ sshd: deadline at notty 4087 ?
>> Ss     0:00              \_
>>>>>>>>>>>> /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /op
>>>>>>>>>>>> t/gridengine/default/spool/sge_execd/n0/active_jobs/483.1 4100
>> ?        S      0:00                  \_ sleep 500
>>>>>>>>>>>>
>>>>>>>>>>>> Trace from qrsh on node (dl=3)
>>>>>>>>>>>> ==============================
>>>>>>>>>>>> 18   6640         main     R E A D I N G    J O B ! ! ! ! ! !
>>>>>>>>>>>> ! !
>>>>>>>>>>>> !
>>>>>>>>>>>> !
>>>>>>>>>>>> ! 19   6640         main
>>>>>>>>>>>> ============================================ 20   6640
>>>>>>>>>>>> main
>>>>>>>> random polling set to 3
>>>>>>>>>>>> 21   6640         main     accepted client connection, fd = 3
>>>>>>>>>>>> 22
>>>>>>>> 6640         main     qlogin_starter sent:
>>>>>>>>>>>> 0:58576:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/509.1:n0
>> 23   6640         main     accepted client connection, fd = 3
>> 24
>>>>>>>> 6640         main     exit_status = 0
>>>>>>>>>>>> l
>>>>>>>>>>>> Relevant parts of execd trace file for job on worker node
>> ---------------------------------------------------------
>> 08/26/2010 08:39:57 [0:4355]: now running with uid=0, euid=0
>> 08/26/2010 08:39:57 [0:4355]: start qlogin
>>>>>>>>>>>> 08/26/2010 08:39:57 [0:4355]: calling
>>>>>>>>>>>> qlogin_starter(/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1,
>>>>>>>> /usr/sbin/sshd -i);
>>>>>>>>>>>> 08/26/2010 08:39:57 [0:4355]: uid = 0, euid = 0, gid = 0, egid
>>>>>>>>>>>> = 0
>>>>>>>> 08/26/2010 08:39:57 [0:4355]: using sfd 0
>>>>>>>>>>>> 08/26/2010 08:39:57 [0:4355]: bound to port 48105
>>>>>>>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - data =
>>>>>>>>>>>> 0:48105:/opt/gridengine/utilbin/lx26-amd64:/opt/gridengine/default/spool/sge_execd/n0/active_jobs/499.1:n0
>>>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - address =
>>>>>>>>>>>> n2.cluster:54407
>>>>>>>>>>>> 08/26/2010 08:39:57 [0:4355]: write_to_qrsh - host =
>> n2.cluster, port =
>>>>>>>>>>>> 54407 08/26/2010 08:39:57 [0:4355]: waiting for connection.
>> 08/26/2010 08:39:57 [0:4355]: accepted connection on fd 1
>> 08/26/2010 08:39:57 [0:4355]: daemon to start: |/usr/sbin/sshd
>> -i|
>>>>>>>> 08/26/2010 08:39:57 [0:4354]: wait3 returned 4355 (status: 0;
>> WIFSIGNALED:
>>>>>>>>>>>> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>>>>>>>>>>>> 08/26/2010 08:39:57 [0:4354]: job exited with exit status 0
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Doug
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277191
>>>>>>>>>>>>
>>>>>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277396
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> This message has been scanned for viruses and
>>>>>>>>>>> dangerous content by MailScanner, and is
>>>>>>>>>>> believed to be clean.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Doug
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> This message has been scanned for viruses and
>>>>>>>>>> dangerous content by MailScanner, and is
>>>>>>>>>> believed to be clean.
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------
>>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277416
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------
>>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277419
>>>>>>>>>
>>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> This message has been scanned for viruses and
>>>>>>>>> dangerous content by MailScanner, and is
>>>>>>>>> believed to be clean.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Doug
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Doug
>>>>>>>>
>>>>>>>> --
>>>>>>>> This message has been scanned for viruses and
>>>>>>>> dangerous content by MailScanner, and is
>>>>>>>> believed to be clean.
>>>>>>>>
>>>>>>>> ------------------------------------------------------
>>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277537
>>>>>>>>
>>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------
>>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277539
>>>>>>>
>>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>>
>>>>>>> --
>>>>>>> This message has been scanned for viruses and
>>>>>>> dangerous content by MailScanner, and is
>>>>>>> believed to be clean.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Doug
>>>>>>
>>>>>> --
>>>>>> This message has been scanned for viruses and
>>>>>> dangerous content by MailScanner, and is
>>>>>> believed to be clean.
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277740
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail:
>>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=277752
>>>>>
>>>>> To unsubscribe from this discussion, e-mail:
>>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>> --
>>>>> This message has been scanned for viruses and
>>>>> dangerous content by MailScanner, and is
>>>>> believed to be clean.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Doug
>>>>
>>>> --
>>>> This message has been scanned for viruses and
>>>> dangerous content by MailScanner, and is
>>>> believed to be clean.
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278192
>>>>
>>>> To unsubscribe from this discussion, e-mail:
>>>> [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278225
>>>
>>> To unsubscribe from this discussion, e-mail:
>>> [users-unsubscribe at gridengine.sunsource.net].
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>>
>>>
>>
>>
>> --
>> Doug
>>
>>
>>
>> --
>> Doug
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278276
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278280
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


--
Doug

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=278288

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list