[GE users] Can't connect to shepherd error

Heywood, Todd heywood at cshl.edu
Tue Jun 5 19:32:12 BST 2007


P.s.

The hung job eventually disappeared from "ps", and now there is the
following in /var/spool/sge/blade444/messages:

06/05/2007 12:27:30|execd|blade444|W|reaping job "1674155" ptf complains:
Job does not exist
06/05/2007 14:13:39|execd|blade444|E|slave shepherd of job 1674155.1 exited
with exit status = 11
06/05/2007 14:13:39|execd|blade444|E|slave shepherd of job 1674155.1 exited
with exit status = 11
06/05/2007 14:24:54|execd|blade444|W|reaping job "1674431" ptf complains:
Job does not exist

Todd


On 6/5/07 2:11 PM, "Heywood, Todd" <heywood at cshl.edu> wrote:

> Reuti,
> 
> There's been a delay answering your question, since I had to wait for the
> error ("error: cannot  get connection to "shepherd" at host XXX") to happen
> again.
> 
> This time there is no message (yet) in /var/spool/sge/.../messages, but the
> job is hung on the node in question, after the user got the error message.
> 
> Here's how things are:
> 
> [root at blade444 1674155.1]# ps -ef |grep sge
> sgeadmin  3728     1  0 Jun04 ?        00:00:21
> /opt/n1ge6/bin/lx24-amd64/sge_execd
> sgeadmin 20535  3728  0 12:49 ?        00:00:00 sge_shepherd-1674155 -bg
> delabast 20539 20538  0 12:49 ?        00:00:00
> /opt/n1ge6/utilbin/lx24-amd64/qrsh_starter
> /var/spool/sge/blade444/active_jobs/1674155.1/2.blade444 noshell
> sgeadmin 20569  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
> delabast 20570 20569  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
> sgeadmin 20571  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
> delabast 20572 20571  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
> root     22024 21723  0 14:06 pts/0    00:00:00 grep sge
> 
> [root at blade444 1674155.1]# pwd
> /var/spool/sge/blade444/active_jobs/1674155.1
> 
> [root at blade444 1674155.1]# ls
> 2.blade444  3.blade444  4.blade444
> 
> [root at blade444 1674155.1]# cat 2.blade444/addgrpid
> 20002
> [root at blade444 1674155.1]# cat 3.blade444/addgrpid
> 20003
> [root at blade444 1674155.1]# cat 4.blade444/addgrpid
> 20004
> [root at blade444 1674155.1]#
> 
> 
> Any ideas as to what is causing this issue?
> 
> 
> Thanks,
> 
> Todd
> 
> 
> 
> 
> On 5/27/07 3:25 PM, "Reuti" <reuti at staff.uni-marburg.de> wrote:
> 
>> Am 25.05.2007 um 15:43 schrieb Heywood, Todd:
>> 
>>> Hi,
>>> 
>>> We have a gid_range of 20000-20100, and allow only 4 jobs per node.
>>> 
>>> Does the execd look for an unused add_grp_id locally, or does it
>>> need to
>>> contact the master host? My guess is that this error is a function of
>>> certain loads on the cluster.
>> 
>> Can you check on the node, what add_grp_id's are used by the
>> currently running jobs. It's in the active_jobs directory in the
>> spool directory for this node.
>> 
>> -- Reuti
>> 
>>> Thanks,
>>> 
>>> Todd
>>> 
>>> 
>>> On 5/25/07 9:19 AM, "Fred Youhanaie" <fly at anydata.co.uk> wrote:
>>> 
>>>> Hi Todd,
>>>> 
>>>> It appears that your group id range is too short for the number of
>>>> jobs
>>>> you are running on the individuals nodes, see sge_conf man page,
>>>> parameter gid_range.
>>>> 
>>>> You need to increase gid_range to a larger value, it should be
>>>> greater
>>>> than the number of concurrent jobs on a single node.
>>>> 
>>>> HTH
>>>> 
>>>> Cheers
>>>> f.
>>>> 
>>>> Heywood, Todd wrote:
>>>>> Hi,
>>>>> 
>>>>> A user is getting this sporadic error:
>>>>> 
>>>>>    error: cannot  get connection to "shepherd" at host "blade97"
>>>>> 
>>>>> When I look in /var/spool/sge/blade97/messages, I see this:
>>>>> 
>>>>> 05/24/2007 12:20:04|execd|blade97|W|reaping job "1407319" ptf
>>>>> complains: Job
>>>>> does not exist
>>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>>> can not find
>>>>> an unused add_grp_id
>>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>>> can not find
>>>>> an unused add_grp_id
>>>>> 05/24/2007 12:20:06|execd|blade97|E|can't start job "1407319":
>>>>> can not find
>>>>> an unused add_grp_id
>>>>> 05/24/2007 12:20:13|execd|blade97|W|reaping job "1407319" ptf
>>>>> complains: Job
>>>>> does not exist
>>>>> 
>>>>> 
>>>>> Can anyone explain what this means and how it might be avoided?
>>>>> Thanks.
>>>>> 
>>>>> (On a related note, the message "reaping job... ptf complains:
>>>>> Job does not
>>>>> exist" is very common in the message files... why is this?)
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Todd
>>>>> 
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>> 
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list