[GE users] Can't connect to shepherd error

Heywood, Todd heywood at cshl.edu
Tue Jun 5 19:11:57 BST 2007


Reuti,

There's been a delay answering your question, since I had to wait for the
error ("error: cannot  get connection to "shepherd" at host XXX") to happen
again.

This time there is no message (yet) in /var/spool/sge/.../messages, but the
job is hung on the node in question, after the user got the error message.

Here's how things are:

[root at blade444 1674155.1]# ps -ef |grep sge
sgeadmin  3728     1  0 Jun04 ?        00:00:21
/opt/n1ge6/bin/lx24-amd64/sge_execd
sgeadmin 20535  3728  0 12:49 ?        00:00:00 sge_shepherd-1674155 -bg
delabast 20539 20538  0 12:49 ?        00:00:00
/opt/n1ge6/utilbin/lx24-amd64/qrsh_starter
/var/spool/sge/blade444/active_jobs/1674155.1/2.blade444 noshell
sgeadmin 20569  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
delabast 20570 20569  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
sgeadmin 20571  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
delabast 20572 20571  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
root     22024 21723  0 14:06 pts/0    00:00:00 grep sge

[root at blade444 1674155.1]# pwd
/var/spool/sge/blade444/active_jobs/1674155.1

[root at blade444 1674155.1]# ls
2.blade444  3.blade444  4.blade444

[root at blade444 1674155.1]# cat 2.blade444/addgrpid
20002
[root at blade444 1674155.1]# cat 3.blade444/addgrpid
20003
[root at blade444 1674155.1]# cat 4.blade444/addgrpid
20004
[root at blade444 1674155.1]#


Any ideas as to what is causing this issue?


Thanks,

Todd




On 5/27/07 3:25 PM, "Reuti" <reuti at staff.uni-marburg.de> wrote:

> Am 25.05.2007 um 15:43 schrieb Heywood, Todd:
> 
>> Hi,
>> 
>> We have a gid_range of 20000-20100, and allow only 4 jobs per node.
>> 
>> Does the execd look for an unused add_grp_id locally, or does it
>> need to
>> contact the master host? My guess is that this error is a function of
>> certain loads on the cluster.
> 
> Can you check on the node, what add_grp_id's are used by the
> currently running jobs. It's in the active_jobs directory in the
> spool directory for this node.
> 
> -- Reuti
> 
>> Thanks,
>> 
>> Todd
>> 
>> 
>> On 5/25/07 9:19 AM, "Fred Youhanaie" <fly at anydata.co.uk> wrote:
>> 
>>> Hi Todd,
>>> 
>>> It appears that your group id range is too short for the number of
>>> jobs
>>> you are running on the individuals nodes, see sge_conf man page,
>>> parameter gid_range.
>>> 
>>> You need to increase gid_range to a larger value, it should be
>>> greater
>>> than the number of concurrent jobs on a single node.
>>> 
>>> HTH
>>> 
>>> Cheers
>>> f.
>>> 
>>> Heywood, Todd wrote:
>>>> Hi,
>>>> 
>>>> A user is getting this sporadic error:
>>>> 
>>>>    error: cannot  get connection to "shepherd" at host "blade97"
>>>> 
>>>> When I look in /var/spool/sge/blade97/messages, I see this:
>>>> 
>>>> 05/24/2007 12:20:04|execd|blade97|W|reaping job "1407319" ptf
>>>> complains: Job
>>>> does not exist
>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>> can not find
>>>> an unused add_grp_id
>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>> can not find
>>>> an unused add_grp_id
>>>> 05/24/2007 12:20:06|execd|blade97|E|can't start job "1407319":
>>>> can not find
>>>> an unused add_grp_id
>>>> 05/24/2007 12:20:13|execd|blade97|W|reaping job "1407319" ptf
>>>> complains: Job
>>>> does not exist
>>>> 
>>>> 
>>>> Can anyone explain what this means and how it might be avoided?
>>>> Thanks.
>>>> 
>>>> (On a related note, the message "reaping job... ptf complains:
>>>> Job does not
>>>> exist" is very common in the message files... why is this?)
>>>> 
>>>> Thanks,
>>>> 
>>>> Todd
>>>> 
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list