[GE users] Can't connect to shepherd error

Reuti reuti at staff.uni-marburg.de
Sun May 27 20:25:34 BST 2007


Am 25.05.2007 um 15:43 schrieb Heywood, Todd:

> Hi,
>
> We have a gid_range of 20000-20100, and allow only 4 jobs per node.
>
> Does the execd look for an unused add_grp_id locally, or does it  
> need to
> contact the master host? My guess is that this error is a function of
> certain loads on the cluster.

Can you check on the node, what add_grp_id's are used by the  
currently running jobs. It's in the active_jobs directory in the  
spool directory for this node.

-- Reuti

> Thanks,
>
> Todd
>
>
> On 5/25/07 9:19 AM, "Fred Youhanaie" <fly at anydata.co.uk> wrote:
>
>> Hi Todd,
>>
>> It appears that your group id range is too short for the number of  
>> jobs
>> you are running on the individuals nodes, see sge_conf man page,
>> parameter gid_range.
>>
>> You need to increase gid_range to a larger value, it should be  
>> greater
>> than the number of concurrent jobs on a single node.
>>
>> HTH
>>
>> Cheers
>> f.
>>
>> Heywood, Todd wrote:
>>> Hi,
>>>
>>> A user is getting this sporadic error:
>>>
>>>    error: cannot  get connection to "shepherd" at host "blade97"
>>>
>>> When I look in /var/spool/sge/blade97/messages, I see this:
>>>
>>> 05/24/2007 12:20:04|execd|blade97|W|reaping job "1407319" ptf  
>>> complains: Job
>>> does not exist
>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":  
>>> can not find
>>> an unused add_grp_id
>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":  
>>> can not find
>>> an unused add_grp_id
>>> 05/24/2007 12:20:06|execd|blade97|E|can't start job "1407319":  
>>> can not find
>>> an unused add_grp_id
>>> 05/24/2007 12:20:13|execd|blade97|W|reaping job "1407319" ptf  
>>> complains: Job
>>> does not exist
>>>
>>>
>>> Can anyone explain what this means and how it might be avoided?  
>>> Thanks.
>>>
>>> (On a related note, the message "reaping job... ptf complains:  
>>> Job does not
>>> exist" is very common in the message files... why is this?)
>>>
>>> Thanks,
>>>
>>> Todd
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list