[GE users] Can't connect to shepherd error

Heywood, Todd heywood at cshl.edu
Wed Jun 6 17:09:24 BST 2007


Hi,

Unfortunately, administrator mail was not configured (or rather, is is but
sendmail is not).

I did look at the trace file when the job was hanging and saw nothing out of
the ordinary. The error file was empty. These disappeared when the job
finally exited by itself, at which point the "exit status - 11" messages
were written to the spool message file.

This seems a sporadic error, occuring sometimes and sometimes not. Very
annoying to the user though.

How about some conjectures? :-)

Todd


On 6/6/07 6:22 AM, "Ravi Chandra Nallan" <Ravichandra.Nallan at Sun.COM> wrote:

> Can you check what is in the trace and error files? or did you set the
> 'administrator_mail' in the global configuration that generally reports
> the error and some trace before the error.
> 
> Looks like the shepherd was unable to run the script for some reason
> which might be logged in the error/trace files.
> (The trace/error files are in active_jobs/<job.taskid>/ )
> 
> The 'ptf complains' error msg generally occurs while cleanup of a job,
> when the process (forked while executing the job) has exited even before
> the execd was able to collect the job details.
>    
> Ravi
> 
> Heywood, Todd wrote:
>> P.s.
>> 
>> The hung job eventually disappeared from "ps", and now there is the
>> following in /var/spool/sge/blade444/messages:
>> 
>> 06/05/2007 12:27:30|execd|blade444|W|reaping job "1674155" ptf complains:
>> Job does not exist
>> 06/05/2007 14:13:39|execd|blade444|E|slave shepherd of job 1674155.1 exited
>> with exit status = 11
>> 06/05/2007 14:13:39|execd|blade444|E|slave shepherd of job 1674155.1 exited
>> with exit status = 11
>> 06/05/2007 14:24:54|execd|blade444|W|reaping job "1674431" ptf complains:
>> Job does not exist
>> 
>> Todd
>> 
>> 
>> On 6/5/07 2:11 PM, "Heywood, Todd" <heywood at cshl.edu> wrote:
>> 
>>   
>>> Reuti,
>>> 
>>> There's been a delay answering your question, since I had to wait for the
>>> error ("error: cannot  get connection to "shepherd" at host XXX") to happen
>>> again.
>>> 
>>> This time there is no message (yet) in /var/spool/sge/.../messages, but the
>>> job is hung on the node in question, after the user got the error message.
>>> 
>>> Here's how things are:
>>> 
>>> [root at blade444 1674155.1]# ps -ef |grep sge
>>> sgeadmin  3728     1  0 Jun04 ?        00:00:21
>>> /opt/n1ge6/bin/lx24-amd64/sge_execd
>>> sgeadmin 20535  3728  0 12:49 ?        00:00:00 sge_shepherd-1674155 -bg
>>> delabast 20539 20538  0 12:49 ?        00:00:00
>>> /opt/n1ge6/utilbin/lx24-amd64/qrsh_starter
>>> /var/spool/sge/blade444/active_jobs/1674155.1/2.blade444 noshell
>>> sgeadmin 20569  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> delabast 20570 20569  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> sgeadmin 20571  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> delabast 20572 20571  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> root     22024 21723  0 14:06 pts/0    00:00:00 grep sge
>>> 
>>> [root at blade444 1674155.1]# pwd
>>> /var/spool/sge/blade444/active_jobs/1674155.1
>>> 
>>> [root at blade444 1674155.1]# ls
>>> 2.blade444  3.blade444  4.blade444
>>> 
>>> [root at blade444 1674155.1]# cat 2.blade444/addgrpid
>>> 20002
>>> [root at blade444 1674155.1]# cat 3.blade444/addgrpid
>>> 20003
>>> [root at blade444 1674155.1]# cat 4.blade444/addgrpid
>>> 20004
>>> [root at blade444 1674155.1]#
>>> 
>>> 
>>> Any ideas as to what is causing this issue?
>>> 
>>> 
>>> Thanks,
>>> 
>>> Todd
>>> 
>>> 
>>> 
>>> 
>>> On 5/27/07 3:25 PM, "Reuti" <reuti at staff.uni-marburg.de> wrote:
>>> 
>>>     
>>>> Am 25.05.2007 um 15:43 schrieb Heywood, Todd:
>>>> 
>>>>       
>>>>> Hi,
>>>>> 
>>>>> We have a gid_range of 20000-20100, and allow only 4 jobs per node.
>>>>> 
>>>>> Does the execd look for an unused add_grp_id locally, or does it
>>>>> need to
>>>>> contact the master host? My guess is that this error is a function of
>>>>> certain loads on the cluster.
>>>>>         
>>>> Can you check on the node, what add_grp_id's are used by the
>>>> currently running jobs. It's in the active_jobs directory in the
>>>> spool directory for this node.
>>>> 
>>>> -- Reuti
>>>> 
>>>>       
>>>>> Thanks,
>>>>> 
>>>>> Todd
>>>>> 
>>>>> 
>>>>> On 5/25/07 9:19 AM, "Fred Youhanaie" <fly at anydata.co.uk> wrote:
>>>>> 
>>>>>         
>>>>>> Hi Todd,
>>>>>> 
>>>>>> It appears that your group id range is too short for the number of
>>>>>> jobs
>>>>>> you are running on the individuals nodes, see sge_conf man page,
>>>>>> parameter gid_range.
>>>>>> 
>>>>>> You need to increase gid_range to a larger value, it should be
>>>>>> greater
>>>>>> than the number of concurrent jobs on a single node.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Cheers
>>>>>> f.
>>>>>> 
>>>>>> Heywood, Todd wrote:
>>>>>>           
>>>>>>> Hi,
>>>>>>> 
>>>>>>> A user is getting this sporadic error:
>>>>>>> 
>>>>>>>    error: cannot  get connection to "shepherd" at host "blade97"
>>>>>>> 
>>>>>>> When I look in /var/spool/sge/blade97/messages, I see this:
>>>>>>> 
>>>>>>> 05/24/2007 12:20:04|execd|blade97|W|reaping job "1407319" ptf
>>>>>>> complains: Job
>>>>>>> does not exist
>>>>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>>>>> can not find
>>>>>>> an unused add_grp_id
>>>>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>>>>> can not find
>>>>>>> an unused add_grp_id
>>>>>>> 05/24/2007 12:20:06|execd|blade97|E|can't start job "1407319":
>>>>>>> can not find
>>>>>>> an unused add_grp_id
>>>>>>> 05/24/2007 12:20:13|execd|blade97|W|reaping job "1407319" ptf
>>>>>>> complains: Job
>>>>>>> does not exist
>>>>>>> 
>>>>>>> 
>>>>>>> Can anyone explain what this means and how it might be avoided?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> (On a related note, the message "reaping job... ptf complains:
>>>>>>> Job does not
>>>>>>> exist" is very common in the message files... why is this?)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Todd
>>>>>>> 
>>>>>>> --------------------------------------------------------------------
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>> 
>>>>>>> 
>>>>>>>            
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>> 
>>>>>>           
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>         
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> 
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> 
>>>     
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
>>   
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list