[GE users] Can't connect to shepherd error

Heywood, Todd heywood at cshl.edu
Thu Jun 7 14:41:08 BST 2007


Hi,

I have another SGE error with this application, and it is getting more
confusing. To summarize, I have 3 cases (the first two already mentioned in
this exchange):

1. The user gets an stderr message: " error: cannot  get connection to
"shepherd" at host "bladeXXX" ". The exec host messages file has a complaint
about not finding an unused add_grp_id. But we only have 4 slots per node
with a gid range of 200.

2. The user gets an stderr message: " error: cannot  get connection to
"shepherd" at host "bladeXXX" ". The exec host messages file says: "|E|slave
shepherd of job 1674155.1 exited with exit status = 11".

3. The user gets the following stderr message:

Connection closed by 172.20.122.6
Connection closed by 172.20.128.6
Connection closed by 172.20.128.6
can't open file /tmp/1681869.1.public.q/pid.2.blade300: No such file or
directory
can't open file /tmp/1681869.1.public.q/pid.2.blade384: No such file or
directory
can't open file /tmp/1681869.1.public.q/pid.1.blade384: No such file or
directory
Write failed: Broken pipe
qmake[1]: *** [offsets1_0] Error 1
qmake[1]: *** Waiting for unfinished jobs....
qmake[1]: Write failed: Broken pipe
*** [offsets3_0] Error 1
qmake[1]: *** [offsets8_0] Error 1
can't open file /tmp/1681869.1.public.q/pid.4.blade300: No such file or
directory
can't open file /tmp/1681869.1.public.q/pid.3.blade300: No such file or
directory
qmake: *** [nonrecursive] Error 2

The exec host message file says: "|E|slave shepherd of job 1674155.1 exited
with exit status = 11".

To summarize: Cases 1 and 2 have the same stderr error messgae about not
being able to connect to shepherd. Cases 2 and 3 have the same
exit_status=11 error message in the exec host messages file.

Exit status 11 apparently means this:

[root at bhmnode2 n1ge6]# perl -e 'die$!=11'
Resource temporarily unavailable at -e line 1.


This application has been running for a few months, and only started acting
up in the last couple of weeks, the first case for SGE 6.0u8 and the second
two after I upgraded to 6.1. I don't think we have had any cluster
configuration changes or different sets of loads.

Any ideas on getting to the bottom of this would be greatly appreciated!

Thanks,

Todd


On 6/6/07 6:22 AM, "Ravi Chandra Nallan" <Ravichandra.Nallan at Sun.COM> wrote:

> Can you check what is in the trace and error files? or did you set the
> 'administrator_mail' in the global configuration that generally reports
> the error and some trace before the error.
> 
> Looks like the shepherd was unable to run the script for some reason
> which might be logged in the error/trace files.
> (The trace/error files are in active_jobs/<job.taskid>/ )
> 
> The 'ptf complains' error msg generally occurs while cleanup of a job,
> when the process (forked while executing the job) has exited even before
> the execd was able to collect the job details.
>    
> Ravi
> 
> Heywood, Todd wrote:
>> P.s.
>> 
>> The hung job eventually disappeared from "ps", and now there is the
>> following in /var/spool/sge/blade444/messages:
>> 
>> 06/05/2007 12:27:30|execd|blade444|W|reaping job "1674155" ptf complains:
>> Job does not exist
>> 06/05/2007 14:13:39|execd|blade444|E|slave shepherd of job 1674155.1 exited
>> with exit status = 11
>> 06/05/2007 14:13:39|execd|blade444|E|slave shepherd of job 1674155.1 exited
>> with exit status = 11
>> 06/05/2007 14:24:54|execd|blade444|W|reaping job "1674431" ptf complains:
>> Job does not exist
>> 
>> Todd
>> 
>> 
>> On 6/5/07 2:11 PM, "Heywood, Todd" <heywood at cshl.edu> wrote:
>> 
>>   
>>> Reuti,
>>> 
>>> There's been a delay answering your question, since I had to wait for the
>>> error ("error: cannot  get connection to "shepherd" at host XXX") to happen
>>> again.
>>> 
>>> This time there is no message (yet) in /var/spool/sge/.../messages, but the
>>> job is hung on the node in question, after the user got the error message.
>>> 
>>> Here's how things are:
>>> 
>>> [root at blade444 1674155.1]# ps -ef |grep sge
>>> sgeadmin  3728     1  0 Jun04 ?        00:00:21
>>> /opt/n1ge6/bin/lx24-amd64/sge_execd
>>> sgeadmin 20535  3728  0 12:49 ?        00:00:00 sge_shepherd-1674155 -bg
>>> delabast 20539 20538  0 12:49 ?        00:00:00
>>> /opt/n1ge6/utilbin/lx24-amd64/qrsh_starter
>>> /var/spool/sge/blade444/active_jobs/1674155.1/2.blade444 noshell
>>> sgeadmin 20569  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> delabast 20570 20569  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> sgeadmin 20571  3728  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> delabast 20572 20571  0 12:50 ?        00:00:00 sge_shepherd-1674155 -bg
>>> root     22024 21723  0 14:06 pts/0    00:00:00 grep sge
>>> 
>>> [root at blade444 1674155.1]# pwd
>>> /var/spool/sge/blade444/active_jobs/1674155.1
>>> 
>>> [root at blade444 1674155.1]# ls
>>> 2.blade444  3.blade444  4.blade444
>>> 
>>> [root at blade444 1674155.1]# cat 2.blade444/addgrpid
>>> 20002
>>> [root at blade444 1674155.1]# cat 3.blade444/addgrpid
>>> 20003
>>> [root at blade444 1674155.1]# cat 4.blade444/addgrpid
>>> 20004
>>> [root at blade444 1674155.1]#
>>> 
>>> 
>>> Any ideas as to what is causing this issue?
>>> 
>>> 
>>> Thanks,
>>> 
>>> Todd
>>> 
>>> 
>>> 
>>> 
>>> On 5/27/07 3:25 PM, "Reuti" <reuti at staff.uni-marburg.de> wrote:
>>> 
>>>     
>>>> Am 25.05.2007 um 15:43 schrieb Heywood, Todd:
>>>> 
>>>>       
>>>>> Hi,
>>>>> 
>>>>> We have a gid_range of 20000-20100, and allow only 4 jobs per node.
>>>>> 
>>>>> Does the execd look for an unused add_grp_id locally, or does it
>>>>> need to
>>>>> contact the master host? My guess is that this error is a function of
>>>>> certain loads on the cluster.
>>>>>         
>>>> Can you check on the node, what add_grp_id's are used by the
>>>> currently running jobs. It's in the active_jobs directory in the
>>>> spool directory for this node.
>>>> 
>>>> -- Reuti
>>>> 
>>>>       
>>>>> Thanks,
>>>>> 
>>>>> Todd
>>>>> 
>>>>> 
>>>>> On 5/25/07 9:19 AM, "Fred Youhanaie" <fly at anydata.co.uk> wrote:
>>>>> 
>>>>>         
>>>>>> Hi Todd,
>>>>>> 
>>>>>> It appears that your group id range is too short for the number of
>>>>>> jobs
>>>>>> you are running on the individuals nodes, see sge_conf man page,
>>>>>> parameter gid_range.
>>>>>> 
>>>>>> You need to increase gid_range to a larger value, it should be
>>>>>> greater
>>>>>> than the number of concurrent jobs on a single node.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Cheers
>>>>>> f.
>>>>>> 
>>>>>> Heywood, Todd wrote:
>>>>>>           
>>>>>>> Hi,
>>>>>>> 
>>>>>>> A user is getting this sporadic error:
>>>>>>> 
>>>>>>>    error: cannot  get connection to "shepherd" at host "blade97"
>>>>>>> 
>>>>>>> When I look in /var/spool/sge/blade97/messages, I see this:
>>>>>>> 
>>>>>>> 05/24/2007 12:20:04|execd|blade97|W|reaping job "1407319" ptf
>>>>>>> complains: Job
>>>>>>> does not exist
>>>>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>>>>> can not find
>>>>>>> an unused add_grp_id
>>>>>>> 05/24/2007 12:20:05|execd|blade97|E|can't start job "1407319":
>>>>>>> can not find
>>>>>>> an unused add_grp_id
>>>>>>> 05/24/2007 12:20:06|execd|blade97|E|can't start job "1407319":
>>>>>>> can not find
>>>>>>> an unused add_grp_id
>>>>>>> 05/24/2007 12:20:13|execd|blade97|W|reaping job "1407319" ptf
>>>>>>> complains: Job
>>>>>>> does not exist
>>>>>>> 
>>>>>>> 
>>>>>>> Can anyone explain what this means and how it might be avoided?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> (On a related note, the message "reaping job... ptf complains:
>>>>>>> Job does not
>>>>>>> exist" is very common in the message files... why is this?)
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Todd
>>>>>>> 
>>>>>>> --------------------------------------------------------------------
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>> 
>>>>>>> 
>>>>>>>            
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>> 
>>>>>>           
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>         
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>> 
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> 
>>>     
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
>>   
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list