[GE users] Can't connect to shepherd error

Fred Youhanaie fly at anydata.co.uk
Thu Jun 7 16:56:46 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Heywood, Todd wrote:
> Hi,
> 
> I have another SGE error with this application, and it is getting more
> confusing. To summarize, I have 3 cases (the first two already mentioned in
> this exchange):
> 
> 1. The user gets an stderr message: " error: cannot  get connection to
> "shepherd" at host "bladeXXX" ". The exec host messages file has a complaint
> about not finding an unused add_grp_id. But we only have 4 slots per node
> with a gid range of 200.
> 
> 2. The user gets an stderr message: " error: cannot  get connection to
> "shepherd" at host "bladeXXX" ". The exec host messages file says: "|E|slave
> shepherd of job 1674155.1 exited with exit status = 11".
> 
> 3. The user gets the following stderr message:
> 
> Connection closed by 172.20.122.6
> Connection closed by 172.20.128.6
> Connection closed by 172.20.128.6
> can't open file /tmp/1681869.1.public.q/pid.2.blade300: No such file or
> directory
> can't open file /tmp/1681869.1.public.q/pid.2.blade384: No such file or
> directory
> can't open file /tmp/1681869.1.public.q/pid.1.blade384: No such file or
> directory
> Write failed: Broken pipe
> qmake[1]: *** [offsets1_0] Error 1
> qmake[1]: *** Waiting for unfinished jobs....
> qmake[1]: Write failed: Broken pipe
> *** [offsets3_0] Error 1
> qmake[1]: *** [offsets8_0] Error 1
> can't open file /tmp/1681869.1.public.q/pid.4.blade300: No such file or
> directory
> can't open file /tmp/1681869.1.public.q/pid.3.blade300: No such file or
> directory
> qmake: *** [nonrecursive] Error 2
> 
> The exec host message file says: "|E|slave shepherd of job 1674155.1 exited
> with exit status = 11".
> 
> To summarize: Cases 1 and 2 have the same stderr error messgae about not
> being able to connect to shepherd. Cases 2 and 3 have the same
> exit_status=11 error message in the exec host messages file.
> 
> Exit status 11 apparently means this:
> 
> [root at bhmnode2 n1ge6]# perl -e 'die$!=11'
> Resource temporarily unavailable at -e line 1.
> 
> 
> This application has been running for a few months, and only started acting
> up in the last couple of weeks, the first case for SGE 6.0u8 and the second
> two after I upgraded to 6.1. I don't think we have had any cluster
> configuration changes or different sets of loads.
> 
> Any ideas on getting to the bottom of this would be greatly appreciated!
> 
> Thanks,
> 
> Todd

Todd,

Just a few thoughts...

Can you run the application outside of SGE control?
Is it a serial program or MPI etc?
if MPI has anything changed in that environment?

Cheers
f.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list