[GE users] Can't connect to shepherd error

Heywood, Todd heywood at cshl.edu
Thu Jun 7 17:58:11 BST 2007


> Todd,
> 
> Just a few thoughts...
> 
> Can you run the application outside of SGE control?
> Is it a serial program or MPI etc?
> if MPI has anything changed in that environment?
> 
> Cheers
> f.
> 

Fred,

This is a parallel application which uses parallel make, i.e. This method
(from the qmake man page):

       The shell script

              #!/bin/sh
              qmake -inherit --

       can be submitted by

              qsub -cwd -v PATH -pe make 1-10 [further sge options] <script>

       Qmake  will  inherit  the resources granted for the job sumbitted
above under
       parallel environment "make".


It has run without problems for a few months (via SGE), until a couple of
weeks ago. 

I guess I am just asking SGE developers what conditions lead SGE to give the
error messages I am seeing.

Todd



> Heywood, Todd wrote:
>> Hi,
>> 
>> I have another SGE error with this application, and it is getting more
>> confusing. To summarize, I have 3 cases (the first two already mentioned in
>> this exchange):
>> 
>> 1. The user gets an stderr message: " error: cannot  get connection to
>> "shepherd" at host "bladeXXX" ". The exec host messages file has a complaint
>> about not finding an unused add_grp_id. But we only have 4 slots per node
>> with a gid range of 200.
>> 
>> 2. The user gets an stderr message: " error: cannot  get connection to
>> "shepherd" at host "bladeXXX" ". The exec host messages file says: "|E|slave
>> shepherd of job 1674155.1 exited with exit status = 11".
>> 
>> 3. The user gets the following stderr message:
>> 
>> Connection closed by 172.20.122.6
>> Connection closed by 172.20.128.6
>> Connection closed by 172.20.128.6
>> can't open file /tmp/1681869.1.public.q/pid.2.blade300: No such file or
>> directory
>> can't open file /tmp/1681869.1.public.q/pid.2.blade384: No such file or
>> directory
>> can't open file /tmp/1681869.1.public.q/pid.1.blade384: No such file or
>> directory
>> Write failed: Broken pipe
>> qmake[1]: *** [offsets1_0] Error 1
>> qmake[1]: *** Waiting for unfinished jobs....
>> qmake[1]: Write failed: Broken pipe
>> *** [offsets3_0] Error 1
>> qmake[1]: *** [offsets8_0] Error 1
>> can't open file /tmp/1681869.1.public.q/pid.4.blade300: No such file or
>> directory
>> can't open file /tmp/1681869.1.public.q/pid.3.blade300: No such file or
>> directory
>> qmake: *** [nonrecursive] Error 2
>> 
>> The exec host message file says: "|E|slave shepherd of job 1674155.1 exited
>> with exit status = 11".
>> 
>> To summarize: Cases 1 and 2 have the same stderr error messgae about not
>> being able to connect to shepherd. Cases 2 and 3 have the same
>> exit_status=11 error message in the exec host messages file.
>> 
>> Exit status 11 apparently means this:
>> 
>> [root at bhmnode2 n1ge6]# perl -e 'die$!=11'
>> Resource temporarily unavailable at -e line 1.
>> 
>> 
>> This application has been running for a few months, and only started acting
>> up in the last couple of weeks, the first case for SGE 6.0u8 and the second
>> two after I upgraded to 6.1. I don't think we have had any cluster
>> configuration changes or different sets of loads.
>> 
>> Any ideas on getting to the bottom of this would be greatly appreciated!
>> 
>> Thanks,
>> 
>> Todd
> 

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list