[GE users] Shepherd errors

Heywood, Todd heywood at cshl.edu
Fri Jun 29 14:40:52 BST 2007


They are local spool directories /var/spool/sge/...

I'm basically reduced to wild guesses like: perhaps heavy traffic over the
nodes network interface makes the qmaster time out the connection to the
shepherd (stderr messgae), and the other errors are residual effects of
that. I'm hoping an SGE developer might have insight into possible causes of
this!

Thanks,

Todd




On 6/29/07 4:24 AM, "Schenker, Martin" <MSchenker at illumina.com> wrote:

> Are you using a central spool dir or a local one (on each node)? We had
> similar random occurences on a Lustre/SFS  system, but switching the spooling
> to a local dir (/var/spool/sge) seems to have cured this very odd behaviour...
> 
> Best, Martin
> 
> 
> 
> -----Original Message-----
> From: Heywood, Todd [mailto:heywood at cshl.edu]
> Sent: 28 June 2007 19:21
> To: users at gridengine.sunsource.net
> Subject: [GE users] Shepherd errors
> 
> 
> I have a recurrent error which is drving me nuts. It occurs for a pipeline
> application which submits thousands of jobs for over a 6 hour period.
> Sometimes the pipeline finishes fine, and other times it stops with this
> error:
> 
> In stderr:
> 
> error: cannot get connection to "shepherd" at host "blade15"
> 
> In email sent to the SGE admin user:
> 
> failed before job:06/27/2007 02:57:34 [0:23701]: can't open file
> /tmp/1738723.1.solexa.q/pid.339.blade15: No such file
> 
> In /var/spool/sge/blade15/messages:
> 
> 06/27/2007 02:57:34|execd|blade15|E|slave shepherd of job 1738723.1 exited
> with exit status = 11
> 
> This jjust doesn't happen with blade15, but some random node. Further, this
> node has been happily processing pipeline jobs for hours up until this
> failure.
> 
> Any ideas on how to diagnose this further?
> 
> Thanks,
> 
> Todd
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list