[GE users] Shepherd errors

Heywood, Todd heywood at cshl.edu
Thu Jun 28 19:21:23 BST 2007


I have a recurrent error which is drving me nuts. It occurs for a pipeline
application which submits thousands of jobs for over a 6 hour period.
Sometimes the pipeline finishes fine, and other times it stops with this
error:

In stderr:

error: cannot get connection to "shepherd" at host "blade15"

In email sent to the SGE admin user:

failed before job:06/27/2007 02:57:34 [0:23701]: can't open file
/tmp/1738723.1.solexa.q/pid.339.blade15: No such file

In /var/spool/sge/blade15/messages:

06/27/2007 02:57:34|execd|blade15|E|slave shepherd of job 1738723.1 exited
with exit status = 11

This jjust doesn't happen with blade15, but some random node. Further, this
node has been happily processing pipeline jobs for hours up until this
failure.

Any ideas on how to diagnose this further?

Thanks,

Todd

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list