[GE dev] crash in shepherd, gridengine 6.2u5

henk h.a.slim at durham.ac.uk
Fri May 28 15:36:38 BST 2010

Dear GridEngine developers

I recently installed 6.2u5 on a new Nehalem based cluster running SLES
11.0 and combined with OpenMPI 1.4.2

Jobs are failing erratically with a crash of the shepherd, an example is
attached. Often this happens immediately after start up but it can also
happen if a job has run 24 hours. If mpirun is used interactively with a
host file everything works fine.

Is there any debugging option for this or is it possible to build the
shepherd daemon locally? I do have the 6.2u5 source but this requires a
complete build of the core package.

This is getting a serious problem now as the cluster is ready for
hardware acceptance but really cannot be used because of this crash.

Thanks very much



To unsubscribe from this discussion, e-mail: [dev-unsubscribe at gridengine.sunsource.net].

    [ Part 2, "b_eff_sge64.e416.txt"  Text/PLAIN (Name: ]
    [ "b_eff_sge64.e416.txt") ~6.6 KB. ]
    [ Unable to print this part. ]

More information about the gridengine-users mailing list