[GE users] here is a strange one: submitting to a PE reliably takes out sge_schedd
chris at bioteam.net
Wed Apr 2 17:23:34 BST 2008
Can anyone think of any sort of chain of events that would result in
sge_schedd crashing whenever a job requesting a parallel environment
Weird but true and totally reproducible although this inconveniences
users as this is a production system in a university environment.
Using a totally vanilla loosely integrated mpich environment with the
latest 6.0ux binaries on OS X:
> pe_name mpich
> slots 514
> user_lists NONE
> xuser_lists NONE
> start_proc_args /common/sge/mpi/startmpi.sh $pe_hostfile
> stop_proc_args /common/sge/mpi/stopmpi.sh
> allocation_rule $fill_up
> control_slaves FALSE
> job_is_first_task TRUE
> urgency_slots min
... with "mpich" attached to all.q
Whenever I build and submit the simple MPICH example code "cpi" I can
reliably take out the scheduler.
It seems to die almost instantly after job submission.
I've examined the startmpi.sh and stopmpi.sh scripts by hand and have
not seen any problems. Next step may be to switch them over to "/bin/
true" just to see if I can keep crashing the scheduler.
Debugging time is limited as this is an active cluster in use by many
This cluster uses a SAN filesystem with classic spooling. They had a
nasty filesystem event that corrupted some files in the SGE spool
directory. We were able to find and fix those files by hand and later
on did a full upgrade install of the latest 6.0ux distribution.
My only guess at this point is that I missed a corrupted text file
somewhere in the spool/ directory that is causing the problem. I can't
reproduce the crash on any other test SGE system as well.
I've eyeballed all of the standard sort of configuration and state
files and have not seen anything out of the ordinary.
- delete and recreate the MPICH PE
- move/copy the startmpi.sh and stopmpi.sh files just to rule out
bit rot or subtle corruption
- possibly: wipe the spool entirely and reinstall config from saved
Anyone see this before? Are there particular files in a classic
spooling environment related to sge_schedd that I should be paying
special attention to? This is a strange one.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users