[GE users] here is a strange one: submitting to a PE reliably takes out sge_schedd

Chris Dagdigian chris at bioteam.net
Wed Apr 2 17:23:34 BST 2008

Can anyone think of any sort of chain of events that would result in  
sge_schedd crashing whenever a job requesting a parallel environment  
is submitted?

Weird but true and totally reproducible although this inconveniences  
users as this is a production system in a university environment.

Using a totally vanilla loosely integrated mpich environment with the  
latest 6.0ux binaries on OS X:

> pe_name           mpich
> slots             514
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /common/sge/mpi/startmpi.sh $pe_hostfile
> stop_proc_args    /common/sge/mpi/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min

... with "mpich" attached to all.q

Whenever I build and submit the simple MPICH example code "cpi" I can  
reliably take out the scheduler.

It seems to die almost instantly after job submission.

I've examined the startmpi.sh and stopmpi.sh scripts by hand and have  
not seen any problems. Next step may be to switch them over to "/bin/ 
true" just to see if I can keep crashing the scheduler.

Debugging time is limited as this is an active cluster in use by many  

This cluster uses a SAN filesystem with classic spooling. They had a  
nasty filesystem event that corrupted some files in the SGE spool  
directory. We were able to find and fix those files by hand and later  
on did a full upgrade install of the latest 6.0ux distribution.

My only guess at this point is that I missed a corrupted text file  
somewhere in the spool/ directory that is causing the problem. I can't  
reproduce the crash on any other test SGE system as well.

I've eyeballed all of the standard sort of configuration and state  
files and have not seen anything out of the ordinary.

next steps:
  - delete and recreate the MPICH PE
  - move/copy the startmpi.sh and stopmpi.sh files just to rule out  
bit rot or subtle corruption
  - possibly: wipe the spool entirely and reinstall config from saved  

Anyone see this before? Are there particular files in a classic  
spooling environment related to sge_schedd that I should be paying  
special attention to? This is a strange one.


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list