[GE users] here is a strange one: submitting to a PE reliably takes out sge_schedd

Reuti reuti at staff.uni-marburg.de
Thu Apr 3 10:55:31 BST 2008


Hi,

is there any file in /tmp from the crashed scheduler process?

-- Reuti


Am 02.04.2008 um 18:23 schrieb Chris Dagdigian:
>
> Can anyone think of any sort of chain of events that would result  
> in sge_schedd crashing whenever a job requesting a parallel  
> environment is submitted?
>
> Weird but true and totally reproducible although this  
> inconveniences users as this is a production system in a university  
> environment.
>
> Using a totally vanilla loosely integrated mpich environment with  
> the latest 6.0ux binaries on OS X:
>
>> pe_name           mpich
>> slots             514
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /common/sge/mpi/startmpi.sh $pe_hostfile
>> stop_proc_args    /common/sge/mpi/stopmpi.sh
>> allocation_rule   $fill_up
>> control_slaves    FALSE
>> job_is_first_task TRUE
>> urgency_slots     min
>
> ... with "mpich" attached to all.q
>
> Whenever I build and submit the simple MPICH example code "cpi" I  
> can reliably take out the scheduler.
>
> It seems to die almost instantly after job submission.
>
> I've examined the startmpi.sh and stopmpi.sh scripts by hand and  
> have not seen any problems. Next step may be to switch them over to  
> "/bin/true" just to see if I can keep crashing the scheduler.
>
> Debugging time is limited as this is an active cluster in use by  
> many people.
>
> This cluster uses a SAN filesystem with classic spooling. They had  
> a nasty filesystem event that corrupted some files in the SGE spool  
> directory. We were able to find and fix those files by hand and  
> later on did a full upgrade install of the latest 6.0ux distribution.
>
> My only guess at this point is that I missed a corrupted text file  
> somewhere in the spool/ directory that is causing the problem. I  
> can't reproduce the crash on any other test SGE system as well.
>
> I've eyeballed all of the standard sort of configuration and state  
> files and have not seen anything out of the ordinary.
>
> next steps:
>  - delete and recreate the MPICH PE
>  - move/copy the startmpi.sh and stopmpi.sh files just to rule out  
> bit rot or subtle corruption
>  - possibly: wipe the spool entirely and reinstall config from  
> saved templates
>
> Anyone see this before? Are there particular files in a classic  
> spooling environment related to sge_schedd that I should be paying  
> special attention to? This is a strange one.
>
>
> Regards,
> Chris
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list