[GE users] here is a strange one: submitting to a PE reliably takes out sge_schedd

Daniel Templeton Dan.Templeton at Sun.COM
Thu Apr 3 17:36:52 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Chris,

Since the problem is so easily reproducible, you may want to consider 
getting an sge_schedd binary that is compiled with debugging 
information.  We can then dissect the core file and try to figure out 
what's going wrong.  When it crashes, you will then have lots of info 
about where, why, and how.  I know it's a production system, but since 
you can immediately reproduce the problem, it shouldn't have a big impact.

Daniel

Chris Dagdigian wrote:
>
> Can anyone think of any sort of chain of events that would result in 
> sge_schedd crashing whenever a job requesting a parallel environment 
> is submitted?
>
> Weird but true and totally reproducible although this inconveniences 
> users as this is a production system in a university environment.
>
> Using a totally vanilla loosely integrated mpich environment with the 
> latest 6.0ux binaries on OS X:
>
>> pe_name           mpich
>> slots             514
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /common/sge/mpi/startmpi.sh $pe_hostfile
>> stop_proc_args    /common/sge/mpi/stopmpi.sh
>> allocation_rule   $fill_up
>> control_slaves    FALSE
>> job_is_first_task TRUE
>> urgency_slots     min
>
> ... with "mpich" attached to all.q
>
> Whenever I build and submit the simple MPICH example code "cpi" I can 
> reliably take out the scheduler.
>
> It seems to die almost instantly after job submission.
>
> I've examined the startmpi.sh and stopmpi.sh scripts by hand and have 
> not seen any problems. Next step may be to switch them over to 
> "/bin/true" just to see if I can keep crashing the scheduler.
>
> Debugging time is limited as this is an active cluster in use by many 
> people.
>
> This cluster uses a SAN filesystem with classic spooling. They had a 
> nasty filesystem event that corrupted some files in the SGE spool 
> directory. We were able to find and fix those files by hand and later 
> on did a full upgrade install of the latest 6.0ux distribution.
>
> My only guess at this point is that I missed a corrupted text file 
> somewhere in the spool/ directory that is causing the problem. I can't 
> reproduce the crash on any other test SGE system as well.
>
> I've eyeballed all of the standard sort of configuration and state 
> files and have not seen anything out of the ordinary.
>
> next steps:
>  - delete and recreate the MPICH PE
>  - move/copy the startmpi.sh and stopmpi.sh files just to rule out bit 
> rot or subtle corruption
>  - possibly: wipe the spool entirely and reinstall config from saved 
> templates
>
> Anyone see this before? Are there particular files in a classic 
> spooling environment related to sge_schedd that I should be paying 
> special attention to? This is a strange one.
>
>
> Regards,
> Chris
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list