[GE users] Mpich2 and SGE

Reuti reuti at staff.uni-marburg.de
Fri Jan 27 08:29:56 GMT 2006

    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Am 27.01.2006 um 01:35 schrieb Raymond Chan:

> Hi all,
> I?ve been perplexed by a recent problem I am encountering.  I?ve  
> gotten tight integration w/ SGE6 and MPICH2 (daemonless smpd) to  
> work phenomenally for a while now on my ROCKS cluster, but recently  
> my MPICH2 app (mpiblast) isn?t stable anymore.  With mpich2 I put  
> a .smpd with a different passphrase in each user?s home directory  
> (phrase=<your phrase>) w/ rw access only to that user as instructed  
> in the MPICH2 documentation.  This used to work great with multiple  
> users, but now since I run into problems, I?ve been only testing w/  
> two users so I can try to pinpoint the problem.
> The problem is that when I submit this parallel MPICH2 job script  
> to SGE, sometimes it runs the job well, and qdel?s of jobs or  
> graceful terminations through completions never leave the cluster  
> queue in an error state.  However, when I did run into problems  
> with the jobs submitted I found that my .smpd file?s passphrase  
> that I explicitly put in disappears!  This would of course not  
> allow the job to run because MPICH2 daemonless smpd needs the .smpd  
> file in each user?s home directory for contact to be made between  
> the nodes.  If I would run the MPICH2 and app on the command line w/ 
> o SGE in the equation interactively then MPICH2 would of course  
> prompt me for a passphrase if it did not find one in the .smpd  
> file.  I believe w/ SGE when it can?t find the phrase it throws up  
> the prompt, but SGE doesn?t know what to do with it so the job  
> doesn?t work.
> Not sure if this is a MPICH2 question or an SGE question  
> specifically, but I could not reproduce the problem with MPICH2 and  
> my app alone.  I  was wondering if anyone encountered the  
> passphrases just disappearing from the .smpd files.  This is  
> definitely
as SGE isn't touching the .smpd file at all on its own, this looks  
more like a MPICH2 issue. Maybe the difference to a start from the  
command line is, that there might startup two MPICH2 jobs at nearly  
the same time. So I can think of a race-condition inside the MPICH2  
code to open-close of this file.

Cheers - Reuti
> the root of the problem as w/ the passphrases intact, jobs run fine  
> every time.  I?ve tried to bounce between my two test users running  
> jobs successfully on both until one of the users can?t run a job  
> anymore to find some kind of causality.  However, I can?t find any  
> direct link to why this happens.
> Hope someone has an idea,
> Thank you,
> Ray

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list