[GE users] Mpich2 and SGE
raychan at ucdavis.edu
Fri Jan 27 00:35:19 GMT 2006
I've been perplexed by a recent problem I am encountering. I've gotten
tight integration w/ SGE6 and MPICH2 (daemonless smpd) to work phenomenally
for a while now on my ROCKS cluster, but recently my MPICH2 app (mpiblast)
isn't stable anymore. With mpich2 I put a .smpd with a different passphrase
in each user's home directory (phrase=<your phrase>) w/ rw access only to
that user as instructed in the MPICH2 documentation. This used to work
great with multiple users, but now since I run into problems, I've been only
testing w/ two users so I can try to pinpoint the problem.
The problem is that when I submit this parallel MPICH2 job script to SGE,
sometimes it runs the job well, and qdel's of jobs or graceful terminations
through completions never leave the cluster queue in an error state.
However, when I did run into problems with the jobs submitted I found that
my .smpd file's passphrase that I explicitly put in disappears! This would
of course not allow the job to run because MPICH2 daemonless smpd needs the
.smpd file in each user's home directory for contact to be made between the
nodes. If I would run the MPICH2 and app on the command line w/o SGE in the
equation interactively then MPICH2 would of course prompt me for a
passphrase if it did not find one in the .smpd file. I believe w/ SGE when
it can't find the phrase it throws up the prompt, but SGE doesn't know what
to do with it so the job doesn't work.
Not sure if this is a MPICH2 question or an SGE question specifically, but I
could not reproduce the problem with MPICH2 and my app alone. I was
wondering if anyone encountered the passphrases just disappearing from the
.smpd files. This is definitely the root of the problem as w/ the
passphrases intact, jobs run fine every time. I've tried to bounce between
my two test users running jobs successfully on both until one of the users
can't run a job anymore to find some kind of causality. However, I can't
find any direct link to why this happens.
Hope someone has an idea,
More information about the gridengine-users