[GE users] erratic mpich2-mx errors
ate9c at virginia.edu
Mon Dec 7 16:47:29 GMT 2009
Firstly, thanks so much for your time here; I realize my question was
pretty odd & vague!
On Dec 6, 2009, at 7:35 PM, reuti wrote:
> what is happening, when you try to submit a script which mimics the
> behavior of your software - is it also failing? Is the master task or
> the slaves failing?
I have a shell script (cfd.sh) with sge options and an mpiexec
command. If I just run this script at the command like "./cfd.sh" the
program executes just fine. If I execute with "qsub ./cfd.sh" the
code fails on writing output to a text file (the output file is
created, just not written to). If those output text files already
exist when I submit the job with qsub, the code usually runs just fine
(occasionally it crashes on a "Stale NFS" error, but then I resubmit
and it goes just fine).
I'm not sure if the master or the slaves are failing. If I don't make
these output text files before the job starts, the job just fills up
my processors and then hangs. So I guess this is the slaves failing?
> When you have only a loose integration, the tasks on the slaves may
> not know the actual name of the $TMPDIR or other environment
I had thought about this, and I have:
in my shell script. Without it the code crashes no matter what.
> As you know the names of the necessary files, you can try to "touch"
> them in a queue prolog.
That's a good idea, I'll just add this to my script. It would be nice
to fix this problem at the root though. Any ideas of where to look
for this bug would be greatly appreciated. And thanks again for your
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users