[GE users] erratic mpich2-mx errors

amato ate9c at virginia.edu
Mon Dec 7 16:47:29 GMT 2009


Hi Reuti,

Firstly, thanks so much for your time here; I realize my question was  
pretty odd & vague!

On Dec 6, 2009, at 7:35 PM, reuti wrote:
>
> what is happening, when you try to submit a script which mimics the
> behavior of your software - is it also failing? Is the master task or
> the slaves failing?

I have a shell script (cfd.sh) with sge options and an mpiexec  
command.  If I just run this script at the command like "./cfd.sh" the  
program executes just fine.  If I execute with "qsub ./cfd.sh" the  
code fails on writing output to a text file (the output file is  
created, just not written to). If those output text files already  
exist when I submit the job with qsub, the code usually runs just fine  
(occasionally it crashes on a "Stale NFS" error, but then I resubmit  
and it goes just fine).

I'm not sure if the master or the slaves are failing. If I don't make  
these output text files before the job starts, the job just fills up  
my processors and then hangs.  So I guess this is the slaves failing?

> When you have only a loose integration, the tasks on the slaves may
> not know the actual name of the $TMPDIR or other environment  
> variables.

I had thought about this, and I have:

TMPDIR='.'

in my shell script.  Without it the code crashes no matter what.

> As you know the names of the necessary files, you can try to "touch"
> them in a queue prolog.

That's a good idea, I'll just add this to my script.  It would be nice  
to fix this problem at the root though.  Any ideas of where to look  
for this bug would be greatly appreciated.  And thanks again for your  
time!

Sincerely,
Amato

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232044

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list