[GE users] erratic mpich2-mx errors

reuti reuti at staff.uni-marburg.de
Mon Dec 7 23:39:37 GMT 2009


Hi Amato,

Am 07.12.2009 um 17:47 schrieb amato:

> Firstly, thanks so much for your time here; I realize my question was
> pretty odd & vague!
>
> On Dec 6, 2009, at 7:35 PM, reuti wrote:
>>
>> what is happening, when you try to submit a script which mimics the
>> behavior of your software - is it also failing? Is the master task or
>> the slaves failing?
>
> I have a shell script (cfd.sh) with sge options and an mpiexec
> command.  If I just run this script at the command like "./cfd.sh" the
> program executes just fine.  If I execute with "qsub ./cfd.sh" the
> code fails on writing output to a text file (the output file is
> created, just not written to). If those output text files already
> exist when I submit the job with qsub, the code usually runs just fine
> (occasionally it crashes on a "Stale NFS" error, but then I resubmit
> and it goes just fine).

are you using any kind of automounter? This is known to often result  
in errors.


> I'm not sure if the master or the slaves are failing. If I don't make
> these output text files before the job starts, the job just fills up
> my processors and then hangs.  So I guess this is the slaves failing?
>
>> When you have only a loose integration, the tasks on the slaves may
>> not know the actual name of the $TMPDIR or other environment
>> variables.
>
> I had thought about this, and I have:
>
> TMPDIR='.'

What is defined in your queue setup for the entry "tmpdir". It's best  
to have a big /tmp partition on the nodes. When the job always  
crashes, maybe the application is using the $TMPDIR when it's set  
(many programs nowadays comply to this). When it can't be accessed,  
they crash of course.

-- Reuti


> in my shell script.  Without it the code crashes no matter what.
>
>> As you know the names of the necessary files, you can try to "touch"
>> them in a queue prolog.
>
> That's a good idea, I'll just add this to my script.  It would be nice
> to fix this problem at the root though.  Any ideas of where to look
> for this bug would be greatly appreciated.  And thanks again for your
> time!
>
> Sincerely,
> Amato
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=232044
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232118

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list