[GE users] erratic mpich2-mx errors

reuti reuti at staff.uni-marburg.de
Mon Dec 7 00:35:27 GMT 2009


Am 04.12.2009 um 21:02 schrieb amato:

> I run an Apple Xserve cluster which has Myrinet MPI comms. I am  
> trying to do a loose integration of a MPICH2-mx parallel  
> computational fluid dynamics (CFD) program with SGE.  Now, I can  
> submit this CFD program to my cluster via mpiexec w/no problem. And  
> if I compile the code to run on one processor I can submit this via  
> qsub no problem. Also, I can run some basic mpi benchmarking  
> programs (IMB) via qsub using a pe I created with no problem.
> The problem is submitting my CFD program via qsub: Usually the  
> program fails with errors about not being able to find an output  
> file to write to (a file that the program should've written but  
> didn't). Or it actually writes those output files, but they stay  
> empty and the code just runs forever on my cluster.
> However, if I create those files (empty) that the code should be  
> writing, then the program will execute normally and exit when  
> finished.
> To complicate matters more, these file that the CFD program is  
> writing are going into a directory where the code is also writing  
> new folders, so I think permissions are the issue here.
> Has anyone ever encountered anything similar?  Does anyone have a  
> hunch as to where I can look to fix this situation?  Thanks!

what is happening, when you try to submit a script which mimics the  
behavior of your software - is it also failing? Is the master task or  
the slaves failing?

When you have only a loose integration, the tasks on the slaves may  
not know the actual name of the $TMPDIR or other environment variables.

As you know the names of the necessary files, you can try to "touch"  
them in a queue prolog.

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list