[GE users] erratic mpich2-mx errors
reuti at staff.uni-marburg.de
Mon Dec 7 00:35:27 GMT 2009
Am 04.12.2009 um 21:02 schrieb amato:
> I run an Apple Xserve cluster which has Myrinet MPI comms. I am
> trying to do a loose integration of a MPICH2-mx parallel
> computational fluid dynamics (CFD) program with SGE. Now, I can
> submit this CFD program to my cluster via mpiexec w/no problem. And
> if I compile the code to run on one processor I can submit this via
> qsub no problem. Also, I can run some basic mpi benchmarking
> programs (IMB) via qsub using a pe I created with no problem.
> The problem is submitting my CFD program via qsub: Usually the
> program fails with errors about not being able to find an output
> file to write to (a file that the program should've written but
> didn't). Or it actually writes those output files, but they stay
> empty and the code just runs forever on my cluster.
> However, if I create those files (empty) that the code should be
> writing, then the program will execute normally and exit when
> To complicate matters more, these file that the CFD program is
> writing are going into a directory where the code is also writing
> new folders, so I think permissions are the issue here.
> Has anyone ever encountered anything similar? Does anyone have a
> hunch as to where I can look to fix this situation? Thanks!
what is happening, when you try to submit a script which mimics the
behavior of your software - is it also failing? Is the master task or
the slaves failing?
When you have only a loose integration, the tasks on the slaves may
not know the actual name of the $TMPDIR or other environment variables.
As you know the names of the necessary files, you can try to "touch"
them in a queue prolog.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users