[GE users] determining when a mpirun has succeeded and failed

Adam Bruss adam.bruss at staarinc.com
Fri Jul 6 21:23:55 BST 2007


Yes it came down to us looking into the source code and finding out what
exactly was being returned. The value being returned in the source
propagates to MPI and to the exit_status in the accounting file. That's what
I need. 

The thing is I didn't write the source code and I'm still somewhat new to
this kind of stuff.

thanks for the help



-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Friday, July 06, 2007 11:08 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] determining when a mpirun has succeeded and failed

Am 06.07.2007 um 17:27 schrieb Adam Bruss:
> The reason for "-pe mpi 1" is arbitrary. I could have "-pe mpi 16"  
> to have it run on all 16 of our processors. It doesn't matter for  
> what I'm trying to do.
>
> If I don't have the -D option I get this error message: "Error:Proc  
> 0 Err: RUNNING UNLICENSED VERSION!"
Not the LAM license - it's OpenSource, maybe your program looks for a  
special "license" file in the current working directory. But you  
state below, that the software is written on your own?
> The -c option is concerned with how many executables to run and not  
> anything to do with parallel processing.
Yes, but if there is a LAM universe unique for each job, just use the  
uppercase C to use all allocated slots/nodes for this job. This way  
you will never have to change it, when you decide to change the  
number of to be used slots.
> I was wrong when I said mpirun failed. What fails is the solver 
> (dfem) that we wrote.
So, how do you return form this subroutine, in case it fails for now?

As you have the source code, you can easily put a return statement  
there (as I outlined). Then this can be checked by $? on the  
commandline after running "mpirun..." interactively, or show up in  
the accounting file.

-- Reuti

> MPI didn't fail. I want to be able to tell when our solver(dfem)  
> fails through an error code. I was hoping this could be handled by  
> the exit_status variable in accounting rather than the output from  
> our solver.
>
>  Here again is the command:
>
>  qsub -N dfem -b y -V -pe mpi 1 "mpirun -D -c 1 /Analyst/v10dev/ 
> dfem -type rf3p /Analyst/v10dev/wgdblbnd.sup"
>
>
>
> -Adam
>
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Friday, July 06, 2007 5:55 AM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] determining when a mpirun has succeeded and  
> failed
>
>
>
> Am 05.07.2007 um 19:02 schrieb Adam Bruss:
>
>
>
> > -D is needed for the LAM licensing.
>
> > -c 1 tells mpirun to run one copy of the executable.
>
> > dfem is the executable and the stuff after it are arguments to dfem.
>
>
>
> I'm getting confused: what is the purpose of running a parallel job
>
> with only one CPU? After investigating, I even saw these options in
>
> the mpirun, but with another explanation:
>
>
>
> -D Change current working directory of new processes to the directory
>
> where the executable resides
>
>
>
> I don't know, whether this option really makes sense. So, just run
>
> with C as only option and it should work, as the Tight Integration
>
> set up already the right things for you.
>
>
>
> If you create a MPI error, i.e. like:
>
>
>
> return(MPI_ERR_OTHER);
>
> MPI_Finalize();
>
>
>
> (hence before the closing of the MPI environment), you can test it on
>
> the commandline with:
>
>
>
> echo $?
>
>
>
> after the mpirun. This return code should also appear in the SGE
>
> accounting file. What do you mean in detail with "mpirun failed"?
>
>
>
> -- Reuti
>
>
>
>
>
> > It works this way as far as running the job goes. In its current
>
> > state the
>
> > exit_status of qacct is zero if the mpi run was a success and  zero
>
> > if the
>
> > mpi run failed. I want to have the exit_status from qacct tell me
>
> > if the mpi
>
> > job failed or succeeded.
>
> >
>
> > According to a colleague of mine, SGE should be able to capture the
>
> > exit
>
> > status of the mpirun.
>
> >
>
> > Adam
>
> >
>
> > -----Original Message-----
>
> > From: Reuti [mailto:reuti at staff.uni-marburg.de]
>
> > Sent: Thursday, July 05, 2007 11:03 AM
>
> > To: Adam Bruss
>
> > Subject: Re: [GE users] determining when a mpirun has succeeded and
>
> > failed
>
> >
>
> > Am 05.07.2007 um 15:58 schrieb Adam Bruss:
>
> >> I'm running the LAM implementation of MPI with tight integration
>
> >> into SGE.
>
> > Okay, what are the options:
>
> >
>
> >   "mpirun -D -c 1 dfem -type rf3p wgdblbnd.sup"
>
> >
>
> > hence -D, -c1 and -type rf3p and good for?
>
> >
>
> > Just specify "mpirun C wgdblbnd.sup" (if this is your program)  
> should
>
> > work, as the Tight-Integration will create an universe for each job
>
> > on its own.
>
> >
>
> > -- Reuti
>
> >
>
> >  
> ---------------------------------------------------------------------
>
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list