[GE users] SGE and MM5

Brian R Smith brian at cypher.acomp.usf.edu
Wed Apr 13 16:17:59 BST 2005


Martin,

We run MM5 over SGE on our cluster.  I'd be more than happy to share our
configuration parameters with you.

Our submit scripts are very simple, like so:

#!/bin/bash
#$ -V
#$ -N mm5-mpp
#$ -cwd
#$ -v MPIR_HOME=/usr/local/mpich-pgi/bin
#$ -pe mpich 12
#$ -j y

cat $TMPDIR/machines

$MPIR_HOME/mpirun -np $NSLOTS -machinefile $TMPDIR/machines -batch -jid
$JOB_ID mm5.mpp

Our tight-integration (we use mpich) parallel environment looks like
this:

pe_name           mpich
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task TRUE
urgency_slots     min

As well, our queue, all.q, references this environment 

pe_list   make mpich

Also, what does the $SGE_ROOT/<cell_name>/spool/<node>/messages file
say, for the "master" node on those particular jobs?



-Brian

On Tue, 2005-04-12 at 15:41 -0700, lukacm at pdx.edu wrote:
> Reuti,
> 
> Quoting Reuti <reuti at staff.uni-marburg.de>:
> 
> > Hi,
> >
> > to get a tight integration with SGE, you setup a PE which you requested for
> > the
> > job? There is a sample mpi installation in the SGE distribution and a Howto
> > page available at sunsource.net. What was your submitted script and qsub
> > command?
> 
> I followed the manual, but no job can run on the lamtight or non tight
> integration and the error message is always:
> 
> Jobs cannot run because resources requested are not available for parallel job
> 
> however all queues are emtpy and their respective load is 0.
> >
> > - Don't use -nolocal with SGE. You will get an uneven distribution.
> 
> I used this only because the person installing the PG compiler did it with no
> local settings so the mpi must be notified,however i do not useit with qsub.
> 
> >
> > - You didn't specify a "-machinefile $TMPDIR/maches" for your mpirun, so the
> > setup nodes in "blabla/share/machines.LINUX" will be used, and not the SGE
> > selected ones for a parallel job.
> 
> this is the inside of the mm5_submit.sh:
> 
> #$ -V
> #$ -N mm5job
> #$ -o /home/submitter/mm5/sge-output.txt -j y
> #$ -pe mpi 4
> #$ -v MPIR_HOME=/opt/mpich/gnu/bin
> #$ -v MPICH_PROCESS_GROUP=no
> #$ -v CONV_RSH=ssh
> cd  /home/submitter/mm5
> #$ -cwd
> #$ -e ./
> #$ -o ./
> ###Remember only the home directory and /exports/visible are
> ###avilable throughout the cluster
> /opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile ./mmachine
> /home/visible/MM5/Run/mm5.mpp
> 
> 
> Depending on when i run the job i can also get his error message:
> 
> 
> Jobs can not run because total slots of pe are not in range of job
> 
> Moreover i managed to make the job run for tw machines but only for two.
> 
> is this the problem of SGE configuration or MM5? Or does it requires tight
> integration?
> 
> thank you
> 
> martin
> 
> 
> 
> >
> > CU - Reuti
> >
> >
> > Quoting lukacm at pdx.edu:
> >
> > > Hello list,
> > >
> > > i have the following problem. We installed MM5 program and it runs using
> > > mpi.
> > > thus if i do: /opt/mpich/gnu/bin/mpirun -nolocal -np 4
> > > /home/visible/MM5/Run/mm5.mpp, the program will start and run. However when
> > > i
> > > submit it to SGE, the program stays in the waiting state and never goes to
> > > the
> > > activation/run state. And stays there ... well until being removed.
> > > Is there a way how i can debug it?
> > > Is there anyone that had the smae problems?
> > >
> > > thank you
> > >
> > > martin
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list