[GE users] ge6 mpi job startup trouble

David S. dgs at gs.washington.edu
Tue Nov 2 21:56:14 GMT 2004

> > 
> > I'm having trouble starting mpi (MPICH) jobs on nodes with multiple slots 
> > under ge6.  I'm observing the grid attempting to qrsh from the master 
> > mpi_node to the machines in my hostfile, but if a machine has multiple 
> > slots it isn't utilized and the grid attempts to qrsh twice to nodes that 
> > are already fully occupied by the mpi job.  For example, if I try to 
> > start an MPICH job with 8 slots and the following machine file consisting 
> > of 7 single slot nodes, and one 2 slot node, the grid will qrsh from 
> > host1 (master) to hosts2(rank=1), 3(2), 4(3), 5(4), 6(5), 7(6), then 
> > host2(7) again at which point I get a p4_error.  I the job runs only on 
> > single slot nodes there isn't a problem.  Also, I think this is only a 
> > problem when the mpi job's master node has multiple slots.  I've tried 
> > switching control_slaves, job_is_first_task, and the allocation_rule to 
> > no avail.   What needs to be done to get both slots on host1 (or master 
> > job node) to be utilized?  I didn't have this problem with MPICH jobs 
> > under v5.3p5.  Here are my job-script, the grid generated machine_file, 
> > and the job output.

I had a similar problem with GE 6 and MPICH that I fixed by forcing the
MPICH PE to use fully-qualified domain names in the generated machine

	mesh:5% diff startmpi.sh startmpi.sh.orig
	<       #host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
	<       host=`echo $line|cut -f1 -d" "`
	>       host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`

David S.

> > 

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list