[GE users] ge6 mpi job startup trouble

Jason Crane Jason.Crane at mrsc.ucsf.edu
Tue Nov 2 21:01:54 GMT 2004


Hi,

I'm having trouble starting mpi (MPICH) jobs on nodes with multiple slots 
under ge6.  I'm observing the grid attempting to qrsh from the master 
mpi_node to the machines in my hostfile, but if a machine has multiple 
slots it isn't utilized and the grid attempts to qrsh twice to nodes that 
are already fully occupied by the mpi job.  For example, if I try to 
start an MPICH job with 8 slots and the following machine file consisting 
of 7 single slot nodes, and one 2 slot node, the grid will qrsh from 
host1 (master) to hosts2(rank=1), 3(2), 4(3), 5(4), 6(5), 7(6), then 
host2(7) again at which point I get a p4_error.  I the job runs only on 
single slot nodes there isn't a problem.  Also, I think this is only a 
problem when the mpi job's master node has multiple slots.  I've tried 
switching control_slaves, job_is_first_task, and the allocation_rule to 
no avail.   What needs to be done to get both slots on host1 (or master 
job node) to be utilized?  I didn't have this problem with MPICH jobs 
under v5.3p5.  Here are my job-script, the grid generated machine_file, 
and the job output.

job_script:
#$ -pe pe_mpich 8
#$ -q all.q
#$ -l arch=sol-sparc64
#$ -cwd
mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./test_mpich

machine_file:
host1
host1
host2
host3
host4
host5
host6
host7

job_output:
qrsh -V -inherit -nostdin host2 test_mpich host1 63698 -p4amslave 
-p4yourname host2 -p4rmrank 1
qrsh -V -inherit -nostdin host3 test_mpich host1 63698 -p4amslave 
-p4yourname host3 -p4rmrank 2
qrsh -V -inherit -nostdin host4 test_mpich host1 63698 -p4amslave 
-p4yourname host3 -p4rmrank 3
qrsh -V -inherit -nostdin host5 test_mpich host1 63698 -p4amslave 
-p4yourname host4 -p4rmrank 4
qrsh -V -inherit -nostdin host6 test_mpich host1 63698 -p4amslave 
-p4yourname host5 -p4rmrank 5
qrsh -V -inherit -nostdin host7 test_mpich host1 63698 -p4amslave 
-p4yourname host6 -p4rmrank 6
qrsh -V -inherit -nostdin host2 test_mpich host1 63698 -p4amslave 
-p4yourname host2 -p4rmrank 7

p0_25930:  p4_error: Child process exited while making connection to 
remote process on host2: 0


Here is my parallel env conf:
pe_name           pe_mpich
slots             200
user_lists        NONE
xuser_lists       NONE
start_proc_args   /netopt/sge/mpi/startmpi.sh -catch_rsh  $pe_hostfile
stop_proc_args    /netopt/sge/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min


Thank you -Jason



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list