[GE users] ge6 mpi job startup trouble

Reuti reuti at staff.uni-marburg.de
Tue Nov 2 21:46:29 GMT 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

the strange thing is, that the machine_file seems to be okay, and already the 
rank1 job should go there (to host1 via rsh). Can you give the option -keep_pg 
to your mpirun command - this will preserve the PI..... file and you can have a 
look at. There should be two times host1 mentioned. There is no -nolocal in 
your mpirun.args or mpirun.ch_p4.args, which would prevent any local process? - 
Reuti

> Hi,
> 
> I'm having trouble starting mpi (MPICH) jobs on nodes with multiple slots 
> under ge6.  I'm observing the grid attempting to qrsh from the master 
> mpi_node to the machines in my hostfile, but if a machine has multiple 
> slots it isn't utilized and the grid attempts to qrsh twice to nodes that 
> are already fully occupied by the mpi job.  For example, if I try to 
> start an MPICH job with 8 slots and the following machine file consisting 
> of 7 single slot nodes, and one 2 slot node, the grid will qrsh from 
> host1 (master) to hosts2(rank=1), 3(2), 4(3), 5(4), 6(5), 7(6), then 
> host2(7) again at which point I get a p4_error.  I the job runs only on 
> single slot nodes there isn't a problem.  Also, I think this is only a 
> problem when the mpi job's master node has multiple slots.  I've tried 
> switching control_slaves, job_is_first_task, and the allocation_rule to 
> no avail.   What needs to be done to get both slots on host1 (or master 
> job node) to be utilized?  I didn't have this problem with MPICH jobs 
> under v5.3p5.  Here are my job-script, the grid generated machine_file, 
> and the job output.
> 
> job_script:
> #$ -pe pe_mpich 8
> #$ -q all.q
> #$ -l arch=sol-sparc64
> #$ -cwd
> mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./test_mpich
> 
> machine_file:
> host1
> host1
> host2
> host3
> host4
> host5
> host6
> host7
> 
> job_output:
> qrsh -V -inherit -nostdin host2 test_mpich host1 63698 -p4amslave 
> -p4yourname host2 -p4rmrank 1
> qrsh -V -inherit -nostdin host3 test_mpich host1 63698 -p4amslave 
> -p4yourname host3 -p4rmrank 2
> qrsh -V -inherit -nostdin host4 test_mpich host1 63698 -p4amslave 
> -p4yourname host3 -p4rmrank 3
> qrsh -V -inherit -nostdin host5 test_mpich host1 63698 -p4amslave 
> -p4yourname host4 -p4rmrank 4
> qrsh -V -inherit -nostdin host6 test_mpich host1 63698 -p4amslave 
> -p4yourname host5 -p4rmrank 5
> qrsh -V -inherit -nostdin host7 test_mpich host1 63698 -p4amslave 
> -p4yourname host6 -p4rmrank 6
> qrsh -V -inherit -nostdin host2 test_mpich host1 63698 -p4amslave 
> -p4yourname host2 -p4rmrank 7
> 
> p0_25930:  p4_error: Child process exited while making connection to 
> remote process on host2: 0
> 
> 
> Here is my parallel env conf:
> pe_name           pe_mpich
> slots             200
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /netopt/sge/mpi/startmpi.sh -catch_rsh  $pe_hostfile
> stop_proc_args    /netopt/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> 
> 
> Thank you -Jason
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list