[GE users] ge6 mpi job startup trouble

Jason Crane Jason.Crane at mrsc.ucsf.edu
Tue Nov 2 22:46:47 GMT 2004


Hi,

Strange, with the keep_pg flag the PI* file is as follows:

host1.domain.org 0 test_mpich
host2 1 test_mpich
host3 1 test_mpich
host4 1 test_mpich
host5 1 test_mpich
host6 1 test_mpich
host7 1 test_mpich
host2 1 test_mpich

I don't see any reference to nolocal being set to true in 
the mpirun*args files.  And again, these same mpi 
installation worked under v5.3p5


-Jason




>Hi,
>
>the strange thing is, that the machine_file seems to be 
okay, and already the 
>rank1 job should go there (to host1 via rsh). Can you give 
the option -keep_pg 
>to your mpirun command - this will preserve the PI..... 
file and you can have a 
>look at. There should be two times host1 mentioned. There 
is no -nolocal in 
>your mpirun.args or mpirun.ch_p4.args, which would prevent 
any local process? - 
>Reuti
>
>> Hi,
>> 
>> I'm having trouble starting mpi (MPICH) jobs on nodes 
with multiple slots 
>> under ge6.  I'm observing the grid attempting to qrsh 
from the master 
>> mpi_node to the machines in my hostfile, but if a machine 
has multiple 
>> slots it isn't utilized and the grid attempts to qrsh 
twice to nodes that 
>> are already fully occupied by the mpi job.  For example, 
if I try to 
>> start an MPICH job with 8 slots and the following machine 
file consisting 
>> of 7 single slot nodes, and one 2 slot node, the grid 
will qrsh from 
>> host1 (master) to hosts2(rank=1), 3(2), 4(3), 5(4), 6(5), 
7(6), then 
>> host2(7) again at which point I get a p4_error.  I the 
job runs only on 
>> single slot nodes there isn't a problem.  Also, I think 
this is only a 
>> problem when the mpi job's master node has multiple 
slots.  I've tried 
>> switching control_slaves, job_is_first_task, and the 
allocation_rule to 
>> no avail.   What needs to be done to get both slots on 
host1 (or master 
>> job node) to be utilized?  I didn't have this problem 
with MPICH jobs 
>> under v5.3p5.  Here are my job-script, the grid generated 
machine_file, 
>> and the job output.
>> 
>> job_script:
>> #$ -pe pe_mpich 8
>> #$ -q all.q
>> #$ -l arch=sol-sparc64
>> #$ -cwd
>> mpirun -np $NSLOTS -machinefile $TMPDIR/machines 
./test_mpich
>> 
>> machine_file:
>> host1
>> host1
>> host2
>> host3
>> host4
>> host5
>> host6
>> host7
>> 
>> job_output:
>> qrsh -V -inherit -nostdin host2 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host2 -p4rmrank 1
>> qrsh -V -inherit -nostdin host3 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host3 -p4rmrank 2
>> qrsh -V -inherit -nostdin host4 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host3 -p4rmrank 3
>> qrsh -V -inherit -nostdin host5 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host4 -p4rmrank 4
>> qrsh -V -inherit -nostdin host6 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host5 -p4rmrank 5
>> qrsh -V -inherit -nostdin host7 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host6 -p4rmrank 6
>> qrsh -V -inherit -nostdin host2 test_mpich host1 63698 
-p4amslave 
>> -p4yourname host2 -p4rmrank 7
>> 
>> p0_25930:  p4_error: Child process exited while making 
connection to 
>> remote process on host2: 0
>> 
>> 
>> Here is my parallel env conf:
>> pe_name           pe_mpich
>> slots             200
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /netopt/sge/mpi/startmpi.sh -catch_rsh  
$pe_hostfile
>> stop_proc_args    /netopt/sge/mpi/stopmpi.sh
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>> 
>> 
>> Thank you -Jason
>> 
>> 
>> 
>> 
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: 
users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: 
users-help at gridengine.sunsource.net
>> 
>
>
>
>-----------------------------------------------------------
----------
>To unsubscribe, e-mail: 
users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: 
users-help at gridengine.sunsource.net
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list