[GE users] Re: mpich <-> sge --> controlling hosts machinefile

Reuti reuti at staff.uni-marburg.de
Thu Jul 5 12:15:22 BST 2007


Am 05.07.2007 um 12:07 schrieb Gerolf Ziegenhain:

> I was able to track it down even more to this:
>
> ssh lc12 cat /tmp/244224.1.q_mpich/machines
> lc12
> lc12
> lc19
> lc19
> lc14
> lc14
> lc13
> lc13
>
> The master-node of the 8-processore-job has a good looking  
> machinefile.
>
> The current running process creates from this the following  
> machinefile:
> lc12.rhrk.uni-kl.de 0 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc19 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc19 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc14 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc14 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc13 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc13 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
> lc19 1 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel
>
> Where is the machinefile created? Is this done by the
> $SGE_ROOT/mpi/start.mpi.sh -catch_rsh $pe_hostfile
> ?

Was this the same job? Once I saw three processes on node12 in your  
output, and you stated node19. The MPICH will remove one line from  
the machinefile, which corresponds to the reply of `hostname`. If  
`hostname` gives something else than node12, this must be adjusted in  
the startmpi.sh proceduere, so that one line can be indeed removed.  
Otherwise the machinefile will not be scanned completely or be  
scanned more than once.

Some details you may find here:

http://gridengine.sunsource.net/howto/mpich-integration.html

-- Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list