[GE users] puzzling MPICH behaviour with GE 5.3

Carlo Nardone carlo.nardone at sun.com
Fri Aug 27 11:09:10 BST 2004


Hi Andreas (and Reuti),
this workaround was successful in my case.
After having spotted a trivial mistake
in some of the submission scripts mow everything
works fine.
Many thanks!

Andreas Haas wrote:

> This is the workaround I usually recommend. The transformation
> you describe can be made in $SGE_ROOT/mpi/startmpi.sh that is
> run prior actual job. Watch out for the comment
> 
>    # add here code to map ..
> 
> within that script.
> 
> Cheers,
> Andreas
> 
> On Mon, 23 Aug 2004, Carlo Nardone wrote:
> 
> 
>>Hi all,
>>I have found a workaround: modify by brute force $TMDIR/machines
>>in my script by sed creating compute-0-*.local
>>hostnames for mpirun.
>>I wonder if there is a more elegant solution, though.
>>
>>Carlo Nardone wrote:
>>
>>>Ciao,
>>>thanks for your hints. The problem is not solved yet,
>>>but clearly host_aliases is not working well with MPICH + ssh.
>>>Actually I have done an experiment with mpirun alone
>>>(bypassing GE), when logging into the cluster frontend
>>>I find no discrepancies between machinefile and program
>>>output (there is a complete /etc/hosts with all nodes
>>>on the frontend).
>>>
>>>But when I launch mpirun from a compute node, where a limited
>>>/etc/hosts exists (this is due to NPACI Rocks approach),
>>>then I found the same behaviour as using qsub.
>>>
>>>I must say that I do have a host_aliases with
>>>compute-0-*  compute-0-*.local
>>>lines, and also that Rocks implements a strict ssh infrastructure
>>>between all compute nodes. I configured mpich for that
>>>as described in the mpich 1.2.5 manuals.
>>>I think that ssh is the culprit in one way or the other,
>>>since when I tried to launch mpirun from a compute node
>>>using a machinefile with complete host names,
>>>the system prompted something like
>>>Permanently added 'compute-0-*.local' (RSA1) to the list of known hosts
>>>
>>>What do you think?
>>>Thanks again!
>>>
>>>Reuti wrote:
>>>
>>>
>>>>Hi,
>>>>
>>>>two things I look into are:
>>>>
>>>>
>>>>>compute-0-1
>>>>>compute-0-1
>>>>>compute-0-3
>>>>>Process 1 of 3 on compute-0-3.local
>>>>>Process 2 of 3 on compute-0-3.local
>>>>>Process 0 of 3 on compute-0-1.local
>>>>
>>>>
>>>>
>>>>if `hostname` gives compute-0-1.local, you should change one entry to
>>>>this full name, so that MPICH can remove it from the list during the
>>>>first scan of the machinefile (although this is at this point not the
>>>>reason for the strange distribution to the nodes). If you are not
>>>>using a host_aliases file at all, maybe you can adjust the /etc/hosts.
>>>>
>>>>
>>>>>Could not find enough machines for architecture LINUX
>>>>
>>>>
>>>>
>>>>For SGE the machine is free and you got two slots - so the job got
>>>>started. What is in /home/mpich/share/machines.LINUX? The error you
>>>>got comes from MPICH I think. Is your mpirun the final script, or
>>>>links it to something else?
>>>>
>>>
>>>
>>
>>--
>>========================================================================
>>Carlo Nardone                   Sun Microsystems Italia SpA
>>Technical Systems Ambassador    Client Services Organization
>>Grid and HPTC Specialist        Practice Data Center - Platform Design
>>
>>Tel. +39 06 36708 024           via G. Romagnosi, 4
>>Fax. +39 06 3221969             I-00196 Roma
>>Mob. +39 335 5828197            Italy
>>Email: carlo.nardone at sun.com
>>========================================================================
>>"From nothing to more than nothing."
>>(Brian Eno & Peter Schmidt, _Oblique Strategies_)
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Carlo Nardone
Sun Microsystems Italia Spa
carlo.nardone at sun.com
t: +39 06 36708 024
f: +39 06 3236860
m: +39 335 5828197
"From nothing to more than nothing"
(Brian Eno, Peter Schmidt, _Oblique Strategies_)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list