[GE users] puzzling MPICH behaviour with GE 5.3
Carlo.Nardone at Sun.COM
Mon Aug 23 18:03:11 BST 2004
I have found a workaround: modify by brute force $TMDIR/machines
in my script by sed creating compute-0-*.local
hostnames for mpirun.
I wonder if there is a more elegant solution, though.
Carlo Nardone wrote:
> thanks for your hints. The problem is not solved yet,
> but clearly host_aliases is not working well with MPICH + ssh.
> Actually I have done an experiment with mpirun alone
> (bypassing GE), when logging into the cluster frontend
> I find no discrepancies between machinefile and program
> output (there is a complete /etc/hosts with all nodes
> on the frontend).
> But when I launch mpirun from a compute node, where a limited
> /etc/hosts exists (this is due to NPACI Rocks approach),
> then I found the same behaviour as using qsub.
> I must say that I do have a host_aliases with
> compute-0-* compute-0-*.local
> lines, and also that Rocks implements a strict ssh infrastructure
> between all compute nodes. I configured mpich for that
> as described in the mpich 1.2.5 manuals.
> I think that ssh is the culprit in one way or the other,
> since when I tried to launch mpirun from a compute node
> using a machinefile with complete host names,
> the system prompted something like
> Permanently added 'compute-0-*.local' (RSA1) to the list of known hosts
> What do you think?
> Thanks again!
> Reuti wrote:
>> two things I look into are:
>>> Process 1 of 3 on compute-0-3.local
>>> Process 2 of 3 on compute-0-3.local
>>> Process 0 of 3 on compute-0-1.local
>> if `hostname` gives compute-0-1.local, you should change one entry to
>> this full name, so that MPICH can remove it from the list during the
>> first scan of the machinefile (although this is at this point not the
>> reason for the strange distribution to the nodes). If you are not
>> using a host_aliases file at all, maybe you can adjust the /etc/hosts.
>>> Could not find enough machines for architecture LINUX
>> For SGE the machine is free and you got two slots - so the job got
>> started. What is in /home/mpich/share/machines.LINUX? The error you
>> got comes from MPICH I think. Is your mpirun the final script, or
>> links it to something else?
Carlo Nardone Sun Microsystems Italia SpA
Technical Systems Ambassador Client Services Organization
Grid and HPTC Specialist Practice Data Center - Platform Design
Tel. +39 06 36708 024 via G. Romagnosi, 4
Fax. +39 06 3221969 I-00196 Roma
Mob. +39 335 5828197 Italy
Email: carlo.nardone at sun.com
"From nothing to more than nothing."
(Brian Eno & Peter Schmidt, _Oblique Strategies_)
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users