[GE users] puzzling MPICH behaviour with GE 5.3
Carlo.Nardone at Sun.COM
Mon Aug 23 17:13:31 BST 2004
thanks for your hints. The problem is not solved yet,
but clearly host_aliases is not working well with MPICH + ssh.
Actually I have done an experiment with mpirun alone
(bypassing GE), when logging into the cluster frontend
I find no discrepancies between machinefile and program
output (there is a complete /etc/hosts with all nodes
on the frontend).
But when I launch mpirun from a compute node, where a limited
/etc/hosts exists (this is due to NPACI Rocks approach),
then I found the same behaviour as using qsub.
I must say that I do have a host_aliases with
lines, and also that Rocks implements a strict ssh infrastructure
between all compute nodes. I configured mpich for that
as described in the mpich 1.2.5 manuals.
I think that ssh is the culprit in one way or the other,
since when I tried to launch mpirun from a compute node
using a machinefile with complete host names,
the system prompted something like
Permanently added 'compute-0-*.local' (RSA1) to the list of known hosts
What do you think?
> two things I look into are:
>> Process 1 of 3 on compute-0-3.local
>> Process 2 of 3 on compute-0-3.local
>> Process 0 of 3 on compute-0-1.local
> if `hostname` gives compute-0-1.local, you should change one entry to
> this full name, so that MPICH can remove it from the list during the
> first scan of the machinefile (although this is at this point not the
> reason for the strange distribution to the nodes). If you are not using
> a host_aliases file at all, maybe you can adjust the /etc/hosts.
>> Could not find enough machines for architecture LINUX
> For SGE the machine is free and you got two slots - so the job got
> started. What is in /home/mpich/share/machines.LINUX? The error you got
> comes from MPICH I think. Is your mpirun the final script, or links it
> to something else?
"From nothing to more than nothing."
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users