[GE users] puzzling MPICH behaviour with GE 5.3

Carlo Nardone Carlo.Nardone at Sun.COM
Mon Aug 23 17:13:31 BST 2004

thanks for your hints. The problem is not solved yet,
but clearly host_aliases is not working well with MPICH + ssh.
Actually I have done an experiment with mpirun alone
(bypassing GE), when logging into the cluster frontend
I find no discrepancies between machinefile and program
output (there is a complete /etc/hosts with all nodes
on the frontend).

But when I launch mpirun from a compute node, where a limited
/etc/hosts exists (this is due to NPACI Rocks approach),
then I found the same behaviour as using qsub.

I must say that I do have a host_aliases with
compute-0-*  compute-0-*.local
lines, and also that Rocks implements a strict ssh infrastructure
between all compute nodes. I configured mpich for that
as described in the mpich 1.2.5 manuals.
I think that ssh is the culprit in one way or the other,
since when I tried to launch mpirun from a compute node
using a machinefile with complete host names,
the system prompted something like
Permanently added 'compute-0-*.local' (RSA1) to the list of known hosts

What do you think?
Thanks again!

Reuti wrote:
> Hi,
> two things I look into are:
>> compute-0-1
>> compute-0-1
>> compute-0-3
>> Process 1 of 3 on compute-0-3.local
>> Process 2 of 3 on compute-0-3.local
>> Process 0 of 3 on compute-0-1.local
> if `hostname` gives compute-0-1.local, you should change one entry to 
> this full name, so that MPICH can remove it from the list during the 
> first scan of the machinefile (although this is at this point not the 
> reason for the strange distribution to the nodes). If you are not using 
> a host_aliases file at all, maybe you can adjust the /etc/hosts.
>> Could not find enough machines for architecture LINUX 
> For SGE the machine is free and you got two slots - so the job got 
> started. What is in /home/mpich/share/machines.LINUX? The error you got 
> comes from MPICH I think. Is your mpirun the final script, or links it 
> to something else?

"From nothing to more than nothing."

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list