[GE users] puzzling MPICH behaviour with GE 5.3

Andreas Haas Andreas.Haas at Sun.COM
Mon Aug 23 18:14:26 BST 2004


This is the workaround I usually recommend. The transformation
you describe can be made in $SGE_ROOT/mpi/startmpi.sh that is
run prior actual job. Watch out for the comment

   # add here code to map ..

within that script.

Cheers,
Andreas

On Mon, 23 Aug 2004, Carlo Nardone wrote:

> Hi all,
> I have found a workaround: modify by brute force $TMDIR/machines
> in my script by sed creating compute-0-*.local
> hostnames for mpirun.
> I wonder if there is a more elegant solution, though.
>
> Carlo Nardone wrote:
> > Ciao,
> > thanks for your hints. The problem is not solved yet,
> > but clearly host_aliases is not working well with MPICH + ssh.
> > Actually I have done an experiment with mpirun alone
> > (bypassing GE), when logging into the cluster frontend
> > I find no discrepancies between machinefile and program
> > output (there is a complete /etc/hosts with all nodes
> > on the frontend).
> >
> > But when I launch mpirun from a compute node, where a limited
> > /etc/hosts exists (this is due to NPACI Rocks approach),
> > then I found the same behaviour as using qsub.
> >
> > I must say that I do have a host_aliases with
> > compute-0-*  compute-0-*.local
> > lines, and also that Rocks implements a strict ssh infrastructure
> > between all compute nodes. I configured mpich for that
> > as described in the mpich 1.2.5 manuals.
> > I think that ssh is the culprit in one way or the other,
> > since when I tried to launch mpirun from a compute node
> > using a machinefile with complete host names,
> > the system prompted something like
> > Permanently added 'compute-0-*.local' (RSA1) to the list of known hosts
> >
> > What do you think?
> > Thanks again!
> >
> > Reuti wrote:
> >
> >> Hi,
> >>
> >> two things I look into are:
> >>
> >>> compute-0-1
> >>> compute-0-1
> >>> compute-0-3
> >>> Process 1 of 3 on compute-0-3.local
> >>> Process 2 of 3 on compute-0-3.local
> >>> Process 0 of 3 on compute-0-1.local
> >>
> >>
> >>
> >> if `hostname` gives compute-0-1.local, you should change one entry to
> >> this full name, so that MPICH can remove it from the list during the
> >> first scan of the machinefile (although this is at this point not the
> >> reason for the strange distribution to the nodes). If you are not
> >> using a host_aliases file at all, maybe you can adjust the /etc/hosts.
> >>
> >>> Could not find enough machines for architecture LINUX
> >>
> >>
> >>
> >> For SGE the machine is free and you got two slots - so the job got
> >> started. What is in /home/mpich/share/machines.LINUX? The error you
> >> got comes from MPICH I think. Is your mpirun the final script, or
> >> links it to something else?
> >>
> >
> >
>
>
> --
> ========================================================================
> Carlo Nardone                   Sun Microsystems Italia SpA
> Technical Systems Ambassador    Client Services Organization
> Grid and HPTC Specialist        Practice Data Center - Platform Design
>
> Tel. +39 06 36708 024           via G. Romagnosi, 4
> Fax. +39 06 3221969             I-00196 Roma
> Mob. +39 335 5828197            Italy
> Email: carlo.nardone at sun.com
> ========================================================================
> "From nothing to more than nothing."
> (Brian Eno & Peter Schmidt, _Oblique Strategies_)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list