[GE users] Machine file and dedicated IB network
igardais at yahoo.fr
Fri May 21 08:47:52 BST 2010
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I solved the DAT_INSUFFICIENT_RESOURCE by setting [SH]_MEMORYLOCKED to "infinity".
We made our tests with IntelMPI 4.0 and it seems that this version is "integrated" with SGE so whatever the machinefile is set to, (ethernet or IB hostname), it works.
Thanks for your help everybody,
De : reuti <reuti at staff.uni-marburg.de>
? : users at gridengine.sunsource.net
Envoyé le : Mer 19 mai 2010, 16h 08min 11s
Objet : Re: Re : [GE users] Machine file and dedicated IB network
Am 19.05.2010 um 15:56 schrieb igardais:
> Hi Reuti,
> Thanks for the answer.
> I did that but it not helped.
> I'm still having a (non-SGE-related) issue about DAPL.
> Here are one of the line :
>  MPI startup(): DAPL provider <NULL on rank 0:mania-3.beicip.fr differs from <NULL string> on rank 1:mania-3.beicip.fr
is it MPICH1 based? Then the back channel still might be wrong (http://gridengine.sunsource.net/howto/mpich-integration.html) and you need to either:
$ export MPI_HOST=`hostname | sed "s/$[.]beicip[.]fr/-ib.cluster/"`
or adjust the "$SGE_ROOT/mpi/hostname" script to also report some thing like 'mania-X-ib.cluster' for a `hostname` command (the correctly adjusted name for each machine of course) and activate it by the "-hostname" keyword to start_proc_args.
> It refers to 'mania-X.beicip.fr' which is the hostname of the ethernet interface.
> The IB interfaces are named 'mania-X-ib.cluster'.
> The machine file correctly reports hostname 'mania-X-ib.cluster' and I've forced the I_MPI_DEVICE to "rdma:ib0" to use the ib0 interface.
> I've read that RHEL 5.4 has a buggy DAPL layer. I'll see with the MLNX-OFED-1.5.1 release for RHEL 5.4.
> Hope this will help.
> De : reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>>
> ? : users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
> Envoyé le : Mer 19 mai 2010, 13h 27min 17s
> Objet : Re: [GE users] Machine file and dedicated IB network
> Am 19.05.2010 um 12:51 schrieb igardais:
> > Until now, we ran SGE only over Ethernet.
> > All was fine because the same single network was used for SGE and data.
> > We received a new cluster with InfiniBand and the setup we made is :
> > - ethernet network (172.31.0.0/16) for datas and SGE
> > - IB network for MPI
> > IPoIB is setup with a dedicated, non-routed network (192.168.0.0/24).
> > The machine file generated by the PE contains "ethernet name" of the machines.
> > This machine file is use by MPI (IntelMPI) to start the mpi ring.
> > Shouldn't the hostnames in the machine file be the "infiniband name" (host-ib) ?
> > I've read the "multiple interface howto" but I not sure it applies to this case.
> as SGE is still running across the ethernet, you can just map the hostnames like outlined in the $SGE_ROOT/mpi/startmpi.sh where it's mapped to ATM hostnames. Depending on your applications, it can be necessary to supply -hostname to the startmpi.sh script, so that hostnames are mapped in general (also from the applications point of view).
> -- Reuti
> > Any advice is welcome,
> > Thanks and regards,
> > Ionel
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
More information about the gridengine-users