[GE users] Re: SGE issues with LAM and LSDyna

Joe Landman landman at scalableinformatics.com
Thu Jan 27 21:11:39 GMT 2005


Hi Anthony:

   Thanks for the note.  We were able to get it working by installing the 
LSTC supplied LAM-7.0.3 and their shared objects.  The default installed 
version of LAM is 7.1.1-1 for this system.  There have been other naming 
snafu's, so this would not surprise me.

Joe

On Thu, 27 Jan 2005, Anthony J. Ciani wrote:

> Hello,
> 
> This looks like it might be a name resolution issue.  It would be nice if 
> a more verbose output had been included from the SGE submitted job, but it 
> looks as though you are booting LAM on two nodes (compute-1-6 plus
> another), and then trying to execute the job from "compute-1-6.local", 
> which may not be the same as "compute-1-6", according to LAM.
> 
> You may want to look into what name is returned by gethostname(), what 
> aliases are returned by gethostbyaddr(), or (and probably simplest) look 
> into /etc/hosts to see what aliases may exist within the cluster.
> 
> On Wed, 26 Jan 2005, Joe Landman wrote:
> > Hi Folks:
> >
> >  We are running SGE 5.3p6 for lx24_amd64 with LSDyna.  We are attempting to use
> > the LAM compilation due to issues with MPICH.  We are having a problem that
> > seems to show up only under SGE.  When we run the batch script by hand, it works
> > just fine.
> >
> >  In our batch job, LAM boots up fine, and I can tping it.
> >
> > /opt/lam/gnu/bin/tping -c1 N
> >  1 byte from 1 remote node and 1 local node: 0.000 secs
> >
> >  but, the mpirun complains that it cannot see the other lamd's (which tping found).
> >
> > /opt/lam/gnu/bin/tping -c1 N
> >  1 byte from 1 remote node and 1 local node: 0.000 secs
> >
> > 1 message, 1 byte (0.001K), 0.000 secs ( infK/sec)
> > roundtrip min/avg/max: 0.000/0.000/0.000
> > /opt/lam/gnu/bin/mpirun -np 4 /apps/lsdyna/mpp970_s_5434a_amd64_linux_lam703 --
> > i=four_cpu.key memory=250000000
> > -----------------------------------------------------------------------------
> >
> > It seems that there is no lamd running on the host compute-1-6.local.
> >
> > This is odd, as it works by hand:
> >
> > [landman at compute-1-6 ~/test51]$ lamboot -v machines
> >
> > LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
> >
> > n-1<22349> ssi:boot:base:linear: booting n0 (compute-1-6)
> > n-1<22349> ssi:boot:base:linear: booting n1 (compute-1-22)
> > n-1<22349> ssi:boot:base:linear: finished
> > [landman at compute-1-6 ~/test51]$ tping -c2 N
> >  1 byte from 1 remote node and 1 local node: 0.000 secs
> >  1 byte from 1 remote node and 1 local node: 0.000 secs
> >
> > 2 messages, 2 bytes (0.002K), 0.000 secs ( infK/sec)
> > roundtrip min/avg/max: 0.000/0.000/0.000
> > [landman at compute-1-6 ~/test51]$ /opt/lam/gnu/bin/mpirun -np 4
> > /apps/lsdyna/mpp970_s_5434a_amd64_linux_lam703 --  i=four_cpu.key  memory=250000000
> >      Date: 01/26/2005      Time: 10:13:01
> > Executing with local workstation license
> >
> >     ___________________________________________________
> >     |                                                 |
> >     |  Livermore  Software  Technology  Corporation   |
> >     |                                                 |
> >     |  7374 Las Positas Road                          |
> >     |  Livermore, CA 94551                            |
> > ...
> >
> > Any thoughts?  If we have to go back to MPICH, then we need a reliable way to
> > kill hung MPICH processes (we set up tight integration, but it looks like MPICH
> > issues are messing up clean kill of processes).
> >
> > Joe
> >
> > --
> > Joseph Landman, Ph.D
> > Scalable Informatics LLC,
> > email: landman at scalableinformatics.com
> > web  : http://scalableinformatics.com
> > phone: +1 734 612 4615
> >
> >
> 
> ------------------------------------------------------------
>                Anthony Ciani (aciani1 at uic.edu)
>             Computational Condensed Matter Physics
>     Department of Physics, University of Illinois, Chicago
>                http://ciani.phy.uic.edu/~tony
> ------------------------------------------------------------
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list