[GE users] Re: SGE issues with LAM and LSDyna

Anthony J. Ciani aciani1 at uic.edu
Thu Jan 27 20:44:05 GMT 2005


Hello,

This looks like it might be a name resolution issue.  It would be nice if 
a more verbose output had been included from the SGE submitted job, but it 
looks as though you are booting LAM on two nodes (compute-1-6 plus
another), and then trying to execute the job from "compute-1-6.local", 
which may not be the same as "compute-1-6", according to LAM.

You may want to look into what name is returned by gethostname(), what 
aliases are returned by gethostbyaddr(), or (and probably simplest) look 
into /etc/hosts to see what aliases may exist within the cluster.

On Wed, 26 Jan 2005, Joe Landman wrote:
> Hi Folks:
>
>  We are running SGE 5.3p6 for lx24_amd64 with LSDyna.  We are attempting to use
> the LAM compilation due to issues with MPICH.  We are having a problem that
> seems to show up only under SGE.  When we run the batch script by hand, it works
> just fine.
>
>  In our batch job, LAM boots up fine, and I can tping it.
>
> /opt/lam/gnu/bin/tping -c1 N
>  1 byte from 1 remote node and 1 local node: 0.000 secs
>
>  but, the mpirun complains that it cannot see the other lamd's (which tping found).
>
> /opt/lam/gnu/bin/tping -c1 N
>  1 byte from 1 remote node and 1 local node: 0.000 secs
>
> 1 message, 1 byte (0.001K), 0.000 secs ( infK/sec)
> roundtrip min/avg/max: 0.000/0.000/0.000
> /opt/lam/gnu/bin/mpirun -np 4 /apps/lsdyna/mpp970_s_5434a_amd64_linux_lam703 --
> i=four_cpu.key memory=250000000
> -----------------------------------------------------------------------------
>
> It seems that there is no lamd running on the host compute-1-6.local.
>
> This is odd, as it works by hand:
>
> [landman at compute-1-6 ~/test51]$ lamboot -v machines
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<22349> ssi:boot:base:linear: booting n0 (compute-1-6)
> n-1<22349> ssi:boot:base:linear: booting n1 (compute-1-22)
> n-1<22349> ssi:boot:base:linear: finished
> [landman at compute-1-6 ~/test51]$ tping -c2 N
>  1 byte from 1 remote node and 1 local node: 0.000 secs
>  1 byte from 1 remote node and 1 local node: 0.000 secs
>
> 2 messages, 2 bytes (0.002K), 0.000 secs ( infK/sec)
> roundtrip min/avg/max: 0.000/0.000/0.000
> [landman at compute-1-6 ~/test51]$ /opt/lam/gnu/bin/mpirun -np 4
> /apps/lsdyna/mpp970_s_5434a_amd64_linux_lam703 --  i=four_cpu.key  memory=250000000
>      Date: 01/26/2005      Time: 10:13:01
> Executing with local workstation license
>
>     ___________________________________________________
>     |                                                 |
>     |  Livermore  Software  Technology  Corporation   |
>     |                                                 |
>     |  7374 Las Positas Road                          |
>     |  Livermore, CA 94551                            |
> ...
>
> Any thoughts?  If we have to go back to MPICH, then we need a reliable way to
> kill hung MPICH processes (we set up tight integration, but it looks like MPICH
> issues are messing up clean kill of processes).
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
> phone: +1 734 612 4615
>
>

------------------------------------------------------------
               Anthony Ciani (aciani1 at uic.edu)
            Computational Condensed Matter Physics
    Department of Physics, University of Illinois, Chicago
               http://ciani.phy.uic.edu/~tony
------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list