[GE users] SGE issues with LAM and LSDyna

Joe Landman landman at scalableinformatics.com
Wed Jan 26 15:22:00 GMT 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Folks:

  We are running SGE 5.3p6 for lx24_amd64 with LSDyna.  We are attempting to use
the LAM compilation due to issues with MPICH.  We are having a problem that
seems to show up only under SGE.  When we run the batch script by hand, it works
just fine.

  In our batch job, LAM boots up fine, and I can tping it.  

/opt/lam/gnu/bin/tping -c1 N
  1 byte from 1 remote node and 1 local node: 0.000 secs

  but, the mpirun complains that it cannot see the other lamd's (which tping found).

/opt/lam/gnu/bin/tping -c1 N
  1 byte from 1 remote node and 1 local node: 0.000 secs

1 message, 1 byte (0.001K), 0.000 secs ( infK/sec)
roundtrip min/avg/max: 0.000/0.000/0.000
/opt/lam/gnu/bin/mpirun -np 4 /apps/lsdyna/mpp970_s_5434a_amd64_linux_lam703 --
i=four_cpu.key memory=250000000
-----------------------------------------------------------------------------

It seems that there is no lamd running on the host compute-1-6.local.

This is odd, as it works by hand:

[landman at compute-1-6 ~/test51]$ lamboot -v machines

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<22349> ssi:boot:base:linear: booting n0 (compute-1-6)
n-1<22349> ssi:boot:base:linear: booting n1 (compute-1-22)
n-1<22349> ssi:boot:base:linear: finished
[landman at compute-1-6 ~/test51]$ tping -c2 N
  1 byte from 1 remote node and 1 local node: 0.000 secs
  1 byte from 1 remote node and 1 local node: 0.000 secs

2 messages, 2 bytes (0.002K), 0.000 secs ( infK/sec)
roundtrip min/avg/max: 0.000/0.000/0.000
[landman at compute-1-6 ~/test51]$ /opt/lam/gnu/bin/mpirun -np 4
/apps/lsdyna/mpp970_s_5434a_amd64_linux_lam703 --  i=four_cpu.key  memory=250000000
      Date: 01/26/2005      Time: 10:13:01
 Executing with local workstation license

     ___________________________________________________
     |                                                 |
     |  Livermore  Software  Technology  Corporation   |
     |                                                 |
     |  7374 Las Positas Road                          |
     |  Livermore, CA 94551                            |
...

Any thoughts?  If we have to go back to MPICH, then we need a reliable way to
kill hung MPICH processes (we set up tight integration, but it looks like MPICH
issues are messing up clean kill of processes).

Joe

--
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list