[GE users] lam_loose_rsh isn't lamhalt-ing

Mike Hanby mhanby at uab.edu
Fri Mar 9 17:31:18 GMT 2007


Howdy, I'm using Grid Engine 6.0u4 on a Rocks 4.0.0 cluster, 128 nodes.

 

I set up the parallel environment for lam_loose_rsh using the
instructions at:

http://gridengine.sunsource.net/howto/lam-integration/lam-integration.ht
ml

 

I configured the cluster to distribute the /opt/gridengine/lam_loose_rsh
directory to each of the compute nodes, so they all have a copy of
startlam.sh and stoplam.sh (both of which are executable by everyone.

 

I'm running a simple hello world test where it prints the name of the
compute node that it is running on. The output is correctly printing the
name of each node, so the job looks like it's working.

 

However if I check the jobs head node for processes under my name, I
see:

/opt/lam/intel/bin/lamd -H 172.20.5.166 -P 42681 -n 0 -o 0
-sessionsuffix sge-26884-undefined

 

 

I added the -v -d switches to lamhalt in stoplam.sh and here's what I
see in the job log:

 

/opt/gridengine/default/spool/compute-2-39/active_jobs/26890.1/pe_hostfi
le

compute-2-39.local

compute-4-104.local

compute-2-64.local

compute-4-98.local

 

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

 

Using lamhalt: /opt/lam/intel/bin/lamhalt   on node compute-2-39.local

 

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

 

Shutting down LAM

hreq: sending HALT_PING to n1 (compute-4-104.local)

hreq: sending HALT_PING to n2 (compute-2-64.local)

hreq: sending HALT_PING to n3 (compute-4-98.local)

hreq: waiting for HALT ACKs from remote LAM daemons

hreq: received HALT_ACK from n1 (compute-4-104.local)

hreq: sending HALT_DIE to n1 (compute-4-104.local)

hreq: received HALT_ACK from n2 (compute-2-64.local)

hreq: sending HALT_DIE to n2 (compute-2-64.local)

hreq: received HALT_ACK from n3 (compute-4-98.local)

hreq: sending HALT_DIE to n3 (compute-4-98.local)

hreq: sending HALT_PING to n0 (compute-2-39.local)

hreq: received HALT_ACK from n0 (compute-2-39.local)

hreq: sending HALT_DIE to n0 (compute-2-39.local)

lamhalt: local LAM daemon halted

LAM halted

mkdir: No such file or directory

 

 

I'm not sure where that "mkdir: No such file..." is coming from, however
if I ssh to the head compute node and kill the lamd process, another
"mkdir: No such..." will get logged to the job log file.




More information about the gridengine-users mailing list