[GE users] lam_loose_rsh isn't lamhalt-ing

Reuti reuti at staff.uni-marburg.de
Fri Mar 9 20:04:39 GMT 2007


Hi,

Am 09.03.2007 um 18:31 schrieb Mike Hanby:

> Howdy, I'm using Grid Engine 6.0u4 on a Rocks 4.0.0 cluster, 128  
> nodes.
>
>
>
> I set up the parallel environment for lam_loose_rsh using the  
> instructions at:
>
> http://gridengine.sunsource.net/howto/lam-integration/lam- 
> integration.html
>
>
>
> I configured the cluster to distribute the /opt/gridengine/ 
> lam_loose_rsh directory to each of the compute nodes, so they all  
> have a copy of startlam.sh and stoplam.sh (both of which are  
> executable by everyone.
>
>
>
> I'm running a simple hello world test where it prints the name of  
> the compute node that it is running on. The output is correctly  
> printing the name of each node, so the job looks like it's working.
>
>
>
> However if I check the jobs head node for processes under my name,  
> I see:
>
> /opt/lam/intel/bin/lamd -H 172.20.5.166 -P 42681 -n 0 -o 0 - 
> sessionsuffix sge-26884-undefined
one after the other. The head-node is also an exec-node, or do you  
mean that head-node of the parallel job?

But if you see such a process during the job it's okay (on any of the  
cluster nodes). But it shouldn't survive the end of the job, i.e. the  
lamhalt.

-- Reuti

> /opt/gridengine/default/spool/compute-2-39/active_jobs/26890.1/ 
> pe_hostfile
>
> compute-2-39.local
>
> compute-4-104.local
>
> compute-2-64.local
>
> compute-4-98.local
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
>
> Using lamhalt: /opt/lam/intel/bin/lamhalt   on node compute-2-39.local
>
>
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
>
>
> Shutting down LAM
>
> hreq: sending HALT_PING to n1 (compute-4-104.local)
>
> hreq: sending HALT_PING to n2 (compute-2-64.local)
>
> hreq: sending HALT_PING to n3 (compute-4-98.local)
>
> hreq: waiting for HALT ACKs from remote LAM daemons
>
> hreq: received HALT_ACK from n1 (compute-4-104.local)
>
> hreq: sending HALT_DIE to n1 (compute-4-104.local)
>
> hreq: received HALT_ACK from n2 (compute-2-64.local)
>
> hreq: sending HALT_DIE to n2 (compute-2-64.local)
>
> hreq: received HALT_ACK from n3 (compute-4-98.local)
>
> hreq: sending HALT_DIE to n3 (compute-4-98.local)
>
> hreq: sending HALT_PING to n0 (compute-2-39.local)
>
> hreq: received HALT_ACK from n0 (compute-2-39.local)
>
> hreq: sending HALT_DIE to n0 (compute-2-39.local)
>
> lamhalt: local LAM daemon halted
>
> LAM halted
>
> mkdir: No such file or directory
>
>
>
>
>
> I'm not sure where that "mkdir: No such file..." is coming from,  
> however if I ssh to the head compute node and kill the lamd  
> process, another "mkdir: No such..." will get logged to the job log  
> file.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list