[GE users] Status of tight integration between LAM-MPI and v6.0

Anthony J. Ciani aciani1 at uic.edu
Sat Nov 13 02:05:41 GMT 2004


Hello,

I had Grid Engine v6.0 running with LAM 7.0.6 and now with LAM 7.1.1

To perform the integration, I added a script I called sge-lamhalt 
to LAM (/usr/local/LAM/bin/sge-lamhalt) but did not alter any other part 
of LAM (i.e. I am not using qrsh, it doesn't work for booting LAM).  I 
then set an environment variable system wide,

LAM_MPI_SESSION_PREFIX=/tmp

sge-lamhalt:
-------------------------------------------
#!/bin/bash

#For some reason, sge_shepherd doesn't pass the full environment to
#stop_proc_args....

export LAM_MPI_SESSION_PREFIX=/tmp

#Not necessary with LAM >7.0.3
#export LAM_MPI_SESSION_SUFFIX=$JOB_ID

/usr/local/LAM/bin/lamhalt
-------------------------------------------


It is important to make certain that LAM always keeps its sockets in /tmp, 
because under SGE, LAM's default behavior is to put sockets in SGE's 
TMPDIR, which gets deleted before 'sge-lamhalt' is executed, thereby 
polluting nodes with a bunch of lamd's that lamhalt couldn't kill.  I 
believe this change (setting the session prefix to TMPDIR) was implemented 
after LAM 7.0.3.  This script will work for any LAM version.

!!!!!!!Why does sge_shepherd not create TMPDIR before running the 
prologue/pe_start, and why does it destroy TMPDIR before running 
the pe_stop/epilogue!!!!!!!

New versions of LAM (>7.0.3) automatically recognize the environment 
variables SGE_TASK_ID and JOB_ID and combine them into the session suffix

LAM_MPI_SESSION_SUFFIX="sge-$JOB_ID-$SGE_TASK_ID"

which is nice, because you can have multiple lamd's running at the same 
time on the same node under the same user without the start or termination 
of one affecting the others.


I created the PE's with the following options:
-------------------------------------------
pe_name           lam
slots             16
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /usr/local/LAM/bin/sge-lamhalt
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task TRUE
urgency_slots     min
-------------------------------------------


Notice that lamboot isn't called yet.  For some reason, sge_shepherd has a 
problem running lamboot, and this hasn't been fixed yet.  Maybe, someday 
someone will make a modified lamboot to boot natively under SGE.  For now, 
we start the LAM in a wrapper script around mpirun.

sge_mpirun:
-------------------------------------------
#!/bin/bash

#No longer necessary, it's already system wide
#export LAM_MPI_SESSION_PREFIX="/tmp"

#Not necessary with LAM >7.0.3
#export LAM_MPI_SESSION_SUFFIX=$JOB_ID

#Make the hostfile, but first we zero it just in case
lamhostfile="$TMPDIR/lamhostfile"
export lamhostfile
echo "" > $lamhostfile

cat $PE_HOSTFILE|while read e f g
do
         printf "%s cpu=%d\n" $e $f >>$lamhostfile
done
#Finished making hostfile

lamboot $lamhostfile >/dev/null

#Just to see what machines we're running on
lamnodes

#Could use mpirun C, or mpiexec here I guess.
mpirun -np 0 $*

-------------------------------------------



In summary, the "tight" integration doesn't exist yet.  The hostfile 
format of Grid Engine is not yet understood by lamboot, necessitating a 
wrapper.  For some reason sge_shepherd doesn't like executing lamboot. 
The only agents which work with lamboot are ssh and rsh; a qrsh boot 
will fail.

However, with the above scripts LAM and Grid Engine seem to work quite 
nicely.  If you want, you could even re-write these scripts in perl to 
improve the portability.  You need to be able to boot LAM outside of Grid 
Engine (which means crafty users can logon to compute nodes and run 
unscheduled tasks), and people need to use a wrapper instead of mpirun, 
but it is usable, and can be run multiple times within the script.  I 
"think" that even the accounting is right.


On Thu, 11 Nov 2004, Didier Contis wrote:
>
> I am trying to find out if any progress have been made
> regarding a tight integration between LAM-MPI and v6.0
>
> The current information posted at:
> http://gridengine.sunsource.net/project/gridengine/howto/lam/SGE_LAM_Integration.html
>
> are from July 2003 for Grid Engine v5.3
>
> I browsed the user mailing list and found some reference to some
> newer scripts, but I am not sure where they have been posted.
>
> Any help would be welcome.
>
> Thanks in advance.
>
> Didier.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

------------------------------------------------------------
               Anthony Ciani (aciani1 at uic.edu)
            Computational Condensed Matter Physics
    Department of Physics, University of Illinois, Chicago
               http://ciani.phy.uic.edu/~tony
------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list