[GE users] Problems with LAM tight integration

slaton slaton at berkeley.edu
Wed Aug 2 07:43:52 BST 2006


I'm using SGE 6.0u8 and am following these directions for configuration of 
LAM tight integration (based on Reuti's previous work on the same):


Currently i cannot get the mpihello (or the hello program included with 
LAM) program to run using grid engine. It does run properly using standard 
lamboot/mpirun/lamhalt procedure. Also, mpich and openmp jobs are working 
fine in grid engine (using the appropriate parallel environments).

Performing the mpihello example, I get the following error in the job 

 error: ERROR! invalid option argument "-n"
 The lamboot agent timed out while waiting for the newly-booted process
 to call back and indicated that it had successfully booted.

 (...stock LAM/MPI admonishment to read the FAQ...)

 error: error reading returncode of remote command

...from the qmaster's logfile:

 E|invalid job object in job submission from user "slaton", commproc 
 "qsub" on host "qln01"
 E|can not remove file pe task spool file: 
 E|tightly integrated parallel task 50.1 task 1.qcn13 failed - killing job

...from the selected slave nodes' logfiles:

 W|reaping job "50" ptf complains: Job does not exist

...from the job's output file:

 -catch_rsh /opt/sge/default/spool/qcn13/active_jobs/50.1/pe_hostfile
 /opt/sge/bin/lx24-amd64/qrsh -V -inherit -n -p 32795 qcn13 exec 

I requested 4 processors, and the slave nodes SGE picked for this job were 
qcn13, qcn14, qcn15 and qcn16. However, the job only appears to run on one 
of these nodes (qcn13) before eventually erroring out. The other nodes 
remain idle. Perhaps the nodelist/machinefile is not being properly passed 
to startlam.sh?

Output of ps -e f -o pid,ppid,pgrp,command --cols=80 on the one working 
node looks like this:
 1987  1296  1987  \_ in.rlogind 
 1988  1987  1988      \_ login -- root
 1989  1988  1989          \_ -bash
 2060  1989  2060              \_ ps -e f -o pid,ppid,pgrp,command
 1345     1  1345 /opt/sge/bin/lx24-amd64/sge_execd
 1932  1345  1932  \_ sge_shepherd-51 -bg
 1933  1932  1933  |   \_ /bin/sh /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh
 1973  1933  1933  |       \_ lamboot /tmp/51.1.testing/machines
 1978  1345  1978  \_ sge_shepherd-51 -bg
 1979  1978  1979      \_ sge_shepherd-51 -bg

Per the article, my PE is defined as follows (with startlam.sh and 
stoplam.sh being customized per the article):

 # qconf -sp lam_tight_qrsh
 pe_name           lam_tight_qrsh
 slots             18
 user_lists        NogalesLab
 xuser_lists       NONE
 start_proc_args   /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile
 stop_proc_args    /opt/sge/lam_tight_qrsh/stoplam.sh
 allocation_rule   $round_robin
 control_slaves    TRUE
 job_is_first_task FALSE
 urgency_slots     min

The 'invalid option argument "-n"' error is being generated by qrsh. 
Somehow qrsh is being passed a '-n' arg, although a cursory look through 
the rsh wrapper, startlam.sh and stoplam.sh does not reveal a place where 
this would happen.

Any suggestions would be greatly appreciated.


Slaton Lipscomb
Nogales Lab, Howard Hughes Medical Institute

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list