[GE users] Problems with LAM tight integration

slaton slaton at berkeley.edu
Wed Aug 2 07:43:52 BST 2006


Hi,

I'm using SGE 6.0u8 and am following these directions for configuration of 
LAM tight integration (based on Reuti's previous work on the same):

 http://wiki.gridengine.info/wiki/index.php/Tight-LAM-Integration-Notes

Currently i cannot get the mpihello (or the hello program included with 
LAM) program to run using grid engine. It does run properly using standard 
lamboot/mpirun/lamhalt procedure. Also, mpich and openmp jobs are working 
fine in grid engine (using the appropriate parallel environments).

Performing the mpihello example, I get the following error in the job 
output:

 error: ERROR! invalid option argument "-n"
 The lamboot agent timed out while waiting for the newly-booted process
 to call back and indicated that it had successfully booted.

 (...stock LAM/MPI admonishment to read the FAQ...)

 error: error reading returncode of remote command

...from the qmaster's logfile:

 E|invalid job object in job submission from user "slaton", commproc 
 "qsub" on host "qln01"
 E|can not remove file pe task spool file: 
 jobs/00/0000/0050/1-4096/1/1.qcn13
 E|tightly integrated parallel task 50.1 task 1.qcn13 failed - killing job

...from the selected slave nodes' logfiles:

 W|reaping job "50" ptf complains: Job does not exist

...from the job's output file:

 -catch_rsh /opt/sge/default/spool/qcn13/active_jobs/50.1/pe_hostfile
 qcn13
 qcn14
 qcn15
 qcn16
 /opt/sge/bin/lx24-amd64/qrsh -V -inherit -n -p 32795 qcn13 exec 
 '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
 '/opt/sge/default/spool/qcn13/active_jobs/50.1/1.qcn13'

I requested 4 processors, and the slave nodes SGE picked for this job were 
qcn13, qcn14, qcn15 and qcn16. However, the job only appears to run on one 
of these nodes (qcn13) before eventually erroring out. The other nodes 
remain idle. Perhaps the nodelist/machinefile is not being properly passed 
to startlam.sh?

Output of ps -e f -o pid,ppid,pgrp,command --cols=80 on the one working 
node looks like this:
 
 1987  1296  1987  \_ in.rlogind 
 1988  1987  1988      \_ login -- root
 1989  1988  1989          \_ -bash
 2060  1989  2060              \_ ps -e f -o pid,ppid,pgrp,command
 1345     1  1345 /opt/sge/bin/lx24-amd64/sge_execd
 1932  1345  1932  \_ sge_shepherd-51 -bg
 1933  1932  1933  |   \_ /bin/sh /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh
 1973  1933  1933  |       \_ lamboot /tmp/51.1.testing/machines
 1978  1345  1978  \_ sge_shepherd-51 -bg
 1979  1978  1979      \_ sge_shepherd-51 -bg


Per the article, my PE is defined as follows (with startlam.sh and 
stoplam.sh being customized per the article):

 # qconf -sp lam_tight_qrsh
 pe_name           lam_tight_qrsh
 slots             18
 user_lists        NogalesLab
 xuser_lists       NONE
 start_proc_args   /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile
 stop_proc_args    /opt/sge/lam_tight_qrsh/stoplam.sh
 allocation_rule   $round_robin
 control_slaves    TRUE
 job_is_first_task FALSE
 urgency_slots     min

The 'invalid option argument "-n"' error is being generated by qrsh. 
Somehow qrsh is being passed a '-n' arg, although a cursory look through 
the rsh wrapper, startlam.sh and stoplam.sh does not reveal a place where 
this would happen.

Any suggestions would be greatly appreciated.

thanks
slaton

Slaton Lipscomb
Nogales Lab, Howard Hughes Medical Institute
http://cryoem.berkeley.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list