[GE users] Problems with LAM tight integration
slaton at berkeley.edu
Wed Aug 2 07:43:52 BST 2006
I'm using SGE 6.0u8 and am following these directions for configuration of
LAM tight integration (based on Reuti's previous work on the same):
Currently i cannot get the mpihello (or the hello program included with
LAM) program to run using grid engine. It does run properly using standard
lamboot/mpirun/lamhalt procedure. Also, mpich and openmp jobs are working
fine in grid engine (using the appropriate parallel environments).
Performing the mpihello example, I get the following error in the job
error: ERROR! invalid option argument "-n"
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
(...stock LAM/MPI admonishment to read the FAQ...)
error: error reading returncode of remote command
...from the qmaster's logfile:
E|invalid job object in job submission from user "slaton", commproc
"qsub" on host "qln01"
E|can not remove file pe task spool file:
E|tightly integrated parallel task 50.1 task 1.qcn13 failed - killing job
...from the selected slave nodes' logfiles:
W|reaping job "50" ptf complains: Job does not exist
...from the job's output file:
/opt/sge/bin/lx24-amd64/qrsh -V -inherit -n -p 32795 qcn13 exec
I requested 4 processors, and the slave nodes SGE picked for this job were
qcn13, qcn14, qcn15 and qcn16. However, the job only appears to run on one
of these nodes (qcn13) before eventually erroring out. The other nodes
remain idle. Perhaps the nodelist/machinefile is not being properly passed
Output of ps -e f -o pid,ppid,pgrp,command --cols=80 on the one working
node looks like this:
1987 1296 1987 \_ in.rlogind
1988 1987 1988 \_ login -- root
1989 1988 1989 \_ -bash
2060 1989 2060 \_ ps -e f -o pid,ppid,pgrp,command
1345 1 1345 /opt/sge/bin/lx24-amd64/sge_execd
1932 1345 1932 \_ sge_shepherd-51 -bg
1933 1932 1933 | \_ /bin/sh /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh
1973 1933 1933 | \_ lamboot /tmp/51.1.testing/machines
1978 1345 1978 \_ sge_shepherd-51 -bg
1979 1978 1979 \_ sge_shepherd-51 -bg
Per the article, my PE is defined as follows (with startlam.sh and
stoplam.sh being customized per the article):
# qconf -sp lam_tight_qrsh
start_proc_args /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile
The 'invalid option argument "-n"' error is being generated by qrsh.
Somehow qrsh is being passed a '-n' arg, although a cursory look through
the rsh wrapper, startlam.sh and stoplam.sh does not reveal a place where
this would happen.
Any suggestions would be greatly appreciated.
Nogales Lab, Howard Hughes Medical Institute
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users