[GE users] Problems with LAM tight integration

Reuti reuti at staff.uni-marburg.de
Wed Aug 2 13:01:44 BST 2006


Hi,

Am 02.08.2006 um 08:43 schrieb slaton:

> Hi,
>
> I'm using SGE 6.0u8 and am following these directions for  
> configuration of
> LAM tight integration (based on Reuti's previous work on the same):
>
>  http://wiki.gridengine.info/wiki/index.php/Tight-LAM-Integration- 
> Notes
>
> Currently i cannot get the mpihello (or the hello program included  
> with
> LAM) program to run using grid engine. It does run properly using  
> standard
> lamboot/mpirun/lamhalt procedure. Also, mpich and openmp jobs are  
> working
> fine in grid engine (using the appropriate parallel environments).
>
> Performing the mpihello example, I get the following error in the job
> output:
>
>  error: ERROR! invalid option argument "-n"
>  The lamboot agent timed out while waiting for the newly-booted  
> process
>  to call back and indicated that it had successfully booted.
>
>  (...stock LAM/MPI admonishment to read the FAQ...)
>
>  error: error reading returncode of remote command
>
> ...from the qmaster's logfile:
>
>  E|invalid job object in job submission from user "slaton", commproc
>  "qsub" on host "qln01"

is the same SGE version running on all machines? Is the qmaster also  
running on "qln01"?

>  E|can not remove file pe task spool file:
>  jobs/00/0000/0050/1-4096/1/1.qcn13
>  E|tightly integrated parallel task 50.1 task 1.qcn13 failed -  
> killing job
>
> ...from the selected slave nodes' logfiles:
>
>  W|reaping job "50" ptf complains: Job does not exist
>
> ...from the job's output file:
>
>  -catch_rsh /opt/sge/default/spool/qcn13/active_jobs/50.1/pe_hostfile
>  qcn13
>  qcn14
>  qcn15
>  qcn16
>  /opt/sge/bin/lx24-amd64/qrsh -V -inherit -n -p 32795 qcn13 exec

There is the -n, and -p (priority) is also not the intended option  
for qrsh I think. This is the echo from your rsh-wrapper? Can you  
please check, whether you are using the correct rsh-wrapper, i.e.  
create the correct link in $TMPDIR to point to the rsh-wrapper for  
lam_tight_qrsh?

Is the "/opt/sge/bin/lx24-amd64/qrsh -V -inherit" compiled into LAM  
as to be used rsh-program by accident, and so bypassing the rsh- 
wrapper (the -n would be filtered out by the rsh-wrapper otherwise)?

-- Reuti


>  '/opt/sge/utilbin/lx24-amd64/qrsh_starter'
>  '/opt/sge/default/spool/qcn13/active_jobs/50.1/1.qcn13'
>
> I requested 4 processors, and the slave nodes SGE picked for this  
> job were
> qcn13, qcn14, qcn15 and qcn16. However, the job only appears to run  
> on one
> of these nodes (qcn13) before eventually erroring out. The other nodes
> remain idle. Perhaps the nodelist/machinefile is not being properly  
> passed
> to startlam.sh?
>
> Output of ps -e f -o pid,ppid,pgrp,command --cols=80 on the one  
> working
> node looks like this:
>
>  1987  1296  1987  \_ in.rlogind
>  1988  1987  1988      \_ login -- root
>  1989  1988  1989          \_ -bash
>  2060  1989  2060              \_ ps -e f -o pid,ppid,pgrp,command
>  1345     1  1345 /opt/sge/bin/lx24-amd64/sge_execd
>  1932  1345  1932  \_ sge_shepherd-51 -bg
>  1933  1932  1933  |   \_ /bin/sh /opt/sge/lam_tight_qrsh/ 
> startlam.sh -catch_rsh
>  1973  1933  1933  |       \_ lamboot /tmp/51.1.testing/machines
>  1978  1345  1978  \_ sge_shepherd-51 -bg
>  1979  1978  1979      \_ sge_shepherd-51 -bg
>
>
> Per the article, my PE is defined as follows (with startlam.sh and
> stoplam.sh being customized per the article):
>
>  # qconf -sp lam_tight_qrsh
>  pe_name           lam_tight_qrsh
>  slots             18
>  user_lists        NogalesLab
>  xuser_lists       NONE
>  start_proc_args   /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh  
> $pe_hostfile
>  stop_proc_args    /opt/sge/lam_tight_qrsh/stoplam.sh
>  allocation_rule   $round_robin
>  control_slaves    TRUE
>  job_is_first_task FALSE
>  urgency_slots     min
>
> The 'invalid option argument "-n"' error is being generated by qrsh.
> Somehow qrsh is being passed a '-n' arg, although a cursory look  
> through
> the rsh wrapper, startlam.sh and stoplam.sh does not reveal a place  
> where
> this would happen.
>
> Any suggestions would be greatly appreciated.
>
> thanks
> slaton
>
> Slaton Lipscomb
> Nogales Lab, Howard Hughes Medical Institute
> http://cryoem.berkeley.edu
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list