[GE users] Re: LAM SGE Integration issues with rocks 4.1

Srividya Valivarthi srividya.v at gmail.com
Wed Jan 18 21:15:58 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

On 1/18/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 18.01.2006 um 21:33 schrieb Srividya Valivarthi:
>
> > Thanks so much for your patience.
> >
> >  Have stopped all the deamons using the lamhalt command and
> > attached the log.
> >
>
> Fine!
>
> > Thanks again,
> > Srividya
> > On 1/18/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >> Srividya:
> >>
> >> Am 18.01.2006 um 16:50 schrieb Srividya Valivarthi:
> >>
> >>> Hi,
> >>>
> >>>    Thanks so much for the prompt responses.  I would like to go over
> >>> again the commands that i have used and the error logs more clearly,
> >>> so that i can get some help on this problem.
> >>>
> >>> 1) Firstly i have aliased rsh to ssh. will this cause any issues?
> >>>
> >>
> >> for a loose integration this schould work, but not for any of the
> >> qrsh based setups of the LAM/MPI integration (where rsh will be
> >> caught by SGE and routed to a qrsh command).
> >>
> >>> 2) On my first login into the system ran the following command to
> >>> have
> >>> the lamd deamon running on all nodes as follows
> >>>     # lamboot -v -ssi boot rsh hostfile
> >>>       and the host file contains
> >>>       	medusa.lab.ac.uab.edu cpu=4
> >>> 	compute-0-0.local cpu=4
> >>> 	compute-0-1.local cpu=4
> >>> 	compute-0-2.local cpu=4
> >>> 	compute-0-3.local cpu=4
> >>> 	compute-0-4.local cpu=4
> >>> 	compute-0-5.local cpu=4
> >>> 	compute-0-6.local cpu=4
> >>> 	compute-0-7.local cpu=4
> >>>
> >>
> >> Again: please stop the daemons! Then come back and we go to the next
> >> point. - Reuti
> >>
> >>> 3) Then on compiling and running the mpihello program with the lam
> >>> binaries i get the expected results.
> >>>       [srividya at medusa ~]$ /opt/lam/gnu/bin/mpirun -np 2 /home/
> >>> srividya/mpihello
> >>> 	Hello World from Node 0.
> >>> 	Hello World from Node 1.
> >>>
>
> This was working outside of SGE - ok.
>
> >>> 4) Now, in order to be able to submit jobs through SGE. I defined
> >>> the
> >>> pe through qmon as follows:
> >>>      [srividya at medusa ~]$ qconf -sp lam_loose_rsh
> >>> 	pe_name           lam_loose_rsh
> >>> 	slots             4
> >>> 	user_lists        NONE
> >>> 	xuser_lists       NONE
> >>> 	start_proc_args   /home/srividya/scripts/lam_loose_rsh/
> >>> startlam.sh \
> >>>          			$pe_hostfile
> >>> 	stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> >>> 	allocation_rule   $round_robin
> >>> 	control_slaves    FALSE
> >>> 	job_is_first_task TRUE
> >>> 	urgency_slots     min
> >>>
> >>>      Have also added this pe to the queue list through qmon.
> >>>
>
> Near the end of the lam_loose_rsh are the lines:
>
> ...
> #
> # Extra LAM statement(s)
> #
> if [ -z "`which lamboot 2>/dev/null`" ] ; then
>      export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
> fi
> lamboot $machines
> ...
>
> Please adjust them to reflect the PATH to your LAM installation.
> Sometimes lamboot is already found by the default PATH for a user (as
> in it's .bashrc or similar), and so this was just some safety to find
> it.
>
>

The path has been set earlier itself.

> >>> 5) Have  modified the corresponding startlam.sh  as suggested from
> >>> hostname to hostname.local
> >>>
>
> Okay, this we need for ROCKS.
>
> Now submit a test job with:
>
> #!/bin/sh
> lamnodes
> exit 0
>
> and request the LAM PE as you did below (with different amount of
> requested slots). In the .po file should just find the LAM copyright
> notice twice, and in the .o file a confirmation of the selected nodes.
>
> It might be necessary, to put a line like:
>
> export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
>
> in your .profile and/or .bashrc (of course with your actual location
> of the LAM installation).
>
> If we got this - we go to the next step. - Reuti
>
>

i get the .o and the .po files as follows:
[srividya at medusa ~]$ cat simple.script.o96
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
n0      compute-0-7.local:1:origin,this_node
n1      compute-0-3.local:1:
[srividya at medusa ~]$ cat simple.script.po96
/opt/gridengine/default/spool/compute-0-7/active_jobs/96.1/pe_hostfile
compute-0-7.local
compute-0-3.local

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University


LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
----
I hope this  is fine..

> >>> 6) Now, have defined the script file as follows
> >>>       [srividya at medusa ~]$ cat tester1.sh
> >>> 		#!/bin/sh
> >>> 		/opt/lam/gnu/bin/mpirun C /home/srividya/mpihello
> >>>
> >>> 7) On running the script file as follows
> >>>        [srividya at medusa ~]$ qsub -pe lam_loose_rsh 2 tester1.sh
> >>> 		Your job 79 ("tester1.sh") has been submitted.
> >>> 	[srividya at medusa ~]$ qstat
> >>> 	job-ID  prior   name       user         state submit/start at
> >>> queue                          slots 		ja-task-ID
> >>> 	-------------------------------------------------------------------
> >>> --
> >>> --------------------------------------------
> >>>      79 0.00000 tester1.sh srividya     qw    01/18/2006 09:37:12
> >>>                               2
> >>>
> >>> 8) And obtain the following results in the tester1.sh.e77
> >>>
> >>>      [srividya at medusa ~]$ cat tester1.sh.e79
> >>> 	/home/srividya/mpihello: error while loading shared libraries:
> >>> liblamf77mpi.so.0: 	cannot open shared object file: No such file or
> >>> directory
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>> It seems that [at least] one of the processes that was started with
> >>> mpirun did not invoke MPI_INIT before quitting (it is possible that
> >>> more than one process did not invoke MPI_INIT -- mpirun was only
> >>> notified of the first one, which was on node n0).
> >>>
> >>> mpirun can *only* be used with MPI programs (i.e., programs that
> >>> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec"
> >>> program
> >>> to run non-MPI programs over the lambooted nodes.
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>> /home/srividya/mpihello: error while loading shared libraries:
> >>> liblamf77mpi.so.0: cannot open shared object file: No such file or
> >>> directory
> >>>
> >>> I am not sure why the path information is not being read by
> >>> SGE... The
> >>> LD_LIBRARY_PATH env variable has the required path... Is there
> >>> something else that i am missing.
> >>>
> >>> 9) On changing the script to sge.lam.script as follows.. the only
> >>> diff
> >>> being the LAM_MPI_SOCKET_SUFFIX
> >>>    #cat sge.lam.script
> >>>     #!/bin/sh
> >>>    #$ -N mpihello
> >>>    #$ -cwd
> >>>    #$ -j y
> >>>    #
> >>>    # pe request for LAM. Set your number of processors here.
> >>>   #$ -pe lam_loose_rsh 2
> >>>   #
> >>>   # Run job through bash shell
> >>>   #$ -S /bin/bash
> >>>   # This MUST be in your LAM run script, otherwise
> >>>   # multiple LAM jobs will NOT RUN
> >>>   export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME
> >>>  #
> >>>  # Use full pathname to make sure we are using the right mpirun
> >>> /opt/lam/gnu/bin/mpirun -np $NSLOTS /home/srividya/mpihello
> >>>
> >>> 10) and submitting to the queue
> >>>         #qsub sge.lam.script
> >>>
> >>> 11) Obtain the following error message
> >>>         [srividya at medusa ~]$ cat mpihello.o80
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>> It seems that there is no lamd running on the host
> >>> compute-0-6.local.
> >>>
> >>> This indicates that the LAM/MPI runtime environment is not
> >>> operating.
> >>> The LAM/MPI runtime environment is necessary for the "mpirun"
> >>> command.
> >>>
> >>> Please run the "lamboot" command the start the LAM/MPI runtime
> >>> environment.  See the LAM/MPI documentation for how to invoke
> >>> "lamboot" across multiple machines.
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>>
> >>> And this is the message that i was sending out earlier. I am new to
> >>> the sge-lam environment and thanks so much for your patience. Any
> >>> help
> >>> will be greatly appreciated.
> >>>
> >>> Thanks,
> >>> Srividya
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>> Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:
> >>>>
> >>>>> The change in the startlam.sh from
> >>>>> echo host
> >>>>> to
> >>>>> echo host.local
> >>>>>
> >>>>> after stopping and booting the lamuniverse does not seem to solve
> >>>>> the
> >>>>
> >>>> No - stop the lamuniverse. Don't boot it by hand! Just start a
> >>>> parallel job like I mentioned the mpihello.c, and post the error/
> >>>> log-
> >>>> files of this job. Your rsh connection is also working between the
> >>>> nodes for a passwordless invocation? - Reuti
> >>>>
> >>>>> problem either..
> >>>>>
> >>>>> Thanks again,
> >>>>> Srividya
> >>>>>
> >>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>>> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
> >>>>>>
> >>>>>>> The pe is defined as follows:
> >>>>>>>
> >>>>>>> #qconf -sp lam_loose_rsh
> >>>>>>> pe_name           lam_loose_rsh
> >>>>>>> slots             4
> >>>>>>> user_lists        NONE
> >>>>>>> xuser_lists       NONE
> >>>>>>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/
> >>>>>>> startlam.sh \
> >>>>>>>                   $pe_hostfile
> >>>>>>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/
> >>>>>>> stoplam.sh
> >>>>>>> allocation_rule   $round_robin
> >>>>>>> control_slaves    FALSE
> >>>>>>> job_is_first_task TRUE
> >>>>>>> urgency_slots     min
> >>>>>>>
> >>>>>>
> >>>>>> Okay, fine. As you use ROCKS, please change in the startlam.sh in
> >>>>>> PeHostfile2MachineFile():
> >>>>>>
> >>>>>>           echo $host
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>>           echo $host.local
> >>>>>>
> >>>>>> As we have no ROCKS, I don't know whether this is necessary. Then
> >>>>>> just try as outlined in the Howto with the included mpihello.c,
> >>>>>> just
> >>>>>> to test the distribution to the nodes (after shutting down the
> >>>>>> started LAM universe). - Reuti
> >>>>>>
> >>>>>>
> >>>>>>> Thanks so much,
> >>>>>>> Srividya
> >>>>>>>
> >>>>>>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>    I did define the pe for loose rsh using qmon. and also added
> >>>>>>>> this
> >>>>>>>> pe to the queue list using the queue manager provided by qmon.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Srividya
> >>>>>>>>
> >>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>>>>>> Hi again.
> >>>>>>>>>
> >>>>>>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>>    Thanks for your prompt response. I am sorry if i was not
> >>>>>>>>>> clear on
> >>>>>>>>>> the earlier mail. I did not  start the lamd deamons prior to
> >>>>>>>>>> submitting the job by hand. What I was trying to convey was
> >>>>>>>>>> that
> >>>>>>>>>> the
> >>>>>>>>>> lamd deamons are running on the compute nodes possibly
> >>>>>>>>>> started
> >>>>>>>>>> by SGE
> >>>>>>>>>> itself, but somehow is not registered with LAM/MPI??!!
> >>>>>>>>>>
> >>>>>>>>>>     And also the hostfile that is used during lamboot
> >>>>>>>>>> #lamboot -v -ssi boot rsh hostfile
> >>>>>>>>>
> >>>>>>>>> lamboot will start the daemons, which isn't necessary. Also
> >>>>>>>>> with a
> >>>>>>>>> loose integration, SGE will start the daemons on its own
> >>>>>>>>> (just by
> >>>>>>>>> rsh
> >>>>>>>>> in contrast to qrsh with a Tight Integration).
> >>>>>>>>>
> >>>>>>>>> LAM/MPI is in some way SGE aware, and will look for some
> >>>>>>>>> special
> >>>>>>>>> information in the SGE created directories on all the slave
> >>>>>>>>> nodes.
> >>>>>>>>>
> >>>>>>>>> But anyway: how did you define the PE - loose with rsh or
> >>>>>>>>> qrsh? -
> >>>>>>>>> Reuti
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> is as follows, which already had the .local suffix as
> >>>>>>>>>> medusa.lab.ac.uab.edu cpu=4
> >>>>>>>>>> compute-0-0.local cpu=4
> >>>>>>>>>> compute-0-1.local cpu=4
> >>>>>>>>>> compute-0-2.local cpu=4
> >>>>>>>>>> compute-0-3.local cpu=4
> >>>>>>>>>> compute-0-4.local cpu=4
> >>>>>>>>>> compute-0-5.local cpu=4
> >>>>>>>>>> compute-0-6.local cpu=4
> >>>>>>>>>> compute-0-7.local cpu=4
> >>>>>>>>>>
> >>>>>>>>>> Any further ideas to solve this issue will be very helpful.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Srividya
> >>>>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>>     I am working with a pentium III rocks cluster which has
> >>>>>>>>>>>> LAM/MPI
> >>>>>>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the
> >>>>>>>>>>>> loose
> >>>>>>>>>>>> integration mechanism with rsh working with SGE and LAM as
> >>>>>>>>>>>> suggested
> >>>>>>>>>>>> by the following post on this mailing list
> >>>>>>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
> >>>>>>>>>>>> integration.html
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, on submitting the jobs to the queue, i get the
> >>>>>>>>>>>> following
> >>>>>>>>>>>> error message
> >>>>>>>>>>>> -----------------------------------------------------------
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -------
> >>>>>>>>>>>> It seems that there is no lamd running on the host
> >>>>>>>>>>>> compute-0-5.local.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This indicates that the LAM/MPI runtime environment is not
> >>>>>>>>>>>> operating.
> >>>>>>>>>>>> The LAM/MPI runtime environment is necessary for the
> >>>>>>>>>>>> "mpirun"
> >>>>>>>>>>>> command.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please run the "lamboot" command the start the LAM/MPI
> >>>>>>>>>>>> runtime
> >>>>>>>>>>>> environment.  See the LAM/MPI documentation for how to
> >>>>>>>>>>>> invoke
> >>>>>>>>>>>> "lamboot" across multiple machines.
> >>>>>>>>>>>> -----------------------------------------------------------
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -------
> >>>>>>>>>>>> But, lamnodes  command shows all the nodes on the system
> >>>>>>>>>>>> and i
> >>>>>>>>>>>> can
> >>>>>>>>>>>> also see the lamd deamon running on the local compute
> >>>>>>>>>>>> nodes.  Any
> >>>>>>>>>>>> ideas on the what the issue could be are greatly
> >>>>>>>>>>>> appreciated.
> >>>>>>>>>>>
> >>>>>>>>>>> there is no need to startup any daemon on your own by hand
> >>>>>>>>>>> before. In
> >>>>>>>>>>> fact, it will not work. SGE takes care of starting a private
> >>>>>>>>>>> daemon
> >>>>>>>>>>> for each job on all the selected nodes for this particular
> >>>>>>>>>>> job.
> >>>>>>>>>>>
> >>>>>>>>>>> One issue with ROCKS might be similar to this (change the
> >>>>>>>>>>> startscript
> >>>>>>>>>>> to include .local for the nodes in the "machines"-file):
> >>>>>>>>>>>
> >>>>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
> >>>>>>>>>>> listName=users&msgNo=14170
> >>>>>>>>>>>
> >>>>>>>>>>> Just let me know, whether it worked after adjusting the
> >>>>>>>>>>> start
> >>>>>>>>>>> script.
> >>>>>>>>>>>
> >>>>>>>>>>> -- Reuti
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Srividya
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----------------------------------------------------------
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --
> >>>>>>>>>>>> --
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> -
> >>>>>>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>>>>>> For additional commands, e-mail: users-
> >>>>>>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> ------------------------------------------------------------
> >>>>>>>>>>> --
> >>>>>>>>>>> --
> >>>>>>>>>>> --
> >>>>>>>>>>> ---
> >>>>>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>>>>> For additional commands, e-mail: users-
> >>>>>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> -------------------------------------------------------------
> >>>>>>>>>> --
> >>>>>>>>>> --
> >>>>>>>>>> --
> >>>>>>>>>> --
> >>>>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>>>> For additional commands, e-mail: users-
> >>>>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --------------------------------------------------------------
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> -
> >>>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>>> For additional commands, e-mail: users-
> >>>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> --
> >>>>>>> -
> >>>>>>> To unsubscribe, e-mail: users-
> >>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>> For additional commands, e-mail: users-
> >>>>>>> help at gridengine.sunsource.net
> >>>>>>>
> >>>>>>
> >>>>>> -----------------------------------------------------------------
> >>>>>> --
> >>>>>> --
> >>>>>> To unsubscribe, e-mail: users-
> >>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail: users-
> >>>>>> help at gridengine.sunsource.net
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> --
> >>>>> -
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: users-
> >>>>> help at gridengine.sunsource.net
> >>>>>
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-
> >>>> help at gridengine.sunsource.net
> >>>>
> >>>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >> <logsge-lam.txt>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list