[GE users] Re: LAM SGE Integration issues with rocks 4.1

Srividya Valivarthi srividya.v at gmail.com
Wed Jan 18 20:33:52 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Thanks so much for your patience.

 Have stopped all the deamons using the lamhalt command and attached the log.

Thanks again,
Srividya
On 1/18/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> Srividya:
>
> Am 18.01.2006 um 16:50 schrieb Srividya Valivarthi:
>
> > Hi,
> >
> >    Thanks so much for the prompt responses.  I would like to go over
> > again the commands that i have used and the error logs more clearly,
> > so that i can get some help on this problem.
> >
> > 1) Firstly i have aliased rsh to ssh. will this cause any issues?
> >
>
> for a loose integration this schould work, but not for any of the
> qrsh based setups of the LAM/MPI integration (where rsh will be
> caught by SGE and routed to a qrsh command).
>
> > 2) On my first login into the system ran the following command to have
> > the lamd deamon running on all nodes as follows
> >     # lamboot -v -ssi boot rsh hostfile
> >       and the host file contains
> >       	medusa.lab.ac.uab.edu cpu=4
> > 	compute-0-0.local cpu=4
> > 	compute-0-1.local cpu=4
> > 	compute-0-2.local cpu=4
> > 	compute-0-3.local cpu=4
> > 	compute-0-4.local cpu=4
> > 	compute-0-5.local cpu=4
> > 	compute-0-6.local cpu=4
> > 	compute-0-7.local cpu=4
> >
>
> Again: please stop the daemons! Then come back and we go to the next
> point. - Reuti
>
> > 3) Then on compiling and running the mpihello program with the lam
> > binaries i get the expected results.
> >       [srividya at medusa ~]$ /opt/lam/gnu/bin/mpirun -np 2 /home/
> > srividya/mpihello
> > 	Hello World from Node 0.
> > 	Hello World from Node 1.
> >
> > 4) Now, in order to be able to submit jobs through SGE. I defined the
> > pe through qmon as follows:
> >      [srividya at medusa ~]$ qconf -sp lam_loose_rsh
> > 	pe_name           lam_loose_rsh
> > 	slots             4
> > 	user_lists        NONE
> > 	xuser_lists       NONE
> > 	start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
> >          			$pe_hostfile
> > 	stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> > 	allocation_rule   $round_robin
> > 	control_slaves    FALSE
> > 	job_is_first_task TRUE
> > 	urgency_slots     min
> >
> >      Have also added this pe to the queue list through qmon.
> >
> > 5) Have  modified the corresponding startlam.sh  as suggested from
> > hostname to hostname.local
> >
> > 6) Now, have defined the script file as follows
> >       [srividya at medusa ~]$ cat tester1.sh
> > 		#!/bin/sh
> > 		/opt/lam/gnu/bin/mpirun C /home/srividya/mpihello
> >
> > 7) On running the script file as follows
> >        [srividya at medusa ~]$ qsub -pe lam_loose_rsh 2 tester1.sh
> > 		Your job 79 ("tester1.sh") has been submitted.
> > 	[srividya at medusa ~]$ qstat
> > 	job-ID  prior   name       user         state submit/start at
> > queue                          slots 		ja-task-ID
> > 	---------------------------------------------------------------------
> > --------------------------------------------
> >      79 0.00000 tester1.sh srividya     qw    01/18/2006 09:37:12
> >                               2
> >
> > 8) And obtain the following results in the tester1.sh.e77
> >
> >      [srividya at medusa ~]$ cat tester1.sh.e79
> > 	/home/srividya/mpihello: error while loading shared libraries:
> > liblamf77mpi.so.0: 	cannot open shared object file: No such file or
> > directory
> > ----------------------------------------------------------------------
> > -------
> > It seems that [at least] one of the processes that was started with
> > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > more than one process did not invoke MPI_INIT -- mpirun was only
> > notified of the first one, which was on node n0).
> >
> > mpirun can *only* be used with MPI programs (i.e., programs that
> > invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> > to run non-MPI programs over the lambooted nodes.
> > ----------------------------------------------------------------------
> > -------
> > /home/srividya/mpihello: error while loading shared libraries:
> > liblamf77mpi.so.0: cannot open shared object file: No such file or
> > directory
> >
> > I am not sure why the path information is not being read by SGE... The
> > LD_LIBRARY_PATH env variable has the required path... Is there
> > something else that i am missing.
> >
> > 9) On changing the script to sge.lam.script as follows.. the only diff
> > being the LAM_MPI_SOCKET_SUFFIX
> >    #cat sge.lam.script
> >     #!/bin/sh
> >    #$ -N mpihello
> >    #$ -cwd
> >    #$ -j y
> >    #
> >    # pe request for LAM. Set your number of processors here.
> >   #$ -pe lam_loose_rsh 2
> >   #
> >   # Run job through bash shell
> >   #$ -S /bin/bash
> >   # This MUST be in your LAM run script, otherwise
> >   # multiple LAM jobs will NOT RUN
> >   export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME
> >  #
> >  # Use full pathname to make sure we are using the right mpirun
> > /opt/lam/gnu/bin/mpirun -np $NSLOTS /home/srividya/mpihello
> >
> > 10) and submitting to the queue
> >         #qsub sge.lam.script
> >
> > 11) Obtain the following error message
> >         [srividya at medusa ~]$ cat mpihello.o80
> > ----------------------------------------------------------------------
> > -------
> > It seems that there is no lamd running on the host compute-0-6.local.
> >
> > This indicates that the LAM/MPI runtime environment is not operating.
> > The LAM/MPI runtime environment is necessary for the "mpirun" command.
> >
> > Please run the "lamboot" command the start the LAM/MPI runtime
> > environment.  See the LAM/MPI documentation for how to invoke
> > "lamboot" across multiple machines.
> > ----------------------------------------------------------------------
> > -------
> >
> > And this is the message that i was sending out earlier. I am new to
> > the sge-lam environment and thanks so much for your patience. Any help
> > will be greatly appreciated.
> >
> > Thanks,
> > Srividya
> >
> >
> >
> >
> >
> > On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >> Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:
> >>
> >>> The change in the startlam.sh from
> >>> echo host
> >>> to
> >>> echo host.local
> >>>
> >>> after stopping and booting the lamuniverse does not seem to solve
> >>> the
> >>
> >> No - stop the lamuniverse. Don't boot it by hand! Just start a
> >> parallel job like I mentioned the mpihello.c, and post the error/log-
> >> files of this job. Your rsh connection is also working between the
> >> nodes for a passwordless invocation? - Reuti
> >>
> >>> problem either..
> >>>
> >>> Thanks again,
> >>> Srividya
> >>>
> >>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
> >>>>
> >>>>> The pe is defined as follows:
> >>>>>
> >>>>> #qconf -sp lam_loose_rsh
> >>>>> pe_name           lam_loose_rsh
> >>>>> slots             4
> >>>>> user_lists        NONE
> >>>>> xuser_lists       NONE
> >>>>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/
> >>>>> startlam.sh \
> >>>>>                   $pe_hostfile
> >>>>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> >>>>> allocation_rule   $round_robin
> >>>>> control_slaves    FALSE
> >>>>> job_is_first_task TRUE
> >>>>> urgency_slots     min
> >>>>>
> >>>>
> >>>> Okay, fine. As you use ROCKS, please change in the startlam.sh in
> >>>> PeHostfile2MachineFile():
> >>>>
> >>>>           echo $host
> >>>>
> >>>> to
> >>>>
> >>>>           echo $host.local
> >>>>
> >>>> As we have no ROCKS, I don't know whether this is necessary. Then
> >>>> just try as outlined in the Howto with the included mpihello.c,
> >>>> just
> >>>> to test the distribution to the nodes (after shutting down the
> >>>> started LAM universe). - Reuti
> >>>>
> >>>>
> >>>>> Thanks so much,
> >>>>> Srividya
> >>>>>
> >>>>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>>    I did define the pe for loose rsh using qmon. and also added
> >>>>>> this
> >>>>>> pe to the queue list using the queue manager provided by qmon.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Srividya
> >>>>>>
> >>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>>>> Hi again.
> >>>>>>>
> >>>>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>    Thanks for your prompt response. I am sorry if i was not
> >>>>>>>> clear on
> >>>>>>>> the earlier mail. I did not  start the lamd deamons prior to
> >>>>>>>> submitting the job by hand. What I was trying to convey was
> >>>>>>>> that
> >>>>>>>> the
> >>>>>>>> lamd deamons are running on the compute nodes possibly started
> >>>>>>>> by SGE
> >>>>>>>> itself, but somehow is not registered with LAM/MPI??!!
> >>>>>>>>
> >>>>>>>>     And also the hostfile that is used during lamboot
> >>>>>>>> #lamboot -v -ssi boot rsh hostfile
> >>>>>>>
> >>>>>>> lamboot will start the daemons, which isn't necessary. Also
> >>>>>>> with a
> >>>>>>> loose integration, SGE will start the daemons on its own
> >>>>>>> (just by
> >>>>>>> rsh
> >>>>>>> in contrast to qrsh with a Tight Integration).
> >>>>>>>
> >>>>>>> LAM/MPI is in some way SGE aware, and will look for some special
> >>>>>>> information in the SGE created directories on all the slave
> >>>>>>> nodes.
> >>>>>>>
> >>>>>>> But anyway: how did you define the PE - loose with rsh or
> >>>>>>> qrsh? -
> >>>>>>> Reuti
> >>>>>>>
> >>>>>>>
> >>>>>>>> is as follows, which already had the .local suffix as
> >>>>>>>> medusa.lab.ac.uab.edu cpu=4
> >>>>>>>> compute-0-0.local cpu=4
> >>>>>>>> compute-0-1.local cpu=4
> >>>>>>>> compute-0-2.local cpu=4
> >>>>>>>> compute-0-3.local cpu=4
> >>>>>>>> compute-0-4.local cpu=4
> >>>>>>>> compute-0-5.local cpu=4
> >>>>>>>> compute-0-6.local cpu=4
> >>>>>>>> compute-0-7.local cpu=4
> >>>>>>>>
> >>>>>>>> Any further ideas to solve this issue will be very helpful.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Srividya
> >>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>>     I am working with a pentium III rocks cluster which has
> >>>>>>>>>> LAM/MPI
> >>>>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the
> >>>>>>>>>> loose
> >>>>>>>>>> integration mechanism with rsh working with SGE and LAM as
> >>>>>>>>>> suggested
> >>>>>>>>>> by the following post on this mailing list
> >>>>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
> >>>>>>>>>> integration.html
> >>>>>>>>>>
> >>>>>>>>>> However, on submitting the jobs to the queue, i get the
> >>>>>>>>>> following
> >>>>>>>>>> error message
> >>>>>>>>>> -------------------------------------------------------------
> >>>>>>>>>> --
> >>>>>>>>>> --
> >>>>>>>>>> ---
> >>>>>>>>>> --
> >>>>>>>>>> -------
> >>>>>>>>>> It seems that there is no lamd running on the host
> >>>>>>>>>> compute-0-5.local.
> >>>>>>>>>>
> >>>>>>>>>> This indicates that the LAM/MPI runtime environment is not
> >>>>>>>>>> operating.
> >>>>>>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
> >>>>>>>>>> command.
> >>>>>>>>>>
> >>>>>>>>>> Please run the "lamboot" command the start the LAM/MPI
> >>>>>>>>>> runtime
> >>>>>>>>>> environment.  See the LAM/MPI documentation for how to invoke
> >>>>>>>>>> "lamboot" across multiple machines.
> >>>>>>>>>> -------------------------------------------------------------
> >>>>>>>>>> --
> >>>>>>>>>> --
> >>>>>>>>>> ---
> >>>>>>>>>> --
> >>>>>>>>>> -------
> >>>>>>>>>> But, lamnodes  command shows all the nodes on the system
> >>>>>>>>>> and i
> >>>>>>>>>> can
> >>>>>>>>>> also see the lamd deamon running on the local compute
> >>>>>>>>>> nodes.  Any
> >>>>>>>>>> ideas on the what the issue could be are greatly appreciated.
> >>>>>>>>>
> >>>>>>>>> there is no need to startup any daemon on your own by hand
> >>>>>>>>> before. In
> >>>>>>>>> fact, it will not work. SGE takes care of starting a private
> >>>>>>>>> daemon
> >>>>>>>>> for each job on all the selected nodes for this particular
> >>>>>>>>> job.
> >>>>>>>>>
> >>>>>>>>> One issue with ROCKS might be similar to this (change the
> >>>>>>>>> startscript
> >>>>>>>>> to include .local for the nodes in the "machines"-file):
> >>>>>>>>>
> >>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
> >>>>>>>>> listName=users&msgNo=14170
> >>>>>>>>>
> >>>>>>>>> Just let me know, whether it worked after adjusting the start
> >>>>>>>>> script.
> >>>>>>>>>
> >>>>>>>>> -- Reuti
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Srividya
> >>>>>>>>>>
> >>>>>>>>>> -------------------------------------------------------------
> >>>>>>>>>> --
> >>>>>>>>>> --
> >>>>>>>>>> ---
> >>>>>>>>>> -
> >>>>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>>>> For additional commands, e-mail: users-
> >>>>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --------------------------------------------------------------
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> ---
> >>>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>>> For additional commands, e-mail: users-
> >>>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------
> >>>>>>>> --
> >>>>>>>> --
> >>>>>>>> --
> >>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>> For additional commands, e-mail: users-
> >>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>
> >>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> --
> >>>>>>> -
> >>>>>>> To unsubscribe, e-mail: users-
> >>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>> For additional commands, e-mail: users-
> >>>>>>> help at gridengine.sunsource.net
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> --
> >>>>> -
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: users-
> >>>>> help at gridengine.sunsource.net
> >>>>>
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-
> >>>> help at gridengine.sunsource.net
> >>>>
> >>>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


    [ Part 2, Text/PLAIN (Name: "logsge-lam.txt") 228 lines. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list