[GE users] Re: LAM SGE Integration issues with rocks 4.1

Srividya Valivarthi srividya.v at gmail.com
Wed Jan 18 15:50:40 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

   Thanks so much for the prompt responses.  I would like to go over
again the commands that i have used and the error logs more clearly,
so that i can get some help on this problem.

1) Firstly i have aliased rsh to ssh. will this cause any issues?

2) On my first login into the system ran the following command to have
the lamd deamon running on all nodes as follows
    # lamboot -v -ssi boot rsh hostfile
      and the host file contains
      	medusa.lab.ac.uab.edu cpu=4
	compute-0-0.local cpu=4
	compute-0-1.local cpu=4
	compute-0-2.local cpu=4
	compute-0-3.local cpu=4
	compute-0-4.local cpu=4
	compute-0-5.local cpu=4
	compute-0-6.local cpu=4
	compute-0-7.local cpu=4

3) Then on compiling and running the mpihello program with the lam
binaries i get the expected results.
      [srividya at medusa ~]$ /opt/lam/gnu/bin/mpirun -np 2 /home/srividya/mpihello
	Hello World from Node 0.
	Hello World from Node 1.

4) Now, in order to be able to submit jobs through SGE. I defined the
pe through qmon as follows:
     [srividya at medusa ~]$ qconf -sp lam_loose_rsh
	pe_name           lam_loose_rsh
	slots             4
	user_lists        NONE
	xuser_lists       NONE
	start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
         			$pe_hostfile
	stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
	allocation_rule   $round_robin
	control_slaves    FALSE
	job_is_first_task TRUE
	urgency_slots     min

     Have also added this pe to the queue list through qmon.

5) Have  modified the corresponding startlam.sh  as suggested from
hostname to hostname.local

6) Now, have defined the script file as follows
      [srividya at medusa ~]$ cat tester1.sh
		#!/bin/sh
		/opt/lam/gnu/bin/mpirun C /home/srividya/mpihello

7) On running the script file as follows
       [srividya at medusa ~]$ qsub -pe lam_loose_rsh 2 tester1.sh
		Your job 79 ("tester1.sh") has been submitted.
	[srividya at medusa ~]$ qstat
	job-ID  prior   name       user         state submit/start at    
queue                          slots 		ja-task-ID
	-----------------------------------------------------------------------------------------------------------------
     79 0.00000 tester1.sh srividya     qw    01/18/2006 09:37:12     
                              2

8) And obtain the following results in the tester1.sh.e77

     [srividya at medusa ~]$ cat tester1.sh.e79
	/home/srividya/mpihello: error while loading shared libraries:
liblamf77mpi.so.0: 	cannot open shared object file: No such file or
directory
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
/home/srividya/mpihello: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or
directory

I am not sure why the path information is not being read by SGE... The
LD_LIBRARY_PATH env variable has the required path... Is there
something else that i am missing.

9) On changing the script to sge.lam.script as follows.. the only diff
being the LAM_MPI_SOCKET_SUFFIX
   #cat sge.lam.script
    #!/bin/sh
   #$ -N mpihello
   #$ -cwd
   #$ -j y
   #
   # pe request for LAM. Set your number of processors here.
  #$ -pe lam_loose_rsh 2
  #
  # Run job through bash shell
  #$ -S /bin/bash
  # This MUST be in your LAM run script, otherwise
  # multiple LAM jobs will NOT RUN
  export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME
 #
 # Use full pathname to make sure we are using the right mpirun
/opt/lam/gnu/bin/mpirun -np $NSLOTS /home/srividya/mpihello

10) and submitting to the queue
        #qsub sge.lam.script

11) Obtain the following error message
        [srividya at medusa ~]$ cat mpihello.o80
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host compute-0-6.local.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "mpirun" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------

And this is the message that i was sending out earlier. I am new to
the sge-lam environment and thanks so much for your patience. Any help
will be greatly appreciated.

Thanks,
Srividya





On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:
>
> > The change in the startlam.sh from
> > echo host
> > to
> > echo host.local
> >
> > after stopping and booting the lamuniverse does not seem to solve the
>
> No - stop the lamuniverse. Don't boot it by hand! Just start a
> parallel job like I mentioned the mpihello.c, and post the error/log-
> files of this job. Your rsh connection is also working between the
> nodes for a passwordless invocation? - Reuti
>
> > problem either..
> >
> > Thanks again,
> > Srividya
> >
> > On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
> >>
> >>> The pe is defined as follows:
> >>>
> >>> #qconf -sp lam_loose_rsh
> >>> pe_name           lam_loose_rsh
> >>> slots             4
> >>> user_lists        NONE
> >>> xuser_lists       NONE
> >>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
> >>>                   $pe_hostfile
> >>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> >>> allocation_rule   $round_robin
> >>> control_slaves    FALSE
> >>> job_is_first_task TRUE
> >>> urgency_slots     min
> >>>
> >>
> >> Okay, fine. As you use ROCKS, please change in the startlam.sh in
> >> PeHostfile2MachineFile():
> >>
> >>           echo $host
> >>
> >> to
> >>
> >>           echo $host.local
> >>
> >> As we have no ROCKS, I don't know whether this is necessary. Then
> >> just try as outlined in the Howto with the included mpihello.c, just
> >> to test the distribution to the nodes (after shutting down the
> >> started LAM universe). - Reuti
> >>
> >>
> >>> Thanks so much,
> >>> Srividya
> >>>
> >>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
> >>>> Hi,
> >>>>
> >>>>    I did define the pe for loose rsh using qmon. and also added
> >>>> this
> >>>> pe to the queue list using the queue manager provided by qmon.
> >>>>
> >>>> Thanks,
> >>>> Srividya
> >>>>
> >>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>> Hi again.
> >>>>>
> >>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>    Thanks for your prompt response. I am sorry if i was not
> >>>>>> clear on
> >>>>>> the earlier mail. I did not  start the lamd deamons prior to
> >>>>>> submitting the job by hand. What I was trying to convey was that
> >>>>>> the
> >>>>>> lamd deamons are running on the compute nodes possibly started
> >>>>>> by SGE
> >>>>>> itself, but somehow is not registered with LAM/MPI??!!
> >>>>>>
> >>>>>>     And also the hostfile that is used during lamboot
> >>>>>> #lamboot -v -ssi boot rsh hostfile
> >>>>>
> >>>>> lamboot will start the daemons, which isn't necessary. Also with a
> >>>>> loose integration, SGE will start the daemons on its own (just by
> >>>>> rsh
> >>>>> in contrast to qrsh with a Tight Integration).
> >>>>>
> >>>>> LAM/MPI is in some way SGE aware, and will look for some special
> >>>>> information in the SGE created directories on all the slave nodes.
> >>>>>
> >>>>> But anyway: how did you define the PE - loose with rsh or qrsh? -
> >>>>> Reuti
> >>>>>
> >>>>>
> >>>>>> is as follows, which already had the .local suffix as
> >>>>>> medusa.lab.ac.uab.edu cpu=4
> >>>>>> compute-0-0.local cpu=4
> >>>>>> compute-0-1.local cpu=4
> >>>>>> compute-0-2.local cpu=4
> >>>>>> compute-0-3.local cpu=4
> >>>>>> compute-0-4.local cpu=4
> >>>>>> compute-0-5.local cpu=4
> >>>>>> compute-0-6.local cpu=4
> >>>>>> compute-0-7.local cpu=4
> >>>>>>
> >>>>>> Any further ideas to solve this issue will be very helpful.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Srividya
> >>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>     I am working with a pentium III rocks cluster which has
> >>>>>>>> LAM/MPI
> >>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the loose
> >>>>>>>> integration mechanism with rsh working with SGE and LAM as
> >>>>>>>> suggested
> >>>>>>>> by the following post on this mailing list
> >>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
> >>>>>>>> integration.html
> >>>>>>>>
> >>>>>>>> However, on submitting the jobs to the queue, i get the
> >>>>>>>> following
> >>>>>>>> error message
> >>>>>>>> ---------------------------------------------------------------
> >>>>>>>> --
> >>>>>>>> ---
> >>>>>>>> --
> >>>>>>>> -------
> >>>>>>>> It seems that there is no lamd running on the host
> >>>>>>>> compute-0-5.local.
> >>>>>>>>
> >>>>>>>> This indicates that the LAM/MPI runtime environment is not
> >>>>>>>> operating.
> >>>>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
> >>>>>>>> command.
> >>>>>>>>
> >>>>>>>> Please run the "lamboot" command the start the LAM/MPI runtime
> >>>>>>>> environment.  See the LAM/MPI documentation for how to invoke
> >>>>>>>> "lamboot" across multiple machines.
> >>>>>>>> ---------------------------------------------------------------
> >>>>>>>> --
> >>>>>>>> ---
> >>>>>>>> --
> >>>>>>>> -------
> >>>>>>>> But, lamnodes  command shows all the nodes on the system and i
> >>>>>>>> can
> >>>>>>>> also see the lamd deamon running on the local compute
> >>>>>>>> nodes.  Any
> >>>>>>>> ideas on the what the issue could be are greatly appreciated.
> >>>>>>>
> >>>>>>> there is no need to startup any daemon on your own by hand
> >>>>>>> before. In
> >>>>>>> fact, it will not work. SGE takes care of starting a private
> >>>>>>> daemon
> >>>>>>> for each job on all the selected nodes for this particular job.
> >>>>>>>
> >>>>>>> One issue with ROCKS might be similar to this (change the
> >>>>>>> startscript
> >>>>>>> to include .local for the nodes in the "machines"-file):
> >>>>>>>
> >>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
> >>>>>>> listName=users&msgNo=14170
> >>>>>>>
> >>>>>>> Just let me know, whether it worked after adjusting the start
> >>>>>>> script.
> >>>>>>>
> >>>>>>> -- Reuti
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Srividya
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------
> >>>>>>>> --
> >>>>>>>> ---
> >>>>>>>> -
> >>>>>>>> To unsubscribe, e-mail: users-
> >>>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>>> For additional commands, e-mail: users-
> >>>>>>>> help at gridengine.sunsource.net
> >>>>>>>>
> >>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> ---
> >>>>>>> To unsubscribe, e-mail: users-
> >>>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>>> For additional commands, e-mail: users-
> >>>>>>> help at gridengine.sunsource.net
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> -----------------------------------------------------------------
> >>>>>> --
> >>>>>> --
> >>>>>> To unsubscribe, e-mail: users-
> >>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail: users-
> >>>>>> help at gridengine.sunsource.net
> >>>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> --
> >>>>> -
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: users-
> >>>>> help at gridengine.sunsource.net
> >>>>>
> >>>>>
> >>>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list