[GE users] Re: LAM SGE Integration issues with rocks 4.1

Srividya Valivarthi srividya.v at gmail.com
Wed Jan 11 19:45:29 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The change in the startlam.sh from
echo host
to
echo host.local

after stopping and booting the lamuniverse does not seem to solve the
problem either..

Thanks again,
Srividya

On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
>
> > The pe is defined as follows:
> >
> > #qconf -sp lam_loose_rsh
> > pe_name           lam_loose_rsh
> > slots             4
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
> >                   $pe_hostfile
> > stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> > allocation_rule   $round_robin
> > control_slaves    FALSE
> > job_is_first_task TRUE
> > urgency_slots     min
> >
>
> Okay, fine. As you use ROCKS, please change in the startlam.sh in
> PeHostfile2MachineFile():
>
>           echo $host
>
> to
>
>           echo $host.local
>
> As we have no ROCKS, I don't know whether this is necessary. Then
> just try as outlined in the Howto with the included mpihello.c, just
> to test the distribution to the nodes (after shutting down the
> started LAM universe). - Reuti
>
>
> > Thanks so much,
> > Srividya
> >
> > On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
> >> Hi,
> >>
> >>    I did define the pe for loose rsh using qmon. and also added this
> >> pe to the queue list using the queue manager provided by qmon.
> >>
> >> Thanks,
> >> Srividya
> >>
> >> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>> Hi again.
> >>>
> >>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
> >>>
> >>>> Hi,
> >>>>
> >>>>    Thanks for your prompt response. I am sorry if i was not
> >>>> clear on
> >>>> the earlier mail. I did not  start the lamd deamons prior to
> >>>> submitting the job by hand. What I was trying to convey was that
> >>>> the
> >>>> lamd deamons are running on the compute nodes possibly started
> >>>> by SGE
> >>>> itself, but somehow is not registered with LAM/MPI??!!
> >>>>
> >>>>     And also the hostfile that is used during lamboot
> >>>> #lamboot -v -ssi boot rsh hostfile
> >>>
> >>> lamboot will start the daemons, which isn't necessary. Also with a
> >>> loose integration, SGE will start the daemons on its own (just by
> >>> rsh
> >>> in contrast to qrsh with a Tight Integration).
> >>>
> >>> LAM/MPI is in some way SGE aware, and will look for some special
> >>> information in the SGE created directories on all the slave nodes.
> >>>
> >>> But anyway: how did you define the PE - loose with rsh or qrsh? -
> >>> Reuti
> >>>
> >>>
> >>>> is as follows, which already had the .local suffix as
> >>>> medusa.lab.ac.uab.edu cpu=4
> >>>> compute-0-0.local cpu=4
> >>>> compute-0-1.local cpu=4
> >>>> compute-0-2.local cpu=4
> >>>> compute-0-3.local cpu=4
> >>>> compute-0-4.local cpu=4
> >>>> compute-0-5.local cpu=4
> >>>> compute-0-6.local cpu=4
> >>>> compute-0-7.local cpu=4
> >>>>
> >>>> Any further ideas to solve this issue will be very helpful.
> >>>>
> >>>> Thanks,
> >>>> Srividya
> >>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>     I am working with a pentium III rocks cluster which has
> >>>>>> LAM/MPI
> >>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the loose
> >>>>>> integration mechanism with rsh working with SGE and LAM as
> >>>>>> suggested
> >>>>>> by the following post on this mailing list
> >>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
> >>>>>> integration.html
> >>>>>>
> >>>>>> However, on submitting the jobs to the queue, i get the following
> >>>>>> error message
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>> --
> >>>>>> -------
> >>>>>> It seems that there is no lamd running on the host
> >>>>>> compute-0-5.local.
> >>>>>>
> >>>>>> This indicates that the LAM/MPI runtime environment is not
> >>>>>> operating.
> >>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
> >>>>>> command.
> >>>>>>
> >>>>>> Please run the "lamboot" command the start the LAM/MPI runtime
> >>>>>> environment.  See the LAM/MPI documentation for how to invoke
> >>>>>> "lamboot" across multiple machines.
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>> --
> >>>>>> -------
> >>>>>> But, lamnodes  command shows all the nodes on the system and i
> >>>>>> can
> >>>>>> also see the lamd deamon running on the local compute nodes.  Any
> >>>>>> ideas on the what the issue could be are greatly appreciated.
> >>>>>
> >>>>> there is no need to startup any daemon on your own by hand
> >>>>> before. In
> >>>>> fact, it will not work. SGE takes care of starting a private
> >>>>> daemon
> >>>>> for each job on all the selected nodes for this particular job.
> >>>>>
> >>>>> One issue with ROCKS might be similar to this (change the
> >>>>> startscript
> >>>>> to include .local for the nodes in the "machines"-file):
> >>>>>
> >>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
> >>>>> listName=users&msgNo=14170
> >>>>>
> >>>>> Just let me know, whether it worked after adjusting the start
> >>>>> script.
> >>>>>
> >>>>> -- Reuti
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Srividya
> >>>>>>
> >>>>>> -----------------------------------------------------------------
> >>>>>> ---
> >>>>>> -
> >>>>>> To unsubscribe, e-mail: users-
> >>>>>> unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail: users-
> >>>>>> help at gridengine.sunsource.net
> >>>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: users-
> >>>>> help at gridengine.sunsource.net
> >>>>>
> >>>>>
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-
> >>>> help at gridengine.sunsource.net
> >>>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list