[GE users] Re: LAM SGE Integration issues with rocks 4.1

Srividya Valivarthi srividya.v at gmail.com
Wed Jan 11 18:53:32 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The pe is defined as follows:

#qconf -sp lam_loose_rsh
pe_name           lam_loose_rsh
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
                  $pe_hostfile
stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

Thanks so much,
Srividya

On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
> Hi,
>
>    I did define the pe for loose rsh using qmon. and also added this
> pe to the queue list using the queue manager provided by qmon.
>
> Thanks,
> Srividya
>
> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi again.
> >
> > Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
> >
> > > Hi,
> > >
> > >    Thanks for your prompt response. I am sorry if i was not clear on
> > > the earlier mail. I did not  start the lamd deamons prior to
> > > submitting the job by hand. What I was trying to convey was that the
> > > lamd deamons are running on the compute nodes possibly started by SGE
> > > itself, but somehow is not registered with LAM/MPI??!!
> > >
> > >     And also the hostfile that is used during lamboot
> > > #lamboot -v -ssi boot rsh hostfile
> >
> > lamboot will start the daemons, which isn't necessary. Also with a
> > loose integration, SGE will start the daemons on its own (just by rsh
> > in contrast to qrsh with a Tight Integration).
> >
> > LAM/MPI is in some way SGE aware, and will look for some special
> > information in the SGE created directories on all the slave nodes.
> >
> > But anyway: how did you define the PE - loose with rsh or qrsh? - Reuti
> >
> >
> > > is as follows, which already had the .local suffix as
> > > medusa.lab.ac.uab.edu cpu=4
> > > compute-0-0.local cpu=4
> > > compute-0-1.local cpu=4
> > > compute-0-2.local cpu=4
> > > compute-0-3.local cpu=4
> > > compute-0-4.local cpu=4
> > > compute-0-5.local cpu=4
> > > compute-0-6.local cpu=4
> > > compute-0-7.local cpu=4
> > >
> > > Any further ideas to solve this issue will be very helpful.
> > >
> > > Thanks,
> > > Srividya
> > > On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> > >> Hi,
> > >>
> > >> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
> > >>
> > >>> Hi,
> > >>>
> > >>>     I am working with a pentium III rocks cluster which has LAM/MPI
> > >>> version 7.1.1 and SGE version 6.0. I am trying to get the loose
> > >>> integration mechanism with rsh working with SGE and LAM as suggested
> > >>> by the following post on this mailing list
> > >>> http://gridengine.sunsource.net/howto/lam-integration/lam-
> > >>> integration.html
> > >>>
> > >>> However, on submitting the jobs to the queue, i get the following
> > >>> error message
> > >>> --------------------------------------------------------------------
> > >>> --
> > >>> -------
> > >>> It seems that there is no lamd running on the host
> > >>> compute-0-5.local.
> > >>>
> > >>> This indicates that the LAM/MPI runtime environment is not
> > >>> operating.
> > >>> The LAM/MPI runtime environment is necessary for the "mpirun"
> > >>> command.
> > >>>
> > >>> Please run the "lamboot" command the start the LAM/MPI runtime
> > >>> environment.  See the LAM/MPI documentation for how to invoke
> > >>> "lamboot" across multiple machines.
> > >>> --------------------------------------------------------------------
> > >>> --
> > >>> -------
> > >>> But, lamnodes  command shows all the nodes on the system and i can
> > >>> also see the lamd deamon running on the local compute nodes.  Any
> > >>> ideas on the what the issue could be are greatly appreciated.
> > >>
> > >> there is no need to startup any daemon on your own by hand before. In
> > >> fact, it will not work. SGE takes care of starting a private daemon
> > >> for each job on all the selected nodes for this particular job.
> > >>
> > >> One issue with ROCKS might be similar to this (change the startscript
> > >> to include .local for the nodes in the "machines"-file):
> > >>
> > >> http://gridengine.sunsource.net/servlets/ReadMsg?
> > >> listName=users&msgNo=14170
> > >>
> > >> Just let me know, whether it worked after adjusting the start script.
> > >>
> > >> -- Reuti
> > >>
> > >>
> > >>>
> > >>> Thanks,
> > >>> Srividya
> > >>>
> > >>> --------------------------------------------------------------------
> > >>> -
> > >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> > >>>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> > >>
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list