[GE users] Re: LAM SGE Integration issues with rocks 4.1

Reuti reuti at staff.uni-marburg.de
Wed Jan 11 20:07:42 GMT 2006


Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:

> The change in the startlam.sh from
> echo host
> to
> echo host.local
>
> after stopping and booting the lamuniverse does not seem to solve the

No - stop the lamuniverse. Don't boot it by hand! Just start a  
parallel job like I mentioned the mpihello.c, and post the error/log- 
files of this job. Your rsh connection is also working between the  
nodes for a passwordless invocation? - Reuti

> problem either..
>
> Thanks again,
> Srividya
>
> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
>>
>>> The pe is defined as follows:
>>>
>>> #qconf -sp lam_loose_rsh
>>> pe_name           lam_loose_rsh
>>> slots             4
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
>>>                   $pe_hostfile
>>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
>>> allocation_rule   $round_robin
>>> control_slaves    FALSE
>>> job_is_first_task TRUE
>>> urgency_slots     min
>>>
>>
>> Okay, fine. As you use ROCKS, please change in the startlam.sh in
>> PeHostfile2MachineFile():
>>
>>           echo $host
>>
>> to
>>
>>           echo $host.local
>>
>> As we have no ROCKS, I don't know whether this is necessary. Then
>> just try as outlined in the Howto with the included mpihello.c, just
>> to test the distribution to the nodes (after shutting down the
>> started LAM universe). - Reuti
>>
>>
>>> Thanks so much,
>>> Srividya
>>>
>>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
>>>> Hi,
>>>>
>>>>    I did define the pe for loose rsh using qmon. and also added  
>>>> this
>>>> pe to the queue list using the queue manager provided by qmon.
>>>>
>>>> Thanks,
>>>> Srividya
>>>>
>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>> Hi again.
>>>>>
>>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>    Thanks for your prompt response. I am sorry if i was not
>>>>>> clear on
>>>>>> the earlier mail. I did not  start the lamd deamons prior to
>>>>>> submitting the job by hand. What I was trying to convey was that
>>>>>> the
>>>>>> lamd deamons are running on the compute nodes possibly started
>>>>>> by SGE
>>>>>> itself, but somehow is not registered with LAM/MPI??!!
>>>>>>
>>>>>>     And also the hostfile that is used during lamboot
>>>>>> #lamboot -v -ssi boot rsh hostfile
>>>>>
>>>>> lamboot will start the daemons, which isn't necessary. Also with a
>>>>> loose integration, SGE will start the daemons on its own (just by
>>>>> rsh
>>>>> in contrast to qrsh with a Tight Integration).
>>>>>
>>>>> LAM/MPI is in some way SGE aware, and will look for some special
>>>>> information in the SGE created directories on all the slave nodes.
>>>>>
>>>>> But anyway: how did you define the PE - loose with rsh or qrsh? -
>>>>> Reuti
>>>>>
>>>>>
>>>>>> is as follows, which already had the .local suffix as
>>>>>> medusa.lab.ac.uab.edu cpu=4
>>>>>> compute-0-0.local cpu=4
>>>>>> compute-0-1.local cpu=4
>>>>>> compute-0-2.local cpu=4
>>>>>> compute-0-3.local cpu=4
>>>>>> compute-0-4.local cpu=4
>>>>>> compute-0-5.local cpu=4
>>>>>> compute-0-6.local cpu=4
>>>>>> compute-0-7.local cpu=4
>>>>>>
>>>>>> Any further ideas to solve this issue will be very helpful.
>>>>>>
>>>>>> Thanks,
>>>>>> Srividya
>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>     I am working with a pentium III rocks cluster which has
>>>>>>>> LAM/MPI
>>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the loose
>>>>>>>> integration mechanism with rsh working with SGE and LAM as
>>>>>>>> suggested
>>>>>>>> by the following post on this mailing list
>>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
>>>>>>>> integration.html
>>>>>>>>
>>>>>>>> However, on submitting the jobs to the queue, i get the  
>>>>>>>> following
>>>>>>>> error message
>>>>>>>> --------------------------------------------------------------- 
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> --
>>>>>>>> -------
>>>>>>>> It seems that there is no lamd running on the host
>>>>>>>> compute-0-5.local.
>>>>>>>>
>>>>>>>> This indicates that the LAM/MPI runtime environment is not
>>>>>>>> operating.
>>>>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
>>>>>>>> command.
>>>>>>>>
>>>>>>>> Please run the "lamboot" command the start the LAM/MPI runtime
>>>>>>>> environment.  See the LAM/MPI documentation for how to invoke
>>>>>>>> "lamboot" across multiple machines.
>>>>>>>> --------------------------------------------------------------- 
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> --
>>>>>>>> -------
>>>>>>>> But, lamnodes  command shows all the nodes on the system and i
>>>>>>>> can
>>>>>>>> also see the lamd deamon running on the local compute  
>>>>>>>> nodes.  Any
>>>>>>>> ideas on the what the issue could be are greatly appreciated.
>>>>>>>
>>>>>>> there is no need to startup any daemon on your own by hand
>>>>>>> before. In
>>>>>>> fact, it will not work. SGE takes care of starting a private
>>>>>>> daemon
>>>>>>> for each job on all the selected nodes for this particular job.
>>>>>>>
>>>>>>> One issue with ROCKS might be similar to this (change the
>>>>>>> startscript
>>>>>>> to include .local for the nodes in the "machines"-file):
>>>>>>>
>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
>>>>>>> listName=users&msgNo=14170
>>>>>>>
>>>>>>> Just let me know, whether it worked after adjusting the start
>>>>>>> script.
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Srividya
>>>>>>>>
>>>>>>>> --------------------------------------------------------------- 
>>>>>>>> --
>>>>>>>> ---
>>>>>>>> -
>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: users-
>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> ---
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> --
>>>>>> --
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list