[GE users] Re: LAM SGE Integration issues with rocks 4.1

Reuti reuti at staff.uni-marburg.de
Wed Jan 11 19:06:12 GMT 2006


Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:

> The pe is defined as follows:
>
> #qconf -sp lam_loose_rsh
> pe_name           lam_loose_rsh
> slots             4
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
>                   $pe_hostfile
> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> allocation_rule   $round_robin
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
>

Okay, fine. As you use ROCKS, please change in the startlam.sh in  
PeHostfile2MachineFile():

          echo $host

to

          echo $host.local

As we have no ROCKS, I don't know whether this is necessary. Then  
just try as outlined in the Howto with the included mpihello.c, just  
to test the distribution to the nodes (after shutting down the  
started LAM universe). - Reuti


> Thanks so much,
> Srividya
>
> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
>> Hi,
>>
>>    I did define the pe for loose rsh using qmon. and also added this
>> pe to the queue list using the queue manager provided by qmon.
>>
>> Thanks,
>> Srividya
>>
>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>> Hi again.
>>>
>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
>>>
>>>> Hi,
>>>>
>>>>    Thanks for your prompt response. I am sorry if i was not  
>>>> clear on
>>>> the earlier mail. I did not  start the lamd deamons prior to
>>>> submitting the job by hand. What I was trying to convey was that  
>>>> the
>>>> lamd deamons are running on the compute nodes possibly started  
>>>> by SGE
>>>> itself, but somehow is not registered with LAM/MPI??!!
>>>>
>>>>     And also the hostfile that is used during lamboot
>>>> #lamboot -v -ssi boot rsh hostfile
>>>
>>> lamboot will start the daemons, which isn't necessary. Also with a
>>> loose integration, SGE will start the daemons on its own (just by  
>>> rsh
>>> in contrast to qrsh with a Tight Integration).
>>>
>>> LAM/MPI is in some way SGE aware, and will look for some special
>>> information in the SGE created directories on all the slave nodes.
>>>
>>> But anyway: how did you define the PE - loose with rsh or qrsh? -  
>>> Reuti
>>>
>>>
>>>> is as follows, which already had the .local suffix as
>>>> medusa.lab.ac.uab.edu cpu=4
>>>> compute-0-0.local cpu=4
>>>> compute-0-1.local cpu=4
>>>> compute-0-2.local cpu=4
>>>> compute-0-3.local cpu=4
>>>> compute-0-4.local cpu=4
>>>> compute-0-5.local cpu=4
>>>> compute-0-6.local cpu=4
>>>> compute-0-7.local cpu=4
>>>>
>>>> Any further ideas to solve this issue will be very helpful.
>>>>
>>>> Thanks,
>>>> Srividya
>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>> Hi,
>>>>>
>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>     I am working with a pentium III rocks cluster which has  
>>>>>> LAM/MPI
>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the loose
>>>>>> integration mechanism with rsh working with SGE and LAM as  
>>>>>> suggested
>>>>>> by the following post on this mailing list
>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
>>>>>> integration.html
>>>>>>
>>>>>> However, on submitting the jobs to the queue, i get the following
>>>>>> error message
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---
>>>>>> --
>>>>>> -------
>>>>>> It seems that there is no lamd running on the host
>>>>>> compute-0-5.local.
>>>>>>
>>>>>> This indicates that the LAM/MPI runtime environment is not
>>>>>> operating.
>>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
>>>>>> command.
>>>>>>
>>>>>> Please run the "lamboot" command the start the LAM/MPI runtime
>>>>>> environment.  See the LAM/MPI documentation for how to invoke
>>>>>> "lamboot" across multiple machines.
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---
>>>>>> --
>>>>>> -------
>>>>>> But, lamnodes  command shows all the nodes on the system and i  
>>>>>> can
>>>>>> also see the lamd deamon running on the local compute nodes.  Any
>>>>>> ideas on the what the issue could be are greatly appreciated.
>>>>>
>>>>> there is no need to startup any daemon on your own by hand  
>>>>> before. In
>>>>> fact, it will not work. SGE takes care of starting a private  
>>>>> daemon
>>>>> for each job on all the selected nodes for this particular job.
>>>>>
>>>>> One issue with ROCKS might be similar to this (change the  
>>>>> startscript
>>>>> to include .local for the nodes in the "machines"-file):
>>>>>
>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
>>>>> listName=users&msgNo=14170
>>>>>
>>>>> Just let me know, whether it worked after adjusting the start  
>>>>> script.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Srividya
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---
>>>>>> -
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users- 
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list