[GE users] Re: LAM SGE Integration issues with rocks 4.1

Reuti reuti at staff.uni-marburg.de
Wed Jan 18 20:11:44 GMT 2006


Srividya:

Am 18.01.2006 um 16:50 schrieb Srividya Valivarthi:

> Hi,
>
>    Thanks so much for the prompt responses.  I would like to go over
> again the commands that i have used and the error logs more clearly,
> so that i can get some help on this problem.
>
> 1) Firstly i have aliased rsh to ssh. will this cause any issues?
>

for a loose integration this schould work, but not for any of the  
qrsh based setups of the LAM/MPI integration (where rsh will be  
caught by SGE and routed to a qrsh command).

> 2) On my first login into the system ran the following command to have
> the lamd deamon running on all nodes as follows
>     # lamboot -v -ssi boot rsh hostfile
>       and the host file contains
>       	medusa.lab.ac.uab.edu cpu=4
> 	compute-0-0.local cpu=4
> 	compute-0-1.local cpu=4
> 	compute-0-2.local cpu=4
> 	compute-0-3.local cpu=4
> 	compute-0-4.local cpu=4
> 	compute-0-5.local cpu=4
> 	compute-0-6.local cpu=4
> 	compute-0-7.local cpu=4
>

Again: please stop the daemons! Then come back and we go to the next  
point. - Reuti

> 3) Then on compiling and running the mpihello program with the lam
> binaries i get the expected results.
>       [srividya at medusa ~]$ /opt/lam/gnu/bin/mpirun -np 2 /home/ 
> srividya/mpihello
> 	Hello World from Node 0.
> 	Hello World from Node 1.
>
> 4) Now, in order to be able to submit jobs through SGE. I defined the
> pe through qmon as follows:
>      [srividya at medusa ~]$ qconf -sp lam_loose_rsh
> 	pe_name           lam_loose_rsh
> 	slots             4
> 	user_lists        NONE
> 	xuser_lists       NONE
> 	start_proc_args   /home/srividya/scripts/lam_loose_rsh/startlam.sh \
>          			$pe_hostfile
> 	stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
> 	allocation_rule   $round_robin
> 	control_slaves    FALSE
> 	job_is_first_task TRUE
> 	urgency_slots     min
>
>      Have also added this pe to the queue list through qmon.
>
> 5) Have  modified the corresponding startlam.sh  as suggested from
> hostname to hostname.local
>
> 6) Now, have defined the script file as follows
>       [srividya at medusa ~]$ cat tester1.sh
> 		#!/bin/sh
> 		/opt/lam/gnu/bin/mpirun C /home/srividya/mpihello
>
> 7) On running the script file as follows
>        [srividya at medusa ~]$ qsub -pe lam_loose_rsh 2 tester1.sh
> 		Your job 79 ("tester1.sh") has been submitted.
> 	[srividya at medusa ~]$ qstat
> 	job-ID  prior   name       user         state submit/start at
> queue                          slots 		ja-task-ID
> 	--------------------------------------------------------------------- 
> --------------------------------------------
>      79 0.00000 tester1.sh srividya     qw    01/18/2006 09:37:12
>                               2
>
> 8) And obtain the following results in the tester1.sh.e77
>
>      [srividya at medusa ~]$ cat tester1.sh.e79
> 	/home/srividya/mpihello: error while loading shared libraries:
> liblamf77mpi.so.0: 	cannot open shared object file: No such file or
> directory
> ---------------------------------------------------------------------- 
> -------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> ---------------------------------------------------------------------- 
> -------
> /home/srividya/mpihello: error while loading shared libraries:
> liblamf77mpi.so.0: cannot open shared object file: No such file or
> directory
>
> I am not sure why the path information is not being read by SGE... The
> LD_LIBRARY_PATH env variable has the required path... Is there
> something else that i am missing.
>
> 9) On changing the script to sge.lam.script as follows.. the only diff
> being the LAM_MPI_SOCKET_SUFFIX
>    #cat sge.lam.script
>     #!/bin/sh
>    #$ -N mpihello
>    #$ -cwd
>    #$ -j y
>    #
>    # pe request for LAM. Set your number of processors here.
>   #$ -pe lam_loose_rsh 2
>   #
>   # Run job through bash shell
>   #$ -S /bin/bash
>   # This MUST be in your LAM run script, otherwise
>   # multiple LAM jobs will NOT RUN
>   export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME
>  #
>  # Use full pathname to make sure we are using the right mpirun
> /opt/lam/gnu/bin/mpirun -np $NSLOTS /home/srividya/mpihello
>
> 10) and submitting to the queue
>         #qsub sge.lam.script
>
> 11) Obtain the following error message
>         [srividya at medusa ~]$ cat mpihello.o80
> ---------------------------------------------------------------------- 
> -------
> It seems that there is no lamd running on the host compute-0-6.local.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "mpirun" command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> ---------------------------------------------------------------------- 
> -------
>
> And this is the message that i was sending out earlier. I am new to
> the sge-lam environment and thanks so much for your patience. Any help
> will be greatly appreciated.
>
> Thanks,
> Srividya
>
>
>
>
>
> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:
>>
>>> The change in the startlam.sh from
>>> echo host
>>> to
>>> echo host.local
>>>
>>> after stopping and booting the lamuniverse does not seem to solve  
>>> the
>>
>> No - stop the lamuniverse. Don't boot it by hand! Just start a
>> parallel job like I mentioned the mpihello.c, and post the error/log-
>> files of this job. Your rsh connection is also working between the
>> nodes for a passwordless invocation? - Reuti
>>
>>> problem either..
>>>
>>> Thanks again,
>>> Srividya
>>>
>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
>>>>
>>>>> The pe is defined as follows:
>>>>>
>>>>> #qconf -sp lam_loose_rsh
>>>>> pe_name           lam_loose_rsh
>>>>> slots             4
>>>>> user_lists        NONE
>>>>> xuser_lists       NONE
>>>>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/ 
>>>>> startlam.sh \
>>>>>                   $pe_hostfile
>>>>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
>>>>> allocation_rule   $round_robin
>>>>> control_slaves    FALSE
>>>>> job_is_first_task TRUE
>>>>> urgency_slots     min
>>>>>
>>>>
>>>> Okay, fine. As you use ROCKS, please change in the startlam.sh in
>>>> PeHostfile2MachineFile():
>>>>
>>>>           echo $host
>>>>
>>>> to
>>>>
>>>>           echo $host.local
>>>>
>>>> As we have no ROCKS, I don't know whether this is necessary. Then
>>>> just try as outlined in the Howto with the included mpihello.c,  
>>>> just
>>>> to test the distribution to the nodes (after shutting down the
>>>> started LAM universe). - Reuti
>>>>
>>>>
>>>>> Thanks so much,
>>>>> Srividya
>>>>>
>>>>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>>    I did define the pe for loose rsh using qmon. and also added
>>>>>> this
>>>>>> pe to the queue list using the queue manager provided by qmon.
>>>>>>
>>>>>> Thanks,
>>>>>> Srividya
>>>>>>
>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>> Hi again.
>>>>>>>
>>>>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>    Thanks for your prompt response. I am sorry if i was not
>>>>>>>> clear on
>>>>>>>> the earlier mail. I did not  start the lamd deamons prior to
>>>>>>>> submitting the job by hand. What I was trying to convey was  
>>>>>>>> that
>>>>>>>> the
>>>>>>>> lamd deamons are running on the compute nodes possibly started
>>>>>>>> by SGE
>>>>>>>> itself, but somehow is not registered with LAM/MPI??!!
>>>>>>>>
>>>>>>>>     And also the hostfile that is used during lamboot
>>>>>>>> #lamboot -v -ssi boot rsh hostfile
>>>>>>>
>>>>>>> lamboot will start the daemons, which isn't necessary. Also  
>>>>>>> with a
>>>>>>> loose integration, SGE will start the daemons on its own  
>>>>>>> (just by
>>>>>>> rsh
>>>>>>> in contrast to qrsh with a Tight Integration).
>>>>>>>
>>>>>>> LAM/MPI is in some way SGE aware, and will look for some special
>>>>>>> information in the SGE created directories on all the slave  
>>>>>>> nodes.
>>>>>>>
>>>>>>> But anyway: how did you define the PE - loose with rsh or  
>>>>>>> qrsh? -
>>>>>>> Reuti
>>>>>>>
>>>>>>>
>>>>>>>> is as follows, which already had the .local suffix as
>>>>>>>> medusa.lab.ac.uab.edu cpu=4
>>>>>>>> compute-0-0.local cpu=4
>>>>>>>> compute-0-1.local cpu=4
>>>>>>>> compute-0-2.local cpu=4
>>>>>>>> compute-0-3.local cpu=4
>>>>>>>> compute-0-4.local cpu=4
>>>>>>>> compute-0-5.local cpu=4
>>>>>>>> compute-0-6.local cpu=4
>>>>>>>> compute-0-7.local cpu=4
>>>>>>>>
>>>>>>>> Any further ideas to solve this issue will be very helpful.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Srividya
>>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>     I am working with a pentium III rocks cluster which has
>>>>>>>>>> LAM/MPI
>>>>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the  
>>>>>>>>>> loose
>>>>>>>>>> integration mechanism with rsh working with SGE and LAM as
>>>>>>>>>> suggested
>>>>>>>>>> by the following post on this mailing list
>>>>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
>>>>>>>>>> integration.html
>>>>>>>>>>
>>>>>>>>>> However, on submitting the jobs to the queue, i get the
>>>>>>>>>> following
>>>>>>>>>> error message
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> ---
>>>>>>>>>> --
>>>>>>>>>> -------
>>>>>>>>>> It seems that there is no lamd running on the host
>>>>>>>>>> compute-0-5.local.
>>>>>>>>>>
>>>>>>>>>> This indicates that the LAM/MPI runtime environment is not
>>>>>>>>>> operating.
>>>>>>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
>>>>>>>>>> command.
>>>>>>>>>>
>>>>>>>>>> Please run the "lamboot" command the start the LAM/MPI  
>>>>>>>>>> runtime
>>>>>>>>>> environment.  See the LAM/MPI documentation for how to invoke
>>>>>>>>>> "lamboot" across multiple machines.
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> ---
>>>>>>>>>> --
>>>>>>>>>> -------
>>>>>>>>>> But, lamnodes  command shows all the nodes on the system  
>>>>>>>>>> and i
>>>>>>>>>> can
>>>>>>>>>> also see the lamd deamon running on the local compute
>>>>>>>>>> nodes.  Any
>>>>>>>>>> ideas on the what the issue could be are greatly appreciated.
>>>>>>>>>
>>>>>>>>> there is no need to startup any daemon on your own by hand
>>>>>>>>> before. In
>>>>>>>>> fact, it will not work. SGE takes care of starting a private
>>>>>>>>> daemon
>>>>>>>>> for each job on all the selected nodes for this particular  
>>>>>>>>> job.
>>>>>>>>>
>>>>>>>>> One issue with ROCKS might be similar to this (change the
>>>>>>>>> startscript
>>>>>>>>> to include .local for the nodes in the "machines"-file):
>>>>>>>>>
>>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
>>>>>>>>> listName=users&msgNo=14170
>>>>>>>>>
>>>>>>>>> Just let me know, whether it worked after adjusting the start
>>>>>>>>> script.
>>>>>>>>>
>>>>>>>>> -- Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Srividya
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> ---
>>>>>>>>>> -
>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> ---
>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------- 
>>>>>>>> --
>>>>>>>> --
>>>>>>>> --
>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: users-
>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list