[GE users] Re: LAM SGE Integration issues with rocks 4.1

Reuti reuti at staff.uni-marburg.de
Wed Jan 18 20:56:49 GMT 2006


Am 18.01.2006 um 21:33 schrieb Srividya Valivarthi:

> Thanks so much for your patience.
>
>  Have stopped all the deamons using the lamhalt command and  
> attached the log.
>

Fine!

> Thanks again,
> Srividya
> On 1/18/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Srividya:
>>
>> Am 18.01.2006 um 16:50 schrieb Srividya Valivarthi:
>>
>>> Hi,
>>>
>>>    Thanks so much for the prompt responses.  I would like to go over
>>> again the commands that i have used and the error logs more clearly,
>>> so that i can get some help on this problem.
>>>
>>> 1) Firstly i have aliased rsh to ssh. will this cause any issues?
>>>
>>
>> for a loose integration this schould work, but not for any of the
>> qrsh based setups of the LAM/MPI integration (where rsh will be
>> caught by SGE and routed to a qrsh command).
>>
>>> 2) On my first login into the system ran the following command to  
>>> have
>>> the lamd deamon running on all nodes as follows
>>>     # lamboot -v -ssi boot rsh hostfile
>>>       and the host file contains
>>>       	medusa.lab.ac.uab.edu cpu=4
>>> 	compute-0-0.local cpu=4
>>> 	compute-0-1.local cpu=4
>>> 	compute-0-2.local cpu=4
>>> 	compute-0-3.local cpu=4
>>> 	compute-0-4.local cpu=4
>>> 	compute-0-5.local cpu=4
>>> 	compute-0-6.local cpu=4
>>> 	compute-0-7.local cpu=4
>>>
>>
>> Again: please stop the daemons! Then come back and we go to the next
>> point. - Reuti
>>
>>> 3) Then on compiling and running the mpihello program with the lam
>>> binaries i get the expected results.
>>>       [srividya at medusa ~]$ /opt/lam/gnu/bin/mpirun -np 2 /home/
>>> srividya/mpihello
>>> 	Hello World from Node 0.
>>> 	Hello World from Node 1.
>>>

This was working outside of SGE - ok.

>>> 4) Now, in order to be able to submit jobs through SGE. I defined  
>>> the
>>> pe through qmon as follows:
>>>      [srividya at medusa ~]$ qconf -sp lam_loose_rsh
>>> 	pe_name           lam_loose_rsh
>>> 	slots             4
>>> 	user_lists        NONE
>>> 	xuser_lists       NONE
>>> 	start_proc_args   /home/srividya/scripts/lam_loose_rsh/ 
>>> startlam.sh \
>>>          			$pe_hostfile
>>> 	stop_proc_args    /home/srividya/scripts/lam_loose_rsh/stoplam.sh
>>> 	allocation_rule   $round_robin
>>> 	control_slaves    FALSE
>>> 	job_is_first_task TRUE
>>> 	urgency_slots     min
>>>
>>>      Have also added this pe to the queue list through qmon.
>>>

Near the end of the lam_loose_rsh are the lines:

...
#
# Extra LAM statement(s)
#
if [ -z "`which lamboot 2>/dev/null`" ] ; then
     export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
fi
lamboot $machines
...

Please adjust them to reflect the PATH to your LAM installation.  
Sometimes lamboot is already found by the default PATH for a user (as  
in it's .bashrc or similar), and so this was just some safety to find  
it.


>>> 5) Have  modified the corresponding startlam.sh  as suggested from
>>> hostname to hostname.local
>>>

Okay, this we need for ROCKS.

Now submit a test job with:

#!/bin/sh
lamnodes
exit 0

and request the LAM PE as you did below (with different amount of  
requested slots). In the .po file should just find the LAM copyright  
notice twice, and in the .o file a confirmation of the selected nodes.

It might be necessary, to put a line like:

export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH

in your .profile and/or .bashrc (of course with your actual location  
of the LAM installation).

If we got this - we go to the next step. - Reuti


>>> 6) Now, have defined the script file as follows
>>>       [srividya at medusa ~]$ cat tester1.sh
>>> 		#!/bin/sh
>>> 		/opt/lam/gnu/bin/mpirun C /home/srividya/mpihello
>>>
>>> 7) On running the script file as follows
>>>        [srividya at medusa ~]$ qsub -pe lam_loose_rsh 2 tester1.sh
>>> 		Your job 79 ("tester1.sh") has been submitted.
>>> 	[srividya at medusa ~]$ qstat
>>> 	job-ID  prior   name       user         state submit/start at
>>> queue                          slots 		ja-task-ID
>>> 	------------------------------------------------------------------- 
>>> --
>>> --------------------------------------------
>>>      79 0.00000 tester1.sh srividya     qw    01/18/2006 09:37:12
>>>                               2
>>>
>>> 8) And obtain the following results in the tester1.sh.e77
>>>
>>>      [srividya at medusa ~]$ cat tester1.sh.e79
>>> 	/home/srividya/mpihello: error while loading shared libraries:
>>> liblamf77mpi.so.0: 	cannot open shared object file: No such file or
>>> directory
>>> -------------------------------------------------------------------- 
>>> --
>>> -------
>>> It seems that [at least] one of the processes that was started with
>>> mpirun did not invoke MPI_INIT before quitting (it is possible that
>>> more than one process did not invoke MPI_INIT -- mpirun was only
>>> notified of the first one, which was on node n0).
>>>
>>> mpirun can *only* be used with MPI programs (i.e., programs that
>>> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec"  
>>> program
>>> to run non-MPI programs over the lambooted nodes.
>>> -------------------------------------------------------------------- 
>>> --
>>> -------
>>> /home/srividya/mpihello: error while loading shared libraries:
>>> liblamf77mpi.so.0: cannot open shared object file: No such file or
>>> directory
>>>
>>> I am not sure why the path information is not being read by  
>>> SGE... The
>>> LD_LIBRARY_PATH env variable has the required path... Is there
>>> something else that i am missing.
>>>
>>> 9) On changing the script to sge.lam.script as follows.. the only  
>>> diff
>>> being the LAM_MPI_SOCKET_SUFFIX
>>>    #cat sge.lam.script
>>>     #!/bin/sh
>>>    #$ -N mpihello
>>>    #$ -cwd
>>>    #$ -j y
>>>    #
>>>    # pe request for LAM. Set your number of processors here.
>>>   #$ -pe lam_loose_rsh 2
>>>   #
>>>   # Run job through bash shell
>>>   #$ -S /bin/bash
>>>   # This MUST be in your LAM run script, otherwise
>>>   # multiple LAM jobs will NOT RUN
>>>   export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME
>>>  #
>>>  # Use full pathname to make sure we are using the right mpirun
>>> /opt/lam/gnu/bin/mpirun -np $NSLOTS /home/srividya/mpihello
>>>
>>> 10) and submitting to the queue
>>>         #qsub sge.lam.script
>>>
>>> 11) Obtain the following error message
>>>         [srividya at medusa ~]$ cat mpihello.o80
>>> -------------------------------------------------------------------- 
>>> --
>>> -------
>>> It seems that there is no lamd running on the host  
>>> compute-0-6.local.
>>>
>>> This indicates that the LAM/MPI runtime environment is not  
>>> operating.
>>> The LAM/MPI runtime environment is necessary for the "mpirun"  
>>> command.
>>>
>>> Please run the "lamboot" command the start the LAM/MPI runtime
>>> environment.  See the LAM/MPI documentation for how to invoke
>>> "lamboot" across multiple machines.
>>> -------------------------------------------------------------------- 
>>> --
>>> -------
>>>
>>> And this is the message that i was sending out earlier. I am new to
>>> the sge-lam environment and thanks so much for your patience. Any  
>>> help
>>> will be greatly appreciated.
>>>
>>> Thanks,
>>> Srividya
>>>
>>>
>>>
>>>
>>>
>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>> Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:
>>>>
>>>>> The change in the startlam.sh from
>>>>> echo host
>>>>> to
>>>>> echo host.local
>>>>>
>>>>> after stopping and booting the lamuniverse does not seem to solve
>>>>> the
>>>>
>>>> No - stop the lamuniverse. Don't boot it by hand! Just start a
>>>> parallel job like I mentioned the mpihello.c, and post the error/ 
>>>> log-
>>>> files of this job. Your rsh connection is also working between the
>>>> nodes for a passwordless invocation? - Reuti
>>>>
>>>>> problem either..
>>>>>
>>>>> Thanks again,
>>>>> Srividya
>>>>>
>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
>>>>>>
>>>>>>> The pe is defined as follows:
>>>>>>>
>>>>>>> #qconf -sp lam_loose_rsh
>>>>>>> pe_name           lam_loose_rsh
>>>>>>> slots             4
>>>>>>> user_lists        NONE
>>>>>>> xuser_lists       NONE
>>>>>>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/
>>>>>>> startlam.sh \
>>>>>>>                   $pe_hostfile
>>>>>>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/ 
>>>>>>> stoplam.sh
>>>>>>> allocation_rule   $round_robin
>>>>>>> control_slaves    FALSE
>>>>>>> job_is_first_task TRUE
>>>>>>> urgency_slots     min
>>>>>>>
>>>>>>
>>>>>> Okay, fine. As you use ROCKS, please change in the startlam.sh in
>>>>>> PeHostfile2MachineFile():
>>>>>>
>>>>>>           echo $host
>>>>>>
>>>>>> to
>>>>>>
>>>>>>           echo $host.local
>>>>>>
>>>>>> As we have no ROCKS, I don't know whether this is necessary. Then
>>>>>> just try as outlined in the Howto with the included mpihello.c,
>>>>>> just
>>>>>> to test the distribution to the nodes (after shutting down the
>>>>>> started LAM universe). - Reuti
>>>>>>
>>>>>>
>>>>>>> Thanks so much,
>>>>>>> Srividya
>>>>>>>
>>>>>>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>    I did define the pe for loose rsh using qmon. and also added
>>>>>>>> this
>>>>>>>> pe to the queue list using the queue manager provided by qmon.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Srividya
>>>>>>>>
>>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>>> Hi again.
>>>>>>>>>
>>>>>>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>    Thanks for your prompt response. I am sorry if i was not
>>>>>>>>>> clear on
>>>>>>>>>> the earlier mail. I did not  start the lamd deamons prior to
>>>>>>>>>> submitting the job by hand. What I was trying to convey was
>>>>>>>>>> that
>>>>>>>>>> the
>>>>>>>>>> lamd deamons are running on the compute nodes possibly  
>>>>>>>>>> started
>>>>>>>>>> by SGE
>>>>>>>>>> itself, but somehow is not registered with LAM/MPI??!!
>>>>>>>>>>
>>>>>>>>>>     And also the hostfile that is used during lamboot
>>>>>>>>>> #lamboot -v -ssi boot rsh hostfile
>>>>>>>>>
>>>>>>>>> lamboot will start the daemons, which isn't necessary. Also
>>>>>>>>> with a
>>>>>>>>> loose integration, SGE will start the daemons on its own
>>>>>>>>> (just by
>>>>>>>>> rsh
>>>>>>>>> in contrast to qrsh with a Tight Integration).
>>>>>>>>>
>>>>>>>>> LAM/MPI is in some way SGE aware, and will look for some  
>>>>>>>>> special
>>>>>>>>> information in the SGE created directories on all the slave
>>>>>>>>> nodes.
>>>>>>>>>
>>>>>>>>> But anyway: how did you define the PE - loose with rsh or
>>>>>>>>> qrsh? -
>>>>>>>>> Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> is as follows, which already had the .local suffix as
>>>>>>>>>> medusa.lab.ac.uab.edu cpu=4
>>>>>>>>>> compute-0-0.local cpu=4
>>>>>>>>>> compute-0-1.local cpu=4
>>>>>>>>>> compute-0-2.local cpu=4
>>>>>>>>>> compute-0-3.local cpu=4
>>>>>>>>>> compute-0-4.local cpu=4
>>>>>>>>>> compute-0-5.local cpu=4
>>>>>>>>>> compute-0-6.local cpu=4
>>>>>>>>>> compute-0-7.local cpu=4
>>>>>>>>>>
>>>>>>>>>> Any further ideas to solve this issue will be very helpful.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Srividya
>>>>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>     I am working with a pentium III rocks cluster which has
>>>>>>>>>>>> LAM/MPI
>>>>>>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the
>>>>>>>>>>>> loose
>>>>>>>>>>>> integration mechanism with rsh working with SGE and LAM as
>>>>>>>>>>>> suggested
>>>>>>>>>>>> by the following post on this mailing list
>>>>>>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/lam-
>>>>>>>>>>>> integration.html
>>>>>>>>>>>>
>>>>>>>>>>>> However, on submitting the jobs to the queue, i get the
>>>>>>>>>>>> following
>>>>>>>>>>>> error message
>>>>>>>>>>>> ----------------------------------------------------------- 
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> ---
>>>>>>>>>>>> --
>>>>>>>>>>>> -------
>>>>>>>>>>>> It seems that there is no lamd running on the host
>>>>>>>>>>>> compute-0-5.local.
>>>>>>>>>>>>
>>>>>>>>>>>> This indicates that the LAM/MPI runtime environment is not
>>>>>>>>>>>> operating.
>>>>>>>>>>>> The LAM/MPI runtime environment is necessary for the  
>>>>>>>>>>>> "mpirun"
>>>>>>>>>>>> command.
>>>>>>>>>>>>
>>>>>>>>>>>> Please run the "lamboot" command the start the LAM/MPI
>>>>>>>>>>>> runtime
>>>>>>>>>>>> environment.  See the LAM/MPI documentation for how to  
>>>>>>>>>>>> invoke
>>>>>>>>>>>> "lamboot" across multiple machines.
>>>>>>>>>>>> ----------------------------------------------------------- 
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> ---
>>>>>>>>>>>> --
>>>>>>>>>>>> -------
>>>>>>>>>>>> But, lamnodes  command shows all the nodes on the system
>>>>>>>>>>>> and i
>>>>>>>>>>>> can
>>>>>>>>>>>> also see the lamd deamon running on the local compute
>>>>>>>>>>>> nodes.  Any
>>>>>>>>>>>> ideas on the what the issue could be are greatly  
>>>>>>>>>>>> appreciated.
>>>>>>>>>>>
>>>>>>>>>>> there is no need to startup any daemon on your own by hand
>>>>>>>>>>> before. In
>>>>>>>>>>> fact, it will not work. SGE takes care of starting a private
>>>>>>>>>>> daemon
>>>>>>>>>>> for each job on all the selected nodes for this particular
>>>>>>>>>>> job.
>>>>>>>>>>>
>>>>>>>>>>> One issue with ROCKS might be similar to this (change the
>>>>>>>>>>> startscript
>>>>>>>>>>> to include .local for the nodes in the "machines"-file):
>>>>>>>>>>>
>>>>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
>>>>>>>>>>> listName=users&msgNo=14170
>>>>>>>>>>>
>>>>>>>>>>> Just let me know, whether it worked after adjusting the  
>>>>>>>>>>> start
>>>>>>>>>>> script.
>>>>>>>>>>>
>>>>>>>>>>> -- Reuti
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Srividya
>>>>>>>>>>>>
>>>>>>>>>>>> ----------------------------------------------------------- 
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> ---
>>>>>>>>>>>> -
>>>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> ---
>>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> -
>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> --
>>>>>> --
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> <logsge-lam.txt>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list