[GE users] Re: LAM SGE Integration issues with rocks 4.1

Reuti reuti at staff.uni-marburg.de
Wed Jan 18 21:34:58 GMT 2006


Am 18.01.2006 um 22:15 schrieb Srividya Valivarthi:

<snip>

>> Now submit a test job with:
>>
>> #!/bin/sh
>> lamnodes
>> exit 0
>>
>> and request the LAM PE as you did below (with different amount of
>> requested slots). In the .po file should just find the LAM copyright
>> notice twice, and in the .o file a confirmation of the selected  
>> nodes.
>>
>> It might be necessary, to put a line like:
>>
>> export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
>>
>> in your .profile and/or .bashrc (of course with your actual location
>> of the LAM installation).
>>
>> If we got this - we go to the next step. - Reuti
>>
>>
>
> i get the .o and the .po files as follows:
> [srividya at medusa ~]$ cat simple.script.o96
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> n0      compute-0-7.local:1:origin,this_node
> n1      compute-0-3.local:1:
> [srividya at medusa ~]$ cat simple.script.po96
> /opt/gridengine/default/spool/compute-0-7/active_jobs/96.1/pe_hostfile
> compute-0-7.local
> compute-0-3.local
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
> ----
> I hope this  is fine..

This is perfect!

BTW: What is your "shell_start_mode" and "shell" in the queue defintion?

Unless you need the hard-coded csh, I'd suggest to set it to:

shell                 /bin/sh
shell_start_mode      unix_behavior

this way the first line of the script will be honored and you get  
the /bin/sh which is specified there (Warning: no access to tty (Bad  
file descriptor) comes from csh AFAIK).

>
>>>>> 6) Now, have defined the script file as follows
>>>>>       [srividya at medusa ~]$ cat tester1.sh
>>>>> 		#!/bin/sh
>>>>> 		/opt/lam/gnu/bin/mpirun C /home/srividya/mpihello
>>>>>

As we now know, that the job specific LAM universe was started by  
SGE: are you still getting errors here?

- Reuti


>>>>> 7) On running the script file as follows
>>>>>        [srividya at medusa ~]$ qsub -pe lam_loose_rsh 2 tester1.sh
>>>>> 		Your job 79 ("tester1.sh") has been submitted.
>>>>> 	[srividya at medusa ~]$ qstat
>>>>> 	job-ID  prior   name       user         state submit/start at
>>>>> queue                          slots 		ja-task-ID
>>>>> 	----------------------------------------------------------------- 
>>>>> --
>>>>> --
>>>>> --------------------------------------------
>>>>>      79 0.00000 tester1.sh srividya     qw    01/18/2006 09:37:12
>>>>>                               2
>>>>>
>>>>> 8) And obtain the following results in the tester1.sh.e77
>>>>>
>>>>>      [srividya at medusa ~]$ cat tester1.sh.e79
>>>>> 	/home/srividya/mpihello: error while loading shared libraries:
>>>>> liblamf77mpi.so.0: 	cannot open shared object file: No such  
>>>>> file or
>>>>> directory
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>> -------
>>>>> It seems that [at least] one of the processes that was started  
>>>>> with
>>>>> mpirun did not invoke MPI_INIT before quitting (it is possible  
>>>>> that
>>>>> more than one process did not invoke MPI_INIT -- mpirun was only
>>>>> notified of the first one, which was on node n0).
>>>>>
>>>>> mpirun can *only* be used with MPI programs (i.e., programs that
>>>>> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec"
>>>>> program
>>>>> to run non-MPI programs over the lambooted nodes.
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>> -------
>>>>> /home/srividya/mpihello: error while loading shared libraries:
>>>>> liblamf77mpi.so.0: cannot open shared object file: No such file or
>>>>> directory
>>>>>
>>>>> I am not sure why the path information is not being read by
>>>>> SGE... The
>>>>> LD_LIBRARY_PATH env variable has the required path... Is there
>>>>> something else that i am missing.
>>>>>
>>>>> 9) On changing the script to sge.lam.script as follows.. the only
>>>>> diff
>>>>> being the LAM_MPI_SOCKET_SUFFIX
>>>>>    #cat sge.lam.script
>>>>>     #!/bin/sh
>>>>>    #$ -N mpihello
>>>>>    #$ -cwd
>>>>>    #$ -j y
>>>>>    #
>>>>>    # pe request for LAM. Set your number of processors here.
>>>>>   #$ -pe lam_loose_rsh 2
>>>>>   #
>>>>>   # Run job through bash shell
>>>>>   #$ -S /bin/bash
>>>>>   # This MUST be in your LAM run script, otherwise
>>>>>   # multiple LAM jobs will NOT RUN
>>>>>   export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME
>>>>>  #
>>>>>  # Use full pathname to make sure we are using the right mpirun
>>>>> /opt/lam/gnu/bin/mpirun -np $NSLOTS /home/srividya/mpihello
>>>>>
>>>>> 10) and submitting to the queue
>>>>>         #qsub sge.lam.script
>>>>>
>>>>> 11) Obtain the following error message
>>>>>         [srividya at medusa ~]$ cat mpihello.o80
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>> -------
>>>>> It seems that there is no lamd running on the host
>>>>> compute-0-6.local.
>>>>>
>>>>> This indicates that the LAM/MPI runtime environment is not
>>>>> operating.
>>>>> The LAM/MPI runtime environment is necessary for the "mpirun"
>>>>> command.
>>>>>
>>>>> Please run the "lamboot" command the start the LAM/MPI runtime
>>>>> environment.  See the LAM/MPI documentation for how to invoke
>>>>> "lamboot" across multiple machines.
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> --
>>>>> -------
>>>>>
>>>>> And this is the message that i was sending out earlier. I am  
>>>>> new to
>>>>> the sge-lam environment and thanks so much for your patience. Any
>>>>> help
>>>>> will be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Srividya
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>> Am 11.01.2006 um 20:45 schrieb Srividya Valivarthi:
>>>>>>
>>>>>>> The change in the startlam.sh from
>>>>>>> echo host
>>>>>>> to
>>>>>>> echo host.local
>>>>>>>
>>>>>>> after stopping and booting the lamuniverse does not seem to  
>>>>>>> solve
>>>>>>> the
>>>>>>
>>>>>> No - stop the lamuniverse. Don't boot it by hand! Just start a
>>>>>> parallel job like I mentioned the mpihello.c, and post the error/
>>>>>> log-
>>>>>> files of this job. Your rsh connection is also working between  
>>>>>> the
>>>>>> nodes for a passwordless invocation? - Reuti
>>>>>>
>>>>>>> problem either..
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> Srividya
>>>>>>>
>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>> Am 11.01.2006 um 19:53 schrieb Srividya Valivarthi:
>>>>>>>>
>>>>>>>>> The pe is defined as follows:
>>>>>>>>>
>>>>>>>>> #qconf -sp lam_loose_rsh
>>>>>>>>> pe_name           lam_loose_rsh
>>>>>>>>> slots             4
>>>>>>>>> user_lists        NONE
>>>>>>>>> xuser_lists       NONE
>>>>>>>>> start_proc_args   /home/srividya/scripts/lam_loose_rsh/
>>>>>>>>> startlam.sh \
>>>>>>>>>                   $pe_hostfile
>>>>>>>>> stop_proc_args    /home/srividya/scripts/lam_loose_rsh/
>>>>>>>>> stoplam.sh
>>>>>>>>> allocation_rule   $round_robin
>>>>>>>>> control_slaves    FALSE
>>>>>>>>> job_is_first_task TRUE
>>>>>>>>> urgency_slots     min
>>>>>>>>>
>>>>>>>>
>>>>>>>> Okay, fine. As you use ROCKS, please change in the  
>>>>>>>> startlam.sh in
>>>>>>>> PeHostfile2MachineFile():
>>>>>>>>
>>>>>>>>           echo $host
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>           echo $host.local
>>>>>>>>
>>>>>>>> As we have no ROCKS, I don't know whether this is necessary.  
>>>>>>>> Then
>>>>>>>> just try as outlined in the Howto with the included mpihello.c,
>>>>>>>> just
>>>>>>>> to test the distribution to the nodes (after shutting down the
>>>>>>>> started LAM universe). - Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks so much,
>>>>>>>>> Srividya
>>>>>>>>>
>>>>>>>>> On 1/11/06, Srividya Valivarthi <srividya.v at gmail.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>    I did define the pe for loose rsh using qmon. and also  
>>>>>>>>>> added
>>>>>>>>>> this
>>>>>>>>>> pe to the queue list using the queue manager provided by  
>>>>>>>>>> qmon.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Srividya
>>>>>>>>>>
>>>>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>>>>> Hi again.
>>>>>>>>>>>
>>>>>>>>>>> Am 11.01.2006 um 19:34 schrieb Srividya Valivarthi:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>    Thanks for your prompt response. I am sorry if i was not
>>>>>>>>>>>> clear on
>>>>>>>>>>>> the earlier mail. I did not  start the lamd deamons  
>>>>>>>>>>>> prior to
>>>>>>>>>>>> submitting the job by hand. What I was trying to convey was
>>>>>>>>>>>> that
>>>>>>>>>>>> the
>>>>>>>>>>>> lamd deamons are running on the compute nodes possibly
>>>>>>>>>>>> started
>>>>>>>>>>>> by SGE
>>>>>>>>>>>> itself, but somehow is not registered with LAM/MPI??!!
>>>>>>>>>>>>
>>>>>>>>>>>>     And also the hostfile that is used during lamboot
>>>>>>>>>>>> #lamboot -v -ssi boot rsh hostfile
>>>>>>>>>>>
>>>>>>>>>>> lamboot will start the daemons, which isn't necessary. Also
>>>>>>>>>>> with a
>>>>>>>>>>> loose integration, SGE will start the daemons on its own
>>>>>>>>>>> (just by
>>>>>>>>>>> rsh
>>>>>>>>>>> in contrast to qrsh with a Tight Integration).
>>>>>>>>>>>
>>>>>>>>>>> LAM/MPI is in some way SGE aware, and will look for some
>>>>>>>>>>> special
>>>>>>>>>>> information in the SGE created directories on all the slave
>>>>>>>>>>> nodes.
>>>>>>>>>>>
>>>>>>>>>>> But anyway: how did you define the PE - loose with rsh or
>>>>>>>>>>> qrsh? -
>>>>>>>>>>> Reuti
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> is as follows, which already had the .local suffix as
>>>>>>>>>>>> medusa.lab.ac.uab.edu cpu=4
>>>>>>>>>>>> compute-0-0.local cpu=4
>>>>>>>>>>>> compute-0-1.local cpu=4
>>>>>>>>>>>> compute-0-2.local cpu=4
>>>>>>>>>>>> compute-0-3.local cpu=4
>>>>>>>>>>>> compute-0-4.local cpu=4
>>>>>>>>>>>> compute-0-5.local cpu=4
>>>>>>>>>>>> compute-0-6.local cpu=4
>>>>>>>>>>>> compute-0-7.local cpu=4
>>>>>>>>>>>>
>>>>>>>>>>>> Any further ideas to solve this issue will be very helpful.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Srividya
>>>>>>>>>>>> On 1/11/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 11.01.2006 um 18:55 schrieb Srividya Valivarthi:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     I am working with a pentium III rocks cluster  
>>>>>>>>>>>>>> which has
>>>>>>>>>>>>>> LAM/MPI
>>>>>>>>>>>>>> version 7.1.1 and SGE version 6.0. I am trying to get the
>>>>>>>>>>>>>> loose
>>>>>>>>>>>>>> integration mechanism with rsh working with SGE and  
>>>>>>>>>>>>>> LAM as
>>>>>>>>>>>>>> suggested
>>>>>>>>>>>>>> by the following post on this mailing list
>>>>>>>>>>>>>> http://gridengine.sunsource.net/howto/lam-integration/ 
>>>>>>>>>>>>>> lam-
>>>>>>>>>>>>>> integration.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, on submitting the jobs to the queue, i get the
>>>>>>>>>>>>>> following
>>>>>>>>>>>>>> error message
>>>>>>>>>>>>>> --------------------------------------------------------- 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>> It seems that there is no lamd running on the host
>>>>>>>>>>>>>> compute-0-5.local.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This indicates that the LAM/MPI runtime environment is  
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> operating.
>>>>>>>>>>>>>> The LAM/MPI runtime environment is necessary for the
>>>>>>>>>>>>>> "mpirun"
>>>>>>>>>>>>>> command.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please run the "lamboot" command the start the LAM/MPI
>>>>>>>>>>>>>> runtime
>>>>>>>>>>>>>> environment.  See the LAM/MPI documentation for how to
>>>>>>>>>>>>>> invoke
>>>>>>>>>>>>>> "lamboot" across multiple machines.
>>>>>>>>>>>>>> --------------------------------------------------------- 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>> But, lamnodes  command shows all the nodes on the system
>>>>>>>>>>>>>> and i
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> also see the lamd deamon running on the local compute
>>>>>>>>>>>>>> nodes.  Any
>>>>>>>>>>>>>> ideas on the what the issue could be are greatly
>>>>>>>>>>>>>> appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> there is no need to startup any daemon on your own by hand
>>>>>>>>>>>>> before. In
>>>>>>>>>>>>> fact, it will not work. SGE takes care of starting a  
>>>>>>>>>>>>> private
>>>>>>>>>>>>> daemon
>>>>>>>>>>>>> for each job on all the selected nodes for this particular
>>>>>>>>>>>>> job.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One issue with ROCKS might be similar to this (change the
>>>>>>>>>>>>> startscript
>>>>>>>>>>>>> to include .local for the nodes in the "machines"-file):
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://gridengine.sunsource.net/servlets/ReadMsg?
>>>>>>>>>>>>> listName=users&msgNo=14170
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just let me know, whether it worked after adjusting the
>>>>>>>>>>>>> start
>>>>>>>>>>>>> script.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Srividya
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --------------------------------------------------------- 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------------------------------------------------------- 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> --
>>>>>>>>>>>>> --
>>>>>>>>>>>>> --
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ----------------------------------------------------------- 
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> -
>>>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> -
>>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail: users-
>>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------- 
>>>>>>>> --
>>>>>>>> --
>>>>>>>> --
>>>>>>>> To unsubscribe, e-mail: users-
>>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: users-
>>>>>>>> help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> --
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: users- 
>>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-
>>>>>>> help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> --
>>>>>> --
>>>>>> To unsubscribe, e-mail: users- 
>>>>>> unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-
>>>>>> help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users- 
>>>>> help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>>
>>>>
>>>> <logsge-lam.txt>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list