[GE users] lam tight and host names

Reuti reuti at staff.uni-marburg.de
Wed Feb 28 16:51:11 GMT 2007


Am 28.02.2007 um 17:12 schrieb Davide Cittaro:

> Hi again, I've reinstalled everything according to
>
> http://wiki.gridengine.info/wiki/index.php/Tight-LAM-Integration-Notes
>
> the scripts and the pe are pretty the same I used... let' go on..
>>
>> can you please try to use mpirun directly instead of mpiexec?
>
> I'm testing with mpihello, so that I can be sure we are all looking  
> to something common.
> Even using mpirun:
>
> #!/bin/sh
>
> #$ -S /bin/bash
> #$ -cwd
> #$ -N MPIHELLO
>
> mpirun C ./mpihello
>
> I have lamd daemon hanging on SLAVE nodes. Also I have this
>
> $ cat MPIHELLO.e3179
> mpirun: cannot start ./mpihello on n1: invalid address tag
>
> $ cat MPIHELLO.pe3179
> ---------------------------------------------------------------------- 
> -------
> *** Oops -- cannot find the help that you're supposed to get.
> *** Using the following help file:
> ***
> ***    /etc/lam-mpi/lam-helpfile
> ***
> *** You were supposed to get help on the program "rhreq"
> *** about the topic "timeout"
> *** But it doesn't seem to be in that file.
> ***
> *** Sorry!
> ---------------------------------------------------------------------- 
> -------
>
> I'm googling to understand a bit more
>
>>
>> What were the complete command options to mpiexec - number of  
>> nodes or just C?
>>
>
> $ cat ../script_tests/mb-mpi/mbtest2.sh
> #/bin/bash
>
> #$ -S /bin/bash
> #$ -cwd -v PATH
> #$ -pe lam_mpi 10
> mpiexec /usr/bin/mb-mpi mb_comm.txt
>
>
>> What is the name of your PE?
>>
>
> lam_mpi...
>
> $ qconf -sp lam_mpi
> pe_name           lam_mpi
> slots             999
> user_lists        s-comp
> xuser_lists       l-bioinf t-xtal
> start_proc_args   /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args    /opt/sge/lam_tight_qrsh/stoplam.sh -catch_rsh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> is "lam_tight_qrsh" somehow hardcoded into integration scripts?

Yes. In the lamd_wrapper. This you have to adjust to your PE name.

-- Reuti


>
> d
>
>> -- Reuti
>>
>>
>>> [..]
>>>
>>> but
>>>
>>> rsh node5.sge.ifom-ieo-campus.it ps -efHwww
>>> [..]
>>> dcittaro  5981     1  0 15:14 ?        00:00:00   /usr/bin/lamd - 
>>> H 85.239.175.25 -P 50589 -n 8 -o 0 -sessionsuffix sge-3171-undefined
>>> dcittaro  5992     1  0 15:16 ?        00:00:00   /usr/bin/lamd - 
>>> H 85.239.175.21 -P 40954 -n 8 -o 0 -sessionsuffix sge-3172-undefined
>>>
>>> and so on for each SLAVE node. It seems that my process has been  
>>> spawned but not started...
>>>
>>> d
>>> /*
>>> Davide Cittaro
>>> HPC and Bioinformatics Systems @ Informatics Core
>>>
>>> IFOM - Istituto FIRC di Oncologia Molecolare
>>> via adamello, 16
>>> 20139 Milano
>>> Italy
>>>
>>> tel.: +39(02)574303007
>>> e-mail: davide.cittaro at ifom-ieo-campus.it
>>> */
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> /*
> Davide Cittaro
> HPC and Bioinformatics Systems @ Informatics Core
>
> IFOM - Istituto FIRC di Oncologia Molecolare
> via adamello, 16
> 20139 Milano
> Italy
>
> tel.: +39(02)574303007
> e-mail: davide.cittaro at ifom-ieo-campus.it
> */
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list