[GE users] lam tight and host names

Davide Cittaro davide.cittaro at ifom-ieo-campus.it
Wed Feb 28 16:12:39 GMT 2007


Hi again, I've reinstalled everything according to

http://wiki.gridengine.info/wiki/index.php/Tight-LAM-Integration-Notes

the scripts and the pe are pretty the same I used... let' go on..
>
> can you please try to use mpirun directly instead of mpiexec?

I'm testing with mpihello, so that I can be sure we are all looking  
to something common.
Even using mpirun:

#!/bin/sh

#$ -S /bin/bash
#$ -cwd
#$ -N MPIHELLO

mpirun C ./mpihello

I have lamd daemon hanging on SLAVE nodes. Also I have this

$ cat MPIHELLO.e3179
mpirun: cannot start ./mpihello on n1: invalid address tag

$ cat MPIHELLO.pe3179
------------------------------------------------------------------------ 
-----
*** Oops -- cannot find the help that you're supposed to get.
*** Using the following help file:
***
***    /etc/lam-mpi/lam-helpfile
***
*** You were supposed to get help on the program "rhreq"
*** about the topic "timeout"
*** But it doesn't seem to be in that file.
***
*** Sorry!
------------------------------------------------------------------------ 
-----

I'm googling to understand a bit more

>
> What were the complete command options to mpiexec - number of nodes  
> or just C?
>

$ cat ../script_tests/mb-mpi/mbtest2.sh
#/bin/bash

#$ -S /bin/bash
#$ -cwd -v PATH
#$ -pe lam_mpi 10
mpiexec /usr/bin/mb-mpi mb_comm.txt


> What is the name of your PE?
>

lam_mpi...

$ qconf -sp lam_mpi
pe_name           lam_mpi
slots             999
user_lists        s-comp
xuser_lists       l-bioinf t-xtal
start_proc_args   /opt/sge/lam_tight_qrsh/startlam.sh -catch_rsh  
$pe_hostfile
stop_proc_args    /opt/sge/lam_tight_qrsh/stoplam.sh -catch_rsh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

is "lam_tight_qrsh" somehow hardcoded into integration scripts?

d

> -- Reuti
>
>
>> [..]
>>
>> but
>>
>> rsh node5.sge.ifom-ieo-campus.it ps -efHwww
>> [..]
>> dcittaro  5981     1  0 15:14 ?        00:00:00   /usr/bin/lamd -H  
>> 85.239.175.25 -P 50589 -n 8 -o 0 -sessionsuffix sge-3171-undefined
>> dcittaro  5992     1  0 15:16 ?        00:00:00   /usr/bin/lamd -H  
>> 85.239.175.21 -P 40954 -n 8 -o 0 -sessionsuffix sge-3172-undefined
>>
>> and so on for each SLAVE node. It seems that my process has been  
>> spawned but not started...
>>
>> d
>> /*
>> Davide Cittaro
>> HPC and Bioinformatics Systems @ Informatics Core
>>
>> IFOM - Istituto FIRC di Oncologia Molecolare
>> via adamello, 16
>> 20139 Milano
>> Italy
>>
>> tel.: +39(02)574303007
>> e-mail: davide.cittaro at ifom-ieo-campus.it
>> */
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

/*
Davide Cittaro
HPC and Bioinformatics Systems @ Informatics Core

IFOM - Istituto FIRC di Oncologia Molecolare
via adamello, 16
20139 Milano
Italy

tel.: +39(02)574303007
e-mail: davide.cittaro at ifom-ieo-campus.it
*/





More information about the gridengine-users mailing list