[GE users] LAM/MPI and SGE : tight_integration

Reuti reuti at staff.uni-marburg.de
Wed Jan 11 18:34:36 GMT 2006


Hi,

Am 11.01.2006 um 19:27 schrieb christophe.caron at jouy.inra.fr:

> Hello,
>
>> Hi Christophe,
>>
>> Is the PE assigned to a queue?
>
> Yes it was ! Thanks for all reply about this HOWTO but which seems  
> to be
> obsolete now.
>
> So after a break i've decided to look at the last HOW-TO LAM/SGE:  
> http://gridengine.sunsource.net/howto/lam-integration/lam- 
> integration.html
> So i've got LAM 7.1.1 (versus 7.0.2) and i've apply all modifications
> in "Tight integration using qrsh" section.
>
> My PE configuration:
> $ qconf -sp lam711
> pe_name           lam711

as you used another name for this PE: did you also adjust the  
lamd_wrapper to test against this name? - Reuti

> slots             8
> user_lists        test
> xuser_lists       deadlineusers
> start_proc_args   /opt/lam711-sge/lam_tight_qrsh/startlam.sh - 
> catch_rsh \
>                   $pe_hostfile
> stop_proc_args    /opt/lam711-sge/lam_tight_qrsh/stoplam.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
>
> I'm using qrsh with ssh (qrsh ls /tmp works)
>
> Now it seems i've some problems to dispatch jobs on all nodes
> # qsub -pe lam711 8 test_lam.sh
> will launched lamd on one first node
>  /usr/local/public/lam/bin/lamd_binary -d -H 192.168.1.56 -P 32849 - 
> n 0 -o 0 -sessionsuffix sge-122589-undefined
>
>
> But not on all others nodes with this error
> #more lam.err
> ---------------------------------------------------------------------- 
> -------
> The selected RPI failed to initialize during MPI_INIT.  This is a
> fatal error; I must abort.
>
> This occurred on host n57 (n2).
> The PID of failed process was 12494 (MPI_COMM_WORLD rank: 4)
> ---------------------------------------------------------------------- 
> -------
> ---------------------------------------------------------------------- 
> -------
> One of the processes started by mpirun has exited with a nonzero exit
> code.  This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 11964 failed on node n0 (192.168.1.56) with exit status 1.
> ---------------------------------------------------------------------- 
> -------
> mkdir: No such file or directory
>
>
>
> I've searched agian and again since some hours without any success
> (i had other problems but this is the last)
>
> Any clue ?
>
> thanks
>
> cc
>
>
> Prenez note de ma nouvelle adresse : christophe.caron at jouy.inra.fr
>
> ***********************************************************
>  Christophe Caron - INRA
>  Mathematique, Informatique et Genome
>  Domaine de Vilvert 78350 Jouy-en-Josas
>  Web: http://migale.jouy.inra.fr/
>  Tel: 01-34-65-28-88  Email: christophe.caron at jouy.inra.fr  
> ***********************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list