[GE users] LAM/MPI and SGE : tight_integration

christophe.caron at jouy.inra.fr christophe.caron at jouy.inra.fr
Wed Jan 11 18:27:23 GMT 2006


Hello,

> Hi Christophe,
>
> Is the PE assigned to a queue?

Yes it was ! Thanks for all reply about this HOWTO but which seems to be
obsolete now.

So after a break i've decided to look at the last HOW-TO LAM/SGE: 
http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html
So i've got LAM 7.1.1 (versus 7.0.2) and i've apply all modifications
in "Tight integration using qrsh" section.

My PE configuration:
$ qconf -sp lam711
pe_name           lam711
slots             8
user_lists        test
xuser_lists       deadlineusers
start_proc_args   /opt/lam711-sge/lam_tight_qrsh/startlam.sh -catch_rsh \
                   $pe_hostfile
stop_proc_args    /opt/lam711-sge/lam_tight_qrsh/stoplam.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min


I'm using qrsh with ssh (qrsh ls /tmp works)

Now it seems i've some problems to dispatch jobs on all nodes
# qsub -pe lam711 8 test_lam.sh
will launched lamd on one first node
  /usr/local/public/lam/bin/lamd_binary -d -H 192.168.1.56 -P 32849 -n 0 
-o 0 -sessionsuffix sge-122589-undefined


But not on all others nodes with this error
#more lam.err
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT.  This is a
fatal error; I must abort.

This occurred on host n57 (n2).
The PID of failed process was 12494 (MPI_COMM_WORLD rank: 4)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 11964 failed on node n0 (192.168.1.56) with exit status 1.
-----------------------------------------------------------------------------
mkdir: No such file or directory



I've searched agian and again since some hours without any success
(i had other problems but this is the last)

Any clue ?

thanks

cc


Prenez note de ma nouvelle adresse : christophe.caron at jouy.inra.fr

***********************************************************
  Christophe Caron - INRA
  Mathematique, Informatique et Genome
  Domaine de Vilvert 78350 Jouy-en-Josas
  Web: http://migale.jouy.inra.fr/
  Tel: 01-34-65-28-88  Email: christophe.caron at jouy.inra.fr 
***********************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list