[GE users] Open MPI tight integration in HOWTO page

Reuti reuti at staff.uni-marburg.de
Thu Feb 1 22:46:22 GMT 2007


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Todd,

Am 01.02.2007 um 22:57 schrieb Heywood, Todd:

> Hi Andreas,
>
>
>
> I set up a simple test parallel environment (adapting $SGE_ROOT/mpi/ 
> startmpi.sh to simply create a macines file under $TMDIR), and ran  
> the script you sent me, which is:
>
>
>
>
>
> #!/bin/sh
>
> #$ -S /bin/sh
>
>
>
> echo "starting $NSLOTS at host `hostname`"
>
>
>
> I=1
>
> while [ $I -le $NSLOTS ]; do
>
>     host=`sed -n "${I}p" $TMPDIR/machines`
>
>     cmd="$SGE_ROOT/bin/$ARC/qrsh -nostdin -noshell -inherit $host  
> sleep 2"
>
>     echo $cmd
>
>     $cmd &
it's not advisable to use & inside a SGE job.

If you are accessing all nodes with qrsh -inherit, be sure to set  
"job_is_first_task FALSE". Otherwise the local qrsh isn't allowed by  
SGE.
>     I=`expr $I + 1`
>
> done
>
> wait
>
>
>
> exit 0
>
>
>
>
>
> I get the same ?GMSH? error messages. Here?s the output for a 16- 
> way job:
>
>
>
> [heywood at bhmnode2 test]$ ls *82*
>
> -rw-r--r--  1 heywood itstaff  963 Feb  1 16:20 testsge.sh.e8292
>
> -rw-r--r--  1 heywood itstaff 1229 Feb  1 16:20 testsge.sh.o8292
>
> -rw-r--r--  1 heywood itstaff    0 Feb  1 16:20 testsge.sh.pe8292
>
> -rw-r--r--  1 heywood itstaff  210 Feb  1 16:20 testsge.sh.po8292
>
> [heywood at bhmnode2 test]$ more testsge.sh.e8292
>
> error: commlib error: can't read general message size header (GMSH)  
> (closing "blade193
>
> .bluehelix.cshl.edu/execd/1")
>
>
Are you using any special communication lib? Myrinet, Infiniband,... ?
>
>
>
>
> I?ve also installed MPICH2 (SMPD daemon-based), and found some  
> issues with the .smpd configuration file being corrupted when too  
> many tasks access it at once. Having solved that to run large jobs  
> successfully outside of SGE, I tried Reuti?s tight integration, and  
> found that smpd daemons hang when started with the ?-d 0? option  
> (hard coded into the start_mpich2 program). But that?s another story.
The -d 0 option I got from the MPICH2 developers. It's purpose is to  
avoid the forking of the daemons (and leaving the process tree). The  
forking is instead handled by the start_mpich2 program. So: what do  
you mean with "hang" in detail? After the start of the daemons by the  
PE start_proc_args they should stay there, still bound to the  
shepered and wait for connections.

-- Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list