[GE users] Open MPI tight integration in HOWTO page

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Fri Feb 2 10:20:13 GMT 2007


On Thu, 1 Feb 2007, Reuti wrote:

> Hi Todd,
>
> Am 01.02.2007 um 22:57 schrieb Heywood, Todd:
>
>> Hi Andreas,
>> 
>> 
>> 
>> I set up a simple test parallel environment (adapting 
>> $SGE_ROOT/mpi/startmpi.sh to simply create a macines file under $TMDIR), 
>> and ran the script you sent me, which is:
>> 
>> 
>> 
>> 
>> 
>> #!/bin/sh
>> 
>> #$ -S /bin/sh
>> 
>> 
>> 
>> echo "starting $NSLOTS at host `hostname`"
>> 
>> 
>> 
>> I=1
>> 
>> while [ $I -le $NSLOTS ]; do
>>
>>    host=`sed -n "${I}p" $TMPDIR/machines`
>>
>>    cmd="$SGE_ROOT/bin/$ARC/qrsh -nostdin -noshell -inherit $host sleep 2"
>>
>>    echo $cmd
>>
>>    $cmd &
> it's not advisable to use & inside a SGE job.

True, but in this case it shouldn't be an issue thanks to the 'wait' after the loop.

Regards,
Andreas


>
> If you are accessing all nodes with qrsh -inherit, be sure to set 
> "job_is_first_task FALSE". Otherwise the local qrsh isn't allowed by SGE.
>>    I=`expr $I + 1`
>> 
>> done
>> 
>> wait
>> 
>> 
>> 
>> exit 0
>> 
>> 
>> 
>> 
>> 
>> I get the same ?GMSH? error messages. Here?s the output for a 16-way job:
>> 
>> 
>> 
>> [heywood at bhmnode2 test]$ ls *82*
>> 
>> -rw-r--r--  1 heywood itstaff  963 Feb  1 16:20 testsge.sh.e8292
>> 
>> -rw-r--r--  1 heywood itstaff 1229 Feb  1 16:20 testsge.sh.o8292
>> 
>> -rw-r--r--  1 heywood itstaff    0 Feb  1 16:20 testsge.sh.pe8292
>> 
>> -rw-r--r--  1 heywood itstaff  210 Feb  1 16:20 testsge.sh.po8292
>> 
>> [heywood at bhmnode2 test]$ more testsge.sh.e8292
>> 
>> error: commlib error: can't read general message size header (GMSH) 
>> (closing "blade193
>> 
>> .bluehelix.cshl.edu/execd/1")
>> 
>> 
> Are you using any special communication lib? Myrinet, Infiniband,... ?
>> 
>> 
>> 
>> 
>> I?ve also installed MPICH2 (SMPD daemon-based), and found some issues with 
>> the .smpd configuration file being corrupted when too many tasks access it 
>> at once. Having solved that to run large jobs successfully outside of SGE, 
>> I tried Reuti?s tight integration, and found that smpd daemons hang when 
>> started with the ?-d 0? option (hard coded into the start_mpich2 program). 
>> But that?s another story.
> The -d 0 option I got from the MPICH2 developers. It's purpose is to avoid 
> the forking of the daemons (and leaving the process tree). The forking is 
> instead handled by the start_mpich2 program. So: what do you mean with "hang" 
> in detail? After the start of the daemons by the PE start_proc_args they 
> should stay there, still bound to the shepered and wait for connections.
>
> -- Reuti
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list