[GE users] Open MPI tight integration in HOWTO page

Heywood, Todd heywood at cshl.edu
Fri Feb 2 19:50:37 GMT 2007


Hi,

If you recall, I had 2 classes of errors: (1) the GMSH error while jobs
give output, and (2) complete failure for a large enough number of MPI
tasks, sometimes giving a grab bag of error messages (see first post on
this thread), and sometimes giving no output, but with qstat saying
"critical error: unrecoverable error - contact systems manager.
Aborted". This second case might be related to LDAP, as I gound the
following messages in /var/log/messages of the job nodes:

Feb  2 14:06:16 blade183 sge_execd: nss_ldap: reconnecting to LDAP
server...
Feb  2 14:06:16 blade183 sge_execd: nss_ldap: reconnected to LDAP server
after 1 attempt(s)
Feb  2 14:06:16 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
LDAP server...
Feb  2 14:06:16 blade183 sge_shepherd-9194: nss_ldap: reconnected to
LDAP server after 1 attempt(s)
Feb  2 14:06:17 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
LDAP server...
Feb  2 14:06:17 blade183 sge_shepherd-9194: nss_ldap: reconnected to
LDAP server after 1 attempt(s)
Feb  2 14:07:19 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
LDAP server...
Feb  2 14:07:19 blade183 sge_shepherd-9194: nss_ldap: reconnected to
LDAP server after 1 attempt(s)

Googling on LDAP plus various cluster/MPI/scalability topics shows up
nothing.

Todd


-----Original Message-----
From: Andreas.Haas at Sun.COM [mailto:Andreas.Haas at Sun.COM] 
Sent: Friday, February 02, 2007 5:20 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Open MPI tight integration in HOWTO page

On Thu, 1 Feb 2007, Reuti wrote:

> Hi Todd,
>
> Am 01.02.2007 um 22:57 schrieb Heywood, Todd:
>
>> Hi Andreas,
>> 
>> 
>> 
>> I set up a simple test parallel environment (adapting 
>> $SGE_ROOT/mpi/startmpi.sh to simply create a macines file under
$TMDIR), 
>> and ran the script you sent me, which is:
>> 
>> 
>> 
>> 
>> 
>> #!/bin/sh
>> 
>> #$ -S /bin/sh
>> 
>> 
>> 
>> echo "starting $NSLOTS at host `hostname`"
>> 
>> 
>> 
>> I=1
>> 
>> while [ $I -le $NSLOTS ]; do
>>
>>    host=`sed -n "${I}p" $TMPDIR/machines`
>>
>>    cmd="$SGE_ROOT/bin/$ARC/qrsh -nostdin -noshell -inherit $host
sleep 2"
>>
>>    echo $cmd
>>
>>    $cmd &
> it's not advisable to use & inside a SGE job.

True, but in this case it shouldn't be an issue thanks to the 'wait'
after the loop.

Regards,
Andreas


>
> If you are accessing all nodes with qrsh -inherit, be sure to set 
> "job_is_first_task FALSE". Otherwise the local qrsh isn't allowed by
SGE.
>>    I=`expr $I + 1`
>> 
>> done
>> 
>> wait
>> 
>> 
>> 
>> exit 0
>> 
>> 
>> 
>> 
>> 
>> I get the same ?GMSH? error messages. Here?s the output for a 16-way
job:
>> 
>> 
>> 
>> [heywood at bhmnode2 test]$ ls *82*
>> 
>> -rw-r--r--  1 heywood itstaff  963 Feb  1 16:20 testsge.sh.e8292
>> 
>> -rw-r--r--  1 heywood itstaff 1229 Feb  1 16:20 testsge.sh.o8292
>> 
>> -rw-r--r--  1 heywood itstaff    0 Feb  1 16:20 testsge.sh.pe8292
>> 
>> -rw-r--r--  1 heywood itstaff  210 Feb  1 16:20 testsge.sh.po8292
>> 
>> [heywood at bhmnode2 test]$ more testsge.sh.e8292
>> 
>> error: commlib error: can't read general message size header (GMSH) 
>> (closing "blade193
>> 
>> .bluehelix.cshl.edu/execd/1")
>> 
>> 
> Are you using any special communication lib? Myrinet, Infiniband,... ?
>> 
>> 
>> 
>> 
>> I?ve also installed MPICH2 (SMPD daemon-based), and found some issues
with 
>> the .smpd configuration file being corrupted when too many tasks
access it 
>> at once. Having solved that to run large jobs successfully outside of
SGE, 
>> I tried Reuti?s tight integration, and found that smpd daemons hang
when 
>> started with the ?-d 0? option (hard coded into the start_mpich2
program). 
>> But that?s another story.
> The -d 0 option I got from the MPICH2 developers. It's purpose is to
avoid 
> the forking of the daemons (and leaving the process tree). The forking
is 
> instead handled by the start_mpich2 program. So: what do you mean with
"hang" 
> in detail? After the start of the daemons by the PE start_proc_args
they 
> should stay there, still bound to the shepered and wait for
connections.
>
> -- Reuti
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list