[GE users] Why my job's accounting information always indicates "failed: 12 before pestop"

Eric Zhang maillistbox at 126.com
Tue Mar 13 01:38:17 GMT 2007


    [ The following text is in the "GB2312" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi, GE users:

I am using sge 6.0u9 now, and my pe configuration is:

==================================================
pe_name mpich
slots 99
user_lists NONE
xuser_lists NONE
start_proc_args /home/sge6/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /bin/sge6/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
==================================================

My job submit script is:

==================================================
#!/bin/sh
#
#$ -S /bin/sh
# ---------------------------
# our name
#$ -N EricPi
#$ -j y
#
# output path
#$ -o /home/eric/output
#$ -e /home/eric/output

# pe request
#$ -pe mpich 2
#
#$ -v P4_RSHCOMMAND=rsh
#$ -v MPICH_PROCESS_GROUP=no
# ---------------------------

#
# needs in
# $NSLOTS
# the number of tasks to be used
# $TMPDIR/machines
# a valid machiche file to be passed to mpirun

# export NSLOTS=4

# enables $TMPDIR/rsh to catch rsh calls if available
export path=$TMPDIR:$path

/usr/local/mpich-ifort/bin/mpirun -np $NSLOTS -machinefile
$TMPDIR/machines /home/eric/testcodes/pi3f90
=========================================================================

I have three questions here:

1. The job is running fine, but I found that in my job's accounting
information, the "failed" field always indicates: "12: before pestop", why?

2. I have read the article "Tight MPICH Integration in Grid Engine", and
in my job's script, I defined "-v MPICH_PROCESS_GROUP=no" to achieve the
tight integration. Is this correct?

3. In my PE's configuration, Is the option "-catch_rsh" necessary? I
found in sge's PE template which named "mpi.template", hasn't set this
option. I think that "startmpi.sh" will place a link which points to
sge's rsh wrapper in $TMPDIR so that the application will use sge's rsh
wrapper to dispatch it's processes, that means this option cannot be
ignored, Is this correct?





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list