[GE users] Yet anaother MPICH tight-integration problem

David S. dgs at gs.washington.edu
Tue Sep 7 05:07:07 BST 2004

I'm installed SGE V60u1 and MPICH 1.2.6 in x86 Linux 2.4.  I've got 
a cluster of 39 dual processor boxes.  I'm using the "ch_p4" MPICH 
device over ethernet, and a slightly modified version of the script 
'mpi/mpi_cp.sh' from the SGE distribution to test my MPICH parallel 

	# our name
	#$ -N MPI_Job
	# pe request
	#$ -pe mpich 2-78
	# Specify the location of the output
	#$ -o /home/dgs/work/adhoc
	#$ -e /home/dgs/work/adhoc
	#setenv P4_RSHCOMMAND /usr/bin/rsh
	echo "Got $NSLOTS slots."
	/mnt/local/mpich/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /mnt/local/mpich/examples/cpi

Using tight integration, I can't get the test to run on more than
39 slots, or half of what I have available.  For example, setting
the PE request to "#$ -s pe mpich 2-40", the output file will report

	Got 40 slots.
	/mnt/local/sge/bin/lx24-x86/qrsh -V -inherit -nostdin eee013 /mnt/local/mpich/examples/cpi eee012.grid.gs.washington.edu 48402 \-p4amslave \-p4yourname eee013 \-p4rmrank 1
	/mnt/local/sge/bin/lx24-x86/qrsh -V -inherit -nostdin eee015 /mnt/local/mpich/examples/cpi eee012.grid.gs.washington.edu 48402 \-p4amslave \-p4yourname eee015 \-p4rmrank 2

	/mnt/local/sge/bin/lx24-x86/qrsh -V -inherit -nostdin eee013 /mnt/local/mpich/examples/cpi eee012.grid.gs.washington.edu 48402 \-p4amslave \-p4yourname eee013 \-p4rmrank 39
p0_3843:  p4_error: Child process exited while making connection to remote process on eee013: 0

It seems that when the system tries to start more than one slave job on 
a node, like "eee013" above, it fails and aborts the job.  From the
standard error output:

	error: executing task of job 27 failed:
	/mnt/local/mpich/bin/mpirun: line 1:  3843 Broken pipe             /mnt/local/mpich/examples/cpi -p4pg /home/dgs/PI2928 -p4wd /home/dgs

If I ask for half or less of the available slots, so that I don't
get more than one slave process any node, it works fine.  Likewise,
loose integration works specifying any number of nodes.  The problem 
seems to be in tight integration.

Note that I get the same behavior whether the "Job is first task"
parameter is set "true" or "false".

I'd be grateful for any advice.

David S.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list