[GE users] Yet anaother MPICH tight-integration problem

Andreas Haas Andreas.Haas at Sun.COM
Tue Sep 7 12:37:06 BST 2004


What are you using as allocation_rule with the 'mpich' PE? Does
it forsee more than two tasks be running when it breaks with

   error: executing task of job 27 failed:

error message?

Regards,
Andreas

On Mon, 6 Sep 2004, David S. wrote:

>
> I'm installed SGE V60u1 and MPICH 1.2.6 in x86 Linux 2.4.  I've got
> a cluster of 39 dual processor boxes.  I'm using the "ch_p4" MPICH
> device over ethernet, and a slightly modified version of the script
> 'mpi/mpi_cp.sh' from the SGE distribution to test my MPICH parallel
> environment:
>
> 	# our name
> 	#$ -N MPI_Job
> 	#
> 	# pe request
> 	#$ -pe mpich 2-78
> 	#
>
> 	# Specify the location of the output
> 	#$ -o /home/dgs/work/adhoc
> 	#$ -e /home/dgs/work/adhoc
>
> 	setenv P4_RSHCOMMAND $TMPDIR/rsh
> 	#setenv P4_RSHCOMMAND /usr/bin/rsh
>
> 	echo "Got $NSLOTS slots."
>
> 	/mnt/local/mpich/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /mnt/local/mpich/examples/cpi
>
> Using tight integration, I can't get the test to run on more than
> 39 slots, or half of what I have available.  For example, setting
> the PE request to "#$ -s pe mpich 2-40", the output file will report
>
> 	Got 40 slots.
> 	/mnt/local/sge/bin/lx24-x86/qrsh -V -inherit -nostdin eee013 /mnt/local/mpich/examples/cpi eee012.grid.gs.washington.edu 48402 \-p4amslave \-p4yourname eee013 \-p4rmrank 1
> 	/mnt/local/sge/bin/lx24-x86/qrsh -V -inherit -nostdin eee015 /mnt/local/mpich/examples/cpi eee012.grid.gs.washington.edu 48402 \-p4amslave \-p4yourname eee015 \-p4rmrank 2
> 	.
> 	.
> 	.
>
> 	/mnt/local/sge/bin/lx24-x86/qrsh -V -inherit -nostdin eee013 /mnt/local/mpich/examples/cpi eee012.grid.gs.washington.edu 48402 \-p4amslave \-p4yourname eee013 \-p4rmrank 39
> p0_3843:  p4_error: Child process exited while making connection to remote process on eee013: 0
>
> It seems that when the system tries to start more than one slave job on
> a node, like "eee013" above, it fails and aborts the job.  From the
> standard error output:
>
> 	error: executing task of job 27 failed:
> 	/mnt/local/mpich/bin/mpirun: line 1:  3843 Broken pipe             /mnt/local/mpich/examples/cpi -p4pg /home/dgs/PI2928 -p4wd /home/dgs
>
> If I ask for half or less of the available slots, so that I don't
> get more than one slave process any node, it works fine.  Likewise,
> loose integration works specifying any number of nodes.  The problem
> seems to be in tight integration.
>
> Note that I get the same behavior whether the "Job is first task"
> parameter is set "true" or "false".
>
> I'd be grateful for any advice.
>
> David S.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list