[GE users] MPICH2 job deletion

Alan Carriou Alan.Carriou at jet.uk
Thu Apr 28 11:56:02 BST 2005

Hi Reuti,

Thanks for your answer. I've changed the settings, now the problem is 
different. Now the head-node cannot connect to the slaves, though I can 
manually connect via ssh to all nodes.

Any idea ?


[acarrio at testgrid-2 mpi-tests] $ qconf -sconf | grep [rs]sh
rlogin_daemon                /usr/sbin/sshd -i
rsh_daemon                   /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
[acarrio at testgrid-2 mpi-tests] $ qconf -sp mpich2_smpd_rsh
pe_name           mpich2_smpd_rsh
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/sge-6.0/mpich2_smpd_rsh/startmpich2.sh \
                   -catch_rsh $pe_hostfile
stop_proc_args    /usr/local/sge-6.0/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh
#$ -S /bin/sh
export MPIEXEC_RSH=rsh
export PATH=/usr/local/mpich2_smpd/bin:$PATH

mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines 

exit 0
[acarrio at testgrid-2 mpi-tests] $ qsub -pe mpich2_smpd_rsh 2 
Your job 65 ("mpich2-daemonless.sh") has been submitted.
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.po65
[acarrio at testgrid-2 mpi-tests] $ ssh testgrid-1 ps -e f -o 
pid,ppid,pgrp,command --cols=80
  3488     1  3488 /usr/local/sge-6.0/bin/lx24-x86/sge_execd
  1974  3488  1974  \_ sge_shepherd-65 -bg
  1997  1974  1997      \_ -sh 
  2017  1997  1997          \_ mpiexec -rsh -nopm -n 2 -machinefile 
  2018  2017  1997              \_ mpiexec -rsh -nopm -n 2 -machinefile 
  2019  2017  1997              \_ rsh testgrid-1 env PMI_RANK=0 
  2020  2017  1997              \_ rsh testgrid-3 env PMI_RANK=1 
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.e65
connect to address Connection refused
connect to address Connection refused
trying normal rsh (/usr/bin/rsh)
connect to address Connection refused
connect to address Connection refused
trying normal rsh (/usr/bin/rsh)
testgrid-1.jet.uk: Connection refused
testgrid-3.jet.uk: Connection refused

Reuti wrote:
> Alan,
> I refer to the setting in the SSH-Howto:
> http://gridengine.sunsource.net/howto/qrsh_ssh.html
> After setting this up, you have to set "MPIEXEC_RSH=rsh" again. This way 
> the MPICH2 program will call rsh -> rsh-wrapper -> qrsh -> ssh.
> It's just a cosmetic issue, that you are calling rsh and use ssh in the 
> end. If you don't like it, you can of course change the in 
> startmpich2.sh created link to be called ssh and avoid the setting of 
> When you run the test job, please check whether all processes of the 
> program are children of the qrsh_starter (with the used "ps" command in 
> the MPICH2-Howto), and not using any other ssh logins outside of SGE. 
> SGE will kill the whole process group of the job,  and the processes 
> created with a conventional ssh will not be killed. Please post the 
> output of the "ps..." on the head- and one slave-node, if it's not working.
> Cheers - Reuti
> Alan Carriou wrote:
>> Hi Reuti,
>> I did not set the "MPIEXEC_RSH".
>>  > you set up SGE to use ssh in it's config,
>> What parameter are you referring to ?
>> Thanks,
>> Alan
>> Reuti wrote:
>>> Hi Alan,
>>> you set up SGE to use ssh in it's config, and/or did you just avoid 
>>> setting "MPIEXEC_RSH=rsh"?
>>> CU - Reuti
>>> Alan Carriou wrote:
>>>> Hi
>>>> On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
>>>> Using the smpd daemonless startup, we have a problem : when we 
>>>> delete a running MPI-job, the MPI processes are not killed.
>>>> The slots are freed, the job is said to be finished, the mpiexec and 
>>>> ssh processes on the first node are killed, but the MPI processes 
>>>> themselves are still alive. This happens both with qdel and qmon. 
>>>> The qmaster/messages says just:
>>>> 04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
>>>> testgrid-4.jet.uk assumedly after job because: job 51.1 died through 
>>>> signal KILL (9)
>>>> If this may explain something, we use ssh instead of rsh to connect 
>>>> to other hosts.
>>>> Using the daemon-based startup, the job deletion works fine. And, 
>>>> using both, the normal end of a MPI-job causes no problem.
>>>> Does anyone have an idea ?
>>>> Thanks,
>>>> Alan

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list