[GE users] MPICH2 job deletion

Alan Carriou Alan.Carriou at jet.uk
Thu Apr 28 11:56:02 BST 2005


Hi Reuti,

Thanks for your answer. I've changed the settings, now the problem is 
different. Now the head-node cannot connect to the slaves, though I can 
manually connect via ssh to all nodes.

Any idea ?

Alan

[acarrio at testgrid-2 mpi-tests] $ qconf -sconf | grep [rs]sh
rlogin_daemon                /usr/sbin/sshd -i
rsh_daemon                   /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
[acarrio at testgrid-2 mpi-tests] $ qconf -sp mpich2_smpd_rsh
pe_name           mpich2_smpd_rsh
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/sge-6.0/mpich2_smpd_rsh/startmpich2.sh \
                   -catch_rsh $pe_hostfile
stop_proc_args    /usr/local/sge-6.0/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh
#!/bin/sh
#$ -S /bin/sh
export MPIEXEC_RSH=rsh
export PATH=/usr/local/mpich2_smpd/bin:$PATH

mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines 
/home/acarrio/mpi-tests/mpihello/mpihello

exit 0
[acarrio at testgrid-2 mpi-tests] $ qsub -pe mpich2_smpd_rsh 2 
mpich2-daemonless.sh
Your job 65 ("mpich2-daemonless.sh") has been submitted.
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.po65
-catch_rsh 
/usr/local/sge-6.0/testgrid/spool/testgrid-1/active_jobs/65.1/pe_hostfile
testgrid-1
testgrid-3
[acarrio at testgrid-2 mpi-tests] $ ssh testgrid-1 ps -e f -o 
pid,ppid,pgrp,command --cols=80
   PID  PPID  PGRP COMMAND
(...)
  3488     1  3488 /usr/local/sge-6.0/bin/lx24-x86/sge_execd
  1974  3488  1974  \_ sge_shepherd-65 -bg
  1997  1974  1997      \_ -sh 
/usr/local/sge-6.0/testgrid/spool/testgrid-1/job_s
  2017  1997  1997          \_ mpiexec -rsh -nopm -n 2 -machinefile 
/tmp/65.1.all
  2018  2017  1997              \_ mpiexec -rsh -nopm -n 2 -machinefile 
/tmp/65.1
  2019  2017  1997              \_ rsh testgrid-1 env PMI_RANK=0 
PMI_SIZE=2 PMI_K
  2020  2017  1997              \_ rsh testgrid-3 env PMI_RANK=1 
PMI_SIZE=2 PMI_K
(...)
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.e65
connect to address 145.239.31.70: Connection refused
connect to address 145.239.31.70: Connection refused
trying normal rsh (/usr/bin/rsh)
connect to address 145.239.31.72: Connection refused
connect to address 145.239.31.72: Connection refused
trying normal rsh (/usr/bin/rsh)
testgrid-1.jet.uk: Connection refused
testgrid-3.jet.uk: Connection refused






Reuti wrote:
> Alan,
> 
> I refer to the setting in the SSH-Howto:
> 
> http://gridengine.sunsource.net/howto/qrsh_ssh.html
> 
> After setting this up, you have to set "MPIEXEC_RSH=rsh" again. This way 
> the MPICH2 program will call rsh -> rsh-wrapper -> qrsh -> ssh.
> 
> It's just a cosmetic issue, that you are calling rsh and use ssh in the 
> end. If you don't like it, you can of course change the in 
> startmpich2.sh created link to be called ssh and avoid the setting of 
> "MPIEXEC_RSH".
> 
> When you run the test job, please check whether all processes of the 
> program are children of the qrsh_starter (with the used "ps" command in 
> the MPICH2-Howto), and not using any other ssh logins outside of SGE. 
> SGE will kill the whole process group of the job,  and the processes 
> created with a conventional ssh will not be killed. Please post the 
> output of the "ps..." on the head- and one slave-node, if it's not working.
> 
> Cheers - Reuti
> 
> 
> Alan Carriou wrote:
> 
>> Hi Reuti,
>>
>> I did not set the "MPIEXEC_RSH".
>>
>>  > you set up SGE to use ssh in it's config,
>> What parameter are you referring to ?
>>
>> Thanks,
>> Alan
>>
>> Reuti wrote:
>>
>>> Hi Alan,
>>>
>>> you set up SGE to use ssh in it's config, and/or did you just avoid 
>>> setting "MPIEXEC_RSH=rsh"?
>>>
>>> CU - Reuti
>>>
>>>
>>> Alan Carriou wrote:
>>>
>>>> Hi
>>>>
>>>> On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
>>>> Using the smpd daemonless startup, we have a problem : when we 
>>>> delete a running MPI-job, the MPI processes are not killed.
>>>> The slots are freed, the job is said to be finished, the mpiexec and 
>>>> ssh processes on the first node are killed, but the MPI processes 
>>>> themselves are still alive. This happens both with qdel and qmon. 
>>>> The qmaster/messages says just:
>>>>
>>>> 04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
>>>> testgrid-4.jet.uk assumedly after job because: job 51.1 died through 
>>>> signal KILL (9)
>>>>
>>>> If this may explain something, we use ssh instead of rsh to connect 
>>>> to other hosts.
>>>>
>>>> Using the daemon-based startup, the job deletion works fine. And, 
>>>> using both, the normal end of a MPI-job causes no problem.
>>>>
>>>> Does anyone have an idea ?
>>>>
>>>> Thanks,
>>>> Alan
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list