[GE users] MPICH2 job deletion

Reuti reuti at staff.uni-marburg.de
Thu Apr 28 12:15:18 BST 2005


Hello Alan,

Alan Carriou wrote:
> Hi Reuti,
> 
> Thanks for your answer. I've changed the settings, now the problem is 
> different. Now the head-node cannot connect to the slaves, though I can 
> manually connect via ssh to all nodes.
> 
> Any idea ?
> 
> Alan
> 
> [acarrio at testgrid-2 mpi-tests] $ qconf -sconf | grep [rs]sh
> rlogin_daemon                /usr/sbin/sshd -i
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh
> rlogin_command               /usr/bin/ssh
> [acarrio at testgrid-2 mpi-tests] $ qconf -sp mpich2_smpd_rsh
> pe_name           mpich2_smpd_rsh
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge-6.0/mpich2_smpd_rsh/startmpich2.sh \
>                   -catch_rsh $pe_hostfile
> stop_proc_args    /usr/local/sge-6.0/mpich2_smpd_rsh/stopmpich2.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> [acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh
> #!/bin/sh
> #$ -S /bin/sh
> export MPIEXEC_RSH=rsh
> export PATH=/usr/local/mpich2_smpd/bin:$PATH
> 
> mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines 
> /home/acarrio/mpi-tests/mpihello/mpihello
> 
> exit 0
> [acarrio at testgrid-2 mpi-tests] $ qsub -pe mpich2_smpd_rsh 2 
> mpich2-daemonless.sh
> Your job 65 ("mpich2-daemonless.sh") has been submitted.
> [acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.po65
> -catch_rsh 
> /usr/local/sge-6.0/testgrid/spool/testgrid-1/active_jobs/65.1/pe_hostfile
> testgrid-1
> testgrid-3
> [acarrio at testgrid-2 mpi-tests] $ ssh testgrid-1 ps -e f -o 
> pid,ppid,pgrp,command --cols=80
>   PID  PPID  PGRP COMMAND
> (...)
>  3488     1  3488 /usr/local/sge-6.0/bin/lx24-x86/sge_execd
>  1974  3488  1974  \_ sge_shepherd-65 -bg
>  1997  1974  1997      \_ -sh 
> /usr/local/sge-6.0/testgrid/spool/testgrid-1/job_s
>  2017  1997  1997          \_ mpiexec -rsh -nopm -n 2 -machinefile 
> /tmp/65.1.all
>  2018  2017  1997              \_ mpiexec -rsh -nopm -n 2 -machinefile 
> /tmp/65.1
>  2019  2017  1997              \_ rsh testgrid-1 env PMI_RANK=0 
> PMI_SIZE=2 PMI_K
>  2020  2017  1997              \_ rsh testgrid-3 env PMI_RANK=1 
> PMI_SIZE=2 PMI_K
> (...)

here the conventional rsh is taken, and not the SGE qrsh command (which 
will use ssh now). Did you changed also the created link in 
startmpich2.sh to create it as ssh in $TMPDIR - then you wouldn't need 
to set "export MPIEXEC_RSH=rsh" any longer?

Can you please put the commands:

ls $TMPDIR
which rsh

before the mpiexec and post the result.

BTW: Unless you have a need for ssh between the nodes because of 
possible hackers in your cluster, you can use the SGE provided qrsh and 
disable rsh and ssh on the nodes completely. The qrsh created rshd is 
created for each qrsh on its own with a random port number on each slave 
node only for the duration of the job. The conventional rsh is not used 
at all this way with SGE. - Reuti

> [acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.e65
> connect to address 145.239.31.70: Connection refused
> connect to address 145.239.31.70: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 145.239.31.72: Connection refused
> connect to address 145.239.31.72: Connection refused
> trying normal rsh (/usr/bin/rsh)
> testgrid-1.jet.uk: Connection refused
> testgrid-3.jet.uk: Connection refused
> 
> 
> 
> 
> 
> 
> Reuti wrote:
> 
>> Alan,
>>
>> I refer to the setting in the SSH-Howto:
>>
>> http://gridengine.sunsource.net/howto/qrsh_ssh.html
>>
>> After setting this up, you have to set "MPIEXEC_RSH=rsh" again. This 
>> way the MPICH2 program will call rsh -> rsh-wrapper -> qrsh -> ssh.
>>
>> It's just a cosmetic issue, that you are calling rsh and use ssh in 
>> the end. If you don't like it, you can of course change the in 
>> startmpich2.sh created link to be called ssh and avoid the setting of 
>> "MPIEXEC_RSH".
>>
>> When you run the test job, please check whether all processes of the 
>> program are children of the qrsh_starter (with the used "ps" command 
>> in the MPICH2-Howto), and not using any other ssh logins outside of 
>> SGE. SGE will kill the whole process group of the job,  and the 
>> processes created with a conventional ssh will not be killed. Please 
>> post the output of the "ps..." on the head- and one slave-node, if 
>> it's not working.
>>
>> Cheers - Reuti
>>
>>
>> Alan Carriou wrote:
>>
>>> Hi Reuti,
>>>
>>> I did not set the "MPIEXEC_RSH".
>>>
>>>  > you set up SGE to use ssh in it's config,
>>> What parameter are you referring to ?
>>>
>>> Thanks,
>>> Alan
>>>
>>> Reuti wrote:
>>>
>>>> Hi Alan,
>>>>
>>>> you set up SGE to use ssh in it's config, and/or did you just avoid 
>>>> setting "MPIEXEC_RSH=rsh"?
>>>>
>>>> CU - Reuti
>>>>
>>>>
>>>> Alan Carriou wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
>>>>> Using the smpd daemonless startup, we have a problem : when we 
>>>>> delete a running MPI-job, the MPI processes are not killed.
>>>>> The slots are freed, the job is said to be finished, the mpiexec 
>>>>> and ssh processes on the first node are killed, but the MPI 
>>>>> processes themselves are still alive. This happens both with qdel 
>>>>> and qmon. The qmaster/messages says just:
>>>>>
>>>>> 04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
>>>>> testgrid-4.jet.uk assumedly after job because: job 51.1 died 
>>>>> through signal KILL (9)
>>>>>
>>>>> If this may explain something, we use ssh instead of rsh to connect 
>>>>> to other hosts.
>>>>>
>>>>> Using the daemon-based startup, the job deletion works fine. And, 
>>>>> using both, the normal end of a MPI-job causes no problem.
>>>>>
>>>>> Does anyone have an idea ?
>>>>>
>>>>> Thanks,
>>>>> Alan
>>>>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list