[GE users] MPICH2 job deletion

Alan Carriou Alan.Carriou at jet.uk
Fri Apr 29 11:55:44 BST 2005


Hi

Finally, we'll be using rsh instead of ssh. So now rlogin_daemon 
         is /usr/sbin/in.rlogind, and rlogin command, rsh daemon and 
command are unset.

Then the daemon-based startup works fine with qrsh_starter, but the 
daemonless still uses a non-SGE rsh. Adding a "echo $PATH" in the 
script, I have found that it was a conflit with Kerberos settings: 
/etc/profile.d/krb5.[c]sh script that add the path to the kerberos-rsh 
_before_ $TMPDIR/.

-------------
[acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh
#!/bin/sh
#$ -S /bin/sh
export MPIEXEC_RSH=rsh
export PATH=/usr/local/mpich2_smpd/bin:$PATH

echo "ls $TMPDIR"
ls $TMPDIR
echo PATH=$PATH
which rsh
mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines 
/home/acarrio/mpi-tests/mpitest
exit 0

[acarrio at testgrid-2 gridengine-testscripts] $ head -5 
mpich2-daemonless.sh.o85
ls /tmp/85.1.all.q
machines
rsh
PATH=/usr/local/mpich2_smpd/bin:/usr/local/mpich2_smpd/bin:/usr/local/sge-6.0/bin/lx24-x86:/usr/kerberos/bin:/tmp/85.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/usr/X11R6/bin
/usr/kerberos/bin/rsh
-------------

In order to keep kerberos working for normal use of the computers, the 
/usr/kerberos/bin has to be before /usr/bin. But, as the $TPMDIR seems 
to be added to the PATH before /etc/profile is read by bash, there is a 
little problem. Do you have any idea, except from explicitly deleting 
kerberos from the PATH (somewehere in the startmpich.sh, maybe) ?

Btw, we use Fedora Core 3, I do not know if other distros have the same 
scripts. They belong (at least) to the following packages:
-------------
[acarrio at testgrid-2 ~] $ rpm -q -f /etc/profile.d/krb5.sh
krb5-workstation-1.3.4-7
krb5-devel-1.3.4-7
-------------


Regards,
Alan

Reuti wrote:
> Hello Alan,
> 
> Alan Carriou wrote:
> 
>> Hi Reuti,
>>
>> Thanks for your answer. I've changed the settings, now the problem is 
>> different. Now the head-node cannot connect to the slaves, though I 
>> can manually connect via ssh to all nodes.
>>
>> Any idea ?
>>
>> Alan
>>
>> [acarrio at testgrid-2 mpi-tests] $ qconf -sconf | grep [rs]sh
>> rlogin_daemon                /usr/sbin/sshd -i
>> rsh_daemon                   /usr/sbin/sshd -i
>> rsh_command                  /usr/bin/ssh
>> rlogin_command               /usr/bin/ssh
>> [acarrio at testgrid-2 mpi-tests] $ qconf -sp mpich2_smpd_rsh
>> pe_name           mpich2_smpd_rsh
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /usr/local/sge-6.0/mpich2_smpd_rsh/startmpich2.sh \
>>                   -catch_rsh $pe_hostfile
>> stop_proc_args    /usr/local/sge-6.0/mpich2_smpd_rsh/stopmpich2.sh
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>> [acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh
>> #!/bin/sh
>> #$ -S /bin/sh
>> export MPIEXEC_RSH=rsh
>> export PATH=/usr/local/mpich2_smpd/bin:$PATH
>>
>> mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines 
>> /home/acarrio/mpi-tests/mpihello/mpihello
>>
>> exit 0
>> [acarrio at testgrid-2 mpi-tests] $ qsub -pe mpich2_smpd_rsh 2 
>> mpich2-daemonless.sh
>> Your job 65 ("mpich2-daemonless.sh") has been submitted.
>> [acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.po65
>> -catch_rsh 
>> /usr/local/sge-6.0/testgrid/spool/testgrid-1/active_jobs/65.1/pe_hostfile
>> testgrid-1
>> testgrid-3
>> [acarrio at testgrid-2 mpi-tests] $ ssh testgrid-1 ps -e f -o 
>> pid,ppid,pgrp,command --cols=80
>>   PID  PPID  PGRP COMMAND
>> (...)
>>  3488     1  3488 /usr/local/sge-6.0/bin/lx24-x86/sge_execd
>>  1974  3488  1974  \_ sge_shepherd-65 -bg
>>  1997  1974  1997      \_ -sh 
>> /usr/local/sge-6.0/testgrid/spool/testgrid-1/job_s
>>  2017  1997  1997          \_ mpiexec -rsh -nopm -n 2 -machinefile 
>> /tmp/65.1.all
>>  2018  2017  1997              \_ mpiexec -rsh -nopm -n 2 -machinefile 
>> /tmp/65.1
>>  2019  2017  1997              \_ rsh testgrid-1 env PMI_RANK=0 
>> PMI_SIZE=2 PMI_K
>>  2020  2017  1997              \_ rsh testgrid-3 env PMI_RANK=1 
>> PMI_SIZE=2 PMI_K
>> (...)
> 
> 
> here the conventional rsh is taken, and not the SGE qrsh command (which 
> will use ssh now). Did you changed also the created link in 
> startmpich2.sh to create it as ssh in $TMPDIR - then you wouldn't need 
> to set "export MPIEXEC_RSH=rsh" any longer?
> 
> Can you please put the commands:
> 
> ls $TMPDIR
> which rsh
> 
> before the mpiexec and post the result.
> 
> BTW: Unless you have a need for ssh between the nodes because of 
> possible hackers in your cluster, you can use the SGE provided qrsh and 
> disable rsh and ssh on the nodes completely. The qrsh created rshd is 
> created for each qrsh on its own with a random port number on each slave 
> node only for the duration of the job. The conventional rsh is not used 
> at all this way with SGE. - Reuti
> 
>> [acarrio at testgrid-2 mpi-tests] $ cat mpich2-daemonless.sh.e65
>> connect to address 145.239.31.70: Connection refused
>> connect to address 145.239.31.70: Connection refused
>> trying normal rsh (/usr/bin/rsh)
>> connect to address 145.239.31.72: Connection refused
>> connect to address 145.239.31.72: Connection refused
>> trying normal rsh (/usr/bin/rsh)
>> testgrid-1.jet.uk: Connection refused
>> testgrid-3.jet.uk: Connection refused
>>
>>
>>
>>
>>
>>
>> Reuti wrote:
>>
>>> Alan,
>>>
>>> I refer to the setting in the SSH-Howto:
>>>
>>> http://gridengine.sunsource.net/howto/qrsh_ssh.html
>>>
>>> After setting this up, you have to set "MPIEXEC_RSH=rsh" again. This 
>>> way the MPICH2 program will call rsh -> rsh-wrapper -> qrsh -> ssh.
>>>
>>> It's just a cosmetic issue, that you are calling rsh and use ssh in 
>>> the end. If you don't like it, you can of course change the in 
>>> startmpich2.sh created link to be called ssh and avoid the setting of 
>>> "MPIEXEC_RSH".
>>>
>>> When you run the test job, please check whether all processes of the 
>>> program are children of the qrsh_starter (with the used "ps" command 
>>> in the MPICH2-Howto), and not using any other ssh logins outside of 
>>> SGE. SGE will kill the whole process group of the job,  and the 
>>> processes created with a conventional ssh will not be killed. Please 
>>> post the output of the "ps..." on the head- and one slave-node, if 
>>> it's not working.
>>>
>>> Cheers - Reuti
>>>
>>>
>>> Alan Carriou wrote:
>>>
>>>> Hi Reuti,
>>>>
>>>> I did not set the "MPIEXEC_RSH".
>>>>
>>>>  > you set up SGE to use ssh in it's config,
>>>> What parameter are you referring to ?
>>>>
>>>> Thanks,
>>>> Alan
>>>>
>>>> Reuti wrote:
>>>>
>>>>> Hi Alan,
>>>>>
>>>>> you set up SGE to use ssh in it's config, and/or did you just avoid 
>>>>> setting "MPIEXEC_RSH=rsh"?
>>>>>
>>>>> CU - Reuti
>>>>>
>>>>>
>>>>> Alan Carriou wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
>>>>>> Using the smpd daemonless startup, we have a problem : when we 
>>>>>> delete a running MPI-job, the MPI processes are not killed.
>>>>>> The slots are freed, the job is said to be finished, the mpiexec 
>>>>>> and ssh processes on the first node are killed, but the MPI 
>>>>>> processes themselves are still alive. This happens both with qdel 
>>>>>> and qmon. The qmaster/messages says just:
>>>>>>
>>>>>> 04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
>>>>>> testgrid-4.jet.uk assumedly after job because: job 51.1 died 
>>>>>> through signal KILL (9)
>>>>>>
>>>>>> If this may explain something, we use ssh instead of rsh to 
>>>>>> connect to other hosts.
>>>>>>
>>>>>> Using the daemon-based startup, the job deletion works fine. And, 
>>>>>> using both, the normal end of a MPI-job causes no problem.
>>>>>>
>>>>>> Does anyone have an idea ?
>>>>>>
>>>>>> Thanks,
>>>>>> Alan
>>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list