[GE users] MPICH2 job deletion

Reuti reuti at staff.uni-marburg.de
Wed Apr 27 16:35:15 BST 2005


Alan,

I refer to the setting in the SSH-Howto:

http://gridengine.sunsource.net/howto/qrsh_ssh.html

After setting this up, you have to set "MPIEXEC_RSH=rsh" again. This way 
the MPICH2 program will call rsh -> rsh-wrapper -> qrsh -> ssh.

It's just a cosmetic issue, that you are calling rsh and use ssh in the 
end. If you don't like it, you can of course change the in 
startmpich2.sh created link to be called ssh and avoid the setting of 
"MPIEXEC_RSH".

When you run the test job, please check whether all processes of the 
program are children of the qrsh_starter (with the used "ps" command in 
the MPICH2-Howto), and not using any other ssh logins outside of SGE. 
SGE will kill the whole process group of the job,  and the processes 
created with a conventional ssh will not be killed. Please post the 
output of the "ps..." on the head- and one slave-node, if it's not working.

Cheers - Reuti


Alan Carriou wrote:
> Hi Reuti,
> 
> I did not set the "MPIEXEC_RSH".
> 
>  > you set up SGE to use ssh in it's config,
> What parameter are you referring to ?
> 
> Thanks,
> Alan
> 
> Reuti wrote:
> 
>> Hi Alan,
>>
>> you set up SGE to use ssh in it's config, and/or did you just avoid 
>> setting "MPIEXEC_RSH=rsh"?
>>
>> CU - Reuti
>>
>>
>> Alan Carriou wrote:
>>
>>> Hi
>>>
>>> On our grid, we have SGE 6.0u3 and MPICH2 1.0.1.
>>> Using the smpd daemonless startup, we have a problem : when we delete 
>>> a running MPI-job, the MPI processes are not killed.
>>> The slots are freed, the job is said to be finished, the mpiexec and 
>>> ssh processes on the first node are killed, but the MPI processes 
>>> themselves are still alive. This happens both with qdel and qmon. The 
>>> qmaster/messages says just:
>>>
>>> 04/27/2005 15:49:07|qmaster|testgrid-3|W|job 51.1 failed on host 
>>> testgrid-4.jet.uk assumedly after job because: job 51.1 died through 
>>> signal KILL (9)
>>>
>>> If this may explain something, we use ssh instead of rsh to connect 
>>> to other hosts.
>>>
>>> Using the daemon-based startup, the job deletion works fine. And, 
>>> using both, the normal end of a MPI-job causes no problem.
>>>
>>> Does anyone have an idea ?
>>>
>>> Thanks,
>>> Alan
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list