[GE users] Mvapich processes not killed on qdel

Reuti reuti at staff.uni-marburg.de
Wed May 9 19:24:57 BST 2007


Am 09.05.2007 um 18:58 schrieb Mike Hanby:

> Hmm, I changed the mpirun command to mpirun_rsh -rsh and submitted the
> job, it started and failed with a bunch of connections refused. By
> default Rocks disables RSH.
>
> Does tight integration only work with rsh? If so, I'll see if I can  
> get
> that enabled and try again.

Yes - no!

If you need a tight ssh-integration, you could compiler 6.1 on your  
own, as there is a tight ssh-integration available, but not in the  
provided binaries.

OTOH: SGE will not use the default rsh-daemons. So you need no  
running rshd or any setting in xinetd.conf at all. SGE will start a  
rshd for each qrsh call on a randomly chosen port - dedicated just  
for this one call. What you might observe, maybe the working firewall  
between the nodes. Often all nodes are on a private network without  
any connection to the outside world at all, so there would be no risk  
to disable the fire wall on the nodes (except the headnode of course  
with its two network cards).

Also worth to note for ROCKS: the command hostname will give the  
FQDN, not only the hostname like in other distributions. So you might  
have to add a .local to all the entries in the generated machinefile  
by the startmpi.sh script of the PE (could be add in the  
PeHostfile2MachineFile() procedure).

-- Reuti

> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, May 09, 2007 11:27
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Mvapich processes not killed on qdel
>
> Hi,
>
> can you please post the processtree (master and slave) of a running
> job on a node by using the ps command:
>
> ps -e f -o pid,ppid,pgrp,command
>
> Are you sure, that the SGE rsh-wrapper is used, as you mentioned
> mpirun_ssh?
>
> -- Reuti
>
>
> Am 09.05.2007 um 17:43 schrieb Mike Hanby:
>
>> Howdy,
>>
>> I have GE 6.0u8 on a Rocks 4.2.1 cluster with Infiniband and the
>> Topspin roll (which includes mvapich).
>>
>>
>>
>> When I qdel an mvapich job, the job immediately is removed from the
>> queue, however most of the processes on the nodes do not get
>> killed. It appears that the mpirun_ssh process does get killed,
>> however all of the actual job executables (sander.MPI) doesn't.
>>
>>
>>
>> I followed the directions for tight integration of Mvapich
>>
>> http://gridengine.sunsource.net/project/gridengine/howto/mvapich/
>> MVAPICH_Integration.html
>>
>>
>>
>> The job runs fine, but again it doesn't kill off processes when
>> qdel'd.
>>
>>
>>
>> Here's the pe:
>>
>> $ qconf -sp mvapich
>>
>> pe_name           mvapich
>>
>> slots             9999
>>
>> user_lists        NONE
>>
>> xuser_lists       NONE
>>
>> start_proc_args   /share/apps/gridengine/mvapich/startmpi.sh -
>> catch_rsh \
>>
>>                   $pe_hostfile
>>
>> stop_proc_args    /share/apps/gridengine/mvapich/stopmpi.sh
>>
>> allocation_rule   $round_robin
>>
>> control_slaves    TRUE
>>
>> job_is_first_task FALSE
>>
>> urgency_slots     min
>>
>>
>>
>> The only modifications made to the startmpi.sh script was to change
>> the location of the hostname and rsh scripts from $SGE_ROOT to /
>> share/apps/gridengine/mvapich
>>
>>
>>
>> Any suggestions on what I should look for?
>>
>>
>>
>> Thanks, MIke
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list