[GE users] mpich2/mpd TI - slave processes not properly accounted for

reuti reuti at staff.uni-marburg.de
Wed Aug 19 08:46:08 BST 2009


Am 19.08.2009 um 02:44 schrieb cwchan:

> Also Sprach reuti:
>
>> Hi,
>>
>> Am 19.08.2009 um 00:32 schrieb cwchan:
>>
>>> Hello,
>>>
>>> We have a small cluster with 20 nodes and 256 x86_64 CPU cores,  
>>> using
>>> SGE 6.1u2 as the DRMS with ssh tight integration instead of rsh.
>>
>> did you recompile SGE with -tight-ssh? Otherwise it would exactly
>> explain your observations, as the supplied rsh will add an additional
>> group ID which is used to track the consumption. The same will be
>> done by the special compiled ssh. You don't have a private network
>> for your cluster and must use ssh?
>>
>> -- Reuti
>
> Yes, the sshd binary was compiled with tight integration and is
> in
>
> /usr/share/sge/6.1/bin/lx26-amd64/sshd

When you create an interactive session, do you see an additonal group  
id:

$ qrsh id
uid=1001(reuti) gid=25000(orgqui) groups=1000(operator),20040,25000 
(orgqui)

(here the 20040)

>
> When invoked, that sshd outputs

MPICH2 uses rsh by default, which is caught by SGE's rsh-wrapper and  
then SGE uses ssh in the end? The name is at this point only a name.  
Someone could compile MPICH2 to call "fubar" as rsh client, and  
adjust startmpich2.sh to create a link "fubar" in $TMPDIR, and SGE  
could use ssh in the end.


> set_admin_username() with zero length username: No such file or  
> directory
>
> The cluster is on a private network with a login node which has
> a separate interface accessible from the public network.  We use
> sshd within the cluster for its features, such as X11 forwarding,
> and also because of HIPAA concerns.

Ok, I see.


> Interactive jobs run via qrsh do have their resource usage properly
> accounted for, and the mpich2 jobs are running as children of the
> qrsh process, e.g.
>
> |-sge_shepherd---sshd---sshd---qrsh_starter---tcsh---python2.4---2* 
> [python2.4---mpilscg]

Is the path of the used sshd output when you try - is it the correct  
one:

$ ps -e f

(f w/o -)

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213003

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list