[GE users] mpich2/mpd TI - slave processes not properly accounted for

reuti reuti at staff.uni-marburg.de
Wed Aug 19 19:17:13 BST 2009


Am 19.08.2009 um 18:28 schrieb cwchan:

> Also Sprach reuti:
>
>> When you create an interactive session, do you see an additonal group
>> id:
>>
>> $ qrsh id
>> uid=1001(reuti) gid=25000(orgqui) groups=1000(operator),20040,25000
>> (orgqui)
>>
>> (here the 20040)
>
> Yes, there is an extra GID.  If I do two "qrsh id" in quick
> succession, the second GID is incremented by 1 from the first.
>
>> MPICH2 uses rsh by default, which is caught by SGE's rsh-wrapper and
>> then SGE uses ssh in the end? The name is at this point only a name.
>> Someone could compile MPICH2 to call "fubar" as rsh client, and
>> adjust startmpich2.sh to create a link "fubar" in $TMPDIR, and SGE
>> could use ssh in the end.
>
> Yes, the rsh wrapper script is configured to call qrsh with
> the just_wrap option off.
>
>>> Interactive jobs run via qrsh do have their resource usage properly
>>> accounted for, and the mpich2 jobs are running as children of the
>>> qrsh process, e.g.
>>>
>>> |-sge_shepherd---sshd---sshd---qrsh_starter---tcsh---python2.4---2*
>>> [python2.4---mpilscg]
>>
>> Is the path of the used sshd output when you try - is it the correct
>> one:
>>
>> $ ps -e f
>>
>> (f w/o -)
>>
>> -- Reuti
>
> Yes, it is the SGE sshd in /usr/share/sge/6.1/bin/lx26-amd64.
>
> The compute nodes do not allow direct ssh logins by non-root users;
> /usr/sbin/sshd reads the /etc/ssh/sshd_config file which has
> "AllowUsers root" set.  The cluster config has the rsh/rlogin
> command set to /usr/bin/ssh and the rshd/rlogind command set to
>
> /usr/share/sge/6.1/bin/lx26-amd64/sshd -i -f /usr/share/sge/etc/ 
> sshd_config
>
> which allows access by all users.  This is meant to force  
> accounting of all
> usage on the cluster, including interactive logins.

All looks fine. The job shut down all on processes and daemons in a  
nice way, end the accounting records were written? How many entries  
do you have in qacct for such a job? It should be one for the  
jobscript (with near zero consumption), and one for each started  
daemon per node.

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213107

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list