[GE users] CPU limit in mpi jobs

Reuti reuti at staff.uni-marburg.de
Wed Aug 30 14:43:55 BST 2006


Hi,

Am 30.08.2006 um 11:14 schrieb Rui Ramos:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Hi sorry for returning this thread, only now i had time to get back  
> to this issue.
>
>  It seems this problem is a bad Tight Integration o LAM/MPI. As  
> such the CPU time used is the the one of the mpirun process.
>
>  I've restarted the tight integration of LAM/MPI from scratch.  
> Following the Reuti howto.
>
>  I'm using lam version lam-7.1.2, as such there is no need to  
> modify hboot.c
>  This where the steps i've followed:
>    - compiled lam, prefix=/opt/lam
>    - copied the integration scripts to SGE_ROOT
>    - copied the lamd_wrapper and created the symlink
>    - Created the lam_tight_qrsh and add it to test.q pe_list
>
>    - Submitted a test job
>       qsub -cwd -q test.q -pe lam_tight_qrsh 1 mpihello.sh
>
>    - where mpihello.sh is:
>      #!/bin/sh
>      #$ - cwd
>      #$ -N MPIHELLO
>
>      /opt/lam/bin/mpirun C ./mpihello
>
>    - And checking with ps i get:
>
> 10135     1 10135 /opt/n1ge/bin/lx26-amd64/sge_execd
> 31876 10135 31876  \_ sge_shepherd-5177 -bg
> 31907 31876 31907  |   \_ /bin/sh /n1ge/grid003/job_scripts/5177
> 31908 31907 31907  |       \_ /opt/lam/bin/mpirun C ./mpihello
> 31898 10135 31898  \_ sge_shepherd-5177 -bg
> 31899 31898 31899      \_ sshd: rramos [priv]
> 31904 31899 31899          \_ sshd: rramos at notty
> 31905 31904 31905              \_ /opt/n1ge/utilbin/lx24-amd64/ 
> qrsh_starter /n1ge/grid003/active_jobs/5177.1/1.grid003
> 31906 31905 31906                  \_ lamd_binary -H 193.137.51.3 - 
> P 48215 -n 0 -o 0 -sessionsuffix sge-5177-undefined
> 31909 31906 31906                      \_ ./mpihello
> 12077     1 12077 cupsd
> 31896     1 31896 /opt/n1ge/bin/lx24-amd64/qrsh -V -inherit - 
> nostdin grid003.up.pt lamd_binary -H 193.137.51.3 -P 48215 -n 0 -o  
> 0 -sessionsuffix sge-5177-undefined
> 31900 31896 31896  \_ /usr/bin/ssh -n -p 48219 grid003.up.pt exec '/ 
> opt/n1ge/utilbin/lx24-amd64/qrsh_starter' '/n1ge/grid003/ 
> active_jobs/5177.1/1.grid003'
>
>     - I've also followed the qrsh using ssh howto. I have this  
> setting in global configuration. Don't know if that could be the  
> issue.
>
>       qconf -sconf global
>
> load_report_time             00:00:40
> max_unheard                  00:05:00
> reschedule_unknown           00:00:00
> loglevel                     log_warning
> administrator_mail           gridup-admin at iric.up.pt
> set_token_cmd                none
> pag_cmd                      none
> token_extend_time            none
> shepherd_cmd                 none
> qmaster_params               none
> execd_params                 none
> reporting_params             accounting=true reporting=true \
>                              flush_time=00:00:05 joblog=true  
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    20000-20100
>
> qlogin_command               /opt/n1ge/utilbin/qlogin
> qlogin_daemon                /usr/sbin/sshd -i
> rlogin_daemon                /usr/sbin/sshd -i
> max_aj_instances             2000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> auto_user_oticket            0
> auto_user_fshare             100
> auto_user_default_project    none
> auto_user_delete_time        INFINITY
> delegated_file_staging       true
> rsh_command                  /usr/bin/ssh
> rlogin_command               /usr/bin/ssh
> rsh_daemon                   /usr/sbin/sshd -i
> reprioritize                 true
>
>  Can any one give a clue on, how to debug this situation ?

AFAIK you have to recompile SGE to support Tight Integration using  
SSH with:

-tight-ssh        -> compile SSH daemon with tight SGE integration

SGE will assign a special additonal group ID with it's own rshd to  
monitor the usage by this additional group ID. With the default sshd  
this additonal group ID won't be used. So you need the special sshd  
version from SGE.

-- Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list