[GE users] CPU limit in mpi jobs

Rui Ramos rramos at iric.up.pt
Wed Aug 30 10:14:00 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi sorry for returning this thread, only now i had time to get back to this issue.

 It seems this problem is a bad Tight Integration o LAM/MPI. As such the CPU time used is the the one of the mpirun process.

 I've restarted the tight integration of LAM/MPI from scratch. Following the Reuti howto.

 I'm using lam version lam-7.1.2, as such there is no need to modify hboot.c
 This where the steps i've followed:
   - compiled lam, prefix=/opt/lam
   - copied the integration scripts to SGE_ROOT
   - copied the lamd_wrapper and created the symlink
   - Created the lam_tight_qrsh and add it to test.q pe_list
   
   - Submitted a test job  
      qsub -cwd -q test.q -pe lam_tight_qrsh 1 mpihello.sh
   
   - where mpihello.sh is:
     #!/bin/sh
     #$ - cwd 
     #$ -N MPIHELLO 

     /opt/lam/bin/mpirun C ./mpihello

   - And checking with ps i get:

10135     1 10135 /opt/n1ge/bin/lx26-amd64/sge_execd
31876 10135 31876  \_ sge_shepherd-5177 -bg
31907 31876 31907  |   \_ /bin/sh /n1ge/grid003/job_scripts/5177
31908 31907 31907  |       \_ /opt/lam/bin/mpirun C ./mpihello
31898 10135 31898  \_ sge_shepherd-5177 -bg
31899 31898 31899      \_ sshd: rramos [priv]
31904 31899 31899          \_ sshd: rramos at notty
31905 31904 31905              \_ /opt/n1ge/utilbin/lx24-amd64/qrsh_starter /n1ge/grid003/active_jobs/5177.1/1.grid003
31906 31905 31906                  \_ lamd_binary -H 193.137.51.3 -P 48215 -n 0 -o 0 -sessionsuffix sge-5177-undefined
31909 31906 31906                      \_ ./mpihello
12077     1 12077 cupsd
31896     1 31896 /opt/n1ge/bin/lx24-amd64/qrsh -V -inherit -nostdin grid003.up.pt lamd_binary -H 193.137.51.3 -P 48215 -n 0 -o 0 -sessionsuffix sge-5177-undefined
31900 31896 31896  \_ /usr/bin/ssh -n -p 48219 grid003.up.pt exec '/opt/n1ge/utilbin/lx24-amd64/qrsh_starter' '/n1ge/grid003/active_jobs/5177.1/1.grid003'

    - I've also followed the qrsh using ssh howto. I have this setting in global configuration. Don't know if that could be the issue.

      qconf -sconf global

load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           gridup-admin at iric.up.pt
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:05 joblog=true sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100

qlogin_command               /opt/n1ge/utilbin/qlogin
qlogin_daemon                /usr/sbin/sshd -i
rlogin_daemon                /usr/sbin/sshd -i
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             100
auto_user_default_project    none
auto_user_delete_time        INFINITY
delegated_file_staging       true
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i
reprioritize                 true

 Can any one give a clue on, how to debug this situation ?

                                                                Thanks in advance

On Fri, 9 Jun 2006 23:13:59 +0200
Reuti <reuti at staff.uni-marburg.de> wrote:

> Hi again,
> 
> the CPU limit is working in principle, but for now there is a  
> possible race condition:
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1960
> 
> The job will disappear, but some slaves keep on running.
> 
> 
> To the usage: the usage of a parallel job is working for me. Can you  
> try after a normal finished job:
> 
> qacct -j <jobid>
> 
> which should show also one entry for each qrsh call.
> 
> -- Reuti
> 
> 
> Am 08.06.2006 um 20:22 schrieb Rui Ramos:
> 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> >
> >  Hi all,
> >
> >  Well i've tried setting the tight mpi integration with no luck.  
> > I've also follow the LAM/MPI integration and set it with tight  
> > integration. It seems to work for some jobs still have to make some  
> > more tests. Anyway ! the cpu limit of the mpi jobs is allways  
> > 00:00:00.
> >
> >  Does anybody have CPU limits working with mpi jobs ?
> >
> >                                                    Apreciate any  
> > help :)
> >
> > PS: Yes is a tight integration of the LAM/MPI like explained in  
> > Reuti howto.
> >
> > On Fri, 2 Jun 2006 16:55:05 +0100
> > Rui Ramos <rramos at iric.up.pt> wrote:
> >
> >>
> >>  Well i guess i don't have the tight integration. I'm reading your  
> >> howto and the symptoms are the ones referenced.
> >>
> >>     http://gridengine.sunsource.net/howto/mpich-integration.html
> >>
> >>                                                                   
> >> Regards, going to try it out
> >>
> >> On Fri, 2 Jun 2006 17:40:50 +0200
> >> Reuti <reuti at staff.uni-marburg.de> wrote:
> >>
> >>> Hi,
> >>>
> >>> Am 02.06.2006 um 17:39 schrieb Rui Ramos:
> >>>
> >>>>
> >>>>  Hi all,
> >>>>
> >>>>  I've set CPU limits in some of my queues. But there is something
> >>>> that worries me. When submitting an mpi job this CPU limit, is set
> >>>> to each mpi instance or to the sum of the all instances ?
> >>>>  Another thing is when doing a qstat i get
> >>>>
> >>>> usage    1:                 cpu=00:00:00, mem=0.00050 GBs,
> >>>> io=0.00000, vmem=121.828M, maxvmem=121.828M
> >>>>
> >>>>  And the cpu time is allways 00:00:00. Is the CPU limit really
> >>>> working with mpi jobs ?
> >>>
> >>> is it a Tightly Integrated setup?- Reuti
> >>>
> >>>>                                                    thanks in  
> >>>> advance
> >>>>
> >> -- 
> >> ============================================
> >>  Rui Manuel dos Santos Ramos
> >>
> >>  Instituto de Recursos e Iniciativas Comuns
> >>  Pra_a Gomes Teixeira, 4099-002 Porto, Portugal
> >>
> >>  phone : +351 223 401 571
> >>  e-mail: rramos[at]iric.up.pt
> >>     web: http://ruiramos.homeip.net
> >> ============================================
> >>
> >>
> >
> >
> > - --
> > ============================================
> >  Rui Manuel dos Santos Ramos
> >
> >  Instituto de Recursos e Iniciativas Comuns
> >  Praca Gomes Teixeira, 4099-002 Porto, Portugal
> >
> >  phone : +351 223 401 571
> >  e-mail: rramos[at]iric.up.pt
> >     web: http://ruiramos.homeip.net
> > ============================================
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.2 (GNU/Linux)
> >
> > iQEVAwUBRIhqz71uR0bdnTWSAQIA3Qf/Xhh3qXS+tDaGNY4Jb3p7a1dBbiYeBk11
> > qPDCrX31GxNndfE5H6TWrIZbXwk1eCQQud8eShOyFeEWJYx95J43uE46NL5L7rqZ
> > IXh2ZgqyaB+aG8AUU3Q/B/TItZz3TfiJmyAQHFVPn1+chQtnGKbloOnk+Cf11Cp+
> > u0bPe/hfeyRsTVP4UPGwCFO4B0Q9buanvPvwwvyPi2VNL6pINLc6ym54hQTubDqP
> > 3pxzKCzCvs3BkFk3NpzQIXpNPRkEnFaQSXiDZi/5K4mEBhbi9PvJNfS6zej7NlTW
> > dGSqcyMSgn3prjVF2RFpRrXWh2OsMndgt8sxkQ5KSQGIg4wCdpD+dQ==
> > =q47k
> > -----END PGP SIGNATURE-----
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


- -- 
============================================
 Rui Manuel dos Santos Ramos

 Instituto de Recursos e Iniciativas Comuns
 Praca Gomes Teixeira, 4099-002 Porto, Portugal

 phone : +351 223 401 571
 e-mail: rramos[at]iric.up.pt
    web: http://ruiramos.homeip.net
============================================

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iQEVAwUBRPVW2L1uR0bdnTWSAQL9rAf7B/OXlbpXi/UkA7/xGLEgeEcJ8SpxKyg0
5mGee0S1plXZvk5u92R+2EUFWASSOmE+8SsMoZImXNHb3cN8a6+YpuXHOyy028cD
EUKNZMgE8+/a3dXh1rPOvUTDA4qxkpV2H/3XKo8Qrgmeeb4KdEgfVq/jfGkDlOqv
Cz2M6A6qWYAYMeXU8Ml8ap3MOVjABDGIJ4qVkDM1ZGUOPDokY/mEqiTiW3KiKACZ
rE4kRYyispvnP8I7XSamAGfqboXtiQB+MnikQSZEGOB07e3xdizUGAQT552XwbTw
rFTXcTstA4/qBL+XCOs7LnC11dAwmiygZ28LYtsu8XpZy8e1gLCfiw==
=ZEa3
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list