[GE users] about submit serial and parallel job

Reuti reuti at staff.uni-marburg.de
Tue Aug 8 11:07:15 BST 2006


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 08.08.2006 um 11:51 schrieb mashaojie163:

> Dear Sir:
>             Our Cluster is composed of  64 nodes and there are 2  
> dual-core  opteron cpus  on each node(4 cpus every node). There are  
> many  jobs are serial and  the others are parrallel .  We want to  
> submit 4  serial jobs on  one  node (that is to say,  try to submit  
> the serial job to the nodes thar are not full but  have been  
> occupied), and  we need to submit  parallel jobs to  the empty  nodes.
>             I  want to use the the complex variable cpu and  
> np_load_avg..  I modified the variable attributes.

I think you can get the desired behavior with ?fill up host?:

http://blogs.sun.com/roller/page/sgrell/20050405

-- Reuti


>            I  excute the command   :
>            [masj at teracluster ~]$ qconf -sc
> #name               shortcut   type        relop requestable  
> consumable default  urgency
> #--------------------------------------------------------------------- 
> -------------------
> arch                a          RESTRING    ==    YES          
> NO         NONE     0
> calendar            c          RESTRING    ==    YES          
> NO         NONE     0
> cpu                 cpu        DOUBLE      ==    YES          
> NO         0        0
> h_core              h_core     MEMORY      <=    YES          
> NO         0        0
> h_cpu               h_cpu      TIME        <=    YES          
> NO         0:0:0    0
> h_data              h_data     MEMORY      <=    YES          
> NO         0        0
> h_fsize             h_fsize    MEMORY      <=    YES          
> NO         0        0
> h_rss               h_rss      MEMORY      <=    YES          
> NO         0        0
> h_rt                h_rt       TIME        <=    YES          
> NO         0:0:0    0
> h_stack             h_stack    MEMORY      <=    YES          
> NO         0        0
> h_vmem              h_vmem     MEMORY      <=    YES          
> NO         0        0
> hostname            h          HOST        ==    YES          
> NO         NONE     0
> load_avg            la         DOUBLE      <=    YES          
> NO         0        0
> load_long           ll         DOUBLE      >=    NO           
> NO         0        0
> load_medium         lm         DOUBLE      >=    NO           
> NO         0        0
> load_short          ls         DOUBLE      >=    YES          
> NO         0        0
> mem_free            mf         MEMORY      <=    YES          
> NO         0        0
> mem_total           mt         MEMORY      <=    YES          
> NO         0        0
> mem_used            mu         MEMORY      >=    YES          
> NO         0        0
> min_cpu_interval    mci        TIME        <=    NO           
> NO         0:0:0    0
> myrinet             myrinet    BOOL        ==    YES          
> NO         FALSE    0
> np_load_avg         nla        DOUBLE      >=    YES          
> NO         0        0
> np_load_long        nll        DOUBLE      >=    NO           
> NO         0        0
> np_load_medium      nlm        DOUBLE      >=    NO           
> NO         0        0
> np_load_short       nls        DOUBLE      >=    YES          
> NO         0        0
> num_proc            p          INT         ==    YES          
> NO         0        0
> qname               q          RESTRING    ==    YES          
> NO         NONE     0
> rerun               re         BOOL        ==    NO           
> NO         0        0
> s_core              s_core     MEMORY      <=    YES          
> NO         0        0
> s_cpu               s_cpu      TIME        <=    YES          
> NO         0:0:0    0
> s_data              s_data     MEMORY      <=    YES          
> NO         0        0
> s_fsize             s_fsize    MEMORY      <=    YES          
> NO         0        0
> s_rss               s_rss      MEMORY      <=    YES          
> NO         0        0
> s_rt                s_rt       TIME        <=    YES          
> NO         0:0:0    0
> s_stack             s_stack    MEMORY      <=    YES          
> NO         0        0
> s_vmem              s_vmem     MEMORY      <=    YES          
> NO         0        0
> seq_no              seq        INT         ==    NO           
> NO         0        0
> slots               s          INT         <=    YES          
> YES        1        1000
> swap_free           sf         MEMORY      <=    YES          
> NO         0        0
> swap_rate           sr         MEMORY      >=    YES          
> NO         0        0
> swap_rsvd           srsv       MEMORY      >=    YES          
> NO         0        0
> swap_total          st         MEMORY      <=    YES          
> NO         0        0
> swap_used           su         MEMORY      >=    YES          
> NO         0        0
> tmpdir              tmp        RESTRING    ==    NO           
> NO         NONE     0
> virtual_free        vf         MEMORY      <=    YES          
> NO         0        0
> virtual_total       vt         MEMORY      <=    YES          
> NO         0        0
> virtual_used        vu         MEMORY      >=    YES          
> NO         0        0
> # >#< starts a comment but comments are not saved across edits  
> --------
>
> We can find  the 2 variables  cpu and np_load_avg are all  
> requestable  and  the relation of np_load_avg is ">="  and  one of  
> cpu is "==".
> In my opinion,  for parallel  jobs. I just need to submit the jobs  
> as  following:
> qsub  -l  cpu=0  -pe mpich  cpunum (multiple of 4)  myscript
>
> Everything is Ok, for the parallel jobs, they always  wait for the  
> nodes which are fully empty.
>
>
> For serial work.  I think  I  need to submit the jobs as following:
> qsub  -soft  -l  np_load_avg=0.2 -pe mpich 1 myscript
>
> I  need to use -soft , because if there are not any serial  jobs ,   
> the serial job can  submit to a empty node. However,  if there are  
> nodes that are occupied but not full , it must to submit the nodes.  
> There are 4 cpus every node, if  1 cpu was occupied, np_load_avg  
> should be equal to 0.25 .  I  try to submit the serial job to the  
> nodes whose  np_load_avg>=0.2.
>
> But I find  SGE do not  submit the serial job according to what I  
> want.  It  stilll submit the serial jobs to the empty nodes  
> although there are the nodes that were occupied but not full.
>
> When I excute  the following command:
> [masj at teracluster ~]$ qhost -l np_load_avg=0.2
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> compute-0-18            lx26-amd64      4  0.00    3.8G  109.9M   
> 996.2M  144.0K
> compute-0-19            lx26-amd64      4  0.00    3.8G   97.9M   
> 996.2M   19.9M
> compute-0-20            lx26-amd64      4  0.00    3.8G  103.9M   
> 996.2M     0.0
> compute-0-21            lx26-amd64      4  0.00    3.8G  104.4M   
> 996.2M     0.0
> compute-0-22            lx26-amd64      4  0.00    3.8G   97.8M   
> 996.2M   18.5M
> compute-0-25            lx26-amd64      4  0.00    3.8G  108.1M   
> 996.2M     0.0
> compute-0-26            lx26-amd64      4  0.01    3.8G  114.9M   
> 996.2M     0.0
> compute-0-27            lx26-amd64      4     -    3.8G       -   
> 996.2M       -
> compute-0-32            lx26-amd64      4  0.00    3.8G  110.2M   
> 996.2M    6.4M
> compute-0-34            lx26-amd64      4  0.00    3.8G  103.1M   
> 996.2M     0.0
> compute-0-35            lx26-amd64      4  0.00    3.8G  110.2M   
> 996.2M     0.0
> compute-0-38            lx26-amd64      4  0.00    3.8G  111.4M   
> 996.2M  144.0K
> compute-0-44            lx26-amd64      4  0.00    3.8G  107.3M   
> 996.2M     0.0
> compute-0-45            lx26-amd64      4  0.00    3.8G  121.8M   
> 996.2M     0.0
> compute-0-50            lx26-amd64      4  0.01    3.8G  103.4M   
> 996.2M  144.0K
> compute-0-54            lx26-amd64      4     -    3.8G       -   
> 996.2M       -
> compute-0-59            lx26-amd64      4  0.00    3.8G  105.9M   
> 996.2M     0.0
> compute-0-64            lx26-amd64      2     -    3.9G       -   
> 996.2M       -
> compute-0-7             lx26-amd64      4  0.00    2.9G  112.9M   
> 996.2M  144.0K
>
> It seems that  display those nodes whose np_load_avg<= 0.2
>
> It is contrary  to the relation of np_load_avg ">=".
>
> I  modified  the complex variable load_avg as following:
>
> [masj at teracluster ~]$ qconf -sc|grep load_avg
> load_avg            la         DOUBLE      <=    YES          
> NO         0        0
> np_load_avg         nla        DOUBLE     >=    YES          
> NO         0        0
>
> the relation of  the variable load_avg is "<="
>
> Howerver, When I excute the command
> [masj at teracluster ~]$ qhost -l load_avg=0.8
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> compute-0-0             lx26-amd64      4  1.00    3.8G  122.3M   
> 996.2M   14.3M
> compute-0-1             lx26-amd64      4  4.00    3.8G  599.7M   
> 996.2M     0.0
> compute-0-10            lx26-amd64      4  1.00    3.8G  897.4M   
> 996.2M     0.0
> compute-0-11            lx26-amd64      4  4.00    3.8G  906.9M   
> 996.2M     0.0
> compute-0-12            lx26-amd64      4  3.94    3.8G  432.9M   
> 996.2M     0.0
> compute-0-13            lx26-amd64      4  4.02    3.8G  926.6M   
> 996.2M   15.6M
> compute-0-14            lx26-amd64      4  4.00    3.8G  220.4M   
> 996.2M     0.0
> compute-0-15            lx26-amd64      4  4.00    3.8G  616.2M   
> 996.2M     0.0
> compute-0-16            lx26-amd64      4  4.00    3.8G  211.8M   
> 996.2M     0.0
> compute-0-17            lx26-amd64      4  4.00    3.8G  357.4M   
> 996.2M   56.0K
> compute-0-2             lx26-amd64      4  4.00    3.8G    1.2G   
> 996.2M   15.6M
> compute-0-23            lx26-amd64      4  4.00    3.8G  204.4M   
> 996.2M  144.0K
> compute-0-24            lx26-amd64      4  4.01    3.8G 1018.3M   
> 996.2M  144.0K
> compute-0-27            lx26-amd64      4     -    3.8G       -   
> 996.2M       -
> compute-0-28            lx26-amd64      4  4.00    3.8G  360.6M   
> 996.2M     0.0
> compute-0-29            lx26-amd64      4  3.94    3.8G  441.6M   
> 996.2M     0.0
> compute-0-3             lx26-amd64      4  4.00    3.8G  611.4M   
> 996.2M     0.0
> compute-0-30            lx26-amd64      4  4.03    3.8G  961.0M   
> 996.2M     0.0
> compute-0-31            lx26-amd64      4  1.00    3.8G    2.6G   
> 996.2M  144.0K
> compute-0-33            lx26-amd64      4  3.97    3.8G  404.2M   
> 996.2M  144.0K
> compute-0-36            lx26-amd64      4  4.00    3.8G  567.9M   
> 996.2M  144.0K
> compute-0-37            lx26-amd64      4  1.00    3.8G  136.1M   
> 996.2M     0.0
> compute-0-39            lx26-amd64      4  4.01    3.8G  587.6M   
> 996.2M   16.7M
> compute-0-4             lx26-amd64      4  4.01    3.8G  616.9M   
> 996.2M     0.0
> compute-0-40            lx26-amd64      4  4.01    3.8G  461.0M   
> 996.2M     0.0
> compute-0-41            lx26-amd64      4  4.02    3.8G  914.8M   
> 996.2M     0.0
> compute-0-42            lx26-amd64      4  4.01    3.8G  905.4M   
> 996.2M     0.0
> compute-0-43            lx26-amd64      4  4.00    3.8G  655.7M   
> 996.2M     0.0
> compute-0-46            lx26-amd64      4  1.02    3.8G    2.5G   
> 996.2M     0.0
> compute-0-47            lx26-amd64      4  4.00    3.8G  621.3M   
> 996.2M     0.0
> compute-0-48            lx26-amd64      4  4.00    3.8G  347.7M   
> 996.2M  144.0K
> compute-0-49            lx26-amd64      4  4.00    3.8G  346.6M   
> 996.2M   15.4M
> compute-0-5             lx26-amd64      4  4.01    3.8G  361.4M   
> 996.2M     0.0
> compute-0-51            lx26-amd64      4  4.00    3.8G  587.6M   
> 996.2M  112.0K
> compute-0-52            lx26-amd64      4  4.00    3.8G  375.1M   
> 996.2M     0.0
> compute-0-53            lx26-amd64      4  1.02    3.8G  102.0M   
> 996.2M  264.0K
> compute-0-54            lx26-amd64      4     -    3.8G       -   
> 996.2M       -
> compute-0-55            lx26-amd64      4  4.01    3.8G  205.9M   
> 996.2M     0.0
> compute-0-56            lx26-amd64      4  4.00    3.8G    1.0G   
> 996.2M     0.0
> compute-0-57            lx26-amd64      4  4.00    3.8G  857.5M   
> 996.2M     0.0
> compute-0-58            lx26-amd64      4  4.00    3.8G  342.0M   
> 996.2M  144.0K
> compute-0-6             lx26-amd64      4  4.00    3.8G  585.1M   
> 996.2M     0.0
> compute-0-60            lx26-amd64      4  1.01    3.8G    2.6G   
> 996.2M     0.0
> compute-0-61            lx26-amd64      4  4.00    3.8G 1010.8M   
> 996.2M     0.0
> compute-0-62            lx26-amd64      4  4.00    3.8G  629.1M   
> 996.2M     0.0
> compute-0-63            lx26-amd64      4  4.02    3.8G  967.8M   
> 996.2M  144.0K
> compute-0-64            lx26-amd64      2     -    3.9G       -   
> 996.2M       -
> compute-0-8             lx26-amd64      4  4.00    3.8G  622.0M   
> 996.2M  512.0K
> compute-0-9             lx26-amd64      4  3.94    3.8G  468.3M   
> 996.2M     0.0
>
> It seems that display the nodes whose load_avg>=0.8 . It is still  
> contrary  to the relation of load_avg "<=". I am confused.
>
>
> I use the command :
> qsub -soft -l load_avg=0.8 -pe mpich1 myscripts
> But it still submit the empty nodes, althougth  there are several  
> nodes whose load is large than 0 and lessthan 4.
>
> What should I do??   Is it a bug of SGE or  I misunderstand it ?
> Best Regards
> **************************************************
> Shaojie Ma
> Institute of Nano Science
> Nanjing University of Aeronautics and Astronautics
> mashaojie at nuaa.edu.cn
> Nanjing 210016, China
> **************************************************
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list