[GE users] exclusive queues - mutually subordinating

James Coomer jamesc at streamline-computing.com
Thu Nov 3 18:59:13 GMT 2005


    [ The following text is in the "iso-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Yes the parallel jobs are distrobuted across multiple hosts. I wasnt using
the monitor option, but I am now and have pasted the output in below: This
should make things clearer.

(1) A parallel job is running, and one is queued, and serial jobs are queued

sccomp at test:~/EXAMPLE/serial> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
    525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
----------------------------------------------------------------------------
parallel.q at comp00.grid.cluster P     1/1       0.03     lx24-amd64
    525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
----------------------------------------------------------------------------
parallel.q at comp01.grid.cluster P     1/1       0.03     lx24-amd64
    525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
----------------------------------------------------------------------------
parallel.q at comp02.grid.cluster P     1/1       0.07     lx24-amd64
    525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
----------------------------------------------------------------------------
parallel.q at comp03.grid.cluster P     1/1       0.03     lx24-amd64
    525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
----------------------------------------------------------------------------
serial.q at comp00.grid.cluster BI    0/2       0.03     lx24-amd64    S
----------------------------------------------------------------------------
serial.q at comp01.grid.cluster BI    0/2       0.03     lx24-amd64    S
----------------------------------------------------------------------------
serial.q at comp02.grid.cluster BI    0/2       0.07     lx24-amd64    S
----------------------------------------------------------------------------
serial.q at comp03.grid.cluster BI    0/2       0.03     lx24-amd64    S

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    526 1000.51000 PMB-MPI1.s sccomp       qw    11/03/2005 18:44:28     5
    527 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:45     1
    528 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:45     1
    529 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:46     1
    530 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:46     1
    531 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:47     1
    532 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:47     1
    533 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:48     1
    534 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:48     1


(2) I qdel the running parallel job and then do qstat -f

sccomp at test:~/EXAMPLE/serial> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
    526 1000.51000 PMB-MPI1.s sccomp       t     11/03/2005 18:45:27     1
----------------------------------------------------------------------------
parallel.q at comp00.grid.cluster P     1/1       0.28     lx24-amd64    S
    526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
parallel.q at comp01.grid.cluster P     1/1       0.28     lx24-amd64    S
    526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
parallel.q at comp02.grid.cluster P     1/1       0.31     lx24-amd64    S
    526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
parallel.q at comp03.grid.cluster P     1/1       0.28     lx24-amd64    S
    526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
serial.q at comp00.grid.cluster BI    2/2       0.28     lx24-amd64    S
    527 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
    533 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
serial.q at comp01.grid.cluster BI    2/2       0.28     lx24-amd64    S
    529 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
    531 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
serial.q at comp02.grid.cluster BI    2/2       0.31     lx24-amd64    S
    530 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
    534 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
----------------------------------------------------------------------------
serial.q at comp03.grid.cluster BI    2/2       0.28     lx24-amd64    S
    528 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
    532 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1



And here is the log from the scheduler monitor:
::::::::
525:1:RUNNING:1131043467:600:P:score:slots:5.000000
525:1:RUNNING:1131043467:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
525:1:RUNNING:1131043467:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
525:1:RUNNING:1131043467:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
525:1:RUNNING:1131043467:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
525:1:RUNNING:1131043467:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
::::::::
526:1:STARTING:1131043527:600:P:score:slots:5.000000
526:1:STARTING:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
526:1:STARTING:1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
526:1:STARTING:1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
526:1:STARTING:1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
526:1:STARTING:1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
527:1:STARTING:1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
528:1:STARTING:1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
529:1:STARTING:1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
530:1:STARTING:1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
531:1:STARTING:1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
532:1:STARTING:1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
533:1:STARTING:1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
534:1:STARTING:1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
::::::::
526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
526:1:SUSPENDED:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
::::::::
526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
526:1:SUSPENDED:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
::::::::
526:1:RUNNING:1131043527:600:P:score:slots:5.000000
526:1:RUNNING:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
526:1:RUNNING:1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
526:1:RUNNING:1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
526:1:RUNNING:1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
526:1:RUNNING:1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
::::::::




> On Thu, 3 Nov 2005, James Coomer wrote:
>
>> Thanks for the quick response.
>>
>> The urgency values are set so that slots gets 1000 default. I have
>> increased the weight_urgency to be far greater than the weight_ticket,
>> weight_priority to see if this makes a difference.
>
> Actually this adresses only the problem of dispatching a sequential
> jobs before a parallel job within the same scheduling cycle. But it
> must make a difference on the job dispatch priority. Just should use
> qstat -pri to see whether the difference is large enough to overrule
> your ticket policy with the concrete case. Btw. how can you say it
> occurs within the same cycle? Are you using
>
>  # qconf -ssconf | grep "^params"
>  params                            monitor=1
>
> and tail -f $SGE_ROOT/default/common/schedule?
>
>> But the jobs still jump
>> on to all the queues simultaneously as before.
>
> Hm ... to be honest I can't explain it right now. Actually
> subordination occurs immediately even within a single scheduling
> cycle so your double subordination should work.
>
> Are these parallel jobs distribted over multiple hosts?
>
> Regards,
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 


Dr James Coomer
HPC and Grid Solutions
Streamline Computing


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list