[GE users] exclusive queues - mutually subordinating

James Coomer jamesc at streamline-computing.com
Wed Nov 9 10:42:51 GMT 2005


    [ The following text is in the "iso-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Andreas,

OK I'll submit a bug report just as soon as I work out how.

Thanks
James

> Hi James,
>
> I can reproduce that behaviour, but I can not yet say why it is broken.
> Thanks for reporting! Could you file a bug for this?
>
> Regards,
> Andreas
>
> On Thu, 3 Nov 2005, James Coomer wrote:
>
>> Yes the parallel jobs are distrobuted across multiple hosts. I wasnt
>> using
>> the monitor option, but I am now and have pasted the output in below:
>> This
>> should make things clearer.
>>
>> (1) A parallel job is running, and one is queued, and serial jobs are
>> queued
>>
>> sccomp at test:~/EXAMPLE/serial> qstat -f
>> queuename                      qtype used/tot. load_avg arch
>> states
>> ----------------------------------------------------------------------------
>> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp00.grid.cluster P     1/1       0.03     lx24-amd64
>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp01.grid.cluster P     1/1       0.03     lx24-amd64
>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp02.grid.cluster P     1/1       0.07     lx24-amd64
>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp03.grid.cluster P     1/1       0.03     lx24-amd64
>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>> 1
>> ----------------------------------------------------------------------------
>> serial.q at comp00.grid.cluster BI    0/2       0.03     lx24-amd64    S
>> ----------------------------------------------------------------------------
>> serial.q at comp01.grid.cluster BI    0/2       0.03     lx24-amd64    S
>> ----------------------------------------------------------------------------
>> serial.q at comp02.grid.cluster BI    0/2       0.07     lx24-amd64    S
>> ----------------------------------------------------------------------------
>> serial.q at comp03.grid.cluster BI    0/2       0.03     lx24-amd64    S
>>
>> ############################################################################
>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS
>> ############################################################################
>>     526 1000.51000 PMB-MPI1.s sccomp       qw    11/03/2005 18:44:28
>> 5
>>     527 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:45     1
>>     528 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:45     1
>>     529 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:46     1
>>     530 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:46     1
>>     531 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:47     1
>>     532 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:47     1
>>     533 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:48     1
>>     534 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:48     1
>>
>>
>> (2) I qdel the running parallel job and then do qstat -f
>>
>> sccomp at test:~/EXAMPLE/serial> qstat -f
>> queuename                      qtype used/tot. load_avg arch
>> states
>> ----------------------------------------------------------------------------
>> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>>     526 1000.51000 PMB-MPI1.s sccomp       t     11/03/2005 18:45:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp00.grid.cluster P     1/1       0.28     lx24-amd64    S
>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp01.grid.cluster P     1/1       0.28     lx24-amd64    S
>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp02.grid.cluster P     1/1       0.31     lx24-amd64    S
>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>> 1
>> ----------------------------------------------------------------------------
>> parallel.q at comp03.grid.cluster P     1/1       0.28     lx24-amd64    S
>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>> 1
>> ----------------------------------------------------------------------------
>> serial.q at comp00.grid.cluster BI    2/2       0.28     lx24-amd64    S
>>     527 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>>     533 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>> ----------------------------------------------------------------------------
>> serial.q at comp01.grid.cluster BI    2/2       0.28     lx24-amd64    S
>>     529 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>>     531 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>> ----------------------------------------------------------------------------
>> serial.q at comp02.grid.cluster BI    2/2       0.31     lx24-amd64    S
>>     530 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>>     534 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>> ----------------------------------------------------------------------------
>> serial.q at comp03.grid.cluster BI    2/2       0.28     lx24-amd64    S
>>     528 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>>     532 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>>
>>
>>
>> And here is the log from the scheduler monitor:
>> ::::::::
>> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000
>> 525:1:RUNNING:1131043467:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>> ::::::::
>> 526:1:STARTING:1131043527:600:P:score:slots:5.000000
>> 526:1:STARTING:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>> 526:1:STARTING:1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>> 526:1:STARTING:1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>> 526:1:STARTING:1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>> 526:1:STARTING:1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>> 527:1:STARTING:1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
>> 528:1:STARTING:1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
>> 529:1:STARTING:1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
>> 530:1:STARTING:1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
>> 531:1:STARTING:1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
>> 532:1:STARTING:1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
>> 533:1:STARTING:1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
>> 534:1:STARTING:1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
>> ::::::::
>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>> 526:1:SUSPENDED:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>> ::::::::
>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>> 526:1:SUSPENDED:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>> ::::::::
>> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000
>> 526:1:RUNNING:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>> ::::::::
>>
>>
>>
>>
>> > On Thu, 3 Nov 2005, James Coomer wrote:
>> >
>> >> Thanks for the quick response.
>> >>
>> >> The urgency values are set so that slots gets 1000 default. I have
>> >> increased the weight_urgency to be far greater than the
>> weight_ticket,
>> >> weight_priority to see if this makes a difference.
>> >
>> > Actually this adresses only the problem of dispatching a sequential
>> > jobs before a parallel job within the same scheduling cycle. But it
>> > must make a difference on the job dispatch priority. Just should use
>> > qstat -pri to see whether the difference is large enough to overrule
>> > your ticket policy with the concrete case. Btw. how can you say it
>> > occurs within the same cycle? Are you using
>> >
>> >  # qconf -ssconf | grep "^params"
>> >  params                            monitor=1
>> >
>> > and tail -f $SGE_ROOT/default/common/schedule?
>> >
>> >> But the jobs still jump
>> >> on to all the queues simultaneously as before.
>> >
>> > Hm ... to be honest I can't explain it right now. Actually
>> > subordination occurs immediately even within a single scheduling
>> > cycle so your double subordination should work.
>> >
>> > Are these parallel jobs distribted over multiple hosts?
>> >
>> > Regards,
>> > Andreas
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>> >
>>
>>
>> --
>>
>>
>> Dr James Coomer
>> HPC and Grid Solutions
>> Streamline Computing
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 


Dr James Coomer
HPC and Grid Solutions
Streamline Computing


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list