[GE users] exclusive queues - mutually subordinating

Andreas Haas Andreas.Haas at Sun.COM
Fri Nov 4 16:36:14 GMT 2005


Hi James,

I can reproduce that behaviour, but I can not yet say why it is broken.
Thanks for reporting! Could you file a bug for this?

Regards,
Andreas

On Thu, 3 Nov 2005, James Coomer wrote:

> Yes the parallel jobs are distrobuted across multiple hosts. I wasnt using
> the monitor option, but I am now and have pasted the output in below: This
> should make things clearer.
>
> (1) A parallel job is running, and one is queued, and serial jobs are queued
>
> sccomp at test:~/EXAMPLE/serial> qstat -f
> queuename                      qtype used/tot. load_avg arch          states
> ----------------------------------------------------------------------------
> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp00.grid.cluster P     1/1       0.03     lx24-amd64
>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp01.grid.cluster P     1/1       0.03     lx24-amd64
>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp02.grid.cluster P     1/1       0.07     lx24-amd64
>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp03.grid.cluster P     1/1       0.03     lx24-amd64
>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27     1
> ----------------------------------------------------------------------------
> serial.q at comp00.grid.cluster BI    0/2       0.03     lx24-amd64    S
> ----------------------------------------------------------------------------
> serial.q at comp01.grid.cluster BI    0/2       0.03     lx24-amd64    S
> ----------------------------------------------------------------------------
> serial.q at comp02.grid.cluster BI    0/2       0.07     lx24-amd64    S
> ----------------------------------------------------------------------------
> serial.q at comp03.grid.cluster BI    0/2       0.03     lx24-amd64    S
>
> ############################################################################
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
> ############################################################################
>     526 1000.51000 PMB-MPI1.s sccomp       qw    11/03/2005 18:44:28     5
>     527 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:45     1
>     528 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:45     1
>     529 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:46     1
>     530 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:46     1
>     531 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:47     1
>     532 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:47     1
>     533 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:48     1
>     534 0.51000 hello.sh   sccomp       qw    11/03/2005 18:44:48     1
>
>
> (2) I qdel the running parallel job and then do qstat -f
>
> sccomp at test:~/EXAMPLE/serial> qstat -f
> queuename                      qtype used/tot. load_avg arch          states
> ----------------------------------------------------------------------------
> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>     526 1000.51000 PMB-MPI1.s sccomp       t     11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp00.grid.cluster P     1/1       0.28     lx24-amd64    S
>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp01.grid.cluster P     1/1       0.28     lx24-amd64    S
>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp02.grid.cluster P     1/1       0.31     lx24-amd64    S
>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> parallel.q at comp03.grid.cluster P     1/1       0.28     lx24-amd64    S
>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> serial.q at comp00.grid.cluster BI    2/2       0.28     lx24-amd64    S
>     527 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>     533 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> serial.q at comp01.grid.cluster BI    2/2       0.28     lx24-amd64    S
>     529 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>     531 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> serial.q at comp02.grid.cluster BI    2/2       0.31     lx24-amd64    S
>     530 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>     534 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
> ----------------------------------------------------------------------------
> serial.q at comp03.grid.cluster BI    2/2       0.28     lx24-amd64    S
>     528 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>     532 0.51000 hello.sh   sccomp       St    11/03/2005 18:45:27     1
>
>
>
> And here is the log from the scheduler monitor:
> ::::::::
> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000
> 525:1:RUNNING:1131043467:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
> 525:1:RUNNING:1131043467:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
> ::::::::
> 526:1:STARTING:1131043527:600:P:score:slots:5.000000
> 526:1:STARTING:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
> 526:1:STARTING:1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
> 526:1:STARTING:1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
> 526:1:STARTING:1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
> 526:1:STARTING:1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
> 527:1:STARTING:1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
> 528:1:STARTING:1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
> 529:1:STARTING:1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
> 530:1:STARTING:1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
> 531:1:STARTING:1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
> 532:1:STARTING:1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
> 533:1:STARTING:1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
> 534:1:STARTING:1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
> ::::::::
> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
> 526:1:SUSPENDED:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
> ::::::::
> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
> 526:1:SUSPENDED:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
> ::::::::
> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000
> 526:1:RUNNING:1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
> 526:1:RUNNING:1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
> ::::::::
>
>
>
>
> > On Thu, 3 Nov 2005, James Coomer wrote:
> >
> >> Thanks for the quick response.
> >>
> >> The urgency values are set so that slots gets 1000 default. I have
> >> increased the weight_urgency to be far greater than the weight_ticket,
> >> weight_priority to see if this makes a difference.
> >
> > Actually this adresses only the problem of dispatching a sequential
> > jobs before a parallel job within the same scheduling cycle. But it
> > must make a difference on the job dispatch priority. Just should use
> > qstat -pri to see whether the difference is large enough to overrule
> > your ticket policy with the concrete case. Btw. how can you say it
> > occurs within the same cycle? Are you using
> >
> >  # qconf -ssconf | grep "^params"
> >  params                            monitor=1
> >
> > and tail -f $SGE_ROOT/default/common/schedule?
> >
> >> But the jobs still jump
> >> on to all the queues simultaneously as before.
> >
> > Hm ... to be honest I can't explain it right now. Actually
> > subordination occurs immediately even within a single scheduling
> > cycle so your double subordination should work.
> >
> > Are these parallel jobs distribted over multiple hosts?
> >
> > Regards,
> > Andreas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
>
> --
>
>
> Dr James Coomer
> HPC and Grid Solutions
> Streamline Computing
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list