[GE users] exclusive queues - mutually subordinating

James Coomer jamesc at streamline-computing.com
Fri Nov 11 11:02:18 GMT 2005


    [ The following text is in the "iso-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Many thanks to all for all the good help. I have now filed the issue.

James

> Hi James,
>
> To report a bug you just have to be a "known" user on http;//
> gridengine.sunsource.net, the basic process is:
>
> - register a user account; join the "gridengine" project as an observer
> - once you are logged in you can go to http://
> gridengine.sunsource.net/servlets/ProjectIssues and file a "DEFECT"
> report
>
> Someone else can open the bug report for you if you don't want to go
> through the register process.  Since others have been able to
> reproduce there is enough basic info in the email thread to open the
> ticket.
>
> One of the advantage to being "known" on the site is that you can
> make your interest in certain issues known to Bugzilla and get email
> updates  etc. when the status changes
>
> Regards,
> Chris
>
>
>
> On Nov 9, 2005, at 5:42 AM, James Coomer wrote:
>
>> Andreas,
>>
>> OK I'll submit a bug report just as soon as I work out how.
>>
>> Thanks
>> James
>>
>>> Hi James,
>>>
>>> I can reproduce that behaviour, but I can not yet say why it is
>>> broken.
>>> Thanks for reporting! Could you file a bug for this?
>>>
>>> Regards,
>>> Andreas
>>>
>>> On Thu, 3 Nov 2005, James Coomer wrote:
>>>
>>>> Yes the parallel jobs are distrobuted across multiple hosts. I wasnt
>>>> using
>>>> the monitor option, but I am now and have pasted the output in
>>>> below:
>>>> This
>>>> should make things clearer.
>>>>
>>>> (1) A parallel job is running, and one is queued, and serial jobs
>>>> are
>>>> queued
>>>>
>>>> sccomp at test:~/EXAMPLE/serial> qstat -f
>>>> queuename                      qtype used/tot. load_avg arch
>>>> states
>>>> --------------------------------------------------------------------
>>>> --------
>>>> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp00.grid.cluster P     1/1       0.03     lx24-amd64
>>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp01.grid.cluster P     1/1       0.03     lx24-amd64
>>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp02.grid.cluster P     1/1       0.07     lx24-amd64
>>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp03.grid.cluster P     1/1       0.03     lx24-amd64
>>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp00.grid.cluster BI    0/2       0.03     lx24-
>>>> amd64    S
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp01.grid.cluster BI    0/2       0.03     lx24-
>>>> amd64    S
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp02.grid.cluster BI    0/2       0.07     lx24-
>>>> amd64    S
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp03.grid.cluster BI    0/2       0.03     lx24-
>>>> amd64    S
>>>>
>>>> ####################################################################
>>>> ########
>>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
>>>> PENDING
>>>> JOBS
>>>> ####################################################################
>>>> ########
>>>>     526 1000.51000 PMB-MPI1.s sccomp       qw    11/03/2005 18:44:28
>>>> 5
>>>>     527 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:45     1
>>>>     528 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:45     1
>>>>     529 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:46     1
>>>>     530 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:46     1
>>>>     531 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:47     1
>>>>     532 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:47     1
>>>>     533 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:48     1
>>>>     534 0.51000 hello.sh   sccomp       qw    11/03/2005
>>>> 18:44:48     1
>>>>
>>>>
>>>> (2) I qdel the running parallel job and then do qstat -f
>>>>
>>>> sccomp at test:~/EXAMPLE/serial> qstat -f
>>>> queuename                      qtype used/tot. load_avg arch
>>>> states
>>>> --------------------------------------------------------------------
>>>> --------
>>>> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>>>>     526 1000.51000 PMB-MPI1.s sccomp       t     11/03/2005 18:45:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp00.grid.cluster P     1/1       0.28     lx24-
>>>> amd64    S
>>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp01.grid.cluster P     1/1       0.28     lx24-
>>>> amd64    S
>>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp02.grid.cluster P     1/1       0.31     lx24-
>>>> amd64    S
>>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> parallel.q at comp03.grid.cluster P     1/1       0.28     lx24-
>>>> amd64    S
>>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>>> 1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp00.grid.cluster BI    2/2       0.28     lx24-
>>>> amd64    S
>>>>     527 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>>     533 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp01.grid.cluster BI    2/2       0.28     lx24-
>>>> amd64    S
>>>>     529 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>>     531 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp02.grid.cluster BI    2/2       0.31     lx24-
>>>> amd64    S
>>>>     530 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>>     534 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>> --------------------------------------------------------------------
>>>> --------
>>>> serial.q at comp03.grid.cluster BI    2/2       0.28     lx24-
>>>> amd64    S
>>>>     528 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>>     532 0.51000 hello.sh   sccomp       St    11/03/2005
>>>> 18:45:27     1
>>>>
>>>>
>>>>
>>>> And here is the log from the scheduler monitor:
>>>> ::::::::
>>>> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000
>>>> 525:1:RUNNING:
>>>> 1131043467:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>>> 525:1:RUNNING:
>>>> 1131043467:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>>> 525:1:RUNNING:
>>>> 1131043467:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>>> 525:1:RUNNING:
>>>> 1131043467:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>>> 525:1:RUNNING:
>>>> 1131043467:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>>> ::::::::
>>>> 526:1:STARTING:1131043527:600:P:score:slots:5.000000
>>>> 526:1:STARTING:
>>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:STARTING:
>>>> 1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:STARTING:
>>>> 1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:STARTING:
>>>> 1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:STARTING:
>>>> 1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>>> 527:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>>> 528:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>>> 529:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>>> 530:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>>> 531:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>>> 532:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>>> 533:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>>> 534:1:STARTING:
>>>> 1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>>> ::::::::
>>>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>>>> 526:1:SUSPENDED:
>>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>>> ::::::::
>>>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>>>> 526:1:SUSPENDED:
>>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>>> ::::::::
>>>> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000
>>>> 526:1:RUNNING:
>>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:RUNNING:
>>>> 1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:RUNNING:
>>>> 1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:RUNNING:
>>>> 1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>>> 526:1:RUNNING:
>>>> 1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>>> ::::::::
>>>>
>>>>
>>>>
>>>>
>>>>> On Thu, 3 Nov 2005, James Coomer wrote:
>>>>>
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>> The urgency values are set so that slots gets 1000 default. I have
>>>>>> increased the weight_urgency to be far greater than the
>>>> weight_ticket,
>>>>>> weight_priority to see if this makes a difference.
>>>>>
>>>>> Actually this adresses only the problem of dispatching a sequential
>>>>> jobs before a parallel job within the same scheduling cycle. But it
>>>>> must make a difference on the job dispatch priority. Just should
>>>>> use
>>>>> qstat -pri to see whether the difference is large enough to
>>>>> overrule
>>>>> your ticket policy with the concrete case. Btw. how can you say it
>>>>> occurs within the same cycle? Are you using
>>>>>
>>>>>  # qconf -ssconf | grep "^params"
>>>>>  params                            monitor=1
>>>>>
>>>>> and tail -f $SGE_ROOT/default/common/schedule?
>>>>>
>>>>>> But the jobs still jump
>>>>>> on to all the queues simultaneously as before.
>>>>>
>>>>> Hm ... to be honest I can't explain it right now. Actually
>>>>> subordination occurs immediately even within a single scheduling
>>>>> cycle so your double subordination should work.
>>>>>
>>>>> Are these parallel jobs distribted over multiple hosts?
>>>>>
>>>>> Regards,
>>>>> Andreas
>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 


Dr James Coomer
HPC and Grid Solutions
Streamline Computing


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list