[GE users] exclusive queues - mutually subordinating

Chris Dagdigian dag at sonsorol.org
Wed Nov 9 11:01:20 GMT 2005


Hi James,

To report a bug you just have to be a "known" user on http;// 
gridengine.sunsource.net, the basic process is:

- register a user account; join the "gridengine" project as an observer
- once you are logged in you can go to http:// 
gridengine.sunsource.net/servlets/ProjectIssues and file a "DEFECT"  
report

Someone else can open the bug report for you if you don't want to go  
through the register process.  Since others have been able to  
reproduce there is enough basic info in the email thread to open the  
ticket.

One of the advantage to being "known" on the site is that you can  
make your interest in certain issues known to Bugzilla and get email  
updates  etc. when the status changes

Regards,
Chris



On Nov 9, 2005, at 5:42 AM, James Coomer wrote:

> Andreas,
>
> OK I'll submit a bug report just as soon as I work out how.
>
> Thanks
> James
>
>> Hi James,
>>
>> I can reproduce that behaviour, but I can not yet say why it is  
>> broken.
>> Thanks for reporting! Could you file a bug for this?
>>
>> Regards,
>> Andreas
>>
>> On Thu, 3 Nov 2005, James Coomer wrote:
>>
>>> Yes the parallel jobs are distrobuted across multiple hosts. I wasnt
>>> using
>>> the monitor option, but I am now and have pasted the output in  
>>> below:
>>> This
>>> should make things clearer.
>>>
>>> (1) A parallel job is running, and one is queued, and serial jobs  
>>> are
>>> queued
>>>
>>> sccomp at test:~/EXAMPLE/serial> qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>> states
>>> -------------------------------------------------------------------- 
>>> --------
>>> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp00.grid.cluster P     1/1       0.03     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp01.grid.cluster P     1/1       0.03     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp02.grid.cluster P     1/1       0.07     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp03.grid.cluster P     1/1       0.03     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp00.grid.cluster BI    0/2       0.03     lx24- 
>>> amd64    S
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp01.grid.cluster BI    0/2       0.03     lx24- 
>>> amd64    S
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp02.grid.cluster BI    0/2       0.07     lx24- 
>>> amd64    S
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp03.grid.cluster BI    0/2       0.03     lx24- 
>>> amd64    S
>>>
>>> #################################################################### 
>>> ########
>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
>>> PENDING
>>> JOBS
>>> #################################################################### 
>>> ########
>>>     526 1000.51000 PMB-MPI1.s sccomp       qw    11/03/2005 18:44:28
>>> 5
>>>     527 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:45     1
>>>     528 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:45     1
>>>     529 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:46     1
>>>     530 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:46     1
>>>     531 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:47     1
>>>     532 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:47     1
>>>     533 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:48     1
>>>     534 0.51000 hello.sh   sccomp       qw    11/03/2005  
>>> 18:44:48     1
>>>
>>>
>>> (2) I qdel the running parallel job and then do qstat -f
>>>
>>> sccomp at test:~/EXAMPLE/serial> qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>> states
>>> -------------------------------------------------------------------- 
>>> --------
>>> master.q at test.grid.cluster P     1/8       0.00     lx24-amd64
>>>     526 1000.51000 PMB-MPI1.s sccomp       t     11/03/2005 18:45:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp00.grid.cluster P     1/1       0.28     lx24- 
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp01.grid.cluster P     1/1       0.28     lx24- 
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp02.grid.cluster P     1/1       0.31     lx24- 
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> parallel.q at comp03.grid.cluster P     1/1       0.28     lx24- 
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp00.grid.cluster BI    2/2       0.28     lx24- 
>>> amd64    S
>>>     527 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>>     533 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp01.grid.cluster BI    2/2       0.28     lx24- 
>>> amd64    S
>>>     529 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>>     531 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp02.grid.cluster BI    2/2       0.31     lx24- 
>>> amd64    S
>>>     530 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>>     534 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>> -------------------------------------------------------------------- 
>>> --------
>>> serial.q at comp03.grid.cluster BI    2/2       0.28     lx24- 
>>> amd64    S
>>>     528 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>>     532 0.51000 hello.sh   sccomp       St    11/03/2005  
>>> 18:45:27     1
>>>
>>>
>>>
>>> And here is the log from the scheduler monitor:
>>> ::::::::
>>> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000
>>> 525:1:RUNNING: 
>>> 1131043467:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING: 
>>> 1131043467:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING: 
>>> 1131043467:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING: 
>>> 1131043467:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING: 
>>> 1131043467:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:STARTING:1131043527:600:P:score:slots:5.000000
>>> 526:1:STARTING: 
>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING: 
>>> 1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING: 
>>> 1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING: 
>>> 1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING: 
>>> 1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>> 527:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>> 528:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>> 529:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>> 530:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>> 531:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>> 532:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>> 533:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>> 534:1:STARTING: 
>>> 1131043527:600:Q:serial.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>>> 526:1:SUSPENDED: 
>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>>> 526:1:SUSPENDED: 
>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000
>>> 526:1:RUNNING: 
>>> 1131043527:600:Q:master.q at test.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING: 
>>> 1131043527:600:Q:parallel.q at comp00.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING: 
>>> 1131043527:600:Q:parallel.q at comp02.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING: 
>>> 1131043527:600:Q:parallel.q at comp03.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING: 
>>> 1131043527:600:Q:parallel.q at comp01.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>>
>>>
>>>
>>>
>>>> On Thu, 3 Nov 2005, James Coomer wrote:
>>>>
>>>>> Thanks for the quick response.
>>>>>
>>>>> The urgency values are set so that slots gets 1000 default. I have
>>>>> increased the weight_urgency to be far greater than the
>>> weight_ticket,
>>>>> weight_priority to see if this makes a difference.
>>>>
>>>> Actually this adresses only the problem of dispatching a sequential
>>>> jobs before a parallel job within the same scheduling cycle. But it
>>>> must make a difference on the job dispatch priority. Just should  
>>>> use
>>>> qstat -pri to see whether the difference is large enough to  
>>>> overrule
>>>> your ticket policy with the concrete case. Btw. how can you say it
>>>> occurs within the same cycle? Are you using
>>>>
>>>>  # qconf -ssconf | grep "^params"
>>>>  params                            monitor=1
>>>>
>>>> and tail -f $SGE_ROOT/default/common/schedule?
>>>>
>>>>> But the jobs still jump
>>>>> on to all the queues simultaneously as before.
>>>>
>>>> Hm ... to be honest I can't explain it right now. Actually
>>>> subordination occurs immediately even within a single scheduling
>>>> cycle so your double subordination should work.
>>>>
>>>> Are these parallel jobs distribted over multiple hosts?
>>>>
>>>> Regards,
>>>> Andreas
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list