[GE users] All queues dropped because of overload or full

Alexandre Racine Alexandre.Racine at mhicc.org
Thu Dec 13 14:53:38 GMT 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Oops, sorry I did confuse -conf and -msconf. So in -msconf it is like this :
   schedd_job_info                   true

So it is already to true.

Currently my statisticians are doing some jobs and I have again the same message.

$ qstat -j 181
[...]
script_file:                script.sh
job-array tasks:            1-24:1
usage    2:                 cpu=01:12:23, mem=175.73819 GBs, io=0.00000, vmem=53.320M, maxvmem=53.465M
usage    3:                 cpu=00:27:27, mem=66.77060 GBs, io=0.00000, vmem=53.305M, maxvmem=53.453M
usage    4:                 cpu=00:27:27, mem=66.86123 GBs, io=0.00000, vmem=53.363M, maxvmem=53.512M
usage    5:                 cpu=00:54:31, mem=132.65321 GBs, io=0.00000, vmem=53.320M, maxvmem=53.469M
usage    6:                 cpu=01:11:02, mem=172.81396 GBs, io=0.00000, vmem=53.410M, maxvmem=53.559M
usage    7:                 cpu=00:27:22, mem=66.61948 GBs, io=0.00000, vmem=53.340M, maxvmem=53.496M
usage    8:                 cpu=00:27:18, mem=66.48247 GBs, io=0.00000, vmem=53.336M, maxvmem=53.488M
usage   10:                 cpu=01:12:33, mem=176.25866 GBs, io=0.00000, vmem=53.328M, maxvmem=53.473M
usage   11:                 cpu=00:28:11, mem=68.54915 GBs, io=0.00000, vmem=53.301M, maxvmem=53.449M
usage   12:                 cpu=00:28:03, mem=68.32510 GBs, io=0.00000, vmem=53.355M, maxvmem=53.355M
usage   13:                 cpu=00:54:04, mem=131.54787 GBs, io=0.00000, vmem=53.309M, maxvmem=53.375M
usage   14:                 cpu=00:28:05, mem=68.33146 GBs, io=0.00000, vmem=53.301M, maxvmem=53.367M
usage   15:                 cpu=00:28:06, mem=68.42011 GBs, io=0.00000, vmem=53.340M, maxvmem=53.488M
usage   16:                 cpu=00:54:34, mem=132.87146 GBs, io=0.00000, vmem=53.344M, maxvmem=53.344M
usage   18:                 cpu=01:12:08, mem=175.33734 GBs, io=0.00000, vmem=53.223M, maxvmem=53.367M
usage   19:                 cpu=00:47:27, mem=115.43399 GBs, io=0.00000, vmem=53.320M, maxvmem=53.469M
usage   20:                 cpu=00:29:15, mem=71.27488 GBs, io=0.00000, vmem=53.359M, maxvmem=53.559M
usage   21:                 cpu=00:29:12, mem=71.05423 GBs, io=0.00000, vmem=53.305M, maxvmem=53.492M
usage   22:                 cpu=00:48:10, mem=117.18612 GBs, io=0.00000, vmem=53.316M, maxvmem=53.465M
usage   23:                 cpu=00:47:21, mem=115.29290 GBs, io=0.00000, vmem=53.348M, maxvmem=53.496M
usage   24:                 cpu=01:11:24, mem=173.82426 GBs, io=0.00000, vmem=53.445M, maxvmem=53.488M
scheduling info:            queue instance "all.q at wasabi01.statgen.local" dropped because it is full
                            queue instance "all.q at oregano.statgen.local" dropped because it is full
                            queue instance "all.q at PAPRIKA" dropped because it is full
                            queue instance "testalpha.q at wasabi01.statgen.local" dropped because it is full
                            All queues dropped because of overload or full


>> Q1- So what you are actually saying is that everything is fine, and
>> that SGE is just saying that there is no more slots left?
>
> No.


All slots are used, and 2 tasks are pending. So this message "All queues dropped ..." is the result that all queue are full and there is nothing else to it?

Could it be that the processors are overloaded since they are hyperthreathing? For example, the oregano server has two (2) dual core (x2) hyperthreath (x2). Hyperthreathing from what I understand is only virtual processors...




More infos...-----------------

$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
PAPRIKA                 lx24-amd64     16 14.20   30.4G    2.5G    1.9G     0.0
oregano                 lx24-amd64      8  8.47   15.7G    3.7G    1.9G     0.0
wasabi01                lx24-amd64      8  4.39   14.6G  502.5M    2.0G     0.0


$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q at PAPRIKA                  BIP   14/14     14.18    lx24-amd64    
    167 0.55000 All_RLS_Me asseling     r     12/12/2007 15:40:07     1        
    180 0.55000 pprd-sw_sn asseling     r     12/12/2007 16:45:40     1 3
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 3
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 4
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 7
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 8
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 11
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 12
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 14
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 15
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 18
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 20
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 21
    182 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:38     1 3
----------------------------------------------------------------------------
all.q at oregano.statgen.local    BIP   8/8       8.45     lx24-amd64    
    166 0.55000 SIME_RLS_M asseling     r     12/12/2007 15:40:07     1        
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 5
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 13
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 16
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 19
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:09     1 22
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:09     1 23
    182 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:23     1 2
----------------------------------------------------------------------------
all.q at wasabi01.statgen.local   BIP   0/0       4.35     lx24-amd64    
----------------------------------------------------------------------------
testalpha.q at wasabi01.statgen.l BIP   4/4       4.35     lx24-amd64    
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 2
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 6
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:08     1 10
    181 0.55000 rls-pbat35 asseling     r     12/13/2007 08:31:23     1 24

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    184 0.56000 testSGE-W0 racinea      qw    12/13/2007 09:45:21     1 1
    182 0.55000 rls-pbat35 asseling     qw    12/13/2007 08:31:23     1 4-24:1
    183 0.55000 rls-pbat35 asseling     qw    12/13/2007 08:32:11     1 1-24:1



Thanks...



Alexandre










-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Wed 2007-12-12 18:06
To: users at gridengine.sunsource.net
Subject: Re: [GE users] All queues dropped because of overload or full
 
Am 12.12.2007 um 22:46 schrieb Alexandre Racine:

> Actually, the schedd_job_info line was not even there.

AFAIK it's always there. Did you mix up -mconf and -msconf? - Reuti


> So i'll activate it for next time.
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wed 2007-12-12 16:00
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] All queues dropped because of overload or full
>
> Am 12.12.2007 um 20:28 schrieb Alexandre Racine:
>
>> Mmmm, you mean the scheduler tuning profile when installing? It is
>> set to normal.
>
> No, I meant:
>
> schedd_job_info                   true
>
> in qconf -msconf and it was false the last time.
>
>>
>> Q1- So what you are actually saying is that everything is fine, and
>> that SGE is just saying that there is no more slots left?
>>
>> Q2- Also, looking in TOP, I had some program that where in the
>> state "'D' = uninterruptible sleep". Would this be related?
>
> No.
>
> -- Reuti
>
>>
>>
>> Thanks.
>>
>>
>>
>>
>> Alexandre Racine
>> Projets spéciaux
>> 514-461-1300 poste 3304
>> alexandre.racine at mhicc.org
>>
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Wed 2007-12-12 13:40
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] All queues dropped because of overload or  
>> full
>>
>> Am 12.12.2007 um 19:05 schrieb Alexandre Racine:
>>
>>> Yes, all slots where used, but I did not have that message while
>>> doing another tests witch had like 100.000 jobs pending. Why this
>>> time I have that error message?
>>> Here is the qstat -f....
>>
>> Maybe the scheduler info wasn't turned on the last time? - Reuti
>>
>>> $ qstat -f
>>> queuename                      qtype used/tot. load_avg
>>> arch          states
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ------
>>> all.q at PAPRIKA                  BIP   14/14     14.20    lx24-amd64
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 2
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 3
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 4
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 5
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 8
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 12
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 13
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 16
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 20
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 21
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:31     1 2
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:31     1 3
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:47     1 7
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:44:56     1 8
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ------
>>> all.q at oregano.statgen.local    BIP   8/8       8.95     lx24-amd64
>>>     131 0.55500 All_RLS_Me asseling     r     12/11/2007
>>> 08:52:54     1
>>>     132 0.55500 SIME_RLS_M asseling     r     12/11/2007
>>> 08:53:25     1
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 7
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 11
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 15
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 19
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 23
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:32     1 5
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ------
>>> all.q at wasabi01.statgen.local   BIP   8/8       8.12     lx24-amd64
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 6
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 10
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 14
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 18
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 22
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 24
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:31     1 4
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:32     1 6
>>>
>>> #################################################################### 
>>> #
>>> #
>>> ######
>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
>>> PENDING JOBS
>>> #################################################################### 
>>> #
>>> #
>>> ######
>>>     140 0.55500 rls-pbat35 asseling     qw    12/11/2007
>>> 14:25:31     1 9-24:1
>>>     141 0.55500 rls-pbat35 asseling     qw    12/11/2007
>>> 14:31:11     1 1-17:8
>>>     142 0.55500 pprd-sw_sn asseling     qw    12/12/2007
>>> 09:36:21     1 3
>>>
>>>
>>>
>>>
>>>
>>> Alexandre Racine
>>> Projets spéciaux
>>> 514-461-1300 poste 3304
>>> alexandre.racine at mhicc.org
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Wed 2007-12-12 11:36
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] All queues dropped because of overload or
>>> full
>>>
>>> Hi,
>>>
>>> Am 12.12.2007 um 16:36 schrieb Alexandre Racine:
>>>
>>>> Looking with "top", the processors works, there is a lot of memory
>>>> available... qhost seems alright... I don't see why I get this.
>>>> There is only mabe the mem field in the "qstat -j" that sounds
>>>> impossible. Or is this the total amount of memory that has been
>>>> used? (used and freed). The only references that I found in the
>>>> archives are from 2004... Thanks.
>>>
>>> what is `qstat -f`saying? What looks the queue configuration like?
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> More details:
>>>>
>>>> $ qhost
>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>> SWAPTO  SWAPUS
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> -
>>>> ---------
>>>> global                  -               -     -       -
>>>> -       -       -
>>>> server1                 lx24-amd64     16 14.26   30.4G    2.6G
>>>> 1.9G     0.0
>>>> server2                 lx24-amd64      8  8.39   15.7G    6.8G
>>>> 1.9G     0.0
>>>> server3                 lx24-amd64      8  8.12   14.6G  577.7M
>>>> 2.0G     0.0
>>>>
>>>> $ qstat -j 139
>>>> [...]
>>>> script_file:                script.sh
>>>> job-array tasks:            1-24:1
>>>> usage    2:                 cpu=20:06:52, mem=3467.62004 GBs,
>>>> io=0.00000, vmem=60.770M, maxvmem=62.227M
>>>> usage    3:                 cpu=07:04:53, mem=1250.25426 GBs,
>>>> io=0.00000, vmem=62.016M, maxvmem=63.266M
>>>> usage    4:                 cpu=07:02:46, mem=1247.70159 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=63.492M
>>>> usage    5:                 cpu=07:04:38, mem=1249.53348 GBs,
>>>> io=0.00000, vmem=62.008M, maxvmem=63.316M
>>>> usage    6:                 cpu=16:03:12, mem=2834.15624 GBs,
>>>> io=0.00000, vmem=62.113M, maxvmem=63.023M
>>>> usage    7:                 cpu=15:17:48, mem=2707.94392 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=62.578M
>>>> usage    8:                 cpu=07:02:46, mem=1247.34336 GBs,
>>>> io=0.00000, vmem=62.148M, maxvmem=63.484M
>>>> usage   10:                 cpu=20:09:24, mem=3475.70453 GBs,
>>>> io=0.00000, vmem=60.832M, maxvmem=62.266M
>>>> usage   11:                 cpu=14:32:42, mem=2568.06738 GBs,
>>>> io=0.00000, vmem=62.016M, maxvmem=63.016M
>>>> usage   12:                 cpu=07:14:50, mem=1283.31948 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=63.504M
>>>> usage   13:                 cpu=07:15:51, mem=1282.46496 GBs,
>>>> io=0.00000, vmem=62.012M, maxvmem=63.359M
>>>> usage   14:                 cpu=15:56:08, mem=2813.61103 GBs,
>>>> io=0.00000, vmem=62.125M, maxvmem=63.430M
>>>> usage   15:                 cpu=14:38:33, mem=2592.12483 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=63.312M
>>>> usage   16:                 cpu=07:17:23, mem=1290.37961 GBs,
>>>> io=0.00000, vmem=62.137M, maxvmem=63.574M
>>>> usage   18:                 cpu=20:09:23, mem=3482.93681 GBs,
>>>> io=0.00000, vmem=60.832M, maxvmem=62.289M
>>>> usage   19:                 cpu=14:26:19, mem=2549.31135 GBs,
>>>> io=0.00000, vmem=62.016M, maxvmem=63.324M
>>>> usage   20:                 cpu=07:22:26, mem=1305.89071 GBs,
>>>> io=0.00000, vmem=62.160M, maxvmem=63.617M
>>>> usage   21:                 cpu=07:23:30, mem=1304.96487 GBs,
>>>> io=0.00000, vmem=62.004M, maxvmem=63.328M
>>>> usage   22:                 cpu=15:08:08, mem=2672.33798 GBs,
>>>> io=0.00000, vmem=62.117M, maxvmem=63.551M
>>>> usage   23:                 cpu=14:23:55, mem=2548.95621 GBs,
>>>> io=0.00000, vmem=62.148M, maxvmem=63.609M
>>>> usage   24:                 cpu=15:04:51, mem=2669.49002 GBs,
>>>> io=0.00000, vmem=62.246M, maxvmem=63.523M
>>>> scheduling info:            queue instance
>>>> "all.q at oregano.statgen.local" dropped because it is full
>>>>                             queue instance
>>>> "all.q at wasabi01.statgen.local" dropped because it is full
>>>>                             queue instance "all.q at PAPRIKA" dropped
>>>> because it is full
>>>>                             All queues dropped because of overload
>>>> or full
>>>>
>>>>
>>>>
>>>>
>>>> Alexandre Racine
>>>> Projets spéciaux
>>>> 514-461-1300 poste 3304
>>>> alexandre.racine at mhicc.org
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list