[GE users] All queues dropped because of overload or full

Reuti reuti at staff.uni-marburg.de
Wed Dec 12 23:06:58 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Am 12.12.2007 um 22:46 schrieb Alexandre Racine:

> Actually, the schedd_job_info line was not even there.

AFAIK it's always there. Did you mix up -mconf and -msconf? - Reuti


> So i'll activate it for next time.
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wed 2007-12-12 16:00
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] All queues dropped because of overload or full
>
> Am 12.12.2007 um 20:28 schrieb Alexandre Racine:
>
>> Mmmm, you mean the scheduler tuning profile when installing? It is
>> set to normal.
>
> No, I meant:
>
> schedd_job_info                   true
>
> in qconf -msconf and it was false the last time.
>
>>
>> Q1- So what you are actually saying is that everything is fine, and
>> that SGE is just saying that there is no more slots left?
>>
>> Q2- Also, looking in TOP, I had some program that where in the
>> state "'D' = uninterruptible sleep". Would this be related?
>
> No.
>
> -- Reuti
>
>>
>>
>> Thanks.
>>
>>
>>
>>
>> Alexandre Racine
>> Projets spéciaux
>> 514-461-1300 poste 3304
>> alexandre.racine at mhicc.org
>>
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Wed 2007-12-12 13:40
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] All queues dropped because of overload or  
>> full
>>
>> Am 12.12.2007 um 19:05 schrieb Alexandre Racine:
>>
>>> Yes, all slots where used, but I did not have that message while
>>> doing another tests witch had like 100.000 jobs pending. Why this
>>> time I have that error message?
>>> Here is the qstat -f....
>>
>> Maybe the scheduler info wasn't turned on the last time? - Reuti
>>
>>> $ qstat -f
>>> queuename                      qtype used/tot. load_avg
>>> arch          states
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ------
>>> all.q at PAPRIKA                  BIP   14/14     14.20    lx24-amd64
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 2
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 3
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 4
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 5
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 8
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 12
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 13
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 16
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 20
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 21
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:31     1 2
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:31     1 3
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:47     1 7
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:44:56     1 8
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ------
>>> all.q at oregano.statgen.local    BIP   8/8       8.95     lx24-amd64
>>>     131 0.55500 All_RLS_Me asseling     r     12/11/2007
>>> 08:52:54     1
>>>     132 0.55500 SIME_RLS_M asseling     r     12/11/2007
>>> 08:53:25     1
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 7
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 11
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 15
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 19
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 23
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:32     1 5
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ------
>>> all.q at wasabi01.statgen.local   BIP   8/8       8.12     lx24-amd64
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 6
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 10
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 14
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:43     1 18
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 22
>>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:23:44     1 24
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:31     1 4
>>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>>> 14:25:32     1 6
>>>
>>> #################################################################### 
>>> #
>>> #
>>> ######
>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
>>> PENDING JOBS
>>> #################################################################### 
>>> #
>>> #
>>> ######
>>>     140 0.55500 rls-pbat35 asseling     qw    12/11/2007
>>> 14:25:31     1 9-24:1
>>>     141 0.55500 rls-pbat35 asseling     qw    12/11/2007
>>> 14:31:11     1 1-17:8
>>>     142 0.55500 pprd-sw_sn asseling     qw    12/12/2007
>>> 09:36:21     1 3
>>>
>>>
>>>
>>>
>>>
>>> Alexandre Racine
>>> Projets spéciaux
>>> 514-461-1300 poste 3304
>>> alexandre.racine at mhicc.org
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Wed 2007-12-12 11:36
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] All queues dropped because of overload or
>>> full
>>>
>>> Hi,
>>>
>>> Am 12.12.2007 um 16:36 schrieb Alexandre Racine:
>>>
>>>> Looking with "top", the processors works, there is a lot of memory
>>>> available... qhost seems alright... I don't see why I get this.
>>>> There is only mabe the mem field in the "qstat -j" that sounds
>>>> impossible. Or is this the total amount of memory that has been
>>>> used? (used and freed). The only references that I found in the
>>>> archives are from 2004... Thanks.
>>>
>>> what is `qstat -f`saying? What looks the queue configuration like?
>>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> More details:
>>>>
>>>> $ qhost
>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>> SWAPTO  SWAPUS
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> -
>>>> ---------
>>>> global                  -               -     -       -
>>>> -       -       -
>>>> server1                 lx24-amd64     16 14.26   30.4G    2.6G
>>>> 1.9G     0.0
>>>> server2                 lx24-amd64      8  8.39   15.7G    6.8G
>>>> 1.9G     0.0
>>>> server3                 lx24-amd64      8  8.12   14.6G  577.7M
>>>> 2.0G     0.0
>>>>
>>>> $ qstat -j 139
>>>> [...]
>>>> script_file:                script.sh
>>>> job-array tasks:            1-24:1
>>>> usage    2:                 cpu=20:06:52, mem=3467.62004 GBs,
>>>> io=0.00000, vmem=60.770M, maxvmem=62.227M
>>>> usage    3:                 cpu=07:04:53, mem=1250.25426 GBs,
>>>> io=0.00000, vmem=62.016M, maxvmem=63.266M
>>>> usage    4:                 cpu=07:02:46, mem=1247.70159 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=63.492M
>>>> usage    5:                 cpu=07:04:38, mem=1249.53348 GBs,
>>>> io=0.00000, vmem=62.008M, maxvmem=63.316M
>>>> usage    6:                 cpu=16:03:12, mem=2834.15624 GBs,
>>>> io=0.00000, vmem=62.113M, maxvmem=63.023M
>>>> usage    7:                 cpu=15:17:48, mem=2707.94392 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=62.578M
>>>> usage    8:                 cpu=07:02:46, mem=1247.34336 GBs,
>>>> io=0.00000, vmem=62.148M, maxvmem=63.484M
>>>> usage   10:                 cpu=20:09:24, mem=3475.70453 GBs,
>>>> io=0.00000, vmem=60.832M, maxvmem=62.266M
>>>> usage   11:                 cpu=14:32:42, mem=2568.06738 GBs,
>>>> io=0.00000, vmem=62.016M, maxvmem=63.016M
>>>> usage   12:                 cpu=07:14:50, mem=1283.31948 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=63.504M
>>>> usage   13:                 cpu=07:15:51, mem=1282.46496 GBs,
>>>> io=0.00000, vmem=62.012M, maxvmem=63.359M
>>>> usage   14:                 cpu=15:56:08, mem=2813.61103 GBs,
>>>> io=0.00000, vmem=62.125M, maxvmem=63.430M
>>>> usage   15:                 cpu=14:38:33, mem=2592.12483 GBs,
>>>> io=0.00000, vmem=62.156M, maxvmem=63.312M
>>>> usage   16:                 cpu=07:17:23, mem=1290.37961 GBs,
>>>> io=0.00000, vmem=62.137M, maxvmem=63.574M
>>>> usage   18:                 cpu=20:09:23, mem=3482.93681 GBs,
>>>> io=0.00000, vmem=60.832M, maxvmem=62.289M
>>>> usage   19:                 cpu=14:26:19, mem=2549.31135 GBs,
>>>> io=0.00000, vmem=62.016M, maxvmem=63.324M
>>>> usage   20:                 cpu=07:22:26, mem=1305.89071 GBs,
>>>> io=0.00000, vmem=62.160M, maxvmem=63.617M
>>>> usage   21:                 cpu=07:23:30, mem=1304.96487 GBs,
>>>> io=0.00000, vmem=62.004M, maxvmem=63.328M
>>>> usage   22:                 cpu=15:08:08, mem=2672.33798 GBs,
>>>> io=0.00000, vmem=62.117M, maxvmem=63.551M
>>>> usage   23:                 cpu=14:23:55, mem=2548.95621 GBs,
>>>> io=0.00000, vmem=62.148M, maxvmem=63.609M
>>>> usage   24:                 cpu=15:04:51, mem=2669.49002 GBs,
>>>> io=0.00000, vmem=62.246M, maxvmem=63.523M
>>>> scheduling info:            queue instance
>>>> "all.q at oregano.statgen.local" dropped because it is full
>>>>                             queue instance
>>>> "all.q at wasabi01.statgen.local" dropped because it is full
>>>>                             queue instance "all.q at PAPRIKA" dropped
>>>> because it is full
>>>>                             All queues dropped because of overload
>>>> or full
>>>>
>>>>
>>>>
>>>>
>>>> Alexandre Racine
>>>> Projets spéciaux
>>>> 514-461-1300 poste 3304
>>>> alexandre.racine at mhicc.org
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> -
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users- 
>>>> help at gridengine.sunsource.net
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list