[GE users] All queues dropped because of overload or full

Alexandre Racine Alexandre.Racine at mhicc.org
Wed Dec 12 21:46:17 GMT 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Actually, the schedd_job_info line was not even there.

So i'll activate it for next time.


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Wed 2007-12-12 16:00
To: users at gridengine.sunsource.net
Subject: Re: [GE users] All queues dropped because of overload or full
 
Am 12.12.2007 um 20:28 schrieb Alexandre Racine:

> Mmmm, you mean the scheduler tuning profile when installing? It is  
> set to normal.

No, I meant:

schedd_job_info                   true

in qconf -msconf and it was false the last time.

>
> Q1- So what you are actually saying is that everything is fine, and  
> that SGE is just saying that there is no more slots left?
>
> Q2- Also, looking in TOP, I had some program that where in the  
> state "'D' = uninterruptible sleep". Would this be related?

No.

-- Reuti

>
>
> Thanks.
>
>
>
>
> Alexandre Racine
> Projets spéciaux
> 514-461-1300 poste 3304
> alexandre.racine at mhicc.org
>
>
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wed 2007-12-12 13:40
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] All queues dropped because of overload or full
>
> Am 12.12.2007 um 19:05 schrieb Alexandre Racine:
>
>> Yes, all slots where used, but I did not have that message while
>> doing another tests witch had like 100.000 jobs pending. Why this
>> time I have that error message?
>> Here is the qstat -f....
>
> Maybe the scheduler info wasn't turned on the last time? - Reuti
>
>> $ qstat -f
>> queuename                      qtype used/tot. load_avg
>> arch          states
>> --------------------------------------------------------------------- 
>> -
>> ------
>> all.q at PAPRIKA                  BIP   14/14     14.20    lx24-amd64
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 2
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 3
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 4
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 5
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 8
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 12
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 13
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 16
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:44     1 20
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:44     1 21
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:25:31     1 2
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:25:31     1 3
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:25:47     1 7
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:44:56     1 8
>> --------------------------------------------------------------------- 
>> -
>> ------
>> all.q at oregano.statgen.local    BIP   8/8       8.95     lx24-amd64
>>     131 0.55500 All_RLS_Me asseling     r     12/11/2007
>> 08:52:54     1
>>     132 0.55500 SIME_RLS_M asseling     r     12/11/2007
>> 08:53:25     1
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 7
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 11
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 15
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:44     1 19
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:44     1 23
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:25:32     1 5
>> --------------------------------------------------------------------- 
>> -
>> ------
>> all.q at wasabi01.statgen.local   BIP   8/8       8.12     lx24-amd64
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 6
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 10
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 14
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:43     1 18
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:44     1 22
>>     139 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:23:44     1 24
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:25:31     1 4
>>     140 0.55500 rls-pbat35 asseling     r     12/11/2007
>> 14:25:32     1 6
>>
>> ##################################################################### 
>> #
>> ######
>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
>> PENDING JOBS
>> ##################################################################### 
>> #
>> ######
>>     140 0.55500 rls-pbat35 asseling     qw    12/11/2007
>> 14:25:31     1 9-24:1
>>     141 0.55500 rls-pbat35 asseling     qw    12/11/2007
>> 14:31:11     1 1-17:8
>>     142 0.55500 pprd-sw_sn asseling     qw    12/12/2007
>> 09:36:21     1 3
>>
>>
>>
>>
>>
>> Alexandre Racine
>> Projets spéciaux
>> 514-461-1300 poste 3304
>> alexandre.racine at mhicc.org
>>
>>
>>
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Wed 2007-12-12 11:36
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] All queues dropped because of overload or  
>> full
>>
>> Hi,
>>
>> Am 12.12.2007 um 16:36 schrieb Alexandre Racine:
>>
>>> Looking with "top", the processors works, there is a lot of memory
>>> available... qhost seems alright... I don't see why I get this.
>>> There is only mabe the mem field in the "qstat -j" that sounds
>>> impossible. Or is this the total amount of memory that has been
>>> used? (used and freed). The only references that I found in the
>>> archives are from 2004... Thanks.
>>
>> what is `qstat -f`saying? What looks the queue configuration like?
>>
>> -- Reuti
>>
>>
>>>
>>> More details:
>>>
>>> $ qhost
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>> SWAPTO  SWAPUS
>>> -------------------------------------------------------------------- 
>>> -
>>> -
>>> ---------
>>> global                  -               -     -       -
>>> -       -       -
>>> server1                 lx24-amd64     16 14.26   30.4G    2.6G
>>> 1.9G     0.0
>>> server2                 lx24-amd64      8  8.39   15.7G    6.8G
>>> 1.9G     0.0
>>> server3                 lx24-amd64      8  8.12   14.6G  577.7M
>>> 2.0G     0.0
>>>
>>> $ qstat -j 139
>>> [...]
>>> script_file:                script.sh
>>> job-array tasks:            1-24:1
>>> usage    2:                 cpu=20:06:52, mem=3467.62004 GBs,
>>> io=0.00000, vmem=60.770M, maxvmem=62.227M
>>> usage    3:                 cpu=07:04:53, mem=1250.25426 GBs,
>>> io=0.00000, vmem=62.016M, maxvmem=63.266M
>>> usage    4:                 cpu=07:02:46, mem=1247.70159 GBs,
>>> io=0.00000, vmem=62.156M, maxvmem=63.492M
>>> usage    5:                 cpu=07:04:38, mem=1249.53348 GBs,
>>> io=0.00000, vmem=62.008M, maxvmem=63.316M
>>> usage    6:                 cpu=16:03:12, mem=2834.15624 GBs,
>>> io=0.00000, vmem=62.113M, maxvmem=63.023M
>>> usage    7:                 cpu=15:17:48, mem=2707.94392 GBs,
>>> io=0.00000, vmem=62.156M, maxvmem=62.578M
>>> usage    8:                 cpu=07:02:46, mem=1247.34336 GBs,
>>> io=0.00000, vmem=62.148M, maxvmem=63.484M
>>> usage   10:                 cpu=20:09:24, mem=3475.70453 GBs,
>>> io=0.00000, vmem=60.832M, maxvmem=62.266M
>>> usage   11:                 cpu=14:32:42, mem=2568.06738 GBs,
>>> io=0.00000, vmem=62.016M, maxvmem=63.016M
>>> usage   12:                 cpu=07:14:50, mem=1283.31948 GBs,
>>> io=0.00000, vmem=62.156M, maxvmem=63.504M
>>> usage   13:                 cpu=07:15:51, mem=1282.46496 GBs,
>>> io=0.00000, vmem=62.012M, maxvmem=63.359M
>>> usage   14:                 cpu=15:56:08, mem=2813.61103 GBs,
>>> io=0.00000, vmem=62.125M, maxvmem=63.430M
>>> usage   15:                 cpu=14:38:33, mem=2592.12483 GBs,
>>> io=0.00000, vmem=62.156M, maxvmem=63.312M
>>> usage   16:                 cpu=07:17:23, mem=1290.37961 GBs,
>>> io=0.00000, vmem=62.137M, maxvmem=63.574M
>>> usage   18:                 cpu=20:09:23, mem=3482.93681 GBs,
>>> io=0.00000, vmem=60.832M, maxvmem=62.289M
>>> usage   19:                 cpu=14:26:19, mem=2549.31135 GBs,
>>> io=0.00000, vmem=62.016M, maxvmem=63.324M
>>> usage   20:                 cpu=07:22:26, mem=1305.89071 GBs,
>>> io=0.00000, vmem=62.160M, maxvmem=63.617M
>>> usage   21:                 cpu=07:23:30, mem=1304.96487 GBs,
>>> io=0.00000, vmem=62.004M, maxvmem=63.328M
>>> usage   22:                 cpu=15:08:08, mem=2672.33798 GBs,
>>> io=0.00000, vmem=62.117M, maxvmem=63.551M
>>> usage   23:                 cpu=14:23:55, mem=2548.95621 GBs,
>>> io=0.00000, vmem=62.148M, maxvmem=63.609M
>>> usage   24:                 cpu=15:04:51, mem=2669.49002 GBs,
>>> io=0.00000, vmem=62.246M, maxvmem=63.523M
>>> scheduling info:            queue instance
>>> "all.q at oregano.statgen.local" dropped because it is full
>>>                             queue instance
>>> "all.q at wasabi01.statgen.local" dropped because it is full
>>>                             queue instance "all.q at PAPRIKA" dropped
>>> because it is full
>>>                             All queues dropped because of overload
>>> or full
>>>
>>>
>>>
>>>
>>> Alexandre Racine
>>> Projets spéciaux
>>> 514-461-1300 poste 3304
>>> alexandre.racine at mhicc.org
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list