[GE users] Scheduler Configuration

reuti reuti at staff.uni-marburg.de
Tue Dec 23 17:11:09 GMT 2008


Am 23.12.2008 um 18:08 schrieb Robert Healey:

> It tosses it back into the negatives.  Is it possible to restart  
> qmaster
> without aborting what's currently running?  I'm not as familiar with
> with sge's quirks as I'd like to be.

Yes, it's save to do so.

-- Reuti


> Bob
>
> reuti wrote:
>> Am 23.12.2008 um 17:54 schrieb Robert Healey:
>>
>>> I defined all host configs before I opened the cluster to users.   
>>> The
>>> reason for the low load average is the parallel users are not
>>> submitting
>>>   anything until I solve the problem, so I've been submitting mpirun
>>> /bin/sleep 1200 to the queue.  Removing my parallel sleep from the
>>> queue, slots goes back to 2 with 6 serial jobs running on the  
>>> node. My
>>> job was submitted using $fill_up for PE allocation.
>>
>> Well, 2 left with 6 running seems fine. If you submit a parallel job
>> again to this particular node, does it pull the slotcount below zero
>> again - this shouldn't happen of coursew. Sometimes a stop/start of
>> the qmaster helps in such cases.
>>
>> Independent from the used allocation_rule, it should never drop below
>> zero.
>>
>> -- Reuti
>>
>>> reuti wrote:
>>>> Am 23.12.2008 um 17:16 schrieb Robert Healey:
>>>>
>>>>> reuti wrote:
>>>>>> Am 23.12.2008 um 09:44 schrieb Robert Healey:
>>>>>>
>>>>> <snip>
>>>>>> Is "qhost -F" showing negative values for the slots entry?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>> <snip>
>>>>>
>>>>> Its currently showing -6 slots remaining.
>>>>>
>>>>>
>>>>> compute-8-24.local      lx24-amd64      8  6.00    7.8G  377.8M
>>>>> 8.0G
>>>>>      0.0
>>>>>     hl:arch=lx24-amd64
>>>>>     hl:num_proc=8.000000
>>>>>     hl:mem_total=7.799G
>>>>>     hl:swap_total=7.997G
>>>>>     hl:virtual_total=15.797G
>>>>>     hl:load_avg=6.000000
>>>>>     hl:load_short=6.000000
>>>>>     hl:load_medium=6.000000
>>>>>     hl:load_long=5.930000
>>>>>     hl:mem_free=7.430G
>>>>>     hl:swap_free=7.997G
>>>>>     hl:virtual_free=15.428G
>>>>>     hl:mem_used=377.828M
>>>>>     hl:swap_used=0.000
>>>>>     hl:virtual_used=377.828M
>>>>>     hl:cpu=75.200000
>>>>>     hl:np_load_avg=0.750000
>>>>>     hl:np_load_short=0.750000
>>>>>     hl:np_load_medium=0.750000
>>>>>     hl:np_load_long=0.741250
>>>>>     hc:slots=-6.000000
>>>> Then the internal accouting got out of sync. After all processes  
>>>> left
>>>> the node it should normalize.  Did you define the exechost slots
>>>> value while jobs were already in the system?
>>>>
>>>> One strange thing I notice is the load: with in total 8+6=14  
>>>> running,
>>>> the lioad should be much higher.
>>>>
>>>> -- Reuti
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=94099
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>> -- 
>>> Bob Healey
>>> Systems Administrator
>>> Physics Department, RPI
>>> healer at rpi.edu
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=94106
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=94110
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>
> -- 
> Bob Healey
> Systems Administrator
> Physics Department, RPI
> healer at rpi.edu
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=94113
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94117

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list