[GE users] Scheduler Configuration

Robert Healey healer at rpi.edu
Tue Dec 23 17:08:12 GMT 2008


It tosses it back into the negatives.  Is it possible to restart qmaster 
without aborting what's currently running?  I'm not as familiar with 
with sge's quirks as I'd like to be.

Bob

reuti wrote:
> Am 23.12.2008 um 17:54 schrieb Robert Healey:
> 
>> I defined all host configs before I opened the cluster to users.  The
>> reason for the low load average is the parallel users are not  
>> submitting
>>   anything until I solve the problem, so I've been submitting mpirun
>> /bin/sleep 1200 to the queue.  Removing my parallel sleep from the
>> queue, slots goes back to 2 with 6 serial jobs running on the node. My
>> job was submitted using $fill_up for PE allocation.
> 
> Well, 2 left with 6 running seems fine. If you submit a parallel job  
> again to this particular node, does it pull the slotcount below zero  
> again - this shouldn't happen of coursew. Sometimes a stop/start of  
> the qmaster helps in such cases.
> 
> Independent from the used allocation_rule, it should never drop below  
> zero.
> 
> -- Reuti
> 
>> reuti wrote:
>>> Am 23.12.2008 um 17:16 schrieb Robert Healey:
>>>
>>>> reuti wrote:
>>>>> Am 23.12.2008 um 09:44 schrieb Robert Healey:
>>>>>
>>>> <snip>
>>>>> Is "qhost -F" showing negative values for the slots entry?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>> <snip>
>>>>
>>>> Its currently showing -6 slots remaining.
>>>>
>>>>
>>>> compute-8-24.local      lx24-amd64      8  6.00    7.8G  377.8M
>>>> 8.0G
>>>>      0.0
>>>>     hl:arch=lx24-amd64
>>>>     hl:num_proc=8.000000
>>>>     hl:mem_total=7.799G
>>>>     hl:swap_total=7.997G
>>>>     hl:virtual_total=15.797G
>>>>     hl:load_avg=6.000000
>>>>     hl:load_short=6.000000
>>>>     hl:load_medium=6.000000
>>>>     hl:load_long=5.930000
>>>>     hl:mem_free=7.430G
>>>>     hl:swap_free=7.997G
>>>>     hl:virtual_free=15.428G
>>>>     hl:mem_used=377.828M
>>>>     hl:swap_used=0.000
>>>>     hl:virtual_used=377.828M
>>>>     hl:cpu=75.200000
>>>>     hl:np_load_avg=0.750000
>>>>     hl:np_load_short=0.750000
>>>>     hl:np_load_medium=0.750000
>>>>     hl:np_load_long=0.741250
>>>>     hc:slots=-6.000000
>>> Then the internal accouting got out of sync. After all processes left
>>> the node it should normalize.  Did you define the exechost slots
>>> value while jobs were already in the system?
>>>
>>> One strange thing I notice is the load: with in total 8+6=14 running,
>>> the lioad should be much higher.
>>>
>>> -- Reuti
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=94099
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>
>> -- 
>> Bob Healey
>> Systems Administrator
>> Physics Department, RPI
>> healer at rpi.edu
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=94106
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94110
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> 

-- 
Bob Healey
Systems Administrator
Physics Department, RPI
healer at rpi.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94113

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list