[GE users] Scheduler Configuration
reuti
reuti at staff.uni-marburg.de
Tue Dec 23 17:11:09 GMT 2008
Am 23.12.2008 um 18:08 schrieb Robert Healey:
> It tosses it back into the negatives. Is it possible to restart
> qmaster
> without aborting what's currently running? I'm not as familiar with
> with sge's quirks as I'd like to be.
Yes, it's save to do so.
-- Reuti
> Bob
>
> reuti wrote:
>> Am 23.12.2008 um 17:54 schrieb Robert Healey:
>>
>>> I defined all host configs before I opened the cluster to users.
>>> The
>>> reason for the low load average is the parallel users are not
>>> submitting
>>> anything until I solve the problem, so I've been submitting mpirun
>>> /bin/sleep 1200 to the queue. Removing my parallel sleep from the
>>> queue, slots goes back to 2 with 6 serial jobs running on the
>>> node. My
>>> job was submitted using $fill_up for PE allocation.
>>
>> Well, 2 left with 6 running seems fine. If you submit a parallel job
>> again to this particular node, does it pull the slotcount below zero
>> again - this shouldn't happen of coursew. Sometimes a stop/start of
>> the qmaster helps in such cases.
>>
>> Independent from the used allocation_rule, it should never drop below
>> zero.
>>
>> -- Reuti
>>
>>> reuti wrote:
>>>> Am 23.12.2008 um 17:16 schrieb Robert Healey:
>>>>
>>>>> reuti wrote:
>>>>>> Am 23.12.2008 um 09:44 schrieb Robert Healey:
>>>>>>
>>>>> <snip>
>>>>>> Is "qhost -F" showing negative values for the slots entry?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>> <snip>
>>>>>
>>>>> Its currently showing -6 slots remaining.
>>>>>
>>>>>
>>>>> compute-8-24.local lx24-amd64 8 6.00 7.8G 377.8M
>>>>> 8.0G
>>>>> 0.0
>>>>> hl:arch=lx24-amd64
>>>>> hl:num_proc=8.000000
>>>>> hl:mem_total=7.799G
>>>>> hl:swap_total=7.997G
>>>>> hl:virtual_total=15.797G
>>>>> hl:load_avg=6.000000
>>>>> hl:load_short=6.000000
>>>>> hl:load_medium=6.000000
>>>>> hl:load_long=5.930000
>>>>> hl:mem_free=7.430G
>>>>> hl:swap_free=7.997G
>>>>> hl:virtual_free=15.428G
>>>>> hl:mem_used=377.828M
>>>>> hl:swap_used=0.000
>>>>> hl:virtual_used=377.828M
>>>>> hl:cpu=75.200000
>>>>> hl:np_load_avg=0.750000
>>>>> hl:np_load_short=0.750000
>>>>> hl:np_load_medium=0.750000
>>>>> hl:np_load_long=0.741250
>>>>> hc:slots=-6.000000
>>>> Then the internal accouting got out of sync. After all processes
>>>> left
>>>> the node it should normalize. Did you define the exechost slots
>>>> value while jobs were already in the system?
>>>>
>>>> One strange thing I notice is the load: with in total 8+6=14
>>>> running,
>>>> the lioad should be much higher.
>>>>
>>>> -- Reuti
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=94099
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>>>
>>> --
>>> Bob Healey
>>> Systems Administrator
>>> Physics Department, RPI
>>> healer at rpi.edu
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=94106
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=94110
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>
> --
> Bob Healey
> Systems Administrator
> Physics Department, RPI
> healer at rpi.edu
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?
> dsForumId=38&dsMessageId=94113
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94117
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users
mailing list