[GE users] scheduling weirdness in 6.0u3

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Fri Apr 8 11:36:26 BST 2005


Well, I do not know, why scheduler interval should change anything. You
configured
on demand scheduling anyway and the scheduler is only triggered, when it
is idle.

How long are your jobs running. Maybe you have a problem with the
internal load
adjustments.From you configuration I read:

job_load_adjustments              np_load_avg=1.0,mem_free=900M
load_adjustment_decay_time        0:7:30

Are your jobs running longer than 7:30 Min? You have a load interval
of 0:40 Min. Though, if you need the job_load_adjustments, maybe it
will help to set it to a smaller value. Such as: 0:2:0 or so. 

You configured you scheduler not only to use the reported load values,
but the internal predicted ones as well.

Cheers,
Stephan





Sean Dilda wrote:

>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>  
>
>>Sean Dilda wrote:
>>
>>    
>>
>>>The strange thing is that I'm getting different results on my production 
>>>cluster and test cluster.  The test cluster was seeing problems only 
>>>with parallel jobs, and the above patch seems to have fixed it.  The 
>>>production cluster is having issues with the parallel and non-parallel 
>>>jobs, and that patch didn't seem to change anything.  I'll do some more 
>>>testing and let you know if I can figure out why its going haywire.
>>>
>>>      
>>>
>>Well, could it be, that you use a slightly different queue and scheduler
>>configuration? Could you post your configuration? The list might be able
>>to help you. Are just compare the configurations between the two grids.
>>    
>>
>
>After continued failure to reproduce the error on my test cluster, I 
>started to wonder if cluster size was an issue.  The test cluster has 8 
>compute nodes.  The production cluster has over 300 execute hosts 
>registered.
>
>On a hunch, I changed the schedule_interval on the production cluster 
>from 0:0:15 to 0:1:15.  I'm hesitant to call things fixed, but I haven't 
>seen the error on the production cluster since I made that change.  This 
>makes me wonder if sge_schedd has some kind of timeout for its 
>scheduling run that is tied to the schedule_interval.  Although even if 
>it does, this doesn't seem quite right as sge_schedd was hardly using 
>any cpu time, even with the 15 second schedule_interval.  And when I 
>temporarily turned on profiling for schedd, it was reporting only a 
>little over one second to do a run.
>
>Thanks,
>
>
>Sean
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list