[GE users] SGE6 does not backfill

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Wed Apr 13 12:14:19 BST 2005


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



Stephan Grell - Sun Germany - SSG - Software Engineer wrote:

>Juha Jäykkä wrote:
>
>  
>
>>>we know the problem, but were not able to replicate it ourself. What
>>>kind of hardware are you using? How big is your grid?
>>>   
>>>
>>>      
>>>
>>12 2-way amd64-nodes plus 2-way front end (which belongs to the cluster
>>queue, but has slots=0). Front end is the only submit host. Hardware is HP
>>DL145's for all nodes (identical, except one has one hard disc from a
>>different manufacturer since the original died) and HP DL585 for front
>>end.
>>
>> 
>>
>>    
>>
>>>It could be a file descriptor limit problem. During the startup of the
>>>qmaster, the qmaster logs the amount of file descriptors it will use.
>>>This number as to be larger than the number of execds + average
>>>connected clients.
>>>   
>>>
>>>      
>>>
>>Hmm... can I check this number somewhere and increase it to see what
>>happens?
>>
>>Note, that with SGE6.0u3 this *never* occurred during the couple of weeks
>>we used it, so it most likely is related to some change between u3 and
>>last Monday's CVS version (which is what I used).
>>
>>    
>>
>The reports which we got were on 6.0u3 and u4. And it were always linux
>on amd64 machines.
>
I should say linux 2.6.

Stephan

>
>  
>
>> 
>>
>>    
>>
>>>In the case of job vanishing, it usually takes very long for that
>>>job to start up. In a previous reported case 2 min. Somehow the
>>>qmaster send a signal to one of the execd, which had not gotten
>>>the job yet and therefor reported it as unknown. The qmaster
>>>than thinks that the job had failed and removes it.
>>>   
>>>
>>>      
>>>
>>I did not notice this previously, but obviously the jobs goes to the exec
>>host, but for some reason qmaster thinks it did not. First, it thinks
>>execd does not know of the job and then, after half a minute, it thinks
>>that the job is in the exec host, though it's not supposed (since the
>>execd just 30 secs ago did not know of the job!):
>>
>>04/13/2005 13:18:52|qmaster|topaasi|W|job 299.1 failed on host compute-0-11.local in recognizing job
>>because: execd doesn't know this job
>>04/13/2005 13:19:24|qmaster|topaasi|E|execd at compute-0-11.local reports running job (299.1/master) in
>>queue "all.q at compute-0-11.local" that was not supposed to be there - killing
>>
>>
>>On a side note, I het this after a successful LAM job. Is it serious?
>>
>>04/13/2005 13:21:21|qmaster|topaasi|W|job 297.1 failed on host compute-0-1.local in pestop because: 
>>04/13/2005 13:21:21 [400:12861]: exit_status of pe_stop = 2
>>
>>
>> 
>>
>>    
>>
>>>One of the possible signals could be the reprioritization. Is it
>>>turned off in your case?
>>>   
>>>
>>>      
>>>
>>Yes, it is off. Though it WAS NOT off previously... should I change it
>>back?
>>
>>If the job goes to unknown state, could I not use reschedule_unknown to
>>reschedule it after it fails?
>>
>>    
>>
>Hm, good idea. We should try it.
>
>  
>
>> 
>>
>>    
>>
>>>You can use qping -dump to monitor that communication of the
>>>qmaster with its execd. Could you try to capture a failing pe job?
>>>   
>>>
>>>      
>>>
>>You mean qping -dump topaasi.local 537 execd 1, where 537 is the port my
>>qmaster sits on?
>>
>>    
>>
>qping -dump master_host $SGE_QMASTER_PORT qmaster 1
>
>This way you will the all communications between the qmaster and the
>clients (scheduler, execd,..)
>
>Stephan
>
>  
>
>> 
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list