[GE users] SGE6 does not backfill

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Wed Apr 13 11:58:12 BST 2005


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



Juha Jäykkä wrote:

>>we know the problem, but were not able to replicate it ourself. What
>>kind of hardware are you using? How big is your grid?
>>    
>>
>
>12 2-way amd64-nodes plus 2-way front end (which belongs to the cluster
>queue, but has slots=0). Front end is the only submit host. Hardware is HP
>DL145's for all nodes (identical, except one has one hard disc from a
>different manufacturer since the original died) and HP DL585 for front
>end.
>
>  
>
>>It could be a file descriptor limit problem. During the startup of the
>>qmaster, the qmaster logs the amount of file descriptors it will use.
>>This number as to be larger than the number of execds + average
>>connected clients.
>>    
>>
>
>Hmm... can I check this number somewhere and increase it to see what
>happens?
>
>Note, that with SGE6.0u3 this *never* occurred during the couple of weeks
>we used it, so it most likely is related to some change between u3 and
>last Monday's CVS version (which is what I used).
>
The reports which we got were on 6.0u3 and u4. And it were always linux
on amd64 machines.

>
>  
>
>>In the case of job vanishing, it usually takes very long for that
>>job to start up. In a previous reported case 2 min. Somehow the
>>qmaster send a signal to one of the execd, which had not gotten
>>the job yet and therefor reported it as unknown. The qmaster
>>than thinks that the job had failed and removes it.
>>    
>>
>
>I did not notice this previously, but obviously the jobs goes to the exec
>host, but for some reason qmaster thinks it did not. First, it thinks
>execd does not know of the job and then, after half a minute, it thinks
>that the job is in the exec host, though it's not supposed (since the
>execd just 30 secs ago did not know of the job!):
>
>04/13/2005 13:18:52|qmaster|topaasi|W|job 299.1 failed on host compute-0-11.local in recognizing job
> because: execd doesn't know this job
>04/13/2005 13:19:24|qmaster|topaasi|E|execd at compute-0-11.local reports running job (299.1/master) in
> queue "all.q at compute-0-11.local" that was not supposed to be there - killing
>
>
>On a side note, I het this after a successful LAM job. Is it serious?
>
>04/13/2005 13:21:21|qmaster|topaasi|W|job 297.1 failed on host compute-0-1.local in pestop because: 
>04/13/2005 13:21:21 [400:12861]: exit_status of pe_stop = 2
>
>
>  
>
>>One of the possible signals could be the reprioritization. Is it
>>turned off in your case?
>>    
>>
>
>Yes, it is off. Though it WAS NOT off previously... should I change it
>back?
>
>If the job goes to unknown state, could I not use reschedule_unknown to
>reschedule it after it fails?
>
Hm, good idea. We should try it.

> 
>  
>
>>You can use qping -dump to monitor that communication of the
>>qmaster with its execd. Could you try to capture a failing pe job?
>>    
>>
>
>You mean qping -dump topaasi.local 537 execd 1, where 537 is the port my
>qmaster sits on?
>
qping -dump master_host $SGE_QMASTER_PORT qmaster 1

This way you will the all communications between the qmaster and the
clients (scheduler, execd,..)

Stephan

>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list